linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* EEH + hotplug fixes
@ 2019-09-03 10:15 Oliver O'Halloran
  2019-09-03 10:15 ` [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes Oliver O'Halloran
                   ` (13 more replies)
  0 siblings, 14 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff

Fixes various random crashes and other bad behaviour when hot
pluggable slots are used with EEH, namely:

1) Random crashes due to eeh_pe and pci_dn lifecycle mis-management
2) Hotplug slots tearing down devices you are trying to recover due
   to the reset that occurs while recovering a PE / bus.
3) Hot-remove causing spurious EEH events.

And some others. This series also enables pnv_php on Power9
since various people were carrying around hacks to make it
work and with the above fixes it seems to be fairly stable
now.

The series also adds the beginnings of a platform-independent test
infrastructure for EEH and a selftest script that exercises the
basic recovery path.

Oliver



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-17  0:43   ` Sam Bobroff
  2019-09-19 10:25   ` Michael Ellerman
  2019-09-03 10:15 ` [PATCH 02/14] powerpc/eeh: Fix race when freeing PDNs Oliver O'Halloran
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

When the last device in an eeh_pe is removed the eeh_pe structure itself
(and any empty parents) are freed since they are no longer needed. This
results in a crash when a hotplug driver is involved since the following
may occur:

1. Device is suprise removed.
2. Driver performs an MMIO, which fails and queues and eeh_event.
3. Hotplug driver receives a hotplug interrupt and removes any
   pci_devs that were under the slot.
4. pci_dev is torn down and the eeh_pe is freed.
5. The EEH event handler thread processes the eeh_event and crashes
   since the eeh_pe pointer in the eeh_event structure is no
   longer valid.

Crashing is generally considered poor form. Instead of doing that use
the fact PEs are marked as EEH_PE_INVALID to keep them around until the
end of the recovery cycle, at which point we can safely prune any empty
PEs.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
Sam Bobroff is working on implementing proper refcounting for EEH PEs,
but that's not going to be ready for a while and this is less broken
than what we have now.
---
 arch/powerpc/kernel/eeh_driver.c | 36 ++++++++++++++++++++++++++++++--
 arch/powerpc/kernel/eeh_event.c  |  8 +++++++
 arch/powerpc/kernel/eeh_pe.c     | 23 +++++++++++++++++++-
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index a31cd32c4ce9..75266156943f 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -734,6 +734,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
  */
 #define MAX_WAIT_FOR_RECOVERY 300
 
+
+/* Walks the PE tree after processing an event to remove any stale PEs.
+ *
+ * NB: This needs to be recursive to ensure the leaf PEs get removed
+ * before their parents do. Although this is possible to do recursively
+ * we don't since this is easier to read and we need to garantee
+ * the leaf nodes will be handled first.
+ */
+static void eeh_pe_cleanup(struct eeh_pe *pe)
+{
+	struct eeh_pe *child_pe, *tmp;
+
+	list_for_each_entry_safe(child_pe, tmp, &pe->child_list, child)
+		eeh_pe_cleanup(child_pe);
+
+	if (pe->state & EEH_PE_KEEP)
+		return;
+
+	if (!(pe->state & EEH_PE_INVALID))
+		return;
+
+	if (list_empty(&pe->edevs) && list_empty(&pe->child_list)) {
+		list_del(&pe->child);
+		kfree(pe);
+	}
+}
+
 /**
  * eeh_handle_normal_event - Handle EEH events on a specific PE
  * @pe: EEH PE - which should not be used after we return, as it may
@@ -772,8 +799,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
-	eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
-
 	eeh_pe_update_time_stamp(pe);
 	pe->freeze_count++;
 	if (pe->freeze_count > eeh_max_freezes) {
@@ -963,6 +988,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			return;
 		}
 	}
+
+	/*
+	 * Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
+	 * we don't want to modify the PE tree structure so we do it here.
+	 */
+	eeh_pe_cleanup(pe);
 	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 }
 
@@ -1035,6 +1066,7 @@ void eeh_handle_special_event(void)
 		 */
 		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
 		    rc == EEH_NEXT_ERR_FENCED_PHB) {
+			eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
 			eeh_handle_normal_event(pe);
 		} else {
 			pci_lock_rescan_remove();
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index 64cfbe41174b..e36653e5f76b 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
 	}
 	event->pe = pe;
 
+	/*
+	 * Mark the PE as recovering before inserting it in the queue.
+	 * This prevents the PE from being free()ed by a hotplug driver
+	 * while the PE is sitting in the event queue.
+	 */
+	if (pe)
+		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
+
 	/* We may or may not be called in an interrupt context */
 	spin_lock_irqsave(&eeh_eventlist_lock, flags);
 	list_add(&event->list, &eeh_eventlist);
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 1a6254bcf056..177852e39a25 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -470,6 +470,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 {
 	struct eeh_pe *pe, *parent, *child;
+	bool keep, recover;
 	int cnt;
 
 	pe = eeh_dev_to_pe(edev);
@@ -490,10 +491,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 	 */
 	while (1) {
 		parent = pe->parent;
+
+		/* PHB PEs should never be removed */
 		if (pe->type & EEH_PE_PHB)
 			break;
 
-		if (!(pe->state & EEH_PE_KEEP)) {
+		/*
+		 * XXX: KEEP is set while resetting a PE. I don't think it's
+		 * ever set without RECOVERING also being set. I could
+		 * be wrong though so catch that with a WARN.
+		 */
+		keep = !!(pe->state & EEH_PE_KEEP);
+		recover = !!(pe->state & EEH_PE_RECOVERING);
+		WARN_ON(keep && !recover);
+
+		if (!keep && !recover) {
 			if (list_empty(&pe->edevs) &&
 			    list_empty(&pe->child_list)) {
 				list_del(&pe->child);
@@ -502,6 +514,15 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 				break;
 			}
 		} else {
+			/*
+			 * Mark the PE as invalid. At the end of the recovery
+			 * process any invalid PEs will be garbage collected.
+			 *
+			 * We need to delay the free()ing of them since we can
+			 * remove edev's while traversing the PE tree which
+			 * might trigger the removal of a PE and we can't
+			 * deal with that (yet).
+			 */
 			if (list_empty(&pe->edevs)) {
 				cnt = 0;
 				list_for_each_entry(child, &pe->child_list, child) {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/14] powerpc/eeh: Fix race when freeing PDNs
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
  2019-09-03 10:15 ` [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-17  0:50   ` Sam Bobroff
  2019-09-03 10:15 ` [PATCH 03/14] powerpc/eeh: Make permanently failed devices non-actionable Oliver O'Halloran
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

When hot-adding devices we rely on the hotplug driver to create pci_dn's
for the devices under the hotplug slot. Converse, when hot-removing the
driver will remove the pci_dn's that it created. This is a problem because
the pci_dev is still live until it's refcount drops to zero. This can
happen if the driver is slow to tear down it's internal state. Ideally, the
driver would not attempt to perform any config accesses to the device once
it's been marked as removed, but sometimes it happens. As a result, we
might attempt to access the pci_dn for a device that has been torn down and
the kernel may crash as a result.

To fix this, don't free the pci_dn unless the corresponding pci_dev has
been released.  If the pci_dev is still live, then we mark the pci_dn with
a flag that indicates the pci_dev's release function should free it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/include/asm/pci-bridge.h |  1 +
 arch/powerpc/kernel/pci-hotplug.c     |  7 +++++++
 arch/powerpc/kernel/pci_dn.c          | 21 +++++++++++++++++++--
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 54a9de01c2a3..69f4cb3b7c56 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -183,6 +183,7 @@ struct iommu_table;
 struct pci_dn {
 	int     flags;
 #define PCI_DN_FLAG_IOV_VF	0x01
+#define PCI_DN_FLAG_DEAD	0x02    /* Device has been hot-removed */
 
 	int	busno;			/* pci bus number */
 	int	devfn;			/* pci device and function number */
diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 0b0cf8168b47..fc62c4bc47b1 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -55,11 +55,18 @@ EXPORT_SYMBOL_GPL(pci_find_bus_by_node);
 void pcibios_release_device(struct pci_dev *dev)
 {
 	struct pci_controller *phb = pci_bus_to_host(dev->bus);
+	struct pci_dn *pdn = pci_get_pdn(dev);
 
 	eeh_remove_device(dev);
 
 	if (phb->controller_ops.release_device)
 		phb->controller_ops.release_device(dev);
+
+	/* free()ing the pci_dn has been deferred to us, do it now */
+	if (pdn && (pdn->flags & PCI_DN_FLAG_DEAD)) {
+		pci_dbg(dev, "freeing dead pdn\n");
+		kfree(pdn);
+	}
 }
 
 /**
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 5614ca0940ca..4e654df55969 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -320,6 +320,7 @@ void pci_remove_device_node_info(struct device_node *dn)
 {
 	struct pci_dn *pdn = dn ? PCI_DN(dn) : NULL;
 	struct device_node *parent;
+	struct pci_dev *pdev;
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
 
@@ -333,12 +334,28 @@ void pci_remove_device_node_info(struct device_node *dn)
 	WARN_ON(!list_empty(&pdn->child_list));
 	list_del(&pdn->list);
 
+	/* Drop the parent pci_dn's ref to our backing dt node */
 	parent = of_get_parent(dn);
 	if (parent)
 		of_node_put(parent);
 
-	dn->data = NULL;
-	kfree(pdn);
+	/*
+	 * At this point we *might* still have a pci_dev that was
+	 * instantiated from this pci_dn. So defer free()ing it until
+	 * the pci_dev's release function is called.
+	 */
+	pdev = pci_get_domain_bus_and_slot(pdn->phb->global_number,
+			pdn->busno, pdn->devfn);
+	if (pdev) {
+		/* NB: pdev has a ref to dn */
+		pci_dbg(pdev, "marked pdn (from %pOF) as dead\n", dn);
+		pdn->flags |= PCI_DN_FLAG_DEAD;
+	} else {
+		dn->data = NULL;
+		kfree(pdn);
+	}
+
+	pci_dev_put(pdev);
 }
 EXPORT_SYMBOL_GPL(pci_remove_device_node_info);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/14] powerpc/eeh: Make permanently failed devices non-actionable
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
  2019-09-03 10:15 ` [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes Oliver O'Halloran
  2019-09-03 10:15 ` [PATCH 02/14] powerpc/eeh: Fix race when freeing PDNs Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-17  0:51   ` Sam Bobroff
  2019-09-03 10:15 ` [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event() Oliver O'Halloran
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

If a device is torn down by a hotplug slot driver it's marked as removed
and marked as permaantly failed. There's no point in trying to recover a
permernantly failed device so it should be considered un-actionable.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/kernel/eeh_driver.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 75266156943f..18a69fac4d80 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -96,8 +96,16 @@ static bool eeh_dev_removed(struct eeh_dev *edev)
 
 static bool eeh_edev_actionable(struct eeh_dev *edev)
 {
-	return (edev->pdev && !eeh_dev_removed(edev) &&
-		!eeh_pe_passed(edev->pe));
+	if (!edev->pdev)
+		return false;
+	if (edev->pdev->error_state == pci_channel_io_perm_failure)
+		return false;
+	if (eeh_dev_removed(edev))
+		return false;
+	if (eeh_pe_passed(edev->pe))
+		return false;
+
+	return true;
 }
 
 /**
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event()
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (2 preceding siblings ...)
  2019-09-03 10:15 ` [PATCH 03/14] powerpc/eeh: Make permanently failed devices non-actionable Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-17  1:00   ` Sam Bobroff
  2019-09-03 10:15 ` [PATCH 05/14] powerpc/eeh: Defer printing stack trace Oliver O'Halloran
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

When a device is surprise removed while undergoing IO we will probably
get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
is detected we send a recovery event to the EEH worker thread which will
notify drivers, and perform recovery as needed.

In the event of a hot-remove we don't want recovery to occur since there
isn't a device to recover. The recovery process is fairly long due to
the number of wait states (required by PCIe) which causes problems when
devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).

To determine if we need to skip the recovery process we can use the
get_adapter_state() operation of the hotplug_slot to determine if the
slot contains a device or not, and if the slot is empty we can skip
recovery entirely.

One thing to note is that the slot being EEH frozen does not prevent the
hotplug driver from working. We don't have the EEH recovery thread
remove any of the devices since it's assumed that the hotplug driver
will handle tearing down the slot state.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/kernel/eeh_driver.c | 60 ++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 18a69fac4d80..52ce7584af43 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -27,6 +27,7 @@
 #include <linux/irq.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <linux/pci_hotplug.h>
 #include <asm/eeh.h>
 #include <asm/eeh_event.h>
 #include <asm/ppc-pci.h>
@@ -769,6 +770,46 @@ static void eeh_pe_cleanup(struct eeh_pe *pe)
 	}
 }
 
+/**
+ * eeh_check_slot_presence - Check if a device is still present in a slot
+ * @pdev: pci_dev to check
+ *
+ * This function may return a false positive if we can't determine the slot's
+ * presence state. This might happen for for PCIe slots if the PE containing
+ * the upstream bridge is also frozen, or the bridge is part of the same PE
+ * as the device.
+ *
+ * This shouldn't happen often, but you might see it if you hotplug a PCIe
+ * switch.
+ */
+static bool eeh_slot_presence_check(struct pci_dev *pdev)
+{
+	const struct hotplug_slot_ops *ops;
+	struct pci_slot *slot;
+	u8 state;
+	int rc;
+
+	if (!pdev)
+		return false;
+
+	if (pdev->error_state == pci_channel_io_perm_failure)
+		return false;
+
+	slot = pdev->slot;
+	if (!slot || !slot->hotplug)
+		return true;
+
+	ops = slot->hotplug->ops;
+	if (!ops || !ops->get_adapter_status)
+		return true;
+
+	rc = ops->get_adapter_status(slot->hotplug, &state);
+	if (rc)
+		return true;
+
+	return !!state;
+}
+
 /**
  * eeh_handle_normal_event - Handle EEH events on a specific PE
  * @pe: EEH PE - which should not be used after we return, as it may
@@ -799,6 +840,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	enum pci_ers_result result = PCI_ERS_RESULT_NONE;
 	struct eeh_rmv_data rmv_data =
 		{LIST_HEAD_INIT(rmv_data.removed_vf_list), 0};
+	int devices = 0;
 
 	bus = eeh_pe_bus_get(pe);
 	if (!bus) {
@@ -807,6 +849,23 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
+	/*
+	 * When devices are hot-removed we might get an EEH due to
+	 * a driver attempting to touch the MMIO space of a removed
+	 * device. In this case we don't have a device to recover
+	 * so suppress the event if we can't find any present devices.
+	 *
+	 * The hotplug driver should take care of tearing down the
+	 * device itself.
+	 */
+	eeh_for_each_pe(pe, tmp_pe)
+		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+			if (eeh_slot_presence_check(edev->pdev))
+				devices++;
+
+	if (!devices)
+		goto out; /* nothing to recover */
+
 	eeh_pe_update_time_stamp(pe);
 	pe->freeze_count++;
 	if (pe->freeze_count > eeh_max_freezes) {
@@ -997,6 +1056,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		}
 	}
 
+out:
 	/*
 	 * Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
 	 * we don't want to modify the PE tree structure so we do it here.
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/14] powerpc/eeh: Defer printing stack trace
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (3 preceding siblings ...)
  2019-09-03 10:15 ` [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event() Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-17  1:04   ` Sam Bobroff
  2019-09-03 10:15 ` [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment Oliver O'Halloran
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

Currently we print a stack trace in the event handler to help with
debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
so we want to prevent printing the stack trace unless we know it's due to
an actual device error. To accomplish this, we can save a stack trace at
the point of detection and only print it once the EEH recovery handler has
determined the freeze was due to an actual error.

Since the whole point of this is to prevent spurious EEH output we also
move a few prints out of the detection thread, or mark them as pr_debug
so anyone interested can get output from the eeh_check_dev_failure()
if they want.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/include/asm/eeh.h   | 11 +++++++++
 arch/powerpc/kernel/eeh.c        | 15 ++++---------
 arch/powerpc/kernel/eeh_driver.c | 38 +++++++++++++++++++++++++++++++-
 arch/powerpc/kernel/eeh_event.c  | 26 ++++++++++------------
 4 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index c13119a5e69b..9d0e1694a94d 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -88,6 +88,17 @@ struct eeh_pe {
 	struct list_head child_list;	/* List of PEs below this PE	*/
 	struct list_head child;		/* Memb. child_list/eeh_phb_pe	*/
 	struct list_head edevs;		/* List of eeh_dev in this PE	*/
+
+	/*
+	 * Saved stack trace. When we find a PE freeze in eeh_dev_check_failure
+	 * the stack trace is saved here so we can print it in the recovery
+	 * thread if it turns out to due to a real problem rather than
+	 * a hot-remove.
+	 *
+	 * A max of 64 entries might be overkill, but it also might not be.
+	 */
+	unsigned long stack_trace[64];
+	int trace_entries;
 };
 
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c468e79d13c..46d17817b438 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -420,11 +420,9 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
 	eeh_pe_mark_isolated(phb_pe);
 	eeh_serialize_unlock(flags);
 
-	pr_err("EEH: PHB#%x failure detected, location: %s\n",
+	pr_debug("EEH: PHB#%x failure detected, location: %s\n",
 		phb_pe->phb->global_number, eeh_pe_loc_get(phb_pe));
-	dump_stack();
 	eeh_send_failure_event(phb_pe);
-
 	return 1;
 out:
 	eeh_serialize_unlock(flags);
@@ -451,7 +449,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	unsigned long flags;
 	struct device_node *dn;
 	struct pci_dev *dev;
-	struct eeh_pe *pe, *parent_pe, *phb_pe;
+	struct eeh_pe *pe, *parent_pe;
 	int rc = 0;
 	const char *location = NULL;
 
@@ -581,13 +579,8 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	 * a stack trace will help the device-driver authors figure
 	 * out what happened.  So print that out.
 	 */
-	phb_pe = eeh_phb_pe_get(pe->phb);
-	pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
-	       pe->phb->global_number, pe->addr);
-	pr_err("EEH: PE location: %s, PHB location: %s\n",
-	       eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
-	dump_stack();
-
+	pr_debug("EEH: %s: Frozen PHB#%x-PE#%x detected\n",
+		__func__, pe->phb->global_number, pe->addr);
 	eeh_send_failure_event(pe);
 
 	return 1;
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 52ce7584af43..0d34cc12c529 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -863,8 +863,44 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			if (eeh_slot_presence_check(edev->pdev))
 				devices++;
 
-	if (!devices)
+	if (!devices) {
+		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
+			pe->phb->global_number, pe->addr);
 		goto out; /* nothing to recover */
+	}
+
+	/* Log the event */
+	if (pe->type & EEH_PE_PHB) {
+		pr_err("EEH: PHB#%x failure detected, location: %s\n",
+			pe->phb->global_number, eeh_pe_loc_get(pe));
+	} else {
+		struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
+
+		pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
+		       pe->phb->global_number, pe->addr);
+		pr_err("EEH: PE location: %s, PHB location: %s\n",
+		       eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
+	}
+
+	/*
+	 * Print the saved stack trace now that we've verified there's
+	 * something to recover.
+	 */
+	if (pe->trace_entries) {
+		void **ptrs = (void **) pe->stack_trace;
+		int i;
+
+		pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
+		       pe->phb->global_number, pe->addr);
+
+		/* FIXME: Use the same format as dump_stack() */
+		pr_err("EEH: Call Trace:\n");
+		for (i = 0; i < pe->trace_entries; i++)
+			pr_err("EEH: [%pK] %pS\n", ptrs[i], ptrs[i]);
+
+		pe->trace_entries = 0;
+	}
+
 
 	eeh_pe_update_time_stamp(pe);
 	pe->freeze_count++;
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index e36653e5f76b..1d55486adb0f 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -40,7 +40,6 @@ static int eeh_event_handler(void * dummy)
 {
 	unsigned long flags;
 	struct eeh_event *event;
-	struct eeh_pe *pe;
 
 	while (!kthread_should_stop()) {
 		if (wait_for_completion_interruptible(&eeh_eventlist_event))
@@ -59,19 +58,10 @@ static int eeh_event_handler(void * dummy)
 			continue;
 
 		/* We might have event without binding PE */
-		pe = event->pe;
-		if (pe) {
-			if (pe->type & EEH_PE_PHB)
-				pr_info("EEH: Detected error on PHB#%x\n",
-					 pe->phb->global_number);
-			else
-				pr_info("EEH: Detected PCI bus error on "
-					"PHB#%x-PE#%x\n",
-					pe->phb->global_number, pe->addr);
-			eeh_handle_normal_event(pe);
-		} else {
+		if (event->pe)
+			eeh_handle_normal_event(event->pe);
+		else
 			eeh_handle_special_event();
-		}
 
 		kfree(event);
 	}
@@ -126,8 +116,16 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
 	 * This prevents the PE from being free()ed by a hotplug driver
 	 * while the PE is sitting in the event queue.
 	 */
-	if (pe)
+	if (pe) {
+		/*
+		 * Save the current stack trace so we can dump it from the
+		 * event handler thread.
+		 */
+		pe->trace_entries = stack_trace_save(pe->stack_trace,
+					 ARRAY_SIZE(pe->stack_trace), 0);
+
 		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
+	}
 
 	/* We may or may not be called in an interrupt context */
 	spin_lock_irqsave(&eeh_eventlist_lock, flags);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (4 preceding siblings ...)
  2019-09-03 10:15 ` [PATCH 05/14] powerpc/eeh: Defer printing stack trace Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-03 14:09   ` Andrew Donnellan
  2019-09-17  1:04   ` Sam Bobroff
  2019-09-03 10:15 ` [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets Oliver O'Halloran
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran, Andrew Donnellan

Support for switching CAPI cards into and out of CAPI mode was removed a
while ago. Drop the comment since it's no longer relevant.

Cc: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index e7b867912f24..94e26d56ecd2 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1125,13 +1125,6 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
 		return -EIO;
 	}
 
-	/*
-	 * If dealing with the root bus (or the bus underneath the
-	 * root port), we reset the bus underneath the root port.
-	 *
-	 * The cxl driver depends on this behaviour for bi-modal card
-	 * switching.
-	 */
 	if (pci_is_root_bus(bus) ||
 	    pci_is_root_bus(bus->parent))
 		return pnv_eeh_root_reset(hose, option);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (5 preceding siblings ...)
  2019-09-03 10:15 ` [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-17  1:15   ` Sam Bobroff
  2019-09-03 10:15 ` [PATCH 08/14] pci-hotplug/pnv_php: Add a reset_slot() callback Oliver O'Halloran
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

When we reset PCI devices managed by a hotplug driver the reset may
generate spurious hotplug events that cause the PCI device we're resetting
to be torn down accidently. This is a problem for EEH (when the driver is
EEH aware) since we want to leave the OS PCI device state intact so that
the device can be re-set without losing any resources (network, disks,
etc) provided by the driver.

Generic PCI code provides the pci_bus_error_reset() function to handle
resetting a PCI Device (or bus) by using the reset method provided by the
hotplug slot driver. We can use this function if the EEH core has
requested a hot reset (common case) without tripping over the hotplug
driver.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
I know that include is a bit gross, but:

a) We're already doing it in pci-ioda.c, and in pseries/pci.
b) It's pci_bus_error_reset() isn't really a function that
   should be provided to non-pci core code.
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 38 ++++++++++++++++++--
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 94e26d56ecd2..6bc24a47e9ef 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -34,6 +34,7 @@
 
 #include "powernv.h"
 #include "pci.h"
+#include "../../../../drivers/pci/pci.h"
 
 static int eeh_event_irq = -EINVAL;
 
@@ -849,7 +850,7 @@ static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	int aer = edev ? edev->aer_cap : 0;
 	u32 ctrl;
 
-	pr_debug("%s: Reset PCI bus %04x:%02x with option %d\n",
+	pr_debug("%s: Secondary Reset PCI bus %04x:%02x with option %d\n",
 		 __func__, pci_domain_nr(dev->bus),
 		 dev->bus->number, option);
 
@@ -907,6 +908,10 @@ static int pnv_eeh_bridge_reset(struct pci_dev *pdev, int option)
 	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
 		return __pnv_eeh_bridge_reset(pdev, option);
 
+	pr_debug("%s: FW reset PCI bus %04x:%02x with option %d\n",
+		 __func__, pci_domain_nr(pdev->bus),
+		 pdev->bus->number, option);
+
 	switch (option) {
 	case EEH_RESET_FUNDAMENTAL:
 		scope = OPAL_RESET_PCI_FUNDAMENTAL;
@@ -1125,10 +1130,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
 		return -EIO;
 	}
 
-	if (pci_is_root_bus(bus) ||
-	    pci_is_root_bus(bus->parent))
+	if (pci_is_root_bus(bus))
 		return pnv_eeh_root_reset(hose, option);
 
+	/*
+	 * For hot resets try use the generic PCI error recovery reset
+	 * functions. These correctly handles the case where the secondary
+	 * bus is behind a hotplug slot and it will use the slot provided
+	 * reset methods to prevent spurious hotplug events during the reset.
+	 *
+	 * Fundemental resets need to be handled internally to EEH since the
+	 * PCI core doesn't really have a concept of a fundemental reset,
+	 * mainly because there's no standard way to generate one. Only a
+	 * few devices require an FRESET so it should be fine.
+	 */
+	if (option != EEH_RESET_FUNDAMENTAL) {
+		/*
+		 * NB: Skiboot and pnv_eeh_bridge_reset() also no-op the
+		 *     de-assert step. It's like the OPAL reset API was
+		 *     poorly designed or something...
+		 */
+		if (option == EEH_RESET_DEACTIVATE)
+			return 0;
+
+		rc = pci_bus_error_reset(bus->self);
+		if (!rc)
+			return 0;
+	}
+
+	/* otherwise, use the generic bridge reset. this might call into FW */
+	if (pci_is_root_bus(bus->parent))
+		return pnv_eeh_root_reset(hose, option);
 	return pnv_eeh_bridge_reset(bus->self, option);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/14] pci-hotplug/pnv_php: Add a reset_slot() callback
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (6 preceding siblings ...)
  2019-09-03 10:15 ` [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets Oliver O'Halloran
@ 2019-09-03 10:15 ` Oliver O'Halloran
  2019-09-03 10:16 ` [PATCH 09/14] pci-hotplug/pnv_php: Add support for IODA3 Power9 PHBs Oliver O'Halloran
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

When performing EEH recovery of devices in a hotplug slot we need to use
the slot driver's ->reset_slot() callback to prevent spurious hotplug
events due to spurious DLActive and PresDet change interrupts. Add a
reset_slot() callback to pnv_php so we can handle recovery of devices
in pnv_php managed slots.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 drivers/pci/hotplug/pnv_php.c | 39 +++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 6758fd7c382e..b0e243dabf77 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -511,6 +511,37 @@ static int pnv_php_enable(struct pnv_php_slot *php_slot, bool rescan)
 	return 0;
 }
 
+static int pnv_php_reset_slot(struct hotplug_slot *slot, int probe)
+{
+	struct pnv_php_slot *php_slot = to_pnv_php_slot(slot);
+	struct pci_dev *bridge = php_slot->pdev;
+	uint16_t sts;
+
+	/*
+	 * The CAPI folks want pnv_php to drive OpenCAPI slots
+	 * which don't have a bridge. Only claim to support
+	 * reset_slot() if we have a bridge device (for now...)
+	 */
+	if (probe)
+		return !bridge;
+
+	/* mask our interrupt while resetting the bridge */
+	if (php_slot->irq > 0)
+		disable_irq(php_slot->irq);
+
+	pci_bridge_secondary_bus_reset(bridge);
+
+	/* clear any state changes that happened due to the reset */
+	pcie_capability_read_word(php_slot->pdev, PCI_EXP_SLTSTA, &sts);
+	sts &= (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC);
+	pcie_capability_write_word(php_slot->pdev, PCI_EXP_SLTSTA, sts);
+
+	if (php_slot->irq > 0)
+		enable_irq(php_slot->irq);
+
+	return 0;
+}
+
 static int pnv_php_enable_slot(struct hotplug_slot *slot)
 {
 	struct pnv_php_slot *php_slot = to_pnv_php_slot(slot);
@@ -548,6 +579,7 @@ static const struct hotplug_slot_ops php_slot_ops = {
 	.set_attention_status	= pnv_php_set_attention_state,
 	.enable_slot		= pnv_php_enable_slot,
 	.disable_slot		= pnv_php_disable_slot,
+	.reset_slot		= pnv_php_reset_slot,
 };
 
 static void pnv_php_release(struct pnv_php_slot *php_slot)
@@ -721,6 +753,12 @@ static irqreturn_t pnv_php_interrupt(int irq, void *data)
 	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &sts);
 	sts &= (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC);
 	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, sts);
+
+	pci_dbg(pdev, "PCI slot [%s]: HP int! DLAct: %d, PresDet: %d\n",
+			php_slot->name,
+			!!(sts & PCI_EXP_SLTSTA_DLLSC),
+			!!(sts & PCI_EXP_SLTSTA_PDC));
+
 	if (sts & PCI_EXP_SLTSTA_DLLSC) {
 		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lsts);
 		added = !!(lsts & PCI_EXP_LNKSTA_DLLLA);
@@ -735,6 +773,7 @@ static irqreturn_t pnv_php_interrupt(int irq, void *data)
 
 		added = !!(presence == OPAL_PCI_SLOT_PRESENT);
 	} else {
+		pci_dbg(pdev, "PCI slot [%s]: Spurious IRQ?\n", php_slot->name);
 		return IRQ_NONE;
 	}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/14] pci-hotplug/pnv_php: Add support for IODA3 Power9 PHBs
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (7 preceding siblings ...)
  2019-09-03 10:15 ` [PATCH 08/14] pci-hotplug/pnv_php: Add a reset_slot() callback Oliver O'Halloran
@ 2019-09-03 10:16 ` Oliver O'Halloran
  2019-09-03 10:16 ` [PATCH 10/14] pci-hotplug/pnv_php: Add attention indicator support Oliver O'Halloran
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

Currently we check that an IODA2 compatible PHB is upstream of this slot.
This is mainly to avoid pnv_php creating slots for the various "virtual
PHBs" that we create for NVLink. There's no real need for this restriction
so allow it on IODA3.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/platforms/powernv/pci.c | 3 ++-
 drivers/pci/hotplug/pnv_php.c        | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 6104418c9ad5..2825d004dece 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -54,7 +54,8 @@ int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id)
 			break;
 		}
 
-		if (!of_device_is_compatible(parent, "ibm,ioda2-phb")) {
+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
+		    !of_device_is_compatible(parent, "ibm,ioda3-phb")) {
 			of_node_put(parent);
 			continue;
 		}
diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index b0e243dabf77..6fdf8b74cb0a 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -994,6 +994,9 @@ static int __init pnv_php_init(void)
 	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
 		pnv_php_register(dn);
 
+	for_each_compatible_node(dn, NULL, "ibm,ioda3-phb")
+		pnv_php_register(dn);
+
 	return 0;
 }
 
@@ -1003,6 +1006,9 @@ static void __exit pnv_php_exit(void)
 
 	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
 		pnv_php_unregister(dn);
+
+	for_each_compatible_node(dn, NULL, "ibm,ioda3-phb")
+		pnv_php_unregister(dn);
 }
 
 module_init(pnv_php_init);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/14] pci-hotplug/pnv_php: Add attention indicator support
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (8 preceding siblings ...)
  2019-09-03 10:16 ` [PATCH 09/14] pci-hotplug/pnv_php: Add support for IODA3 Power9 PHBs Oliver O'Halloran
@ 2019-09-03 10:16 ` Oliver O'Halloran
  2019-09-03 10:16 ` [PATCH 11/14] powerpc/eeh: Set attention indicator while recovering Oliver O'Halloran
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

pnv_php is generally used with PCIe bridges which provide a native
interface for setting the attention and power indicator LEDs. Wire up
those interfaces even if firmware does not have support for them (yet...)

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 drivers/pci/hotplug/pnv_php.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 6fdf8b74cb0a..d7b2b47bc33e 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -419,9 +419,21 @@ static int pnv_php_get_attention_state(struct hotplug_slot *slot, u8 *state)
 static int pnv_php_set_attention_state(struct hotplug_slot *slot, u8 state)
 {
 	struct pnv_php_slot *php_slot = to_pnv_php_slot(slot);
+	struct pci_dev *bridge = php_slot->pdev;
+	u16 new, mask;
 
-	/* FIXME: Make it real once firmware supports it */
 	php_slot->attention_state = state;
+	if (!bridge)
+		return 0;
+
+	mask = PCI_EXP_SLTCTL_AIC;
+
+	if (state)
+		new = PCI_EXP_SLTCTL_ATTN_IND_ON;
+	else
+		new = PCI_EXP_SLTCTL_ATTN_IND_OFF;
+
+	pcie_capability_clear_and_set_word(bridge, PCI_EXP_SLTCTL, mask, new);
 
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 11/14] powerpc/eeh: Set attention indicator while recovering
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (9 preceding siblings ...)
  2019-09-03 10:16 ` [PATCH 10/14] pci-hotplug/pnv_php: Add attention indicator support Oliver O'Halloran
@ 2019-09-03 10:16 ` Oliver O'Halloran
  2019-09-17  1:23   ` Sam Bobroff
  2019-09-03 10:16 ` [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check Oliver O'Halloran
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

I am the RAS team. Hear me roar.

Roar.

On a more serious note, being able to locate failed devices can be helpful.
Set the attention indicator if the slot supports it once we've determined
the device is present and only clear it if the device is fully recovered.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/kernel/eeh_driver.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 0d34cc12c529..80bd157fcb45 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -803,6 +803,10 @@ static bool eeh_slot_presence_check(struct pci_dev *pdev)
 	if (!ops || !ops->get_adapter_status)
 		return true;
 
+	/* set the attention indicator while we've got the slot ops */
+	if (ops->set_attention_status)
+		ops->set_attention_status(slot->hotplug, 1);
+
 	rc = ops->get_adapter_status(slot->hotplug, &state);
 	if (rc)
 		return true;
@@ -810,6 +814,28 @@ static bool eeh_slot_presence_check(struct pci_dev *pdev)
 	return !!state;
 }
 
+static void eeh_clear_slot_attention(struct pci_dev *pdev)
+{
+	const struct hotplug_slot_ops *ops;
+	struct pci_slot *slot;
+
+	if (!pdev)
+		return;
+
+	if (pdev->error_state == pci_channel_io_perm_failure)
+		return;
+
+	slot = pdev->slot;
+	if (!slot || !slot->hotplug)
+		return;
+
+	ops = slot->hotplug->ops;
+	if (!ops || !ops->set_attention_status)
+		return;
+
+	ops->set_attention_status(slot->hotplug, 0);
+}
+
 /**
  * eeh_handle_normal_event - Handle EEH events on a specific PE
  * @pe: EEH PE - which should not be used after we return, as it may
@@ -1098,6 +1124,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	 * we don't want to modify the PE tree structure so we do it here.
 	 */
 	eeh_pe_cleanup(pe);
+
+	/* clear the slot attention LED for all recovered devices */
+	eeh_for_each_pe(pe, tmp_pe)
+		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+			eeh_clear_slot_attention(edev->pdev);
+
 	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (10 preceding siblings ...)
  2019-09-03 10:16 ` [PATCH 11/14] powerpc/eeh: Set attention indicator while recovering Oliver O'Halloran
@ 2019-09-03 10:16 ` Oliver O'Halloran
  2019-09-17  3:15   ` Sam Bobroff
  2019-09-03 10:16 ` [PATCH 13/14] powerpc/eeh: Add a eeh_dev_break debugfs interface Oliver O'Halloran
  2019-09-03 10:16 ` [PATCH 14/14] selftests/powerpc: Add basic EEH selftest Oliver O'Halloran
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs
response. When performing EEH testing using the EEH error injection feature
available on some platforms there is no simple way to kick-off the kernel's
recovery process since any accesses from userspace (usually /dev/mem) will
bypass the MMIO helpers in the kernel which check if a 0xFF response is due
to an EEH freeze or not.

If a device contains a 0xFF byte in it's config space it's possible to
trigger the recovery process via config space read from userspace, but this
is not a reliable method. If a driver is bound to the device an in use it
will frequently trigger the MMIO check, but this is also inconsistent.

To solve these problems this patch adds a debugfs file called
"eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs
eeh_dev_check_failure() on it. This is the same check that's done when the
kernel gets a 0xFF result from an config or MMIO read with the added
benifit that it can be reliably triggered from userspace.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/kernel/eeh.c | 61 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 46d17817b438..ace1c5a6b8ed 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1873,6 +1873,64 @@ static const struct file_operations eeh_force_recover_fops = {
 	.llseek	= no_llseek,
 	.write	= eeh_force_recover_write,
 };
+
+static ssize_t eeh_debugfs_dev_usage(struct file *filp,
+				char __user *user_buf,
+				size_t count, loff_t *ppos)
+{
+	static const char usage[] = "input format: <domain>:<bus>:<dev>.<fn>\n";
+
+	return simple_read_from_buffer(user_buf, count, ppos,
+				       usage, sizeof(usage) - 1);
+}
+
+static ssize_t eeh_dev_check_write(struct file *filp,
+				const char __user *user_buf,
+				size_t count, loff_t *ppos)
+{
+	uint32_t domain, bus, dev, fn;
+	struct pci_dev *pdev;
+	struct eeh_dev *edev;
+	char buf[20];
+	int ret;
+
+	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
+	if (!ret)
+		return -EFAULT;
+
+	ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
+	if (ret != 4) {
+		pr_err("%s: expected 4 args, got %d\n", __func__, ret);
+		return -EINVAL;
+	}
+
+	pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
+	if (!pdev)
+		return -ENODEV;
+
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev) {
+		pci_err(pdev, "No eeh_dev for this device!\n");
+		pci_dev_put(pdev);
+		return -ENODEV;
+	}
+
+	ret = eeh_dev_check_failure(edev);
+	pci_info(pdev, "eeh_dev_check_failure(%04x:%02x:%02x.%01x) = %d\n",
+			domain, bus, dev, fn, ret);
+
+	pci_dev_put(pdev);
+
+	return count;
+}
+
+static const struct file_operations eeh_dev_check_fops = {
+	.open	= simple_open,
+	.llseek	= no_llseek,
+	.write	= eeh_dev_check_write,
+	.read   = eeh_debugfs_dev_usage,
+};
+
 #endif
 
 static int __init eeh_init_proc(void)
@@ -1888,6 +1946,9 @@ static int __init eeh_init_proc(void)
 		debugfs_create_bool("eeh_disable_recovery", 0600,
 				powerpc_debugfs_root,
 				&eeh_debugfs_no_recover);
+		debugfs_create_file_unsafe("eeh_dev_check", 0600,
+				powerpc_debugfs_root, NULL,
+				&eeh_dev_check_fops);
 		debugfs_create_file_unsafe("eeh_force_recover", 0600,
 				powerpc_debugfs_root, NULL,
 				&eeh_force_recover_fops);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 13/14] powerpc/eeh: Add a eeh_dev_break debugfs interface
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (11 preceding siblings ...)
  2019-09-03 10:16 ` [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check Oliver O'Halloran
@ 2019-09-03 10:16 ` Oliver O'Halloran
  2019-09-17  3:19   ` Sam Bobroff
  2019-09-03 10:16 ` [PATCH 14/14] selftests/powerpc: Add basic EEH selftest Oliver O'Halloran
  13 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

Add an interface to debugfs for generating an EEH event on a given device.
This works by disabling memory accesses to and from the device by setting
the PCI_COMMAND register (or the VF Memory Space Enable on the parent PF).

This is a somewhat portable alternative to using the platform specific
error injection mechanisms since those tend to be either hard to use, or
straight up broken. For pseries the interfaces also requires the use of
/dev/mem which is probably going to go away in a post-LOCKDOWN world
(and it's a horrific hack to begin with) so moving to a kernel-provided
interface makes sense and provides a sane, cross-platform interface for
userspace so we can write more generic testing scripts.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/kernel/eeh.c | 139 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index ace1c5a6b8ed..a55d2f01da1d 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1894,7 +1894,8 @@ static ssize_t eeh_dev_check_write(struct file *filp,
 	char buf[20];
 	int ret;
 
-	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
+	memset(buf, 0, sizeof(buf));
+	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
 	if (!ret)
 		return -EFAULT;
 
@@ -1931,6 +1932,139 @@ static const struct file_operations eeh_dev_check_fops = {
 	.read   = eeh_debugfs_dev_usage,
 };
 
+static int eeh_debugfs_break_device(struct pci_dev *pdev)
+{
+	struct resource *bar = NULL;
+	void __iomem *mapped;
+	u16 old, bit;
+	int i, pos;
+
+	/* Do we have an MMIO BAR to disable? */
+	for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
+		struct resource *r = &pdev->resource[i];
+
+		if (!r->flags || !r->start)
+			continue;
+		if (r->flags & IORESOURCE_IO)
+			continue;
+		if (r->flags & IORESOURCE_UNSET)
+			continue;
+
+		bar = r;
+		break;
+	}
+
+	if (!bar) {
+		pci_err(pdev, "Unable to find Memory BAR to cause EEH with\n");
+		return -ENXIO;
+	}
+
+	pci_err(pdev, "Going to break: %pR\n", bar);
+
+	if (pdev->is_virtfn) {
+#ifndef CONFIG_IOV
+		return -ENXIO;
+#else
+		/*
+		 * VFs don't have a per-function COMMAND register, so the best
+		 * we can do is clear the Memory Space Enable bit in the PF's
+		 * SRIOV control reg.
+		 *
+		 * Unfortunately, this requires that we have a PF (i.e doesn't
+		 * work for a passed-through VF) and it has the potential side
+		 * effect of also causing an EEH on every other VF under the
+		 * PF. Oh well.
+		 */
+		pdev = pdev->physfn;
+		if (!pdev)
+			return -ENXIO; /* passed through VFs have no PF */
+
+		pos  = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+		pos += PCI_SRIOV_CTRL;
+		bit  = PCI_SRIOV_CTRL_MSE;
+#endif /* !CONFIG_IOV */
+	} else {
+		bit = PCI_COMMAND_MEMORY;
+		pos = PCI_COMMAND;
+	}
+
+	/*
+	 * Process here is:
+	 *
+	 * 1. Disable Memory space.
+	 *
+	 * 2. Perform an MMIO to the device. This should result in an error
+	 *    (CA  / UR) being raised by the device which results in an EEH
+	 *    PE freeze. Using the in_8() accessor skips the eeh detection hook
+	 *    so the freeze hook so the EEH Detection machinery won't be
+	 *    triggered here. This is to match the usual behaviour of EEH
+	 *    where the HW will asyncronously freeze a PE and it's up to
+	 *    the kernel to notice and deal with it.
+	 *
+	 * 3. Turn Memory space back on. This is more important for VFs
+	 *    since recovery will probably fail if we don't. For normal
+	 *    the COMMAND register is reset as a part of re-initialising
+	 *    the device.
+	 *
+	 * Breaking stuff is the point so who cares if it's racy ;)
+	 */
+	pci_read_config_word(pdev, pos, &old);
+
+	mapped = ioremap(bar->start, PAGE_SIZE);
+	if (!mapped) {
+		pci_err(pdev, "Unable to map MMIO BAR %pR\n", bar);
+		return -ENXIO;
+	}
+
+	pci_write_config_word(pdev, pos, old & ~bit);
+	in_8(mapped);
+	pci_write_config_word(pdev, pos, old);
+
+	iounmap(mapped);
+
+	return 0;
+}
+
+static ssize_t eeh_dev_break_write(struct file *filp,
+				const char __user *user_buf,
+				size_t count, loff_t *ppos)
+{
+	uint32_t domain, bus, dev, fn;
+	struct pci_dev *pdev;
+	char buf[20];
+	int ret;
+
+	memset(buf, 0, sizeof(buf));
+	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+	if (!ret)
+		return -EFAULT;
+
+	ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
+	if (ret != 4) {
+		pr_err("%s: expected 4 args, got %d\n", __func__, ret);
+		return -EINVAL;
+	}
+
+	pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
+	if (!pdev)
+		return -ENODEV;
+
+	ret = eeh_debugfs_break_device(pdev);
+	pci_dev_put(pdev);
+
+	if (ret < 0)
+		return ret;
+
+	return count;
+}
+
+static const struct file_operations eeh_dev_break_fops = {
+	.open	= simple_open,
+	.llseek	= no_llseek,
+	.write	= eeh_dev_break_write,
+	.read   = eeh_debugfs_dev_usage,
+};
+
 #endif
 
 static int __init eeh_init_proc(void)
@@ -1949,6 +2083,9 @@ static int __init eeh_init_proc(void)
 		debugfs_create_file_unsafe("eeh_dev_check", 0600,
 				powerpc_debugfs_root, NULL,
 				&eeh_dev_check_fops);
+		debugfs_create_file_unsafe("eeh_dev_break", 0600,
+				powerpc_debugfs_root, NULL,
+				&eeh_dev_break_fops);
 		debugfs_create_file_unsafe("eeh_force_recover", 0600,
 				powerpc_debugfs_root, NULL,
 				&eeh_force_recover_fops);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 14/14] selftests/powerpc: Add basic EEH selftest
  2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
                   ` (12 preceding siblings ...)
  2019-09-03 10:16 ` [PATCH 13/14] powerpc/eeh: Add a eeh_dev_break debugfs interface Oliver O'Halloran
@ 2019-09-03 10:16 ` Oliver O'Halloran
  13 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-03 10:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH
recovery.  Historically this has been done manually using platform specific
EEH error injection facilities (e.g. via RTAS). However, documentation on
how to use these facilities is haphazard at best and non-existent at worst
so it's hard to develop a cross-platform test.

The new debugfs interfaces allow the kernel to handle the platform specific
details so we can write a more generic set of sets. This patch adds the
most basic of recovery tests where:

a) Errors are injected and recovered from sequentially,
b) Errors are not injected into PCI-PCI bridges, such as PCIe switches.
c) Errors are only injected into device function zero.
d) No errors are injected into Virtual Functions.

a), b) and c) are largely due to limitations of Linux's EEH support.  EEH
recovery is serialised in the EEH recovery thread which forces a).
Similarly, multi-function PCI devices are almost always grouped into the
same PE so injecting an error on one function exercises the same code
paths. c) is because we currently more or less ignore PCI bridges during
recovery and assume that the recovered topology will be the same as the
original.

d) is due to the limits of the eeh_dev_break interface. With the current
implementation we can't inject an error into a specific VF without
potentially causing additional errors on other VFs. Due to the serialised
recovery process we might end up timing out waiting for another function to
recover before the function of interest is recovered. The platform specific
error injection facilities are finer-grained and allow this capability, but
doing that requires working out how to use those facilities first.

Basicly, it's better than nothing and it's a base to build on.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 tools/testing/selftests/powerpc/Makefile      |  1 +
 tools/testing/selftests/powerpc/eeh/Makefile  |  9 ++
 .../selftests/powerpc/eeh/eeh-basic.sh        | 82 +++++++++++++++++++
 .../selftests/powerpc/eeh/eeh-functions.sh    | 76 +++++++++++++++++
 4 files changed, 168 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/eeh/Makefile
 create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-basic.sh
 create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-functions.sh

diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile
index b3ad909aefbc..644770c3b754 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -26,6 +26,7 @@ SUB_DIRS = alignment		\
 	   switch_endian	\
 	   syscalls		\
 	   tm			\
+	   eeh			\
 	   vphn         \
 	   math		\
 	   ptrace	\
diff --git a/tools/testing/selftests/powerpc/eeh/Makefile b/tools/testing/selftests/powerpc/eeh/Makefile
new file mode 100644
index 000000000000..b397babd569b
--- /dev/null
+++ b/tools/testing/selftests/powerpc/eeh/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+noarg:
+	$(MAKE) -C ../
+
+TEST_PROGS := eeh-basic.sh
+TEST_FILES := eeh-functions.sh
+
+top_srcdir = ../../../../..
+include ../../lib.mk
diff --git a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
new file mode 100755
index 000000000000..f988d2f42e8f
--- /dev/null
+++ b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-only
+
+. ./eeh-functions.sh
+
+if ! eeh_supported ; then
+	echo "EEH not supported on this system, skipping"
+	exit 0;
+fi
+
+if [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_check" ] && \
+   [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_break" ] ; then
+	echo "debugfs EEH testing files are missing. Is debugfs mounted?"
+	exit 1;
+fi
+
+pre_lspci=`mktemp`
+lspci > $pre_lspci
+
+# Bump the max freeze count to something absurd so we don't
+# trip over it while breaking things.
+echo 5000 > /sys/kernel/debug/powerpc/eeh_max_freezes
+
+# record the devices that we break in here. Assuming everything
+# goes to plan we should get them back once the recover process
+# is finished.
+devices=""
+
+# Build up a list of candidate devices.
+for dev in `ls -1 /sys/bus/pci/devices/ | grep '\.0$'` ; do
+	# skip bridges since we can't recover them (yet...)
+	if [ -e "/sys/bus/pci/devices/$dev/pci_bus" ] ; then
+		echo "$dev, Skipped: bridge"
+		continue;
+	fi
+
+	# Skip VFs for now since we don't have a reliable way
+	# to break them.
+	if [ -e "/sys/bus/pci/devices/$dev/physfn" ] ; then
+		echo "$dev, Skipped: virtfn"
+		continue;
+	fi
+
+	# Don't inject errosr into an already-frozen PE. This happens with
+	# PEs that contain multiple PCI devices (e.g. multi-function cards)
+	# and injecting new errors during the recovery process will probably
+	# result in the recovery failing and the device being marked as
+	# failed.
+	if ! pe_ok $dev ; then
+		echo "$dev, Skipped: Bad initial PE state"
+		continue;
+	fi
+
+	echo "$dev, Added"
+
+	# Add to this list of device to check
+	devices="$devices $dev"
+done
+
+dev_count="$(echo $devices | wc -w)"
+echo "Found ${dev_count} breakable devices..."
+
+failed=0
+for dev in $devices ; do
+	echo "Breaking $dev..."
+
+	if ! pe_ok $dev ; then
+		echo "Skipping $dev, Initial PE state is not ok"
+		failed="$((failed + 1))"
+		continue;
+	fi
+
+	if ! eeh_one_dev $dev ; then
+		failed="$((failed + 1))"
+	fi
+done
+
+echo "$failed devices failed to recover ($dev_count tested)"
+lspci | diff -u $pre_lspci -
+rm -f $pre_lspci
+
+exit $failed
diff --git a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
new file mode 100755
index 000000000000..26112ab5cdf4
--- /dev/null
+++ b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
@@ -0,0 +1,76 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-only
+
+pe_ok() {
+	local dev="$1"
+	local path="/sys/bus/pci/devices/$dev/eeh_pe_state"
+
+	if ! [ -e "$path" ] ; then
+		return 1;
+	fi
+
+	local fw_state="$(cut -d' ' -f1 < $path)"
+	local sw_state="$(cut -d' ' -f2 < $path)"
+
+	# If EEH_PE_ISOLATED or EEH_PE_RECOVERING are set then the PE is in an
+	# error state or being recovered. Either way, not ok.
+	if [ "$((sw_state & 0x3))" -ne 0 ] ; then
+		return 1
+	fi
+
+	# A functioning PE should have the EEH_STATE_MMIO_ACTIVE and
+	# EEH_STATE_DMA_ACTIVE flags set. For some goddamn stupid reason
+	# the platform backends set these when the PE is in reset. The
+	# RECOVERING check above should stop any false positives though.
+	if [ "$((fw_state & 0x18))" -ne "$((0x18))" ] ; then
+		return 1
+	fi
+
+	return 0;
+}
+
+eeh_supported() {
+	test -e /proc/powerpc/eeh && \
+	grep -q 'EEH Subsystem is enabled' /proc/powerpc/eeh
+}
+
+eeh_one_dev() {
+	local dev="$1"
+
+	# Using this function from the command line is sometimes useful for
+	# testing so check that the argument is a well-formed sysfs device
+	# name.
+	if ! test -e /sys/bus/pci/devices/$dev/ ; then
+		echo "Error: '$dev' must be a sysfs device name (DDDD:BB:DD.F)"
+		return 1;
+	fi
+
+	# Break it
+	echo $dev >/sys/kernel/debug/powerpc/eeh_dev_break
+
+	# Force an EEH device check. If the kernel has already
+	# noticed the EEH (due to a driver poll or whatever), this
+	# is a no-op.
+	echo $dev >/sys/kernel/debug/powerpc/eeh_dev_check
+
+	# Enforce a 30s timeout for recovery. Even the IPR, which is infamously
+	# slow to reset, should recover within 30s.
+	max_wait=30
+
+	for i in `seq 0 ${max_wait}` ; do
+		if pe_ok $dev ; then
+			break;
+		fi
+		echo "$dev, waited $i/${max_wait}"
+		sleep 1
+	done
+
+	if ! pe_ok $dev ; then
+		echo "$dev, Failed to recover!"
+		return 1;
+	fi
+
+	echo "$dev, Recovered after $i seconds"
+	return 0;
+}
+
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment
  2019-09-03 10:15 ` [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment Oliver O'Halloran
@ 2019-09-03 14:09   ` Andrew Donnellan
  2019-09-17  1:04   ` Sam Bobroff
  1 sibling, 0 replies; 34+ messages in thread
From: Andrew Donnellan @ 2019-09-03 14:09 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: sbobroff

On 3/9/19 8:15 pm, Oliver O'Halloran wrote:
> Support for switching CAPI cards into and out of CAPI mode was removed a
> while ago. Drop the comment since it's no longer relevant.
> 
> Cc: Andrew Donnellan <ajd@linux.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Oliver looked... unimpressed with the hackiness of the design of our 
mode-switching as he yelled at me to come explain this comment.

Acked-by: Andrew Donnellan <ajd@linux.ibm.com>

> ---
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 7 -------
>   1 file changed, 7 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index e7b867912f24..94e26d56ecd2 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -1125,13 +1125,6 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>   		return -EIO;
>   	}
>   
> -	/*
> -	 * If dealing with the root bus (or the bus underneath the
> -	 * root port), we reset the bus underneath the root port.
> -	 *
> -	 * The cxl driver depends on this behaviour for bi-modal card
> -	 * switching.
> -	 */
>   	if (pci_is_root_bus(bus) ||
>   	    pci_is_root_bus(bus->parent))
>   		return pnv_eeh_root_reset(hose, option);
> 

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes
  2019-09-03 10:15 ` [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes Oliver O'Halloran
@ 2019-09-17  0:43   ` Sam Bobroff
  2019-09-19 10:25   ` Michael Ellerman
  1 sibling, 0 replies; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  0:43 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7469 bytes --]

On Tue, Sep 03, 2019 at 08:15:52PM +1000, Oliver O'Halloran wrote:
> When the last device in an eeh_pe is removed the eeh_pe structure itself
> (and any empty parents) are freed since they are no longer needed. This
> results in a crash when a hotplug driver is involved since the following
> may occur:
> 
> 1. Device is suprise removed.
> 2. Driver performs an MMIO, which fails and queues and eeh_event.
> 3. Hotplug driver receives a hotplug interrupt and removes any
>    pci_devs that were under the slot.
> 4. pci_dev is torn down and the eeh_pe is freed.
> 5. The EEH event handler thread processes the eeh_event and crashes
>    since the eeh_pe pointer in the eeh_event structure is no
>    longer valid.
> 
> Crashing is generally considered poor form. Instead of doing that use
> the fact PEs are marked as EEH_PE_INVALID to keep them around until the
> end of the recovery cycle, at which point we can safely prune any empty
> PEs.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

I think this could use a bit more explanation about how the change
correct the problem, but if I understand it correctly, what it's doing
is (for the hotplug problem case):
- When the (problmatic) freeeze is detected during hot unplug, change
  eeh_dev_checK_failure() so that it marks PEs with EEH_PE_RECOVERING
  immediately rather than waiting until error recovery has started.
- Then as the hot unplug continues and the device is removed, and
  eeh_rmv_from_parent_pe() is called, change it so that it does not
  remove PEs that are marked EEH_PE_RECOVERING (which will be the ones
  just marked).
- Then as recovery starts, PE removal won't race between the hotplug
  code and the recovery code, and the crash is avoided.

If that's right, it looks fine to me, although as you said we might want
to revert it if/when we have something better.

I haven't got a test setup for the error case, but I've tried my usual
test cases and they all seem to be OK. The only place I can see that
might be disturbed by this change is where EEH_PE_RECOVERING affects
eeh_pe_reset_and_recover() (used when a PE is passed back from
a guest to the host), but the test case doesn't seem to be any worse.

Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
> Sam Bobroff is working on implementing proper refcounting for EEH PEs,
> but that's not going to be ready for a while and this is less broken
> than what we have now.
> ---
>  arch/powerpc/kernel/eeh_driver.c | 36 ++++++++++++++++++++++++++++++--
>  arch/powerpc/kernel/eeh_event.c  |  8 +++++++
>  arch/powerpc/kernel/eeh_pe.c     | 23 +++++++++++++++++++-
>  3 files changed, 64 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index a31cd32c4ce9..75266156943f 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -734,6 +734,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
>   */
>  #define MAX_WAIT_FOR_RECOVERY 300
>  
> +
> +/* Walks the PE tree after processing an event to remove any stale PEs.
> + *
> + * NB: This needs to be recursive to ensure the leaf PEs get removed
> + * before their parents do. Although this is possible to do recursively
> + * we don't since this is easier to read and we need to garantee
> + * the leaf nodes will be handled first.
> + */
> +static void eeh_pe_cleanup(struct eeh_pe *pe)
> +{
> +	struct eeh_pe *child_pe, *tmp;
> +
> +	list_for_each_entry_safe(child_pe, tmp, &pe->child_list, child)
> +		eeh_pe_cleanup(child_pe);
> +
> +	if (pe->state & EEH_PE_KEEP)
> +		return;
> +
> +	if (!(pe->state & EEH_PE_INVALID))
> +		return;
> +
> +	if (list_empty(&pe->edevs) && list_empty(&pe->child_list)) {
> +		list_del(&pe->child);
> +		kfree(pe);
> +	}
> +}
> +
>  /**
>   * eeh_handle_normal_event - Handle EEH events on a specific PE
>   * @pe: EEH PE - which should not be used after we return, as it may
> @@ -772,8 +799,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  		return;
>  	}
>  
> -	eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
> -
>  	eeh_pe_update_time_stamp(pe);
>  	pe->freeze_count++;
>  	if (pe->freeze_count > eeh_max_freezes) {
> @@ -963,6 +988,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  			return;
>  		}
>  	}
> +
> +	/*
> +	 * Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
> +	 * we don't want to modify the PE tree structure so we do it here.
> +	 */
> +	eeh_pe_cleanup(pe);
>  	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
>  }
>  
> @@ -1035,6 +1066,7 @@ void eeh_handle_special_event(void)
>  		 */
>  		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
>  		    rc == EEH_NEXT_ERR_FENCED_PHB) {
> +			eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
>  			eeh_handle_normal_event(pe);
>  		} else {
>  			pci_lock_rescan_remove();
> diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
> index 64cfbe41174b..e36653e5f76b 100644
> --- a/arch/powerpc/kernel/eeh_event.c
> +++ b/arch/powerpc/kernel/eeh_event.c
> @@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
>  	}
>  	event->pe = pe;
>  
> +	/*
> +	 * Mark the PE as recovering before inserting it in the queue.
> +	 * This prevents the PE from being free()ed by a hotplug driver
> +	 * while the PE is sitting in the event queue.
> +	 */
> +	if (pe)
> +		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
> +
>  	/* We may or may not be called in an interrupt context */
>  	spin_lock_irqsave(&eeh_eventlist_lock, flags);
>  	list_add(&event->list, &eeh_eventlist);
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index 1a6254bcf056..177852e39a25 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -470,6 +470,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  {
>  	struct eeh_pe *pe, *parent, *child;
> +	bool keep, recover;
>  	int cnt;
>  
>  	pe = eeh_dev_to_pe(edev);
> @@ -490,10 +491,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  	 */
>  	while (1) {
>  		parent = pe->parent;
> +
> +		/* PHB PEs should never be removed */
>  		if (pe->type & EEH_PE_PHB)
>  			break;
>  
> -		if (!(pe->state & EEH_PE_KEEP)) {
> +		/*
> +		 * XXX: KEEP is set while resetting a PE. I don't think it's
> +		 * ever set without RECOVERING also being set. I could
> +		 * be wrong though so catch that with a WARN.
> +		 */
> +		keep = !!(pe->state & EEH_PE_KEEP);
> +		recover = !!(pe->state & EEH_PE_RECOVERING);
> +		WARN_ON(keep && !recover);
> +
> +		if (!keep && !recover) {
>  			if (list_empty(&pe->edevs) &&
>  			    list_empty(&pe->child_list)) {
>  				list_del(&pe->child);
> @@ -502,6 +514,15 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  				break;
>  			}
>  		} else {
> +			/*
> +			 * Mark the PE as invalid. At the end of the recovery
> +			 * process any invalid PEs will be garbage collected.
> +			 *
> +			 * We need to delay the free()ing of them since we can
> +			 * remove edev's while traversing the PE tree which
> +			 * might trigger the removal of a PE and we can't
> +			 * deal with that (yet).
> +			 */
>  			if (list_empty(&pe->edevs)) {
>  				cnt = 0;
>  				list_for_each_entry(child, &pe->child_list, child) {
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/14] powerpc/eeh: Fix race when freeing PDNs
  2019-09-03 10:15 ` [PATCH 02/14] powerpc/eeh: Fix race when freeing PDNs Oliver O'Halloran
@ 2019-09-17  0:50   ` Sam Bobroff
  0 siblings, 0 replies; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  0:50 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4478 bytes --]

On Tue, Sep 03, 2019 at 08:15:53PM +1000, Oliver O'Halloran wrote:
> When hot-adding devices we rely on the hotplug driver to create pci_dn's
> for the devices under the hotplug slot. Converse, when hot-removing the
> driver will remove the pci_dn's that it created. This is a problem because
> the pci_dev is still live until it's refcount drops to zero. This can
> happen if the driver is slow to tear down it's internal state. Ideally, the
> driver would not attempt to perform any config accesses to the device once
> it's been marked as removed, but sometimes it happens. As a result, we
> might attempt to access the pci_dn for a device that has been torn down and
> the kernel may crash as a result.
> 
> To fix this, don't free the pci_dn unless the corresponding pci_dev has
> been released.  If the pci_dev is still live, then we mark the pci_dn with
> a flag that indicates the pci_dev's release function should free it.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

If I understand this, it works because when
pci_remove_device_node_info() calls pci_get_domain_bus_and_slot(),
it either:
a) returns null, meaning the pci_dev is already gone, the release
handler is already called, and the PDN can be removed there, or
b) returns non-null and atomically increases the refcount and the
release handler won't be called until after we've set the DEAD flag and
released our reference.

Looks good to me,

Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>

> ---
>  arch/powerpc/include/asm/pci-bridge.h |  1 +
>  arch/powerpc/kernel/pci-hotplug.c     |  7 +++++++
>  arch/powerpc/kernel/pci_dn.c          | 21 +++++++++++++++++++--
>  3 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 54a9de01c2a3..69f4cb3b7c56 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -183,6 +183,7 @@ struct iommu_table;
>  struct pci_dn {
>  	int     flags;
>  #define PCI_DN_FLAG_IOV_VF	0x01
> +#define PCI_DN_FLAG_DEAD	0x02    /* Device has been hot-removed */
>  
>  	int	busno;			/* pci bus number */
>  	int	devfn;			/* pci device and function number */
> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
> index 0b0cf8168b47..fc62c4bc47b1 100644
> --- a/arch/powerpc/kernel/pci-hotplug.c
> +++ b/arch/powerpc/kernel/pci-hotplug.c
> @@ -55,11 +55,18 @@ EXPORT_SYMBOL_GPL(pci_find_bus_by_node);
>  void pcibios_release_device(struct pci_dev *dev)
>  {
>  	struct pci_controller *phb = pci_bus_to_host(dev->bus);
> +	struct pci_dn *pdn = pci_get_pdn(dev);
>  
>  	eeh_remove_device(dev);
>  
>  	if (phb->controller_ops.release_device)
>  		phb->controller_ops.release_device(dev);
> +
> +	/* free()ing the pci_dn has been deferred to us, do it now */
> +	if (pdn && (pdn->flags & PCI_DN_FLAG_DEAD)) {
> +		pci_dbg(dev, "freeing dead pdn\n");
> +		kfree(pdn);
> +	}
>  }
>  
>  /**
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index 5614ca0940ca..4e654df55969 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -320,6 +320,7 @@ void pci_remove_device_node_info(struct device_node *dn)
>  {
>  	struct pci_dn *pdn = dn ? PCI_DN(dn) : NULL;
>  	struct device_node *parent;
> +	struct pci_dev *pdev;
>  #ifdef CONFIG_EEH
>  	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>  
> @@ -333,12 +334,28 @@ void pci_remove_device_node_info(struct device_node *dn)
>  	WARN_ON(!list_empty(&pdn->child_list));
>  	list_del(&pdn->list);
>  
> +	/* Drop the parent pci_dn's ref to our backing dt node */
>  	parent = of_get_parent(dn);
>  	if (parent)
>  		of_node_put(parent);
>  
> -	dn->data = NULL;
> -	kfree(pdn);
> +	/*
> +	 * At this point we *might* still have a pci_dev that was
> +	 * instantiated from this pci_dn. So defer free()ing it until
> +	 * the pci_dev's release function is called.
> +	 */
> +	pdev = pci_get_domain_bus_and_slot(pdn->phb->global_number,
> +			pdn->busno, pdn->devfn);
> +	if (pdev) {
> +		/* NB: pdev has a ref to dn */
> +		pci_dbg(pdev, "marked pdn (from %pOF) as dead\n", dn);
> +		pdn->flags |= PCI_DN_FLAG_DEAD;
> +	} else {
> +		dn->data = NULL;
> +		kfree(pdn);
> +	}
> +
> +	pci_dev_put(pdev);
>  }
>  EXPORT_SYMBOL_GPL(pci_remove_device_node_info);
>  
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/14] powerpc/eeh: Make permanently failed devices non-actionable
  2019-09-03 10:15 ` [PATCH 03/14] powerpc/eeh: Make permanently failed devices non-actionable Oliver O'Halloran
@ 2019-09-17  0:51   ` Sam Bobroff
  0 siblings, 0 replies; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  0:51 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

On Tue, Sep 03, 2019 at 08:15:54PM +1000, Oliver O'Halloran wrote:
> If a device is torn down by a hotplug slot driver it's marked as removed
> and marked as permaantly failed. There's no point in trying to recover a
permanently
> permernantly failed device so it should be considered un-actionable.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Other than the typo, looks good (I think it should always have been like
this):

Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
>  arch/powerpc/kernel/eeh_driver.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 75266156943f..18a69fac4d80 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -96,8 +96,16 @@ static bool eeh_dev_removed(struct eeh_dev *edev)
>  
>  static bool eeh_edev_actionable(struct eeh_dev *edev)
>  {
> -	return (edev->pdev && !eeh_dev_removed(edev) &&
> -		!eeh_pe_passed(edev->pe));
> +	if (!edev->pdev)
> +		return false;
> +	if (edev->pdev->error_state == pci_channel_io_perm_failure)
> +		return false;
> +	if (eeh_dev_removed(edev))
> +		return false;
> +	if (eeh_pe_passed(edev->pe))
> +		return false;
> +
> +	return true;
>  }
>  
>  /**
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event()
  2019-09-03 10:15 ` [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event() Oliver O'Halloran
@ 2019-09-17  1:00   ` Sam Bobroff
  2019-09-17  4:20     ` Oliver O'Halloran
  0 siblings, 1 reply; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  1:00 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5051 bytes --]

On Tue, Sep 03, 2019 at 08:15:55PM +1000, Oliver O'Halloran wrote:
> When a device is surprise removed while undergoing IO we will probably
> get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
> is detected we send a recovery event to the EEH worker thread which will
> notify drivers, and perform recovery as needed.
> 
> In the event of a hot-remove we don't want recovery to occur since there
> isn't a device to recover. The recovery process is fairly long due to
> the number of wait states (required by PCIe) which causes problems when
> devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).
> 
> To determine if we need to skip the recovery process we can use the
> get_adapter_state() operation of the hotplug_slot to determine if the
> slot contains a device or not, and if the slot is empty we can skip
> recovery entirely.
> 
> One thing to note is that the slot being EEH frozen does not prevent the
> hotplug driver from working. We don't have the EEH recovery thread
> remove any of the devices since it's assumed that the hotplug driver
> will handle tearing down the slot state.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks good, but some comments, below.

> ---
>  arch/powerpc/kernel/eeh_driver.c | 60 ++++++++++++++++++++++++++++++++
>  1 file changed, 60 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 18a69fac4d80..52ce7584af43 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -27,6 +27,7 @@
>  #include <linux/irq.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
>  #include <asm/eeh.h>
>  #include <asm/eeh_event.h>
>  #include <asm/ppc-pci.h>
> @@ -769,6 +770,46 @@ static void eeh_pe_cleanup(struct eeh_pe *pe)
>  	}
>  }
>  
> +/**
> + * eeh_check_slot_presence - Check if a device is still present in a slot
> + * @pdev: pci_dev to check
> + *
> + * This function may return a false positive if we can't determine the slot's
> + * presence state. This might happen for for PCIe slots if the PE containing
> + * the upstream bridge is also frozen, or the bridge is part of the same PE
> + * as the device.
> + *
> + * This shouldn't happen often, but you might see it if you hotplug a PCIe
> + * switch.
> + */

I don't think the function name is very good; it does check the slot but
it doesn't tell you what a true result means -- but I don't see an
obviously great alternative either. If it can return false positives, it's
really testing for empty so maybe 'eeh_slot_definitely_empty()' or
'eeh_slot_maybe_populated()'?

> +static bool eeh_slot_presence_check(struct pci_dev *pdev)
> +{
> +	const struct hotplug_slot_ops *ops;
> +	struct pci_slot *slot;
> +	u8 state;
> +	int rc;
> +
> +	if (!pdev)
> +		return false;
> +
> +	if (pdev->error_state == pci_channel_io_perm_failure)
> +		return false;
> +
> +	slot = pdev->slot;
> +	if (!slot || !slot->hotplug)
> +		return true;
> +
> +	ops = slot->hotplug->ops;
> +	if (!ops || !ops->get_adapter_status)
> +		return true;
> +
> +	rc = ops->get_adapter_status(slot->hotplug, &state);
> +	if (rc)
> +		return true;
> +
> +	return !!state;
> +}
> +
>  /**
>   * eeh_handle_normal_event - Handle EEH events on a specific PE
>   * @pe: EEH PE - which should not be used after we return, as it may
> @@ -799,6 +840,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  	enum pci_ers_result result = PCI_ERS_RESULT_NONE;
>  	struct eeh_rmv_data rmv_data =
>  		{LIST_HEAD_INIT(rmv_data.removed_vf_list), 0};
> +	int devices = 0;
>  
>  	bus = eeh_pe_bus_get(pe);
>  	if (!bus) {
> @@ -807,6 +849,23 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  		return;
>  	}
>  
> +	/*
> +	 * When devices are hot-removed we might get an EEH due to
> +	 * a driver attempting to touch the MMIO space of a removed
> +	 * device. In this case we don't have a device to recover
> +	 * so suppress the event if we can't find any present devices.
> +	 *
> +	 * The hotplug driver should take care of tearing down the
> +	 * device itself.
> +	 */
> +	eeh_for_each_pe(pe, tmp_pe)
> +		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
> +			if (eeh_slot_presence_check(edev->pdev))
> +				devices++;

In other parts of the EEH code we do a get_device() on edev->pdev before
passing it around, it might be good to do that here too.

> +
> +	if (!devices)
> +		goto out; /* nothing to recover */

Does this handle an empty, but frozen, PHB correctly? (Can that happen?)

> +
>  	eeh_pe_update_time_stamp(pe);
>  	pe->freeze_count++;
>  	if (pe->freeze_count > eeh_max_freezes) {
> @@ -997,6 +1056,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  		}
>  	}
>  
> +out:
>  	/*
>  	 * Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
>  	 * we don't want to modify the PE tree structure so we do it here.
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/14] powerpc/eeh: Defer printing stack trace
  2019-09-03 10:15 ` [PATCH 05/14] powerpc/eeh: Defer printing stack trace Oliver O'Halloran
@ 2019-09-17  1:04   ` Sam Bobroff
  2019-09-17  1:45     ` Oliver O'Halloran
  0 siblings, 1 reply; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  1:04 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7303 bytes --]

On Tue, Sep 03, 2019 at 08:15:56PM +1000, Oliver O'Halloran wrote:
> Currently we print a stack trace in the event handler to help with
> debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
> so we want to prevent printing the stack trace unless we know it's due to
> an actual device error. To accomplish this, we can save a stack trace at
> the point of detection and only print it once the EEH recovery handler has
> determined the freeze was due to an actual error.
> 
> Since the whole point of this is to prevent spurious EEH output we also
> move a few prints out of the detection thread, or mark them as pr_debug
> so anyone interested can get output from the eeh_check_dev_failure()
> if they want.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

I think this is a good change, and even in the normal case it will place
the stacktrace closer to the rest of the recovery information.

But, I think it would make more sense to put the stacktrace into the
struct eeh_event, rather than the struct eeh_pe. Is there some reason
we can't do that? (It would save a fair bit of memory!)

Cheers,
Sam.

> ---
>  arch/powerpc/include/asm/eeh.h   | 11 +++++++++
>  arch/powerpc/kernel/eeh.c        | 15 ++++---------
>  arch/powerpc/kernel/eeh_driver.c | 38 +++++++++++++++++++++++++++++++-
>  arch/powerpc/kernel/eeh_event.c  | 26 ++++++++++------------
>  4 files changed, 64 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index c13119a5e69b..9d0e1694a94d 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -88,6 +88,17 @@ struct eeh_pe {
>  	struct list_head child_list;	/* List of PEs below this PE	*/
>  	struct list_head child;		/* Memb. child_list/eeh_phb_pe	*/
>  	struct list_head edevs;		/* List of eeh_dev in this PE	*/
> +
> +	/*
> +	 * Saved stack trace. When we find a PE freeze in eeh_dev_check_failure
> +	 * the stack trace is saved here so we can print it in the recovery
> +	 * thread if it turns out to due to a real problem rather than
> +	 * a hot-remove.
> +	 *
> +	 * A max of 64 entries might be overkill, but it also might not be.
> +	 */
> +	unsigned long stack_trace[64];
> +	int trace_entries;
>  };
>  
>  #define eeh_pe_for_each_dev(pe, edev, tmp) \
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 9c468e79d13c..46d17817b438 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -420,11 +420,9 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
>  	eeh_pe_mark_isolated(phb_pe);
>  	eeh_serialize_unlock(flags);
>  
> -	pr_err("EEH: PHB#%x failure detected, location: %s\n",
> +	pr_debug("EEH: PHB#%x failure detected, location: %s\n",
>  		phb_pe->phb->global_number, eeh_pe_loc_get(phb_pe));
> -	dump_stack();
>  	eeh_send_failure_event(phb_pe);
> -
>  	return 1;
>  out:
>  	eeh_serialize_unlock(flags);
> @@ -451,7 +449,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>  	unsigned long flags;
>  	struct device_node *dn;
>  	struct pci_dev *dev;
> -	struct eeh_pe *pe, *parent_pe, *phb_pe;
> +	struct eeh_pe *pe, *parent_pe;
>  	int rc = 0;
>  	const char *location = NULL;
>  
> @@ -581,13 +579,8 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>  	 * a stack trace will help the device-driver authors figure
>  	 * out what happened.  So print that out.
>  	 */
> -	phb_pe = eeh_phb_pe_get(pe->phb);
> -	pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
> -	       pe->phb->global_number, pe->addr);
> -	pr_err("EEH: PE location: %s, PHB location: %s\n",
> -	       eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
> -	dump_stack();
> -
> +	pr_debug("EEH: %s: Frozen PHB#%x-PE#%x detected\n",
> +		__func__, pe->phb->global_number, pe->addr);
>  	eeh_send_failure_event(pe);
>  
>  	return 1;
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 52ce7584af43..0d34cc12c529 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -863,8 +863,44 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  			if (eeh_slot_presence_check(edev->pdev))
>  				devices++;
>  
> -	if (!devices)
> +	if (!devices) {
> +		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
> +			pe->phb->global_number, pe->addr);
>  		goto out; /* nothing to recover */
> +	}
> +
> +	/* Log the event */
> +	if (pe->type & EEH_PE_PHB) {
> +		pr_err("EEH: PHB#%x failure detected, location: %s\n",
> +			pe->phb->global_number, eeh_pe_loc_get(pe));
> +	} else {
> +		struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
> +
> +		pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
> +		       pe->phb->global_number, pe->addr);
> +		pr_err("EEH: PE location: %s, PHB location: %s\n",
> +		       eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
> +	}
> +
> +	/*
> +	 * Print the saved stack trace now that we've verified there's
> +	 * something to recover.
> +	 */
> +	if (pe->trace_entries) {
> +		void **ptrs = (void **) pe->stack_trace;
> +		int i;
> +
> +		pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
> +		       pe->phb->global_number, pe->addr);
> +
> +		/* FIXME: Use the same format as dump_stack() */
> +		pr_err("EEH: Call Trace:\n");
> +		for (i = 0; i < pe->trace_entries; i++)
> +			pr_err("EEH: [%pK] %pS\n", ptrs[i], ptrs[i]);
> +
> +		pe->trace_entries = 0;
> +	}
> +
>  
>  	eeh_pe_update_time_stamp(pe);
>  	pe->freeze_count++;
> diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
> index e36653e5f76b..1d55486adb0f 100644
> --- a/arch/powerpc/kernel/eeh_event.c
> +++ b/arch/powerpc/kernel/eeh_event.c
> @@ -40,7 +40,6 @@ static int eeh_event_handler(void * dummy)
>  {
>  	unsigned long flags;
>  	struct eeh_event *event;
> -	struct eeh_pe *pe;
>  
>  	while (!kthread_should_stop()) {
>  		if (wait_for_completion_interruptible(&eeh_eventlist_event))
> @@ -59,19 +58,10 @@ static int eeh_event_handler(void * dummy)
>  			continue;
>  
>  		/* We might have event without binding PE */
> -		pe = event->pe;
> -		if (pe) {
> -			if (pe->type & EEH_PE_PHB)
> -				pr_info("EEH: Detected error on PHB#%x\n",
> -					 pe->phb->global_number);
> -			else
> -				pr_info("EEH: Detected PCI bus error on "
> -					"PHB#%x-PE#%x\n",
> -					pe->phb->global_number, pe->addr);
> -			eeh_handle_normal_event(pe);
> -		} else {
> +		if (event->pe)
> +			eeh_handle_normal_event(event->pe);
> +		else
>  			eeh_handle_special_event();
> -		}
>  
>  		kfree(event);
>  	}
> @@ -126,8 +116,16 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
>  	 * This prevents the PE from being free()ed by a hotplug driver
>  	 * while the PE is sitting in the event queue.
>  	 */
> -	if (pe)
> +	if (pe) {
> +		/*
> +		 * Save the current stack trace so we can dump it from the
> +		 * event handler thread.
> +		 */
> +		pe->trace_entries = stack_trace_save(pe->stack_trace,
> +					 ARRAY_SIZE(pe->stack_trace), 0);
> +
>  		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
> +	}
>  
>  	/* We may or may not be called in an interrupt context */
>  	spin_lock_irqsave(&eeh_eventlist_lock, flags);
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment
  2019-09-03 10:15 ` [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment Oliver O'Halloran
  2019-09-03 14:09   ` Andrew Donnellan
@ 2019-09-17  1:04   ` Sam Bobroff
  1 sibling, 0 replies; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  1:04 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev, Andrew Donnellan

[-- Attachment #1: Type: text/plain, Size: 1227 bytes --]

On Tue, Sep 03, 2019 at 08:15:57PM +1000, Oliver O'Halloran wrote:
> Support for switching CAPI cards into and out of CAPI mode was removed a
> while ago. Drop the comment since it's no longer relevant.
> 
> Cc: Andrew Donnellan <ajd@linux.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

LGTM
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>

> ---
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index e7b867912f24..94e26d56ecd2 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -1125,13 +1125,6 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>  		return -EIO;
>  	}
>  
> -	/*
> -	 * If dealing with the root bus (or the bus underneath the
> -	 * root port), we reset the bus underneath the root port.
> -	 *
> -	 * The cxl driver depends on this behaviour for bi-modal card
> -	 * switching.
> -	 */
>  	if (pci_is_root_bus(bus) ||
>  	    pci_is_root_bus(bus->parent))
>  		return pnv_eeh_root_reset(hose, option);
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets
  2019-09-03 10:15 ` [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets Oliver O'Halloran
@ 2019-09-17  1:15   ` Sam Bobroff
  2019-09-17  7:30     ` Oliver O'Halloran
  0 siblings, 1 reply; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  1:15 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4387 bytes --]

On Tue, Sep 03, 2019 at 08:15:58PM +1000, Oliver O'Halloran wrote:
> When we reset PCI devices managed by a hotplug driver the reset may
> generate spurious hotplug events that cause the PCI device we're resetting
> to be torn down accidently. This is a problem for EEH (when the driver is
> EEH aware) since we want to leave the OS PCI device state intact so that
> the device can be re-set without losing any resources (network, disks,
> etc) provided by the driver.
> 
> Generic PCI code provides the pci_bus_error_reset() function to handle
> resetting a PCI Device (or bus) by using the reset method provided by the
> hotplug slot driver. We can use this function if the EEH core has
> requested a hot reset (common case) without tripping over the hotplug
> driver.

Could you explain a bit more about how this change solves the problem?
Is it that the hotplug driver's reset method doesn't cause spurious
hotplug events?

(Some other comments below)

> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> I know that include is a bit gross, but:
> 
> a) We're already doing it in pci-ioda.c, and in pseries/pci.
> b) It's pci_bus_error_reset() isn't really a function that
>    should be provided to non-pci core code.
> ---
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 38 ++++++++++++++++++--
>  1 file changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 94e26d56ecd2..6bc24a47e9ef 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -34,6 +34,7 @@
>  
>  #include "powernv.h"
>  #include "pci.h"
> +#include "../../../../drivers/pci/pci.h"
>  
>  static int eeh_event_irq = -EINVAL;
>  
> @@ -849,7 +850,7 @@ static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>  	int aer = edev ? edev->aer_cap : 0;
>  	u32 ctrl;
>  
> -	pr_debug("%s: Reset PCI bus %04x:%02x with option %d\n",
> +	pr_debug("%s: Secondary Reset PCI bus %04x:%02x with option %d\n",
>  		 __func__, pci_domain_nr(dev->bus),
>  		 dev->bus->number, option);
>  
> @@ -907,6 +908,10 @@ static int pnv_eeh_bridge_reset(struct pci_dev *pdev, int option)
>  	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
>  		return __pnv_eeh_bridge_reset(pdev, option);
>  
> +	pr_debug("%s: FW reset PCI bus %04x:%02x with option %d\n",
> +		 __func__, pci_domain_nr(pdev->bus),
> +		 pdev->bus->number, option);
> +
>  	switch (option) {
>  	case EEH_RESET_FUNDAMENTAL:
>  		scope = OPAL_RESET_PCI_FUNDAMENTAL;
> @@ -1125,10 +1130,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>  		return -EIO;
>  	}
>  
> -	if (pci_is_root_bus(bus) ||
> -	    pci_is_root_bus(bus->parent))
> +	if (pci_is_root_bus(bus))
>  		return pnv_eeh_root_reset(hose, option);
>  
> +	/*
> +	 * For hot resets try use the generic PCI error recovery reset
> +	 * functions. These correctly handles the case where the secondary
> +	 * bus is behind a hotplug slot and it will use the slot provided
> +	 * reset methods to prevent spurious hotplug events during the reset.
> +	 *
> +	 * Fundemental resets need to be handled internally to EEH since the
> +	 * PCI core doesn't really have a concept of a fundemental reset,
> +	 * mainly because there's no standard way to generate one. Only a
> +	 * few devices require an FRESET so it should be fine.
> +	 */
> +	if (option != EEH_RESET_FUNDAMENTAL) {
> +		/*
> +		 * NB: Skiboot and pnv_eeh_bridge_reset() also no-op the
> +		 *     de-assert step. It's like the OPAL reset API was
> +		 *     poorly designed or something...
> +		 */
> +		if (option == EEH_RESET_DEACTIVATE)
> +			return 0;

It looks like this will prevent pnv_eeh_root_reset(bus->parent) (below)
from being called for EEH_RESET_DEACTIVATE, when it was before the
patch. Is that right?

> +
> +		rc = pci_bus_error_reset(bus->self);
> +		if (!rc)
> +			return 0;

Is it correct to fall through and try a different reset if this fails?

> +	}
> +
> +	/* otherwise, use the generic bridge reset. this might call into FW */
> +	if (pci_is_root_bus(bus->parent))
> +		return pnv_eeh_root_reset(hose, option);
>  	return pnv_eeh_bridge_reset(bus->self, option);
>  }
>  
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 11/14] powerpc/eeh: Set attention indicator while recovering
  2019-09-03 10:16 ` [PATCH 11/14] powerpc/eeh: Set attention indicator while recovering Oliver O'Halloran
@ 2019-09-17  1:23   ` Sam Bobroff
  0 siblings, 0 replies; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  1:23 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2535 bytes --]

On Tue, Sep 03, 2019 at 08:16:02PM +1000, Oliver O'Halloran wrote:
> I am the RAS team. Hear me roar.
> 
> Roar.
> 
> On a more serious note, being able to locate failed devices can be helpful.
> Set the attention indicator if the slot supports it once we've determined
> the device is present and only clear it if the device is fully recovered.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks good, although I think it would be clearer if you could separate
checking the slot from raising the alert.

Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>

> ---
>  arch/powerpc/kernel/eeh_driver.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 0d34cc12c529..80bd157fcb45 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -803,6 +803,10 @@ static bool eeh_slot_presence_check(struct pci_dev *pdev)
>  	if (!ops || !ops->get_adapter_status)
>  		return true;
>  
> +	/* set the attention indicator while we've got the slot ops */
> +	if (ops->set_attention_status)
> +		ops->set_attention_status(slot->hotplug, 1);
> +
>  	rc = ops->get_adapter_status(slot->hotplug, &state);
>  	if (rc)
>  		return true;
> @@ -810,6 +814,28 @@ static bool eeh_slot_presence_check(struct pci_dev *pdev)
>  	return !!state;
>  }
>  
> +static void eeh_clear_slot_attention(struct pci_dev *pdev)
> +{
> +	const struct hotplug_slot_ops *ops;
> +	struct pci_slot *slot;
> +
> +	if (!pdev)
> +		return;
> +
> +	if (pdev->error_state == pci_channel_io_perm_failure)
> +		return;
> +
> +	slot = pdev->slot;
> +	if (!slot || !slot->hotplug)
> +		return;
> +
> +	ops = slot->hotplug->ops;
> +	if (!ops || !ops->set_attention_status)
> +		return;
> +
> +	ops->set_attention_status(slot->hotplug, 0);
> +}
> +
>  /**
>   * eeh_handle_normal_event - Handle EEH events on a specific PE
>   * @pe: EEH PE - which should not be used after we return, as it may
> @@ -1098,6 +1124,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  	 * we don't want to modify the PE tree structure so we do it here.
>  	 */
>  	eeh_pe_cleanup(pe);
> +
> +	/* clear the slot attention LED for all recovered devices */
> +	eeh_for_each_pe(pe, tmp_pe)
> +		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
> +			eeh_clear_slot_attention(edev->pdev);
> +
>  	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
>  }
>  
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/14] powerpc/eeh: Defer printing stack trace
  2019-09-17  1:04   ` Sam Bobroff
@ 2019-09-17  1:45     ` Oliver O'Halloran
  2019-09-17  3:35       ` Sam Bobroff
  0 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  1:45 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev

On Tue, Sep 17, 2019 at 11:04 AM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> On Tue, Sep 03, 2019 at 08:15:56PM +1000, Oliver O'Halloran wrote:
> > Currently we print a stack trace in the event handler to help with
> > debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
> > so we want to prevent printing the stack trace unless we know it's due to
> > an actual device error. To accomplish this, we can save a stack trace at
> > the point of detection and only print it once the EEH recovery handler has
> > determined the freeze was due to an actual error.
> >
> > Since the whole point of this is to prevent spurious EEH output we also
> > move a few prints out of the detection thread, or mark them as pr_debug
> > so anyone interested can get output from the eeh_check_dev_failure()
> > if they want.
> >
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
>
> I think this is a good change, and even in the normal case it will place
> the stacktrace closer to the rest of the recovery information.
>
> But, I think it would make more sense to put the stacktrace into the
> struct eeh_event, rather than the struct eeh_pe. Is there some reason
> we can't do that? (It would save a fair bit of memory!)

Two reasons:

1) the eeh_event structures are allocated with GFP_ATOMIC since
eeh_dev_check_failure() can be called from any context. Minimising the
number of atomic allocations we do is a good idea as a matter of
course.
2) We don't pass the eeh_event structure to the event handler
function. I guess we could, but... eh

I don't see the memory saving as hugely significant either. There's
always fewer eeh_pe structures than there are PCI devices since some
will share PEs (e.g. switches, multifunction cards) so you'd be saving
a dozen KB at most.

root@zaius1:~# lspci | wc -l
59
root@zaius1:~# echo $(( $(lspci | wc -l) * 64 * 8))
30208

I think we'll live...

>
> Cheers,
> Sam.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check
  2019-09-03 10:16 ` [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check Oliver O'Halloran
@ 2019-09-17  3:15   ` Sam Bobroff
  2019-09-17  3:36     ` Oliver O'Halloran
  0 siblings, 1 reply; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  3:15 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4171 bytes --]

On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote:
> Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs
> response. When performing EEH testing using the EEH error injection feature
> available on some platforms there is no simple way to kick-off the kernel's
> recovery process since any accesses from userspace (usually /dev/mem) will
> bypass the MMIO helpers in the kernel which check if a 0xFF response is due
> to an EEH freeze or not.
> 
> If a device contains a 0xFF byte in it's config space it's possible to
> trigger the recovery process via config space read from userspace, but this
> is not a reliable method. If a driver is bound to the device an in use it
> will frequently trigger the MMIO check, but this is also inconsistent.
> 
> To solve these problems this patch adds a debugfs file called
> "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs
> eeh_dev_check_failure() on it. This is the same check that's done when the
> kernel gets a 0xFF result from an config or MMIO read with the added
> benifit that it can be reliably triggered from userspace.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks good, and I tested it with the next patch and it seems to work.

But I think you should make it clear that this does not work with
the hardware "EEH error injection" facility accessible via debugfs in
err_injct (that doesn't seem clear to me from the commit message).

Tested-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
>  arch/powerpc/kernel/eeh.c | 61 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 46d17817b438..ace1c5a6b8ed 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1873,6 +1873,64 @@ static const struct file_operations eeh_force_recover_fops = {
>  	.llseek	= no_llseek,
>  	.write	= eeh_force_recover_write,
>  };
> +
> +static ssize_t eeh_debugfs_dev_usage(struct file *filp,
> +				char __user *user_buf,
> +				size_t count, loff_t *ppos)
> +{
> +	static const char usage[] = "input format: <domain>:<bus>:<dev>.<fn>\n";
> +
> +	return simple_read_from_buffer(user_buf, count, ppos,
> +				       usage, sizeof(usage) - 1);
> +}
> +
> +static ssize_t eeh_dev_check_write(struct file *filp,
> +				const char __user *user_buf,
> +				size_t count, loff_t *ppos)
> +{
> +	uint32_t domain, bus, dev, fn;
> +	struct pci_dev *pdev;
> +	struct eeh_dev *edev;
> +	char buf[20];
> +	int ret;
> +
> +	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> +	if (!ret)
> +		return -EFAULT;
> +
> +	ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
> +	if (ret != 4) {
> +		pr_err("%s: expected 4 args, got %d\n", __func__, ret);
> +		return -EINVAL;
> +	}
> +
> +	pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
> +	if (!pdev)
> +		return -ENODEV;
> +
> +	edev = pci_dev_to_eeh_dev(pdev);
> +	if (!edev) {
> +		pci_err(pdev, "No eeh_dev for this device!\n");
> +		pci_dev_put(pdev);
> +		return -ENODEV;
> +	}
> +
> +	ret = eeh_dev_check_failure(edev);
> +	pci_info(pdev, "eeh_dev_check_failure(%04x:%02x:%02x.%01x) = %d\n",
> +			domain, bus, dev, fn, ret);
> +
> +	pci_dev_put(pdev);
> +
> +	return count;
> +}
> +
> +static const struct file_operations eeh_dev_check_fops = {
> +	.open	= simple_open,
> +	.llseek	= no_llseek,
> +	.write	= eeh_dev_check_write,
> +	.read   = eeh_debugfs_dev_usage,
> +};
> +
>  #endif
>  
>  static int __init eeh_init_proc(void)
> @@ -1888,6 +1946,9 @@ static int __init eeh_init_proc(void)
>  		debugfs_create_bool("eeh_disable_recovery", 0600,
>  				powerpc_debugfs_root,
>  				&eeh_debugfs_no_recover);
> +		debugfs_create_file_unsafe("eeh_dev_check", 0600,
> +				powerpc_debugfs_root, NULL,
> +				&eeh_dev_check_fops);
>  		debugfs_create_file_unsafe("eeh_force_recover", 0600,
>  				powerpc_debugfs_root, NULL,
>  				&eeh_force_recover_fops);
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 13/14] powerpc/eeh: Add a eeh_dev_break debugfs interface
  2019-09-03 10:16 ` [PATCH 13/14] powerpc/eeh: Add a eeh_dev_break debugfs interface Oliver O'Halloran
@ 2019-09-17  3:19   ` Sam Bobroff
  0 siblings, 0 replies; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  3:19 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6192 bytes --]

On Tue, Sep 03, 2019 at 08:16:04PM +1000, Oliver O'Halloran wrote:
> Add an interface to debugfs for generating an EEH event on a given device.
> This works by disabling memory accesses to and from the device by setting
> the PCI_COMMAND register (or the VF Memory Space Enable on the parent PF).
> 
> This is a somewhat portable alternative to using the platform specific
> error injection mechanisms since those tend to be either hard to use, or
> straight up broken. For pseries the interfaces also requires the use of
> /dev/mem which is probably going to go away in a post-LOCKDOWN world
> (and it's a horrific hack to begin with) so moving to a kernel-provided
> interface makes sense and provides a sane, cross-platform interface for
> userspace so we can write more generic testing scripts.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks good to me. Tested with the previous patch.

Tested-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>

> ---
>  arch/powerpc/kernel/eeh.c | 139 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 138 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index ace1c5a6b8ed..a55d2f01da1d 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1894,7 +1894,8 @@ static ssize_t eeh_dev_check_write(struct file *filp,
>  	char buf[20];
>  	int ret;
>  
> -	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> +	memset(buf, 0, sizeof(buf));
> +	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
>  	if (!ret)
>  		return -EFAULT;
>  
> @@ -1931,6 +1932,139 @@ static const struct file_operations eeh_dev_check_fops = {
>  	.read   = eeh_debugfs_dev_usage,
>  };
>  
> +static int eeh_debugfs_break_device(struct pci_dev *pdev)
> +{
> +	struct resource *bar = NULL;
> +	void __iomem *mapped;
> +	u16 old, bit;
> +	int i, pos;
> +
> +	/* Do we have an MMIO BAR to disable? */
> +	for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
> +		struct resource *r = &pdev->resource[i];
> +
> +		if (!r->flags || !r->start)
> +			continue;
> +		if (r->flags & IORESOURCE_IO)
> +			continue;
> +		if (r->flags & IORESOURCE_UNSET)
> +			continue;
> +
> +		bar = r;
> +		break;
> +	}
> +
> +	if (!bar) {
> +		pci_err(pdev, "Unable to find Memory BAR to cause EEH with\n");
> +		return -ENXIO;
> +	}
> +
> +	pci_err(pdev, "Going to break: %pR\n", bar);
> +
> +	if (pdev->is_virtfn) {
> +#ifndef CONFIG_IOV
> +		return -ENXIO;
> +#else
> +		/*
> +		 * VFs don't have a per-function COMMAND register, so the best
> +		 * we can do is clear the Memory Space Enable bit in the PF's
> +		 * SRIOV control reg.
> +		 *
> +		 * Unfortunately, this requires that we have a PF (i.e doesn't
> +		 * work for a passed-through VF) and it has the potential side
> +		 * effect of also causing an EEH on every other VF under the
> +		 * PF. Oh well.
> +		 */
> +		pdev = pdev->physfn;
> +		if (!pdev)
> +			return -ENXIO; /* passed through VFs have no PF */
> +
> +		pos  = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +		pos += PCI_SRIOV_CTRL;
> +		bit  = PCI_SRIOV_CTRL_MSE;
> +#endif /* !CONFIG_IOV */
> +	} else {
> +		bit = PCI_COMMAND_MEMORY;
> +		pos = PCI_COMMAND;
> +	}
> +
> +	/*
> +	 * Process here is:
> +	 *
> +	 * 1. Disable Memory space.
> +	 *
> +	 * 2. Perform an MMIO to the device. This should result in an error
> +	 *    (CA  / UR) being raised by the device which results in an EEH
> +	 *    PE freeze. Using the in_8() accessor skips the eeh detection hook
> +	 *    so the freeze hook so the EEH Detection machinery won't be
> +	 *    triggered here. This is to match the usual behaviour of EEH
> +	 *    where the HW will asyncronously freeze a PE and it's up to
> +	 *    the kernel to notice and deal with it.
> +	 *
> +	 * 3. Turn Memory space back on. This is more important for VFs
> +	 *    since recovery will probably fail if we don't. For normal
> +	 *    the COMMAND register is reset as a part of re-initialising
> +	 *    the device.
> +	 *
> +	 * Breaking stuff is the point so who cares if it's racy ;)
> +	 */
> +	pci_read_config_word(pdev, pos, &old);
> +
> +	mapped = ioremap(bar->start, PAGE_SIZE);
> +	if (!mapped) {
> +		pci_err(pdev, "Unable to map MMIO BAR %pR\n", bar);
> +		return -ENXIO;
> +	}
> +
> +	pci_write_config_word(pdev, pos, old & ~bit);
> +	in_8(mapped);
> +	pci_write_config_word(pdev, pos, old);
> +
> +	iounmap(mapped);
> +
> +	return 0;
> +}
> +
> +static ssize_t eeh_dev_break_write(struct file *filp,
> +				const char __user *user_buf,
> +				size_t count, loff_t *ppos)
> +{
> +	uint32_t domain, bus, dev, fn;
> +	struct pci_dev *pdev;
> +	char buf[20];
> +	int ret;
> +
> +	memset(buf, 0, sizeof(buf));
> +	ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
> +	if (!ret)
> +		return -EFAULT;
> +
> +	ret = sscanf(buf, "%x:%x:%x.%x", &domain, &bus, &dev, &fn);
> +	if (ret != 4) {
> +		pr_err("%s: expected 4 args, got %d\n", __func__, ret);
> +		return -EINVAL;
> +	}
> +
> +	pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
> +	if (!pdev)
> +		return -ENODEV;
> +
> +	ret = eeh_debugfs_break_device(pdev);
> +	pci_dev_put(pdev);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	return count;
> +}
> +
> +static const struct file_operations eeh_dev_break_fops = {
> +	.open	= simple_open,
> +	.llseek	= no_llseek,
> +	.write	= eeh_dev_break_write,
> +	.read   = eeh_debugfs_dev_usage,
> +};
> +
>  #endif
>  
>  static int __init eeh_init_proc(void)
> @@ -1949,6 +2083,9 @@ static int __init eeh_init_proc(void)
>  		debugfs_create_file_unsafe("eeh_dev_check", 0600,
>  				powerpc_debugfs_root, NULL,
>  				&eeh_dev_check_fops);
> +		debugfs_create_file_unsafe("eeh_dev_break", 0600,
> +				powerpc_debugfs_root, NULL,
> +				&eeh_dev_break_fops);
>  		debugfs_create_file_unsafe("eeh_force_recover", 0600,
>  				powerpc_debugfs_root, NULL,
>  				&eeh_force_recover_fops);
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/14] powerpc/eeh: Defer printing stack trace
  2019-09-17  1:45     ` Oliver O'Halloran
@ 2019-09-17  3:35       ` Sam Bobroff
  2019-09-17  3:38         ` Oliver O'Halloran
  0 siblings, 1 reply; 34+ messages in thread
From: Sam Bobroff @ 2019-09-17  3:35 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2319 bytes --]

On Tue, Sep 17, 2019 at 11:45:14AM +1000, Oliver O'Halloran wrote:
> On Tue, Sep 17, 2019 at 11:04 AM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
> >
> > On Tue, Sep 03, 2019 at 08:15:56PM +1000, Oliver O'Halloran wrote:
> > > Currently we print a stack trace in the event handler to help with
> > > debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
> > > so we want to prevent printing the stack trace unless we know it's due to
> > > an actual device error. To accomplish this, we can save a stack trace at
> > > the point of detection and only print it once the EEH recovery handler has
> > > determined the freeze was due to an actual error.
> > >
> > > Since the whole point of this is to prevent spurious EEH output we also
> > > move a few prints out of the detection thread, or mark them as pr_debug
> > > so anyone interested can get output from the eeh_check_dev_failure()
> > > if they want.
> > >
> > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> >
> > I think this is a good change, and even in the normal case it will place
> > the stacktrace closer to the rest of the recovery information.
> >
> > But, I think it would make more sense to put the stacktrace into the
> > struct eeh_event, rather than the struct eeh_pe. Is there some reason
> > we can't do that? (It would save a fair bit of memory!)
> 
> Two reasons:
> 
> 1) the eeh_event structures are allocated with GFP_ATOMIC since
> eeh_dev_check_failure() can be called from any context. Minimising the
> number of atomic allocations we do is a good idea as a matter of
> course.

Yes, but I meant directly inside eeh_event so there wouldn't be a second
allocation. It would just be a bit bigger.

> 2) We don't pass the eeh_event structure to the event handler
> function. I guess we could, but... eh
> 
> I don't see the memory saving as hugely significant either. There's
> always fewer eeh_pe structures than there are PCI devices since some
> will share PEs (e.g. switches, multifunction cards) so you'd be saving
> a dozen KB at most.
> 
> root@zaius1:~# lspci | wc -l
> 59
> root@zaius1:~# echo $(( $(lspci | wc -l) * 64 * 8))
> 30208
> 
> I think we'll live...

Sure, I don't have very strong feelings about it either way.

> >
> > Cheers,
> > Sam.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check
  2019-09-17  3:15   ` Sam Bobroff
@ 2019-09-17  3:36     ` Oliver O'Halloran
  2019-09-17  4:23       ` Oliver O'Halloran
  0 siblings, 1 reply; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  3:36 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev

On Tue, Sep 17, 2019 at 1:16 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote:
> > Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs
> > response. When performing EEH testing using the EEH error injection feature
> > available on some platforms there is no simple way to kick-off the kernel's
> > recovery process since any accesses from userspace (usually /dev/mem) will
> > bypass the MMIO helpers in the kernel which check if a 0xFF response is due
> > to an EEH freeze or not.
> >
> > If a device contains a 0xFF byte in it's config space it's possible to
> > trigger the recovery process via config space read from userspace, but this
> > is not a reliable method. If a driver is bound to the device an in use it
> > will frequently trigger the MMIO check, but this is also inconsistent.
> >
> > To solve these problems this patch adds a debugfs file called
> > "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs
> > eeh_dev_check_failure() on it. This is the same check that's done when the
> > kernel gets a 0xFF result from an config or MMIO read with the added
> > benifit that it can be reliably triggered from userspace.
> >
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
>
> Looks good, and I tested it with the next patch and it seems to work.
>
> But I think you should make it clear that this does not work with
> the hardware "EEH error injection" facility accessible via debugfs in
> err_injct (that doesn't seem clear to me from the commit message).

It's not intended to be a separate mechanisms in the long term. I'm
planning on converting this interface to make use the platform defined
error injection mechanism once I can find how to use the PAPR ones
reliably. The idea is to use this as a generic "cause an EEH to happen
on this device" interface for userspace which we can use in test
scripts and the like.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/14] powerpc/eeh: Defer printing stack trace
  2019-09-17  3:35       ` Sam Bobroff
@ 2019-09-17  3:38         ` Oliver O'Halloran
  0 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  3:38 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev

On Tue, Sep 17, 2019 at 1:35 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> On Tue, Sep 17, 2019 at 11:45:14AM +1000, Oliver O'Halloran wrote:
> >
> > Two reasons:
> >
> > 1) the eeh_event structures are allocated with GFP_ATOMIC since
> > eeh_dev_check_failure() can be called from any context. Minimising the
> > number of atomic allocations we do is a good idea as a matter of
> > course.
>
> Yes, but I meant directly inside eeh_event so there wouldn't be a second
> allocation. It would just be a bit bigger.

Right, my point was that increasing the allocation size also increases
the chance of an allocation failure.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event()
  2019-09-17  1:00   ` Sam Bobroff
@ 2019-09-17  4:20     ` Oliver O'Halloran
  0 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  4:20 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev

On Tue, Sep 17, 2019 at 11:01 AM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> On Tue, Sep 03, 2019 at 08:15:55PM +1000, Oliver O'Halloran wrote:
> > When a device is surprise removed while undergoing IO we will probably
> > get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
> > is detected we send a recovery event to the EEH worker thread which will
> > notify drivers, and perform recovery as needed.
> >
> > In the event of a hot-remove we don't want recovery to occur since there
> > isn't a device to recover. The recovery process is fairly long due to
> > the number of wait states (required by PCIe) which causes problems when
> > devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).
> >
> > To determine if we need to skip the recovery process we can use the
> > get_adapter_state() operation of the hotplug_slot to determine if the
> > slot contains a device or not, and if the slot is empty we can skip
> > recovery entirely.
> >
> > One thing to note is that the slot being EEH frozen does not prevent the
> > hotplug driver from working. We don't have the EEH recovery thread
> > remove any of the devices since it's assumed that the hotplug driver
> > will handle tearing down the slot state.
> >
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
>
> Looks good, but some comments, below.
>
> > ---
> >  arch/powerpc/kernel/eeh_driver.c | 60 ++++++++++++++++++++++++++++++++
> >  1 file changed, 60 insertions(+)
> >
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 18a69fac4d80..52ce7584af43 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/irq.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci_hotplug.h>
> >  #include <asm/eeh.h>
> >  #include <asm/eeh_event.h>
> >  #include <asm/ppc-pci.h>
> > @@ -769,6 +770,46 @@ static void eeh_pe_cleanup(struct eeh_pe *pe)
> >       }
> >  }
> >
> > +/**
> > + * eeh_check_slot_presence - Check if a device is still present in a slot
> > + * @pdev: pci_dev to check
> > + *
> > + * This function may return a false positive if we can't determine the slot's
> > + * presence state. This might happen for for PCIe slots if the PE containing
> > + * the upstream bridge is also frozen, or the bridge is part of the same PE
> > + * as the device.
> > + *
> > + * This shouldn't happen often, but you might see it if you hotplug a PCIe
> > + * switch.
> > + */
>
> I don't think the function name is very good; it does check the slot but
> it doesn't tell you what a true result means -- but I don't see an
> obviously great alternative either. If it can return false positives, it's
> really testing for empty so maybe 'eeh_slot_definitely_empty()' or
> 'eeh_slot_maybe_populated()'?

I don't see a better name either. I thought the meaning was fairly
clear when looked at in the context of the caller though.

> > +     /*
> > +      * When devices are hot-removed we might get an EEH due to
> > +      * a driver attempting to touch the MMIO space of a removed
> > +      * device. In this case we don't have a device to recover
> > +      * so suppress the event if we can't find any present devices.
> > +      *
> > +      * The hotplug driver should take care of tearing down the
> > +      * device itself.
> > +      */
> > +     eeh_for_each_pe(pe, tmp_pe)
> > +             eeh_pe_for_each_dev(tmp_pe, edev, tmp)
> > +                     if (eeh_slot_presence_check(edev->pdev))
> > +                             devices++;
>
> In other parts of the EEH code we do a get_device() on edev->pdev before
> passing it around, it might be good to do that here too.

I don't see any calls to get_device (or pci_dev_get) in arch/powerpc/kernel/

I agree that we probably should be taking a ref to the pci_dev, but
IIRC you we're working on a series to do just that so I figured I
should keep things in line with what's there currently.

> > +     if (!devices)
> > +             goto out; /* nothing to recover */
>
> Does this handle an empty, but frozen, PHB correctly? (Can that happen?)

Probably not, but I don't think we handle the case well (at all?)
currently. In order to start the recovery process we need something to
flag that an error has occurred on the PHB and without a device being
present I don't see where that would come from. It might work on P8
where we have PHB error interrupts, but I don't think those can fire
if the PHB is fenced.

Oliver

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check
  2019-09-17  3:36     ` Oliver O'Halloran
@ 2019-09-17  4:23       ` Oliver O'Halloran
  0 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  4:23 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev

On Tue, Sep 17, 2019 at 1:36 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On Tue, Sep 17, 2019 at 1:16 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
> >
> > On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote:
> > > Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs
> > > response. When performing EEH testing using the EEH error injection feature
> > > available on some platforms there is no simple way to kick-off the kernel's
> > > recovery process since any accesses from userspace (usually /dev/mem) will
> > > bypass the MMIO helpers in the kernel which check if a 0xFF response is due
> > > to an EEH freeze or not.
> > >
> > > If a device contains a 0xFF byte in it's config space it's possible to
> > > trigger the recovery process via config space read from userspace, but this
> > > is not a reliable method. If a driver is bound to the device an in use it
> > > will frequently trigger the MMIO check, but this is also inconsistent.
> > >
> > > To solve these problems this patch adds a debugfs file called
> > > "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs
> > > eeh_dev_check_failure() on it. This is the same check that's done when the
> > > kernel gets a 0xFF result from an config or MMIO read with the added
> > > benifit that it can be reliably triggered from userspace.
> > >
> > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> >
> > Looks good, and I tested it with the next patch and it seems to work.
> >
> > But I think you should make it clear that this does not work with
> > the hardware "EEH error injection" facility accessible via debugfs in
> > err_injct (that doesn't seem clear to me from the commit message).
>
> It's not intended to be a separate mechanisms in the long term. I'm
> planning on converting this interface to make use the platform defined
> error injection mechanism once I can find how to use the PAPR ones
> reliably. The idea is to use this as a generic "cause an EEH to happen
> on this device" interface for userspace which we can use in test
> scripts and the like.

Urgh, I'm tired and thought this was the eeh_debugfs_break patch.

This (the _check) debugfs interface does work with the HW error
injection facilities. After the HW injects an error the PE is frozen,
but the kernel doesn't notice until something runs
eeh_dev_check_failure(). This interface gives userspace a reliable way
to do that rather than relying on drivers doing MMIO, or somewhere in
config space containing a 0xFF.

Oliver

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets
  2019-09-17  1:15   ` Sam Bobroff
@ 2019-09-17  7:30     ` Oliver O'Halloran
  0 siblings, 0 replies; 34+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  7:30 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev

On Tue, Sep 17, 2019 at 11:15 AM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> On Tue, Sep 03, 2019 at 08:15:58PM +1000, Oliver O'Halloran wrote:
> > When we reset PCI devices managed by a hotplug driver the reset may
> > generate spurious hotplug events that cause the PCI device we're resetting
> > to be torn down accidently. This is a problem for EEH (when the driver is
> > EEH aware) since we want to leave the OS PCI device state intact so that
> > the device can be re-set without losing any resources (network, disks,
> > etc) provided by the driver.
> >
> > Generic PCI code provides the pci_bus_error_reset() function to handle
> > resetting a PCI Device (or bus) by using the reset method provided by the
> > hotplug slot driver. We can use this function if the EEH core has
> > requested a hot reset (common case) without tripping over the hotplug
> > driver.

> Could you explain a bit more about how this change solves the problem?
> Is it that the hotplug driver's reset method doesn't cause spurious
> hotplug events?

Yes, see the comment below.

> > -     if (pci_is_root_bus(bus) ||
> > -         pci_is_root_bus(bus->parent))
> > +     if (pci_is_root_bus(bus))
> >               return pnv_eeh_root_reset(hose, option);
> >
> > +     /*
> > +      * For hot resets try use the generic PCI error recovery reset
> > +      * functions. These correctly handles the case where the secondary
> > +      * bus is behind a hotplug slot and it will use the slot provided
> > +      * reset methods to prevent spurious hotplug events during the reset.
> > +      *
> > +      * Fundemental resets need to be handled internally to EEH since the
> > +      * PCI core doesn't really have a concept of a fundemental reset,
> > +      * mainly because there's no standard way to generate one. Only a
> > +      * few devices require an FRESET so it should be fine.
> > +      */
> > +     if (option != EEH_RESET_FUNDAMENTAL) {
> > +             /*
> > +              * NB: Skiboot and pnv_eeh_bridge_reset() also no-op the
> > +              *     de-assert step. It's like the OPAL reset API was
> > +              *     poorly designed or something...
> > +              */
> > +             if (option == EEH_RESET_DEACTIVATE)
> > +                     return 0;
>
> It looks like this will prevent pnv_eeh_root_reset(bus->parent) (below)
> from being called for EEH_RESET_DEACTIVATE, when it was before the
> patch. Is that right?

I agree it's a little awkward, but it works fine. OPAL has always
treated the resets defined by opal_pci_reset() as being edge-triggered
rather than level triggered since the de-assert step has always been
implemented as a no-op. This behaviour is effectively part of the ABI
between OPAL and the kernel since the kernel skips the de-assert step
in pnv_eeh_bridge_reset(). Although pnv_eeh_reset() uses
pnv_eeh_reset_root() to reset the secondary bus of the root port
pnv_pci_reset_secondary_bus() still uses the bridge reset.

I should probably update the OPAL API docs to mention that. Oh well.

> > +             rc = pci_bus_error_reset(bus->self);
> > +             if (!rc)
> > +                     return 0;
>
> Is it correct to fall through and try a different reset if this fails?

The only reason I can see for the generic code failing is when config
space to the bridge is blocked by the EEH core. The internal
pnv_eeh_bridge_reset() function has the option of calling
opal_pci_reset() or using the internal EEH config accessors (which
aren't filtered) so falling back makes sense to me.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes
  2019-09-03 10:15 ` [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes Oliver O'Halloran
  2019-09-17  0:43   ` Sam Bobroff
@ 2019-09-19 10:25   ` Michael Ellerman
  1 sibling, 0 replies; 34+ messages in thread
From: Michael Ellerman @ 2019-09-19 10:25 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: sbobroff, Oliver O'Halloran

On Tue, 2019-09-03 at 10:15:52 UTC, Oliver O'Halloran wrote:
> When the last device in an eeh_pe is removed the eeh_pe structure itself
> (and any empty parents) are freed since they are no longer needed. This
> results in a crash when a hotplug driver is involved since the following
> may occur:
> 
> 1. Device is suprise removed.
> 2. Driver performs an MMIO, which fails and queues and eeh_event.
> 3. Hotplug driver receives a hotplug interrupt and removes any
>    pci_devs that were under the slot.
> 4. pci_dev is torn down and the eeh_pe is freed.
> 5. The EEH event handler thread processes the eeh_event and crashes
>    since the eeh_pe pointer in the eeh_event structure is no
>    longer valid.
> 
> Crashing is generally considered poor form. Instead of doing that use
> the fact PEs are marked as EEH_PE_INVALID to keep them around until the
> end of the recovery cycle, at which point we can safely prune any empty
> PEs.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/799abe283e5103d48e079149579b4f167c95ea0e

cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-09-19 10:40 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 10:15 EEH + hotplug fixes Oliver O'Halloran
2019-09-03 10:15 ` [PATCH 01/14] powerpc/eeh: Clean up EEH PEs after recovery finishes Oliver O'Halloran
2019-09-17  0:43   ` Sam Bobroff
2019-09-19 10:25   ` Michael Ellerman
2019-09-03 10:15 ` [PATCH 02/14] powerpc/eeh: Fix race when freeing PDNs Oliver O'Halloran
2019-09-17  0:50   ` Sam Bobroff
2019-09-03 10:15 ` [PATCH 03/14] powerpc/eeh: Make permanently failed devices non-actionable Oliver O'Halloran
2019-09-17  0:51   ` Sam Bobroff
2019-09-03 10:15 ` [PATCH 04/14] powerpc/eeh: Check slot presence state in eeh_handle_normal_event() Oliver O'Halloran
2019-09-17  1:00   ` Sam Bobroff
2019-09-17  4:20     ` Oliver O'Halloran
2019-09-03 10:15 ` [PATCH 05/14] powerpc/eeh: Defer printing stack trace Oliver O'Halloran
2019-09-17  1:04   ` Sam Bobroff
2019-09-17  1:45     ` Oliver O'Halloran
2019-09-17  3:35       ` Sam Bobroff
2019-09-17  3:38         ` Oliver O'Halloran
2019-09-03 10:15 ` [PATCH 06/14] powerpc/eeh: Remove stale CAPI comment Oliver O'Halloran
2019-09-03 14:09   ` Andrew Donnellan
2019-09-17  1:04   ` Sam Bobroff
2019-09-03 10:15 ` [PATCH 07/14] powernv/eeh: Use generic code to handle hot resets Oliver O'Halloran
2019-09-17  1:15   ` Sam Bobroff
2019-09-17  7:30     ` Oliver O'Halloran
2019-09-03 10:15 ` [PATCH 08/14] pci-hotplug/pnv_php: Add a reset_slot() callback Oliver O'Halloran
2019-09-03 10:16 ` [PATCH 09/14] pci-hotplug/pnv_php: Add support for IODA3 Power9 PHBs Oliver O'Halloran
2019-09-03 10:16 ` [PATCH 10/14] pci-hotplug/pnv_php: Add attention indicator support Oliver O'Halloran
2019-09-03 10:16 ` [PATCH 11/14] powerpc/eeh: Set attention indicator while recovering Oliver O'Halloran
2019-09-17  1:23   ` Sam Bobroff
2019-09-03 10:16 ` [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check Oliver O'Halloran
2019-09-17  3:15   ` Sam Bobroff
2019-09-17  3:36     ` Oliver O'Halloran
2019-09-17  4:23       ` Oliver O'Halloran
2019-09-03 10:16 ` [PATCH 13/14] powerpc/eeh: Add a eeh_dev_break debugfs interface Oliver O'Halloran
2019-09-17  3:19   ` Sam Bobroff
2019-09-03 10:16 ` [PATCH 14/14] selftests/powerpc: Add basic EEH selftest Oliver O'Halloran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).