All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST
@ 2017-03-01 11:08 Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey, Frederic Barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

v2 changes:
* Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
  avoid adding a header dependency to 'pci-bridge.h'. The function is now
  marked as an exported gpl symbol.

* Incorporated changes as suggested by Russell Currey:
- Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()
- Suffixed all the function names used in comments/patch-descriptions with '()'
- Removed an un-needed conditional check of '<0' in eeh_handle_normal_event()
- Rephrased the function comment for eeh_pe_update_freeze_counter() and
  eeh_pe_reset_freeze_counter()
- Brace-wrapped a single line statement at end of eeh_pe_update_freeze_counter()

v1:
Presently to flash a cxl adapter with a new FPGA image a warm pcie reset is
requested on the adapter, once the bitstream is loaded to card flash memory.
This issues a pci-fundamental reset to the card slot signaling the card
controller to reconfigure the fpga with the new bitstream. However
pci-fundamental reset of the slot also results in a fenced PHB that raises an
eeh event triggering the core eeh flow.

The core eeh also maintains a counter named freeze_count for each PE inside
struct eeh_pe. The counter is incremented every time an eeh error is reported on
the PE domain and if the counter reaches the threshold limit, the device is
permanently disabled. The threshold limit is enforced by the variable
eeh_max_freeze variable that can be manipulated via debugfs.

This creates problem for cxl adapters as:

* This puts a limit on number of times a fpga image can be re-flashed which is
  by default 5-time/Hour.

* Since after each reset the adapter can potentially acquire a new personality,
  the freeze_count of older fpga image shouldn't be carried over to newer image.

To fix these problems the proposed patch-set introduces a new function named
eeh_pe_reset_freeze_counter that resets freeze counter for the eeh_pe struct.
This function can then be called by the cxl module before issuing a
pci-fundamental reset to the card slot for loading the new fpga image.

Test Runs
==========
* Without the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
bash: /sys/class/cxl/card0/reset: No such file or directory
# dmesg
...
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
EEH: PHB#22-PE#0 has failed 2 times in the last hour
...
EEH: PHB#22-PE#0 has failed 3 times in the last hour
...
EEH: PHB#22-PE#0 has failed 4 times in the last hour
...
EEH: PHB#22-PE#0 has failed 5 times in the last hour
...
EEH: PHB#22-PE#0 has failed 6 times in the last hour
and has been permanently disabled.

* With the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
# dmesg
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour

---
Vaibhav Jain (3):
  powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update
    freeze_count
  powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  cxl: Reset freeze counters before adapter PERST for flashing new image

 arch/powerpc/include/asm/eeh.h   |  7 ++++-
 arch/powerpc/kernel/eeh_driver.c | 20 ++-----------
 arch/powerpc/kernel/eeh_pe.c     | 64 ++++++++++++++++++++++++++++++----------
 drivers/misc/cxl/pci.c           | 14 +++++++++
 4 files changed, 72 insertions(+), 33 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
@ 2017-03-01 11:08 ` Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
  2 siblings, 0 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey
  Cc: Vaibhav Jain, Frederic Barrat, Andrew Donnellan, Ian Munsie,
	Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan

This patch introduces a new function eeh_pe_update_freeze_counter()
replacing existing function eeh_pe_update_time_stamp(). The new
function also manages the value of reeze_count along with tstamp to
track the number of times the PE roze in last one hour and if the
freeze_count > eeh_max_freezes then eports an error(-ENOTRECOVERABLE)
to indicate that the PE should be ermanently disabled.

This patch should not introduce any behavioral change.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:

v1 -> v2
Changes as suggested by Russell Currey:
- Suffixed function names with '()'
- Dropped '<0' conditional check inside eeh_handle_normal_event()
- Rephrased the comment for function eeh_pe_update_freeze_counter()
- Brace-wrapped a single line statement at end of
  eeh_pe_update_freeze_counter()
---
 arch/powerpc/include/asm/eeh.h   |  2 +-
 arch/powerpc/kernel/eeh_driver.c | 20 +++--------------
 arch/powerpc/kernel/eeh_pe.c     | 47 +++++++++++++++++++++++++++-------------
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8e37b71..68806be 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
 struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
-void eeh_pe_update_time_stamp(struct eeh_pe *pe);
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
 void *eeh_pe_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index b948871..8a088ea 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
-	eeh_pe_update_time_stamp(pe);
-	pe->freeze_count++;
-	if (pe->freeze_count > eeh_max_freezes)
-		goto excess_failures;
+	/* Update freeze counters and see if we have tripped max-freeze limit */
+	if (eeh_pe_update_freeze_counter(pe))
+		goto perm_error;
 	pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
 		pe->freeze_count);
 
@@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
 
 	return;
 
-excess_failures:
-	/*
-	 * About 90% of all real-life EEH failures in the field
-	 * are due to poorly seated PCI cards. Only 10% or so are
-	 * due to actual, failed cards.
-	 */
-	pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
-	       "last hour and has been permanently disabled.\n"
-	       "Please try reseating or replacing it.\n",
-		pe->phb->global_number, pe->addr,
-		pe->freeze_count);
-	goto perm_error;
-
 hard_fail:
 	pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
 	       "Please try reseating or replacing it\n",
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index cc4b206..d367c16 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,30 +504,47 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 }
 
 /**
- * eeh_pe_update_time_stamp - Update PE's frozen time stamp
+ * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
+ * and freeze counter
  * @pe: EEH PE
  *
- * We have time stamp for each PE to trace its time of getting
- * frozen in last hour. The function should be called to update
- * the time stamp on first error of the specific PE. On the other
- * handle, we needn't account for errors happened in last hour.
+ * We have a freeze counter and time stamp for each PE to trace
+ * number of times the PE was frozen in the last hour. This function
+ * updates the PE's freeze counter and returns an error if its greater
+ * than eeh_max_freezes. The function should be called to once every
+ * time a specific PE freezes.
  */
-void eeh_pe_update_time_stamp(struct eeh_pe *pe)
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
 {
 	struct timeval tstamp;
 
-	if (!pe) return;
+	if (!pe)
+		return -EINVAL;
+
+	do_gettimeofday(&tstamp);
+	if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
+		pe->tstamp = tstamp;
+		pe->freeze_count = 1;
+
+	} else if (pe->freeze_count >= eeh_max_freezes) {
+		pe->freeze_count++;
+		/*
+		 * About 90% of all real-life EEH failures in the field
+		 * are due to poorly seated PCI cards. Only 10% or so are
+		 * due to actual, failed cards.
+		 */
+		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
+		       "last hour and has been permanently disabled.\n"
+		       "Please try reseating or replacing it.\n",
+		       pe->phb->global_number, pe->addr,
+		       pe->freeze_count);
+		return -ENOTRECOVERABLE;
 
-	if (pe->freeze_count <= 0) {
-		pe->freeze_count = 0;
-		do_gettimeofday(&pe->tstamp);
 	} else {
-		do_gettimeofday(&tstamp);
-		if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
-			pe->tstamp = tstamp;
-			pe->freeze_count = 0;
-		}
+		pe->freeze_count++;
 	}
+
+	return 0;
 }
 
 /**
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
@ 2017-03-01 11:08 ` Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
  2 siblings, 0 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey
  Cc: Vaibhav Jain, Frederic Barrat, Andrew Donnellan, Ian Munsie,
	Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan

This patch introduces function eeh_pe_reset_freeze_counter() which can
be used to reset the PE's freeze count variable outside eeh code. This
is useful for devices that can acquire a different personality after
a PERST event (e.g FPGA Adapters). Presently an existing freeze
count for an adapter with personality N will be taken into account
when the adapter acquired personality N+1.

By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
counter for an adapter once it has acquired a new personality and
ideally wont be plagued by the failures similar to the one before.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:

v1 -> v2
* Changes as suggested by Russell Currey:
- Suffixed function names with '()'
- Rephrased the description comment for functon eeh_pe_reset_freeze_counter()
- Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()

* Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
  avoid adding a header dependency to 'pci-bridge.h'. The function is
  now marked as an exported gpl symbol.
---
 arch/powerpc/include/asm/eeh.h |  5 +++++
 arch/powerpc/kernel/eeh_pe.c   | 17 +++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 68806be..8dcfb88 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -266,6 +266,9 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
 int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
+
+void eeh_pe_reset_freeze_counter(struct eeh_pe *pe);
+
 void *eeh_pe_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
@@ -339,6 +342,8 @@ static inline int eeh_check_failure(const volatile void __iomem *token)
 	return 0;
 }
 
+static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe) { }
+
 #define eeh_dev_check_failure(x) (0)
 
 static inline void eeh_addr_cache_build(void) { }
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index d367c16..75c781f 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,6 +504,23 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 }
 
 /**
+ * eeh_pe_reset_freeze_counter - Resets the PE freeze counter
+ * @pe: EEH PE
+ *
+ * This function is useful while re-configuring an FPGA adapter
+ * as its about to acquire new a personality and you don't want
+ * freeze count to be carry forwarded. As such calling this function
+ * for regular pci devices might be a bad idea.
+ */
+void eeh_pe_reset_freeze_counter(struct eeh_pe *pe)
+{
+	pr_info("Resetting freeze count for PHB#%x-PE#%x\n",
+		pe->phb->global_number, pe->addr);
+	pe->freeze_count = 0;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_reset_freeze_counter);
+
+/**
  * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
  * and freeze counter
  * @pe: EEH PE
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image
  2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
@ 2017-03-01 11:08 ` Vaibhav Jain
  2 siblings, 0 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
  To: linuxppc-dev, Frederic Barrat, Russell Currey
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

The patch resets the freeze counter on eeh_pe struct for PHB
associated with the cxl pci adapter. This would enable re-flashing of
the cxl-adapter beyond the default limit of 5.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:

v1 -> v2
Changes as suggested by Russell Currey:
- Changed new variable names inline with eeh code nomenclature.
- Removed the dev_info logging the PHB being reset. The log message is now
  moved to eeh_handle_normal_event()
---
 drivers/misc/cxl/pci.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index ce54dab..7960fd64 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -22,6 +22,7 @@
 #include <asm/pnv-pci.h>
 #include <asm/io.h>
 #include <asm/reg.h>
+#include <asm/eeh.h>
 
 #include "cxl.h"
 #include <misc/cxl.h>
@@ -1231,6 +1232,8 @@ static void cxl_pci_remove_afu(struct cxl_afu *afu)
 int cxl_pci_reset(struct cxl *adapter)
 {
 	struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
+	struct eeh_dev *edev = pci_dev_to_eeh_dev(dev);
+	struct eeh_pe *pe = eeh_dev_to_pe(edev);
 	int rc;
 
 	if (adapter->perst_same_image) {
@@ -1244,6 +1247,17 @@ int cxl_pci_reset(struct cxl *adapter)
 	/* the adapter is about to be reset, so ignore errors */
 	cxl_data_cache_flush(adapter);
 
+	/* If loading a new image, reset freeze counters for the PHB
+	 * associated with the adapter.
+	 */
+	if (pe && adapter->perst_loads_image) {
+		/* Find the pe associated with the device PHB */
+		while (pe->parent != NULL && (pe->type & EEH_PE_PHB) == 0)
+			pe = pe->parent;
+
+		eeh_pe_reset_freeze_counter(pe);
+	}
+
 	/* pcie_warm_reset requests a fundamental pci reset which includes a
 	 * PERST assert/deassert.  PERST triggers a loading of the image
 	 * if "user" or "factory" is selected in sysfs */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-03-01 12:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.