All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND-RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST
@ 2017-03-01 11:24 Vaibhav Jain
  2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:24 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey, Frederic Barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

Resend:
Update the Cc recipients list.

v2 changes:
* Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
  avoid adding a header dependency to 'pci-bridge.h'. The function is now
  marked as an exported gpl symbol.

* Incorporated changes as suggested by Russell Currey:
- Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()
- Suffixed all the function names used in comments/patch-descriptions with '()'
- Removed an un-needed conditional check of '<0' in eeh_handle_normal_event()
- Rephrased the function comment for eeh_pe_update_freeze_counter() and
  eeh_pe_reset_freeze_counter()
- Brace-wrapped a single line statement at end of eeh_pe_update_freeze_counter()

v1:
Presently to flash a cxl adapter with a new FPGA image a warm pcie reset is
requested on the adapter, once the bitstream is loaded to card flash memory.
This issues a pci-fundamental reset to the card slot signaling the card
controller to reconfigure the fpga with the new bitstream. However
pci-fundamental reset of the slot also results in a fenced PHB that raises an
eeh event triggering the core eeh flow.

The core eeh also maintains a counter named freeze_count for each PE inside
struct eeh_pe. The counter is incremented every time an eeh error is reported on
the PE domain and if the counter reaches the threshold limit, the device is
permanently disabled. The threshold limit is enforced by the variable
eeh_max_freeze variable that can be manipulated via debugfs.

This creates problem for cxl adapters as:

* This puts a limit on number of times a fpga image can be re-flashed which is
  by default 5-time/Hour.

* Since after each reset the adapter can potentially acquire a new personality,
  the freeze_count of older fpga image shouldn't be carried over to newer image.

To fix these problems the proposed patch-set introduces a new function named
eeh_pe_reset_freeze_counter that resets freeze counter for the eeh_pe struct.
This function can then be called by the cxl module before issuing a
pci-fundamental reset to the card slot for loading the new fpga image.

Test Runs
==========
* Without the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
bash: /sys/class/cxl/card0/reset: No such file or directory
# dmesg
...
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
EEH: PHB#22-PE#0 has failed 2 times in the last hour
...
EEH: PHB#22-PE#0 has failed 3 times in the last hour
...
EEH: PHB#22-PE#0 has failed 4 times in the last hour
...
EEH: PHB#22-PE#0 has failed 5 times in the last hour
...
EEH: PHB#22-PE#0 has failed 6 times in the last hour
and has been permanently disabled.

* With the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
# dmesg
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour

---
Vaibhav Jain (3):
  powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update
    freeze_count
  powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  cxl: Reset freeze counters before adapter PERST for flashing new image

 arch/powerpc/include/asm/eeh.h   |  7 ++++-
 arch/powerpc/kernel/eeh_driver.c | 20 ++-----------
 arch/powerpc/kernel/eeh_pe.c     | 64 ++++++++++++++++++++++++++++++----------
 drivers/misc/cxl/pci.c           | 14 +++++++++
 4 files changed, 72 insertions(+), 33 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 11:24 [RESEND-RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
@ 2017-03-01 11:24 ` Vaibhav Jain
  2017-03-01 14:58   ` Guilherme G. Piccoli
  2017-03-02  1:03   ` Andrew Donnellan
  2017-03-01 11:24 ` [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
  2017-03-01 11:24 ` [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
  2 siblings, 2 replies; 16+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:24 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey
  Cc: Vaibhav Jain, Frederic Barrat, Andrew Donnellan, Ian Munsie,
	Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan

This patch introduces a new function eeh_pe_update_freeze_counter()
replacing existing function eeh_pe_update_time_stamp(). The new
function also manages the value of reeze_count along with tstamp to
track the number of times the PE roze in last one hour and if the
freeze_count > eeh_max_freezes then eports an error(-ENOTRECOVERABLE)
to indicate that the PE should be ermanently disabled.

This patch should not introduce any behavioral change.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:

v1 -> v2
Changes as suggested by Russell Currey:
- Suffixed function names with '()'
- Dropped '<0' conditional check inside eeh_handle_normal_event()
- Rephrased the comment for function eeh_pe_update_freeze_counter()
- Brace-wrapped a single line statement at end of
  eeh_pe_update_freeze_counter()
---
 arch/powerpc/include/asm/eeh.h   |  2 +-
 arch/powerpc/kernel/eeh_driver.c | 20 +++--------------
 arch/powerpc/kernel/eeh_pe.c     | 47 +++++++++++++++++++++++++++-------------
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8e37b71..68806be 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
 struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
-void eeh_pe_update_time_stamp(struct eeh_pe *pe);
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
 void *eeh_pe_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index b948871..8a088ea 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
-	eeh_pe_update_time_stamp(pe);
-	pe->freeze_count++;
-	if (pe->freeze_count > eeh_max_freezes)
-		goto excess_failures;
+	/* Update freeze counters and see if we have tripped max-freeze limit */
+	if (eeh_pe_update_freeze_counter(pe))
+		goto perm_error;
 	pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
 		pe->freeze_count);
 
@@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
 
 	return;
 
-excess_failures:
-	/*
-	 * About 90% of all real-life EEH failures in the field
-	 * are due to poorly seated PCI cards. Only 10% or so are
-	 * due to actual, failed cards.
-	 */
-	pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
-	       "last hour and has been permanently disabled.\n"
-	       "Please try reseating or replacing it.\n",
-		pe->phb->global_number, pe->addr,
-		pe->freeze_count);
-	goto perm_error;
-
 hard_fail:
 	pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
 	       "Please try reseating or replacing it\n",
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index cc4b206..d367c16 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,30 +504,47 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 }
 
 /**
- * eeh_pe_update_time_stamp - Update PE's frozen time stamp
+ * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
+ * and freeze counter
  * @pe: EEH PE
  *
- * We have time stamp for each PE to trace its time of getting
- * frozen in last hour. The function should be called to update
- * the time stamp on first error of the specific PE. On the other
- * handle, we needn't account for errors happened in last hour.
+ * We have a freeze counter and time stamp for each PE to trace
+ * number of times the PE was frozen in the last hour. This function
+ * updates the PE's freeze counter and returns an error if its greater
+ * than eeh_max_freezes. The function should be called to once every
+ * time a specific PE freezes.
  */
-void eeh_pe_update_time_stamp(struct eeh_pe *pe)
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
 {
 	struct timeval tstamp;
 
-	if (!pe) return;
+	if (!pe)
+		return -EINVAL;
+
+	do_gettimeofday(&tstamp);
+	if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
+		pe->tstamp = tstamp;
+		pe->freeze_count = 1;
+
+	} else if (pe->freeze_count >= eeh_max_freezes) {
+		pe->freeze_count++;
+		/*
+		 * About 90% of all real-life EEH failures in the field
+		 * are due to poorly seated PCI cards. Only 10% or so are
+		 * due to actual, failed cards.
+		 */
+		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
+		       "last hour and has been permanently disabled.\n"
+		       "Please try reseating or replacing it.\n",
+		       pe->phb->global_number, pe->addr,
+		       pe->freeze_count);
+		return -ENOTRECOVERABLE;
 
-	if (pe->freeze_count <= 0) {
-		pe->freeze_count = 0;
-		do_gettimeofday(&pe->tstamp);
 	} else {
-		do_gettimeofday(&tstamp);
-		if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
-			pe->tstamp = tstamp;
-			pe->freeze_count = 0;
-		}
+		pe->freeze_count++;
 	}
+
+	return 0;
 }
 
 /**
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-01 11:24 [RESEND-RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
  2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
@ 2017-03-01 11:24 ` Vaibhav Jain
  2017-03-02  1:10   ` Andrew Donnellan
  2017-03-03  4:21   ` Vaibhav Jain
  2017-03-01 11:24 ` [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
  2 siblings, 2 replies; 16+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:24 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey
  Cc: Vaibhav Jain, Frederic Barrat, Andrew Donnellan, Ian Munsie,
	Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan

This patch introduces function eeh_pe_reset_freeze_counter() which can
be used to reset the PE's freeze count variable outside eeh code. This
is useful for devices that can acquire a different personality after
a PERST event (e.g FPGA Adapters). Presently an existing freeze
count for an adapter with personality N will be taken into account
when the adapter acquired personality N+1.

By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
counter for an adapter once it has acquired a new personality and
ideally wont be plagued by the failures similar to the one before.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:

v1 -> v2
* Changes as suggested by Russell Currey:
- Suffixed function names with '()'
- Rephrased the description comment for functon eeh_pe_reset_freeze_counter()
- Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()

* Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
  avoid adding a header dependency to 'pci-bridge.h'. The function is
  now marked as an exported gpl symbol.
---
 arch/powerpc/include/asm/eeh.h |  5 +++++
 arch/powerpc/kernel/eeh_pe.c   | 17 +++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 68806be..8dcfb88 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -266,6 +266,9 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
 int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
+
+void eeh_pe_reset_freeze_counter(struct eeh_pe *pe);
+
 void *eeh_pe_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
@@ -339,6 +342,8 @@ static inline int eeh_check_failure(const volatile void __iomem *token)
 	return 0;
 }
 
+static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe) { }
+
 #define eeh_dev_check_failure(x) (0)
 
 static inline void eeh_addr_cache_build(void) { }
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index d367c16..75c781f 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,6 +504,23 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 }
 
 /**
+ * eeh_pe_reset_freeze_counter - Resets the PE freeze counter
+ * @pe: EEH PE
+ *
+ * This function is useful while re-configuring an FPGA adapter
+ * as its about to acquire new a personality and you don't want
+ * freeze count to be carry forwarded. As such calling this function
+ * for regular pci devices might be a bad idea.
+ */
+void eeh_pe_reset_freeze_counter(struct eeh_pe *pe)
+{
+	pr_info("Resetting freeze count for PHB#%x-PE#%x\n",
+		pe->phb->global_number, pe->addr);
+	pe->freeze_count = 0;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_reset_freeze_counter);
+
+/**
  * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
  * and freeze counter
  * @pe: EEH PE
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image
  2017-03-01 11:24 [RESEND-RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
  2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
  2017-03-01 11:24 ` [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
@ 2017-03-01 11:24 ` Vaibhav Jain
  2017-03-02  1:46   ` Andrew Donnellan
  2 siblings, 1 reply; 16+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:24 UTC (permalink / raw)
  To: capi-linux, linuxppc-dev, Frederic Barrat, Russell Currey
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

The patch resets the freeze counter on eeh_pe struct for PHB
associated with the cxl pci adapter. This would enable re-flashing of
the cxl-adapter beyond the default limit of 5.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:

v1 -> v2
Changes as suggested by Russell Currey:
- Changed new variable names inline with eeh code nomenclature.
- Removed the dev_info logging the PHB being reset. The log message is now
  moved to eeh_handle_normal_event()
---
 drivers/misc/cxl/pci.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index ce54dab..7960fd64 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -22,6 +22,7 @@
 #include <asm/pnv-pci.h>
 #include <asm/io.h>
 #include <asm/reg.h>
+#include <asm/eeh.h>
 
 #include "cxl.h"
 #include <misc/cxl.h>
@@ -1231,6 +1232,8 @@ static void cxl_pci_remove_afu(struct cxl_afu *afu)
 int cxl_pci_reset(struct cxl *adapter)
 {
 	struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
+	struct eeh_dev *edev = pci_dev_to_eeh_dev(dev);
+	struct eeh_pe *pe = eeh_dev_to_pe(edev);
 	int rc;
 
 	if (adapter->perst_same_image) {
@@ -1244,6 +1247,17 @@ int cxl_pci_reset(struct cxl *adapter)
 	/* the adapter is about to be reset, so ignore errors */
 	cxl_data_cache_flush(adapter);
 
+	/* If loading a new image, reset freeze counters for the PHB
+	 * associated with the adapter.
+	 */
+	if (pe && adapter->perst_loads_image) {
+		/* Find the pe associated with the device PHB */
+		while (pe->parent != NULL && (pe->type & EEH_PE_PHB) == 0)
+			pe = pe->parent;
+
+		eeh_pe_reset_freeze_counter(pe);
+	}
+
 	/* pcie_warm_reset requests a fundamental pci reset which includes a
 	 * PERST assert/deassert.  PERST triggers a loading of the image
 	 * if "user" or "factory" is selected in sysfs */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
@ 2017-03-01 14:58   ` Guilherme G. Piccoli
  2017-03-01 16:26     ` Vaibhav Jain
  2017-03-02  1:03   ` Andrew Donnellan
  1 sibling, 1 reply; 16+ messages in thread
From: Guilherme G. Piccoli @ 2017-03-01 14:58 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Russell Currey
  Cc: Philippe Bergheaud, Frederic Barrat, Gavin Shan, Ian Munsie,
	Andrew Donnellan, Christophe Lombard, Greg Kurz

Hi Vaibhav, nice patch! Some comments below:

On 03/01/2017 08:24 AM, Vaibhav Jain wrote:
> This patch introduces a new function eeh_pe_update_freeze_counter()
> replacing existing function eeh_pe_update_time_stamp(). The new
> function also manages the value of reeze_count along with tstamp to
> track the number of times the PE roze in last one hour and if the
> freeze_count > eeh_max_freezes then eports an error(-ENOTRECOVERABLE)
> to indicate that the PE should be ermanently disabled.

Not sure why, but many of the words in commit message are missing their
first letter. See, for example:
reeze_count,  roze,  eports,  ermanently


> 
> This patch should not introduce any behavioral change.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
> Change-log:
> 
> v1 -> v2
> Changes as suggested by Russell Currey:
> - Suffixed function names with '()'
> - Dropped '<0' conditional check inside eeh_handle_normal_event()
> - Rephrased the comment for function eeh_pe_update_freeze_counter()
> - Brace-wrapped a single line statement at end of
>   eeh_pe_update_freeze_counter()
> ---
>  arch/powerpc/include/asm/eeh.h   |  2 +-
>  arch/powerpc/kernel/eeh_driver.c | 20 +++--------------
>  arch/powerpc/kernel/eeh_pe.c     | 47 +++++++++++++++++++++++++++-------------
>  3 files changed, 36 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 8e37b71..68806be 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
>  struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
>  int eeh_add_to_parent_pe(struct eeh_dev *edev);
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
> -void eeh_pe_update_time_stamp(struct eeh_pe *pe);
> +int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
>  void *eeh_pe_traverse(struct eeh_pe *root,
>  		eeh_traverse_func fn, void *flag);
>  void *eeh_pe_dev_traverse(struct eeh_pe *root,
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index b948871..8a088ea 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
>  		return;
>  	}
> 
> -	eeh_pe_update_time_stamp(pe);
> -	pe->freeze_count++;
> -	if (pe->freeze_count > eeh_max_freezes)
> -		goto excess_failures;
> +	/* Update freeze counters and see if we have tripped max-freeze limit */
> +	if (eeh_pe_update_freeze_counter(pe))
> +		goto perm_error;
>  	pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
>  		pe->freeze_count);
> 
> @@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
> 
>  	return;
> 
> -excess_failures:
> -	/*
> -	 * About 90% of all real-life EEH failures in the field
> -	 * are due to poorly seated PCI cards. Only 10% or so are
> -	 * due to actual, failed cards.
> -	 */
> -	pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
> -	       "last hour and has been permanently disabled.\n"
> -	       "Please try reseating or replacing it.\n",
> -		pe->phb->global_number, pe->addr,
> -		pe->freeze_count);
> -	goto perm_error;
> -
>  hard_fail:
>  	pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
>  	       "Please try reseating or replacing it\n",
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index cc4b206..d367c16 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -504,30 +504,47 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  }
> 
>  /**
> - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
> + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
> + * and freeze counter
>   * @pe: EEH PE
>   *
> - * We have time stamp for each PE to trace its time of getting
> - * frozen in last hour. The function should be called to update
> - * the time stamp on first error of the specific PE. On the other
> - * handle, we needn't account for errors happened in last hour.

s/handle/hand? "On the other hand..."

Thanks,


Guilherme


> + * We have a freeze counter and time stamp for each PE to trace
> + * number of times the PE was frozen in the last hour. This function
> + * updates the PE's freeze counter and returns an error if its greater
> + * than eeh_max_freezes. The function should be called to once every
> + * time a specific PE freezes.
>   */
> -void eeh_pe_update_time_stamp(struct eeh_pe *pe)
> +int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
>  {
>  	struct timeval tstamp;
> 
> -	if (!pe) return;
> +	if (!pe)
> +		return -EINVAL;
> +
> +	do_gettimeofday(&tstamp);
> +	if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
> +		pe->tstamp = tstamp;
> +		pe->freeze_count = 1;
> +
> +	} else if (pe->freeze_count >= eeh_max_freezes) {
> +		pe->freeze_count++;
> +		/*
> +		 * About 90% of all real-life EEH failures in the field
> +		 * are due to poorly seated PCI cards. Only 10% or so are
> +		 * due to actual, failed cards.
> +		 */
> +		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
> +		       "last hour and has been permanently disabled.\n"
> +		       "Please try reseating or replacing it.\n",
> +		       pe->phb->global_number, pe->addr,
> +		       pe->freeze_count);
> +		return -ENOTRECOVERABLE;
> 
> -	if (pe->freeze_count <= 0) {
> -		pe->freeze_count = 0;
> -		do_gettimeofday(&pe->tstamp);
>  	} else {
> -		do_gettimeofday(&tstamp);
> -		if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
> -			pe->tstamp = tstamp;
> -			pe->freeze_count = 0;
> -		}
> +		pe->freeze_count++;
>  	}
> +
> +	return 0;
>  }
> 
>  /**
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 14:58   ` Guilherme G. Piccoli
@ 2017-03-01 16:26     ` Vaibhav Jain
  2017-03-01 17:39       ` Guilherme G. Piccoli
  2017-03-01 23:52       ` Russell Currey
  0 siblings, 2 replies; 16+ messages in thread
From: Vaibhav Jain @ 2017-03-01 16:26 UTC (permalink / raw)
  To: Guilherme G. Piccoli, linuxppc-dev, Russell Currey
  Cc: Philippe Bergheaud, Frederic Barrat, Gavin Shan, Ian Munsie,
	Andrew Donnellan, Christophe Lombard, Greg Kurz


Thanks for reviewing the patch !!

"Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com> writes:
>
> Not sure why, but many of the words in commit message are missing their
> first letter. See, for example:
> reeze_count,  roze,  eports,  ermanently
Thanks for pointing this out. Will fix this in the subsequent patch
revision.

>>  /**
>> - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
>> + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
>> + * and freeze counter
>>   * @pe: EEH PE
>>   *
>> - * We have time stamp for each PE to trace its time of getting
>> - * frozen in last hour. The function should be called to update
>> - * the time stamp on first error of the specific PE. On the other
>> - * handle, we needn't account for errors happened in last hour.
>
> s/handle/hand? "On the other hand..."
Thats part of original comment which this patch is replacing.

-- 
Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Linux Technology Center, IBM India Pvt. Ltd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 16:26     ` Vaibhav Jain
@ 2017-03-01 17:39       ` Guilherme G. Piccoli
  2017-03-01 23:52       ` Russell Currey
  1 sibling, 0 replies; 16+ messages in thread
From: Guilherme G. Piccoli @ 2017-03-01 17:39 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Russell Currey
  Cc: Philippe Bergheaud, Frederic Barrat, Gavin Shan, Ian Munsie,
	Andrew Donnellan, Christophe Lombard, Greg Kurz

On 03/01/2017 01:26 PM, Vaibhav Jain wrote:
> 
> Thanks for reviewing the patch !!

Yw =)


> 
> "Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com> writes:
>>
>> Not sure why, but many of the words in commit message are missing their
>> first letter. See, for example:
>> reeze_count,  roze,  eports,  ermanently
> Thanks for pointing this out. Will fix this in the subsequent patch
> revision.
> 
>>>  /**
>>> - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
>>> + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
>>> + * and freeze counter
>>>   * @pe: EEH PE
>>>   *
>>> - * We have time stamp for each PE to trace its time of getting
>>> - * frozen in last hour. The function should be called to update
>>> - * the time stamp on first error of the specific PE. On the other
>>> - * handle, we needn't account for errors happened in last hour.
>>
>> s/handle/hand? "On the other hand..."
> Thats part of original comment which this patch is replacing.

.....hehehe
How didn't I notice?
Sorry, my bad!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 16:26     ` Vaibhav Jain
  2017-03-01 17:39       ` Guilherme G. Piccoli
@ 2017-03-01 23:52       ` Russell Currey
  1 sibling, 0 replies; 16+ messages in thread
From: Russell Currey @ 2017-03-01 23:52 UTC (permalink / raw)
  To: Vaibhav Jain, Guilherme G. Piccoli, linuxppc-dev
  Cc: Philippe Bergheaud, Frederic Barrat, Gavin Shan, Ian Munsie,
	Andrew Donnellan, Christophe Lombard, Greg Kurz

On Wed, 2017-03-01 at 21:56 +0530, Vaibhav Jain wrote:
> Thanks for reviewing the patch !!
> 
> "Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com> writes:
> > 
> > Not sure why, but many of the words in commit message are missing their
> > first letter. See, for example:
> > reeze_count,  roze,  eports,  ermanently
> 
> Thanks for pointing this out. Will fix this in the subsequent patch
> revision.
> 
> > >  /**
> > > - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
> > > + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
> > > + * and freeze counter
> > >   * @pe: EEH PE
> > >   *
> > > - * We have time stamp for each PE to trace its time of getting
> > > - * frozen in last hour. The function should be called to update
> > > - * the time stamp on first error of the specific PE. On the other
> > > - * handle, we needn't account for errors happened in last hour.
> > 
> > s/handle/hand? "On the other hand..."
> 
> Thats part of original comment which this patch is replacing.

It's still worth fixing. Maybe "On the other hand, we don't need to account for
errors that happened in the last hour."

I really need to get around to fixing up a lot of these old comments...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
  2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
  2017-03-01 14:58   ` Guilherme G. Piccoli
@ 2017-03-02  1:03   ` Andrew Donnellan
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Donnellan @ 2017-03-02  1:03 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Russell Currey
  Cc: Frederic Barrat, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

On 01/03/17 22:24, Vaibhav Jain wrote:
> This patch introduces a new function eeh_pe_update_freeze_counter()
> replacing existing function eeh_pe_update_time_stamp(). The new
> function also manages the value of reeze_count along with tstamp to
> track the number of times the PE roze in last one hour and if the
> freeze_count > eeh_max_freezes then eports an error(-ENOTRECOVERABLE)
> to indicate that the PE should be ermanently disabled.
>
> This patch should not introduce any behavioral change.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>

Thanks for addressing Russell's comments. Per Guilherme, your commit 
message is missing a few letters, a couple of minor style points below, 
otherwise:

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

>  /**
> - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
> + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
> + * and freeze counter
>   * @pe: EEH PE
>   *
> - * We have time stamp for each PE to trace its time of getting
> - * frozen in last hour. The function should be called to update
> - * the time stamp on first error of the specific PE. On the other
> - * handle, we needn't account for errors happened in last hour.
> + * We have a freeze counter and time stamp for each PE to trace
> + * number of times the PE was frozen in the last hour. This function
> + * updates the PE's freeze counter and returns an error if its greater

it's

> + * than eeh_max_freezes. The function should be called to once every
> + * time a specific PE freezes.

"The function should be called every time the PE freezes"


-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-01 11:24 ` [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
@ 2017-03-02  1:10   ` Andrew Donnellan
  2017-03-03  4:21   ` Vaibhav Jain
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Donnellan @ 2017-03-02  1:10 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Russell Currey
  Cc: Frederic Barrat, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

On 01/03/17 22:24, Vaibhav Jain wrote:
> This patch introduces function eeh_pe_reset_freeze_counter() which can
> be used to reset the PE's freeze count variable outside eeh code. This
> is useful for devices that can acquire a different personality after
> a PERST event (e.g FPGA Adapters). Presently an existing freeze
> count for an adapter with personality N will be taken into account
> when the adapter acquired personality N+1.
>
> By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
> counter for an adapter once it has acquired a new personality and
> ideally wont be plagued by the failures similar to the one before.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>

LGTM

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

> ---
> Change-log:
>
> v1 -> v2
> * Changes as suggested by Russell Currey:
> - Suffixed function names with '()'
> - Rephrased the description comment for functon eeh_pe_reset_freeze_counter()
> - Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()
>
> * Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
>   avoid adding a header dependency to 'pci-bridge.h'. The function is
>   now marked as an exported gpl symbol.
> ---
>  arch/powerpc/include/asm/eeh.h |  5 +++++
>  arch/powerpc/kernel/eeh_pe.c   | 17 +++++++++++++++++
>  2 files changed, 22 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 68806be..8dcfb88 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -266,6 +266,9 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
>  int eeh_add_to_parent_pe(struct eeh_dev *edev);
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
>  int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
> +
> +void eeh_pe_reset_freeze_counter(struct eeh_pe *pe);
> +
>  void *eeh_pe_traverse(struct eeh_pe *root,
>  		eeh_traverse_func fn, void *flag);
>  void *eeh_pe_dev_traverse(struct eeh_pe *root,
> @@ -339,6 +342,8 @@ static inline int eeh_check_failure(const volatile void __iomem *token)
>  	return 0;
>  }
>
> +static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe) { }
> +
>  #define eeh_dev_check_failure(x) (0)
>
>  static inline void eeh_addr_cache_build(void) { }
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index d367c16..75c781f 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -504,6 +504,23 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  }
>
>  /**
> + * eeh_pe_reset_freeze_counter - Resets the PE freeze counter
> + * @pe: EEH PE
> + *
> + * This function is useful while re-configuring an FPGA adapter
> + * as its about to acquire new a personality and you don't want
> + * freeze count to be carry forwarded. As such calling this function
> + * for regular pci devices might be a bad idea.
> + */
> +void eeh_pe_reset_freeze_counter(struct eeh_pe *pe)
> +{
> +	pr_info("Resetting freeze count for PHB#%x-PE#%x\n",
> +		pe->phb->global_number, pe->addr);
> +	pe->freeze_count = 0;
> +}
> +EXPORT_SYMBOL_GPL(eeh_pe_reset_freeze_counter);
> +
> +/**
>   * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
>   * and freeze counter
>   * @pe: EEH PE
>

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image
  2017-03-01 11:24 ` [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
@ 2017-03-02  1:46   ` Andrew Donnellan
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Donnellan @ 2017-03-02  1:46 UTC (permalink / raw)
  To: Vaibhav Jain, capi-linux, linuxppc-dev, Frederic Barrat, Russell Currey
  Cc: Ian Munsie, Christophe Lombard, Philippe Bergheaud, Greg Kurz,
	Gavin Shan

On 01/03/17 22:24, Vaibhav Jain wrote:
> The patch resets the freeze counter on eeh_pe struct for PHB
> associated with the cxl pci adapter. This would enable re-flashing of
> the cxl-adapter beyond the default limit of 5.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

> ---
> Change-log:
>
> v1 -> v2
> Changes as suggested by Russell Currey:
> - Changed new variable names inline with eeh code nomenclature.
> - Removed the dev_info logging the PHB being reset. The log message is now
>   moved to eeh_handle_normal_event()

In eeh_pe_reset_freeze_counter() to be precise.

> ---
>  drivers/misc/cxl/pci.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
> index ce54dab..7960fd64 100644
> --- a/drivers/misc/cxl/pci.c
> +++ b/drivers/misc/cxl/pci.c
> @@ -22,6 +22,7 @@
>  #include <asm/pnv-pci.h>
>  #include <asm/io.h>
>  #include <asm/reg.h>
> +#include <asm/eeh.h>
>
>  #include "cxl.h"
>  #include <misc/cxl.h>
> @@ -1231,6 +1232,8 @@ static void cxl_pci_remove_afu(struct cxl_afu *afu)
>  int cxl_pci_reset(struct cxl *adapter)
>  {
>  	struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
> +	struct eeh_dev *edev = pci_dev_to_eeh_dev(dev);
> +	struct eeh_pe *pe = eeh_dev_to_pe(edev);
>  	int rc;
>
>  	if (adapter->perst_same_image) {
> @@ -1244,6 +1247,17 @@ int cxl_pci_reset(struct cxl *adapter)
>  	/* the adapter is about to be reset, so ignore errors */
>  	cxl_data_cache_flush(adapter);
>
> +	/* If loading a new image, reset freeze counters for the PHB
> +	 * associated with the adapter.
> +	 */
> +	if (pe && adapter->perst_loads_image) {
> +		/* Find the pe associated with the device PHB */
> +		while (pe->parent != NULL && (pe->type & EEH_PE_PHB) == 0)
> +			pe = pe->parent;
> +
> +		eeh_pe_reset_freeze_counter(pe);
> +	}
> +
>  	/* pcie_warm_reset requests a fundamental pci reset which includes a
>  	 * PERST assert/deassert.  PERST triggers a loading of the image
>  	 * if "user" or "factory" is selected in sysfs */
>

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-01 11:24 ` [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
  2017-03-02  1:10   ` Andrew Donnellan
@ 2017-03-03  4:21   ` Vaibhav Jain
  2017-03-03  4:34     ` Andrew Donnellan
  2017-03-03  4:35     ` Russell Currey
  1 sibling, 2 replies; 16+ messages in thread
From: Vaibhav Jain @ 2017-03-03  4:21 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey
  Cc: Frederic Barrat, Andrew Donnellan, Ian Munsie,
	Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan

Hi Russell,

Vaibhav Jain <vaibhav@linux.vnet.ibm.com> writes:

> This patch introduces function eeh_pe_reset_freeze_counter() which can
> be used to reset the PE's freeze count variable outside eeh code. This
> is useful for devices that can acquire a different personality after
> a PERST event (e.g FPGA Adapters). Presently an existing freeze
> count for an adapter with personality N will be taken into account
> when the adapter acquired personality N+1.
>
> By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
> counter for an adapter once it has acquired a new personality and
> ideally wont be plagued by the failures similar to the one before.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
Had a short chat discussion with Gavin Shan on this patchset and he
preffers restoring the freeze_count on the eeh_pe once FRESET is done.
He expects a the flow to be similar to one below

1. module caches the value of freeze_count and resets it
2. Issue warm reset
3. During eeh error-detected callback module restores the freeze_count
from the cached value.

Russell, what do you think? 

-- 
Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Linux Technology Center, IBM India Pvt. Ltd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-03  4:21   ` Vaibhav Jain
@ 2017-03-03  4:34     ` Andrew Donnellan
  2017-03-03  4:35     ` Russell Currey
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Donnellan @ 2017-03-03  4:34 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Russell Currey
  Cc: Frederic Barrat, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

On 03/03/17 15:21, Vaibhav Jain wrote:
> Had a short chat discussion with Gavin Shan on this patchset and he
> preffers restoring the freeze_count on the eeh_pe once FRESET is done.
> He expects a the flow to be similar to one below
>
> 1. module caches the value of freeze_count and resets it
> 2. Issue warm reset
> 3. During eeh error-detected callback module restores the freeze_count
> from the cached value.

I'm inclined to think that's unnecessarily complex for relatively 
limited gain - I can't imagine there's that many people reflashing their 
CAPI devices so many times in quick succession that the counter won't 
hit 5 in a real failure case.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-03  4:21   ` Vaibhav Jain
  2017-03-03  4:34     ` Andrew Donnellan
@ 2017-03-03  4:35     ` Russell Currey
  2017-03-03  4:39       ` Andrew Donnellan
  2017-03-03  5:45       ` Gavin Shan
  1 sibling, 2 replies; 16+ messages in thread
From: Russell Currey @ 2017-03-03  4:35 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev
  Cc: Frederic Barrat, Andrew Donnellan, Ian Munsie,
	Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan

On Fri, 2017-03-03 at 09:51 +0530, Vaibhav Jain wrote:
> Hi Russell,
> 
> Vaibhav Jain <vaibhav@linux.vnet.ibm.com> writes:
> 
> > This patch introduces function eeh_pe_reset_freeze_counter() which can
> > be used to reset the PE's freeze count variable outside eeh code. This
> > is useful for devices that can acquire a different personality after
> > a PERST event (e.g FPGA Adapters). Presently an existing freeze
> > count for an adapter with personality N will be taken into account
> > when the adapter acquired personality N+1.
> > 
> > By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
> > counter for an adapter once it has acquired a new personality and
> > ideally wont be plagued by the failures similar to the one before.
> > 
> > Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> > ---
> 
> Had a short chat discussion with Gavin Shan on this patchset and he
> preffers restoring the freeze_count on the eeh_pe once FRESET is done.
> He expects a the flow to be similar to one below
> 
> 1. module caches the value of freeze_count and resets it
> 2. Issue warm reset
> 3. During eeh error-detected callback module restores the freeze_count
> from the cached value.
> 
> Russell, what do you think? 
> 
I thought about this but figured it didn't really make sense from a CAPI
perspective.  If you're flashing the device, it is going to have different
behaviour to before it was flashed, and that it should be treated differently as
a result (and thus restoring the freeze_count doesn't make much sense).

Consider a case where there's a buggy FPGA image on an adapter that's failed 4
times in the past hour, and generally has frequent errors.  You decide to update
it to something that's less buggy, so you flash the adapter.  The freeze_count
gets cached and thus is restored to 4 after the flash.  Now even if the new
image is less buggy and may only fail once an hour instead of multiple times, if
it happens to fail within an hour of the earlier failures the device is now
fenced and you need to reboot.

I don't mind either way - I just don't get the logic of restoring the count.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-03  4:35     ` Russell Currey
@ 2017-03-03  4:39       ` Andrew Donnellan
  2017-03-03  5:45       ` Gavin Shan
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Donnellan @ 2017-03-03  4:39 UTC (permalink / raw)
  To: Russell Currey, Vaibhav Jain, linuxppc-dev
  Cc: Frederic Barrat, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

On 03/03/17 15:35, Russell Currey wrote:
> I thought about this but figured it didn't really make sense from a CAPI
> perspective.  If you're flashing the device, it is going to have different
> behaviour to before it was flashed, and that it should be treated differently as
> a result (and thus restoring the freeze_count doesn't make much sense).
>
> Consider a case where there's a buggy FPGA image on an adapter that's failed 4
> times in the past hour, and generally has frequent errors.  You decide to update
> it to something that's less buggy, so you flash the adapter.  The freeze_count
> gets cached and thus is restored to 4 after the flash.  Now even if the new
> image is less buggy and may only fail once an hour instead of multiple times, if
> it happens to fail within an hour of the earlier failures the device is now
> fenced and you need to reboot.
>
> I don't mind either way - I just don't get the logic of restoring the count.

I agree with this logic.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  2017-03-03  4:35     ` Russell Currey
  2017-03-03  4:39       ` Andrew Donnellan
@ 2017-03-03  5:45       ` Gavin Shan
  1 sibling, 0 replies; 16+ messages in thread
From: Gavin Shan @ 2017-03-03  5:45 UTC (permalink / raw)
  To: Russell Currey
  Cc: Vaibhav Jain, linuxppc-dev, Frederic Barrat, Andrew Donnellan,
	Ian Munsie, Christophe Lombard, Philippe Bergheaud, Greg Kurz,
	Gavin Shan

On Fri, Mar 03, 2017 at 03:35:05PM +1100, Russell Currey wrote:
>On Fri, 2017-03-03 at 09:51 +0530, Vaibhav Jain wrote:
>> Hi Russell,
>> 
>> Vaibhav Jain <vaibhav@linux.vnet.ibm.com> writes:
>> 
>> > This patch introduces function eeh_pe_reset_freeze_counter() which can
>> > be used to reset the PE's freeze count variable outside eeh code. This
>> > is useful for devices that can acquire a different personality after
>> > a PERST event (e.g FPGA Adapters). Presently an existing freeze
>> > count for an adapter with personality N will be taken into account
>> > when the adapter acquired personality N+1.
>> > 
>> > By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
>> > counter for an adapter once it has acquired a new personality and
>> > ideally wont be plagued by the failures similar to the one before.
>> > 
>> > Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
>> > ---
>> 
>> Had a short chat discussion with Gavin Shan on this patchset and he
>> preffers restoring the freeze_count on the eeh_pe once FRESET is done.
>> He expects a the flow to be similar to one below
>> 
>> 1. module caches the value of freeze_count and resets it
>> 2. Issue warm reset
>> 3. During eeh error-detected callback module restores the freeze_count
>> from the cached value.
>> 
>> Russell, what do you think? 
>> 
>I thought about this but figured it didn't really make sense from a CAPI
>perspective.  If you're flashing the device, it is going to have different
>behaviour to before it was flashed, and that it should be treated differently as
>a result (and thus restoring the freeze_count doesn't make much sense).
>

There are nothing changed on the PHB. This patch is clearing the error count
of PHB PE, not the PE for the CAPI device. We shouldn't clear the error count
of the PHB PE. Otherwise, it's not consistent.

>Consider a case where there's a buggy FPGA image on an adapter that's failed 4
>times in the past hour, and generally has frequent errors.  You decide to update
>it to something that's less buggy, so you flash the adapter.  The freeze_count
>gets cached and thus is restored to 4 after the flash.  Now even if the new
>image is less buggy and may only fail once an hour instead of multiple times, if
>it happens to fail within an hour of the earlier failures the device is now
>fenced and you need to reboot.
>
>I don't mind either way - I just don't get the logic of restoring the count.
>

I don't get your point. FPGA image isn't the only source of EEH error. Also,
it's not related the PHB PE's error count, which the patch is to clear.

Cheers,
Gavin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-03-03  5:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 11:24 [RESEND-RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
2017-03-01 14:58   ` Guilherme G. Piccoli
2017-03-01 16:26     ` Vaibhav Jain
2017-03-01 17:39       ` Guilherme G. Piccoli
2017-03-01 23:52       ` Russell Currey
2017-03-02  1:03   ` Andrew Donnellan
2017-03-01 11:24 ` [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
2017-03-02  1:10   ` Andrew Donnellan
2017-03-03  4:21   ` Vaibhav Jain
2017-03-03  4:34     ` Andrew Donnellan
2017-03-03  4:35     ` Russell Currey
2017-03-03  4:39       ` Andrew Donnellan
2017-03-03  5:45       ` Gavin Shan
2017-03-01 11:24 ` [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
2017-03-02  1:46   ` Andrew Donnellan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.