All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] cxl: Reset freeze counter for the adapter before PERST
@ 2017-02-28  7:02 Vaibhav Jain
  2017-02-28  7:02 ` [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count Vaibhav Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Vaibhav Jain @ 2017-02-28  7:02 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey, frederic.barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan

Presently to flash a cxl adapter with a new FPGA image a warm pcie reset is
requested on the adapter, once the bitstream is loaded to card flash memory.
This issues a pci-fundamental reset to the card slot signaling the card
controller to reconfigure the fpga with the new bitstream. However
pci-fundamental reset of the slot also results in a fenced PHB that raises an
eeh event triggering the core eeh flow.

The core eeh also maintains a counter named freeze_count for each PE inside
struct eeh_pe. The counter is incremented every time an eeh error is reported on
the PE domain and if the counter reaches the threshold limit, the device is
permanently disabled. The threshold limit is enforced by the variable
eeh_max_freeze variable that can be manipulated via debugfs.

This creates problem for cxl adapters as:

* This puts a limit on number of times a fpga image can be re-flashed which is
  by default 5-time/Hour.

* Since after each reset the adapter can potentially acquire a new personality,
  the freeze_count of older fpga image shouldn't be carried over to newer image.

To fix these problems the proposed patch-set introduces a new function named
eeh_pe_reset_freeze_counter that resets freeze counter for the eeh_pe struct.
This function can then be called by the cxl module the cxl module before issuing
pci-fundamental reset to the card slot for loading the new fpga image.

Test Runs
==========
* Without the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
bash: /sys/class/cxl/card0/reset: No such file or directory
# dmesg
...
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
EEH: PHB#22-PE#0 has failed 2 times in the last hour
...
EEH: PHB#22-PE#0 has failed 3 times in the last hour
...
EEH: PHB#22-PE#0 has failed 4 times in the last hour
...
EEH: PHB#22-PE#0 has failed 5 times in the last hour
...
EEH: PHB#22-PE#0 has failed 6 times in the last hour
and has been permanently disabled.


* With the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
# dmesg
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour

---
Vaibhav Jain (3):
  powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count
  powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter
  cxl: Reset freeze counters before adapter PERST for flashing new image

 arch/powerpc/include/asm/eeh.h   | 11 ++++++++-
 arch/powerpc/kernel/eeh_driver.c | 20 +++-------------
 arch/powerpc/kernel/eeh_pe.c     | 50 ++++++++++++++++++++++++++--------------
 drivers/misc/cxl/pci.c           | 15 ++++++++++++
 4 files changed, 61 insertions(+), 35 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count
  2017-02-28  7:02 [RFC 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
@ 2017-02-28  7:02 ` Vaibhav Jain
  2017-02-28 23:44   ` Russell Currey
  2017-02-28  7:02 ` [RFC 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter Vaibhav Jain
  2017-02-28  7:22 ` [RFC 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
  2 siblings, 1 reply; 7+ messages in thread
From: Vaibhav Jain @ 2017-02-28  7:02 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey, frederic.barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan

This patch introduces a new function named
eeh_pe_update_freeze_counter replacing existing function
eeh_pe_update_time_stamp. The new function also manages the value of
freeze_count along with tstamp to track the number of times the PE
froze in last one hour and if the freeze_count > eeh_max_freezes then
reports an error(-ENOTRECOVERABLE) to indicate that the PE should be
permanently disabled.

This patch should not introduce any behavioral change.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |  2 +-
 arch/powerpc/kernel/eeh_driver.c | 20 +++-------------
 arch/powerpc/kernel/eeh_pe.c     | 50 ++++++++++++++++++++++++++--------------
 3 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8e37b71..68806be 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
 struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
-void eeh_pe_update_time_stamp(struct eeh_pe *pe);
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
 void *eeh_pe_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index b948871..326b4e4 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
-	eeh_pe_update_time_stamp(pe);
-	pe->freeze_count++;
-	if (pe->freeze_count > eeh_max_freezes)
-		goto excess_failures;
+	/* Update freeze counters and see if we have tripped max-freeze limit */
+	if (eeh_pe_update_freeze_counter(pe) < 0)
+		goto perm_error;
 	pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
 		pe->freeze_count);
 
@@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
 
 	return;
 
-excess_failures:
-	/*
-	 * About 90% of all real-life EEH failures in the field
-	 * are due to poorly seated PCI cards. Only 10% or so are
-	 * due to actual, failed cards.
-	 */
-	pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
-	       "last hour and has been permanently disabled.\n"
-	       "Please try reseating or replacing it.\n",
-		pe->phb->global_number, pe->addr,
-		pe->freeze_count);
-	goto perm_error;
-
 hard_fail:
 	pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
 	       "Please try reseating or replacing it\n",
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index cc4b206..cf70a8b 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,30 +504,46 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 }
 
 /**
- * eeh_pe_update_time_stamp - Update PE's frozen time stamp
+ * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
+ * and freeze counter
  * @pe: EEH PE
  *
- * We have time stamp for each PE to trace its time of getting
- * frozen in last hour. The function should be called to update
- * the time stamp on first error of the specific PE. On the other
- * handle, we needn't account for errors happened in last hour.
+ * We have a freeze-counter and time stamp for each PE to trace
+ * number of times the PE was frozen in last one hour. This function
+ * updates the PE's freeze counter and if its > eeh_max_freezes then
+ * returns an error. The function should be called to once every-time
+ * a specific PE freezes.
  */
-void eeh_pe_update_time_stamp(struct eeh_pe *pe)
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
 {
 	struct timeval tstamp;
 
-	if (!pe) return;
+	if (!pe)
+		return -EINVAL;
 
-	if (pe->freeze_count <= 0) {
-		pe->freeze_count = 0;
-		do_gettimeofday(&pe->tstamp);
-	} else {
-		do_gettimeofday(&tstamp);
-		if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
-			pe->tstamp = tstamp;
-			pe->freeze_count = 0;
-		}
-	}
+	do_gettimeofday(&tstamp);
+	if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
+		pe->tstamp = tstamp;
+		pe->freeze_count = 1;
+
+	} else if (pe->freeze_count >= eeh_max_freezes) {
+		pe->freeze_count++;
+		/*
+		 * About 90% of all real-life EEH failures in the field
+		 * are due to poorly seated PCI cards. Only 10% or so are
+		 * due to actual, failed cards.
+		 */
+		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
+		       "last hour and has been permanently disabled.\n"
+		       "Please try reseating or replacing it.\n",
+		       pe->phb->global_number, pe->addr,
+		       pe->freeze_count);
+		return -ENOTRECOVERABLE;
+
+	} else
+		pe->freeze_count++;
+
+	return 0;
 }
 
 /**
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter
  2017-02-28  7:02 [RFC 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
  2017-02-28  7:02 ` [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count Vaibhav Jain
@ 2017-02-28  7:02 ` Vaibhav Jain
  2017-02-28 23:54   ` Russell Currey
  2017-02-28  7:22 ` [RFC 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
  2 siblings, 1 reply; 7+ messages in thread
From: Vaibhav Jain @ 2017-02-28  7:02 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey, frederic.barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan

This patch introduces function eeh_pe_reset_freeze_counter which can
be used to reset the PE's freeze count variable outside eeh code. This
is useful for devices that can acquire a different personality after
a PERST event (e.g FPGA Adapters). Presently an existing freeze
count for an adapter with personality N will be taken into account
when the adapter acquired personality N+1.

By calling eeh_pe_reset_freeze_counter drivers can reset the freeze
counter for an adapter once it has acquired a new personality and
ideally wont be plagued by the failures similar to the one before.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 68806be..19ac6d0 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -266,6 +266,13 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
 int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
+
+/* Reset the PE freeze counter */
+static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe)
+{
+	pe->freeze_count = 0;
+}
+
 void *eeh_pe_traverse(struct eeh_pe *root,
 		eeh_traverse_func fn, void *flag);
 void *eeh_pe_dev_traverse(struct eeh_pe *root,
@@ -339,6 +346,8 @@ static inline int eeh_check_failure(const volatile void __iomem *token)
 	return 0;
 }
 
+static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe) { }
+
 #define eeh_dev_check_failure(x) (0)
 
 static inline void eeh_addr_cache_build(void) { }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image
  2017-02-28  7:02 [RFC 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
  2017-02-28  7:02 ` [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count Vaibhav Jain
  2017-02-28  7:02 ` [RFC 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter Vaibhav Jain
@ 2017-02-28  7:22 ` Vaibhav Jain
  2017-03-01  0:06   ` Russell Currey
  2 siblings, 1 reply; 7+ messages in thread
From: Vaibhav Jain @ 2017-02-28  7:22 UTC (permalink / raw)
  To: linuxppc-dev, frederic.barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan, Russell Currey

The patch resets the freeze counter on eeh_pe struct for PHB
associated with the cxl pci adapter. This would enable re-flashing of
the cxl-adapter beyond the default limit of 5.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
 drivers/misc/cxl/pci.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 679afc9..3b14688 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -22,6 +22,7 @@
 #include <asm/pnv-pci.h>
 #include <asm/io.h>
 #include <asm/reg.h>
+#include <asm/eeh.h>
 
 #include "cxl.h"
 #include <misc/cxl.h>
@@ -1229,6 +1230,8 @@ static void cxl_pci_remove_afu(struct cxl_afu *afu)
 int cxl_pci_reset(struct cxl *adapter)
 {
 	struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
+	struct eeh_dev *eehdev = pci_dev_to_eeh_dev(dev);
+	struct eeh_pe *devpe = eeh_dev_to_pe(eehdev);
 	int rc;
 
 	if (adapter->perst_same_image) {
@@ -1242,6 +1245,18 @@ int cxl_pci_reset(struct cxl *adapter)
 	/* the adapter is about to be reset, so ignore errors */
 	cxl_data_cache_flush(adapter);
 
+	/* If loading a new image, reset freeze counters for the PHB
+	 * associated with the adapter.
+	 */
+	if (devpe && adapter->perst_loads_image) {
+		/* Find the pe associated with the device PHB */
+		while (devpe->parent != NULL && (devpe->type & EEH_PE_PHB) == 0)
+			devpe = devpe->parent;
+
+		dev_info(&dev->dev, "Resetting freeze counters for the PHB\n");
+		eeh_pe_reset_freeze_counter(devpe);
+	}
+
 	/* pcie_warm_reset requests a fundamental pci reset which includes a
 	 * PERST assert/deassert.  PERST triggers a loading of the image
 	 * if "user" or "factory" is selected in sysfs */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count
  2017-02-28  7:02 ` [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count Vaibhav Jain
@ 2017-02-28 23:44   ` Russell Currey
  0 siblings, 0 replies; 7+ messages in thread
From: Russell Currey @ 2017-02-28 23:44 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, frederic.barrat
  Cc: Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan

On Tue, 2017-02-28 at 12:32 +0530, Vaibhav Jain wrote:
> This patch introduces a new function named
> eeh_pe_update_freeze_counter replacing existing function
> eeh_pe_update_time_stamp. The new function also manages the value of
> freeze_count along with tstamp to track the number of times the PE
> froze in last one hour and if the freeze_count > eeh_max_freezes then
> reports an error(-ENOTRECOVERABLE) to indicate that the PE should be
> permanently disabled.

You should add parens to the end of function names in commit messages, i.e.
eeh_pe_update_freeze_counter().  Same goes for the title

> 
> This patch should not introduce any behavioral change.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/eeh.h   |  2 +-
>  arch/powerpc/kernel/eeh_driver.c | 20 +++-------------
>  arch/powerpc/kernel/eeh_pe.c     | 50 ++++++++++++++++++++++++++-------------
> -
>  3 files changed, 37 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 8e37b71..68806be 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
>  struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
>  int eeh_add_to_parent_pe(struct eeh_dev *edev);
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
> -void eeh_pe_update_time_stamp(struct eeh_pe *pe);
> +int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
>  void *eeh_pe_traverse(struct eeh_pe *root,
>  		eeh_traverse_func fn, void *flag);
>  void *eeh_pe_dev_traverse(struct eeh_pe *root,
> diff --git a/arch/powerpc/kernel/eeh_driver.c
> b/arch/powerpc/kernel/eeh_driver.c
> index b948871..326b4e4 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
>  		return;
>  	}
>  
> -	eeh_pe_update_time_stamp(pe);
> -	pe->freeze_count++;
> -	if (pe->freeze_count > eeh_max_freezes)
> -		goto excess_failures;
> +	/* Update freeze counters and see if we have tripped max-freeze limit
> */
> +	if (eeh_pe_update_freeze_counter(pe) < 0)

I would prefer dropping the "< 0" check here

> +		goto perm_error;
>  	pr_warn("EEH: This PCI device has failed %d times in the last
> hour\n",
>  		pe->freeze_count);
>  
> @@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
>  
>  	return;
>  
> -excess_failures:
> -	/*
> -	 * About 90% of all real-life EEH failures in the field
> -	 * are due to poorly seated PCI cards. Only 10% or so are
> -	 * due to actual, failed cards.
> -	 */
> -	pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
> -	       "last hour and has been permanently disabled.\n"
> -	       "Please try reseating or replacing it.\n",
> -		pe->phb->global_number, pe->addr,
> -		pe->freeze_count);
> -	goto perm_error;
> -
>  hard_fail:
>  	pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
>  	       "Please try reseating or replacing it\n",
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index cc4b206..cf70a8b 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -504,30 +504,46 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  }
>  
>  /**
> - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
> + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
> + * and freeze counter
>   * @pe: EEH PE
>   *
> - * We have time stamp for each PE to trace its time of getting
> - * frozen in last hour. The function should be called to update
> - * the time stamp on first error of the specific PE. On the other
> - * handle, we needn't account for errors happened in last hour.
> + * We have a freeze-counter and time stamp for each PE to trace

drop the hyphen between freeze and counter

> + * number of times the PE was frozen in last one hour. This function

this should probably "in the last hour", I know it was copied from the existing
comment but since you're in here, you may as well fix it

> + * updates the PE's freeze counter and if its > eeh_max_freezes then

this reads awkwardly.  perhaps "This function updates the PE's freeze counter
and returns an error if the number of freezes is greater than eeh_max_freezes." 

> + * returns an error. The function should be called to once every-time
> + * a specific PE freezes.

a hyphen between "every time" is unnecessary

>   */
> -void eeh_pe_update_time_stamp(struct eeh_pe *pe)
> +int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
>  {
>  	struct timeval tstamp;
>  
> -	if (!pe) return;
> +	if (!pe)
> +		return -EINVAL;
>  
> -	if (pe->freeze_count <= 0) {
> -		pe->freeze_count = 0;
> -		do_gettimeofday(&pe->tstamp);
> -	} else {
> -		do_gettimeofday(&tstamp);
> -		if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
> -			pe->tstamp = tstamp;
> -			pe->freeze_count = 0;
> -		}
> -	}
> +	do_gettimeofday(&tstamp);
> +	if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec >
> 3600) {
> +		pe->tstamp = tstamp;
> +		pe->freeze_count = 1;
> +
> +	} else if (pe->freeze_count >= eeh_max_freezes) {
> +		pe->freeze_count++;
> +		/*
> +		 * About 90% of all real-life EEH failures in the field
> +		 * are due to poorly seated PCI cards. Only 10% or so are
> +		 * due to actual, failed cards.
> +		 */
> +		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
> +		       "last hour and has been permanently disabled.\n"
> +		       "Please try reseating or replacing it.\n",
> +		       pe->phb->global_number, pe->addr,
> +		       pe->freeze_count);
> +		return -ENOTRECOVERABLE;
> +
> +	} else
> +		pe->freeze_count++;

Wrap this in brackets too.  It looks messy hanging off the end, even though it's
a single line statement.

> +
> +	return 0;
>  }
>  
>  /**

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter
  2017-02-28  7:02 ` [RFC 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter Vaibhav Jain
@ 2017-02-28 23:54   ` Russell Currey
  0 siblings, 0 replies; 7+ messages in thread
From: Russell Currey @ 2017-02-28 23:54 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, frederic.barrat
  Cc: Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan

On Tue, 2017-02-28 at 12:32 +0530, Vaibhav Jain wrote:
> This patch introduces function eeh_pe_reset_freeze_counter which can
> be used to reset the PE's freeze count variable outside eeh code. This
> is useful for devices that can acquire a different personality after
> a PERST event (e.g FPGA Adapters). Presently an existing freeze
> count for an adapter with personality N will be taken into account
> when the adapter acquired personality N+1.
> 
> By calling eeh_pe_reset_freeze_counter drivers can reset the freeze
> counter for an adapter once it has acquired a new personality and
> ideally wont be plagued by the failures similar to the one before.

Same comment as before about adding () to the end of function names

> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/eeh.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 68806be..19ac6d0 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -266,6 +266,13 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
>  int eeh_add_to_parent_pe(struct eeh_dev *edev);
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
>  int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
> +
> +/* Reset the PE freeze counter */

I would like to see a comment here noting that doing this is in general a bad
idea, and this shouldn't be called for regular PCI devices.

> +static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe)
> +{
> +	pe->freeze_count = 0;
> +}
> +
>  void *eeh_pe_traverse(struct eeh_pe *root,
>  		eeh_traverse_func fn, void *flag);
>  void *eeh_pe_dev_traverse(struct eeh_pe *root,
> @@ -339,6 +346,8 @@ static inline int eeh_check_failure(const volatile void
> __iomem *token)
>  	return 0;
>  }
>  
> +static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe) { }
> +
>  #define eeh_dev_check_failure(x) (0)
>  
>  static inline void eeh_addr_cache_build(void) { }

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image
  2017-02-28  7:22 ` [RFC 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
@ 2017-03-01  0:06   ` Russell Currey
  0 siblings, 0 replies; 7+ messages in thread
From: Russell Currey @ 2017-03-01  0:06 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, frederic.barrat
  Cc: Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Gregory Kurz, Gavin Shan

[resending this since it didn't get delivered to the list]

On Tue, 2017-02-28 at 12:52 +0530, Vaibhav Jain wrote:
> The patch resets the freeze counter on eeh_pe struct for PHB
> associated with the cxl pci adapter. This would enable re-flashing of
> the cxl-adapter beyond the default limit of 5.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
>  drivers/misc/cxl/pci.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
> index 679afc9..3b14688 100644
> --- a/drivers/misc/cxl/pci.c
> +++ b/drivers/misc/cxl/pci.c
> @@ -22,6 +22,7 @@
>  #include <asm/pnv-pci.h>
>  #include <asm/io.h>
>  #include <asm/reg.h>
> +#include <asm/eeh.h>
>  
>  #include "cxl.h"
>  #include <misc/cxl.h>
> @@ -1229,6 +1230,8 @@ static void cxl_pci_remove_afu(struct cxl_afu *afu)
>  int cxl_pci_reset(struct cxl *adapter)
>  {
>  	struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
> +	struct eeh_dev *eehdev = pci_dev_to_eeh_dev(dev);
> +	struct eeh_pe *devpe = eeh_dev_to_pe(eehdev);

EEH code typically uses "edev" and "pe" for these variable names

>  	int rc;
>  
>  	if (adapter->perst_same_image) {
> @@ -1242,6 +1245,18 @@ int cxl_pci_reset(struct cxl *adapter)
>  	/* the adapter is about to be reset, so ignore errors */
>  	cxl_data_cache_flush(adapter);
>  
> +	/* If loading a new image, reset freeze counters for the PHB
> +	 * associated with the adapter.
> +	 */
> +	if (devpe && adapter->perst_loads_image) {
> +		/* Find the pe associated with the device PHB */
> +		while (devpe->parent != NULL && (devpe->type & EEH_PE_PHB) ==
> 0)
> +			devpe = devpe->parent;
> +
> +		dev_info(&dev->dev, "Resetting freeze counters for the
> PHB\n");

Would be good to mention "EEH" here to help with grepping, alternatively a
similar message could be printed in eeh_pe_reset_freeze_counter() displaying the
PHB information.

> +		eeh_pe_reset_freeze_counter(devpe);
> +	}
> +
>  	/* pcie_warm_reset requests a fundamental pci reset which includes a
>  	 * PERST assert/deassert.  PERST triggers a loading of the image
>  	 * if "user" or "factory" is selected in sysfs */

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-01  0:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28  7:02 [RFC 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
2017-02-28  7:02 ` [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count Vaibhav Jain
2017-02-28 23:44   ` Russell Currey
2017-02-28  7:02 ` [RFC 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter Vaibhav Jain
2017-02-28 23:54   ` Russell Currey
2017-02-28  7:22 ` [RFC 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
2017-03-01  0:06   ` Russell Currey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.