[RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST

* [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST
@ 2017-03-01 11:08 Vaibhav Jain
  2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
  To: linuxppc-dev, Russell Currey, Frederic Barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
	Philippe Bergheaud, Greg Kurz, Gavin Shan

v2 changes:
* Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
  avoid adding a header dependency to 'pci-bridge.h'. The function is now
  marked as an exported gpl symbol.

* Incorporated changes as suggested by Russell Currey:
- Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()
- Suffixed all the function names used in comments/patch-descriptions with '()'
- Removed an un-needed conditional check of '<0' in eeh_handle_normal_event()
- Rephrased the function comment for eeh_pe_update_freeze_counter() and
  eeh_pe_reset_freeze_counter()
- Brace-wrapped a single line statement at end of eeh_pe_update_freeze_counter()

v1:
Presently to flash a cxl adapter with a new FPGA image a warm pcie reset is
requested on the adapter, once the bitstream is loaded to card flash memory.
This issues a pci-fundamental reset to the card slot signaling the card
controller to reconfigure the fpga with the new bitstream. However
pci-fundamental reset of the slot also results in a fenced PHB that raises an
eeh event triggering the core eeh flow.

The core eeh also maintains a counter named freeze_count for each PE inside
struct eeh_pe. The counter is incremented every time an eeh error is reported on
the PE domain and if the counter reaches the threshold limit, the device is
permanently disabled. The threshold limit is enforced by the variable
eeh_max_freeze variable that can be manipulated via debugfs.

This creates problem for cxl adapters as:

* This puts a limit on number of times a fpga image can be re-flashed which is
  by default 5-time/Hour.

* Since after each reset the adapter can potentially acquire a new personality,
  the freeze_count of older fpga image shouldn't be carried over to newer image.

To fix these problems the proposed patch-set introduces a new function named
eeh_pe_reset_freeze_counter that resets freeze counter for the eeh_pe struct.
This function can then be called by the cxl module before issuing a
pci-fundamental reset to the card slot for loading the new fpga image.

Test Runs
==========
* Without the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
bash: /sys/class/cxl/card0/reset: No such file or directory
# dmesg
...
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
EEH: PHB#22-PE#0 has failed 2 times in the last hour
...
EEH: PHB#22-PE#0 has failed 3 times in the last hour
...
EEH: PHB#22-PE#0 has failed 4 times in the last hour
...
EEH: PHB#22-PE#0 has failed 5 times in the last hour
...
EEH: PHB#22-PE#0 has failed 6 times in the last hour
and has been permanently disabled.

* With the patchset:

# for i in $(seq 0 6); do echo 1 > /sys/class/cxl/card0/reset; sleep 20; done
# dmesg
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour
...
cxl-pci 0022:01:00.0: Resetting freeze counters for the PHB
EEH: Fenced PHB#22 detected, location: N/A
EEH: PHB#22-PE#0 has failed 1 times in the last hour

---
Vaibhav Jain (3):
  powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update
    freeze_count
  powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
  cxl: Reset freeze counters before adapter PERST for flashing new image

 arch/powerpc/include/asm/eeh.h   |  7 ++++-
 arch/powerpc/kernel/eeh_driver.c | 20 ++-----------
 arch/powerpc/kernel/eeh_pe.c     | 64 ++++++++++++++++++++++++++++++----------
 drivers/misc/cxl/pci.c           | 14 +++++++++
 4 files changed, 72 insertions(+), 33 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 4+ messages in thread