linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected
@ 2019-10-16  1:25 Oliver O'Halloran
  2019-10-16  3:44 ` Sam Bobroff
  2020-01-29  5:17 ` Michael Ellerman
  0 siblings, 2 replies; 3+ messages in thread
From: Oliver O'Halloran @ 2019-10-16  1:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Sam Bobroff, Oliver O'Halloran

Many drivers don't check for errors when they get a 0xFFs response from an
MMIO load. As a result after an EEH event occurs a driver can get stuck in
a polling loop unless it some kind of internal timeout logic.

Currently EEH tries to detect and report stuck drivers by dumping a stack
trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an
already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump
would occur every few seconds if the driver was spinning in a loop. This
results in a lot of spurious stack traces in the kernel log.

Fix this by limiting it to printing one stack trace for each PE freeze. If
the driver is truely stuck the kernel's hung task detector is better suited
to reporting the probelm anyway.

Cc: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 arch/powerpc/kernel/eeh.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index bc8a551013be..c35069294ecf 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -503,7 +503,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	rc = 1;
 	if (pe->state & EEH_PE_ISOLATED) {
 		pe->check_count++;
-		if (pe->check_count % EEH_MAX_FAILS == 0) {
+		if (pe->check_count == EEH_MAX_FAILS) {
 			dn = pci_device_to_OF_node(dev);
 			if (dn)
 				location = of_get_property(dn, "ibm,loc-code",
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected
  2019-10-16  1:25 [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected Oliver O'Halloran
@ 2019-10-16  3:44 ` Sam Bobroff
  2020-01-29  5:17 ` Michael Ellerman
  1 sibling, 0 replies; 3+ messages in thread
From: Sam Bobroff @ 2019-10-16  3:44 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1926 bytes --]

On Wed, Oct 16, 2019 at 12:25:36PM +1100, Oliver O'Halloran wrote:
> Many drivers don't check for errors when they get a 0xFFs response from an
> MMIO load. As a result after an EEH event occurs a driver can get stuck in
> a polling loop unless it some kind of internal timeout logic.
> 
> Currently EEH tries to detect and report stuck drivers by dumping a stack
> trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an
> already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump
> would occur every few seconds if the driver was spinning in a loop. This
> results in a lot of spurious stack traces in the kernel log.
> 
> Fix this by limiting it to printing one stack trace for each PE freeze. If
> the driver is truely stuck the kernel's hung task detector is better suited
> to reporting the probelm anyway.
problem
> 
> Cc: Sam Bobroff <sbobroff@linux.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks good to me (especially because if it's stuck in a loop the stack
trace is going to be pretty much the same every time). I tested it by
recovering a device that uses the mlx5_core driver.

Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Tested-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
>  arch/powerpc/kernel/eeh.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index bc8a551013be..c35069294ecf 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -503,7 +503,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>  	rc = 1;
>  	if (pe->state & EEH_PE_ISOLATED) {
>  		pe->check_count++;
> -		if (pe->check_count % EEH_MAX_FAILS == 0) {
> +		if (pe->check_count == EEH_MAX_FAILS) {
>  			dn = pci_device_to_OF_node(dev);
>  			if (dn)
>  				location = of_get_property(dn, "ibm,loc-code",
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected
  2019-10-16  1:25 [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected Oliver O'Halloran
  2019-10-16  3:44 ` Sam Bobroff
@ 2020-01-29  5:17 ` Michael Ellerman
  1 sibling, 0 replies; 3+ messages in thread
From: Michael Ellerman @ 2020-01-29  5:17 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: Sam Bobroff, Oliver O'Halloran

On Wed, 2019-10-16 at 01:25:36 UTC, Oliver O'Halloran wrote:
> Many drivers don't check for errors when they get a 0xFFs response from an
> MMIO load. As a result after an EEH event occurs a driver can get stuck in
> a polling loop unless it some kind of internal timeout logic.
> 
> Currently EEH tries to detect and report stuck drivers by dumping a stack
> trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an
> already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump
> would occur every few seconds if the driver was spinning in a loop. This
> results in a lot of spurious stack traces in the kernel log.
> 
> Fix this by limiting it to printing one stack trace for each PE freeze. If
> the driver is truely stuck the kernel's hung task detector is better suited
> to reporting the probelm anyway.
> 
> Cc: Sam Bobroff <sbobroff@linux.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/4e0942c0302b5ad76b228b1a7b8c09f658a1d58a

cheers

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-01-29  5:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16  1:25 [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected Oliver O'Halloran
2019-10-16  3:44 ` Sam Bobroff
2020-01-29  5:17 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).