All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset"
@ 2015-12-03  7:16 Andrew Donnellan
  2015-12-03 20:46 ` Daniel Axtens
  2015-12-08  5:59 ` [PATCH v2] " Andrew Donnellan
  0 siblings, 2 replies; 4+ messages in thread
From: Andrew Donnellan @ 2015-12-03  7:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: gwshan, imunsie

This reverts commit 527d10ef3a315d3cb9dc098dacd61889a6c26439.

The reverted commit breaks cxlflash devices following an EEH reset.
Attempting to load the cxlflash driver after a reset results in a call to
pci_read_vpd() returning -ENODEV, causing driver initialisation to fail.

At this stage, we don't fully understand why this is happening, and we
also haven't tested whether this occurs for other cxl devices. In the
meantime, though, revert the commit, especially as it was intended to be a
non-functional change.

Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

---

This issue was identified by bisection following breakage observed in
4.4-rc1. I'm continuing to investigate the root cause (and testing on cxl
devices other than cxlflash), as the commit in question shouldn't have
caused problems.
---
 arch/powerpc/kernel/eeh_driver.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 80dfe89..8d14feb 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -590,16 +590,10 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus)
 	eeh_ops->configure_bridge(pe);
 	eeh_pe_restore_bars(pe);
 
-	/*
-	 * If it's PHB PE, the frozen state on all available PEs should have
-	 * been cleared by the PHB reset. Otherwise, we unfreeze the PE and its
-	 * child PEs because they might be in frozen state.
-	 */
-	if (!(pe->type & EEH_PE_PHB)) {
-		rc = eeh_clear_pe_frozen_state(pe, false);
-		if (rc)
-			return rc;
-	}
+	/* Clear frozen state */
+	rc = eeh_clear_pe_frozen_state(pe, false);
+	if (rc)
+		return rc;
 
 	/* Give the system 5 seconds to finish running the user-space
 	 * hotplug shutdown scripts, e.g. ifdown for ethernet.  Yes,
-- 
Andrew Donnellan              Software Engineer, OzLabs
andrew.donnellan@au1.ibm.com  Australia Development Lab, Canberra
+61 2 6201 8874 (work)        IBM Australia Limited

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset"
  2015-12-03  7:16 [PATCH] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" Andrew Donnellan
@ 2015-12-03 20:46 ` Daniel Axtens
  2015-12-08  5:59 ` [PATCH v2] " Andrew Donnellan
  1 sibling, 0 replies; 4+ messages in thread
From: Daniel Axtens @ 2015-12-03 20:46 UTC (permalink / raw)
  To: Andrew Donnellan, linuxppc-dev; +Cc: gwshan, imunsie

[-- Attachment #1: Type: text/plain, Size: 4292 bytes --]

Hi Andrew and Gavin,

To flesh out Andrew's commit message, and add some surrounding detail to
help in debugging:

Before 527d10ef3a315, all PEs were unfrozen when we called
eeh_reset_device(). That patch changed behaviour to skip the PE
associated with or reserved for the PHB. Indeed, this shouldn't have
made a functional change, because no resources required for the device
should be associated with the PHB's PE.

Obviously, however, it does change behaviour.

We found that not only does it break the cxlflash driver, we can test
for its presence by trying to run cat on the relevant vpd file in /sys/.
Before doing a reset, you can cat the file without issue.
After a reset, catting the file fails with -ENODEV.
Curiously, lspci succeeds both before and after. lspci doesn't try to
read the file from start to end however, it just reads certain bytes. So
there are some bytes in the file that trigger -ENODEV after a reset.

We know that it is failing to read bytes in regular config space (not
CAPI's magic MMIO fake config space for AFUs: we're still in regular PCI
land at this point). What we don't know is precisely what those bytes
are, and why those bytes are (seemingly) being associated with the PHB's
PE.

I had some theories, and maybe Andrew can update this list:
    - CAPI code is doing something wrong.
    - There's a bug in PCI resource allocation or the mapping of
      resources to PEs such that something is hitting space assigned for
      PE0.
    - CAPI is 'special' and requires PE#0 to be unfrozen.

This fix is a fix to the symptom, not the problem; we obviously need to
know the root cause in order to fix the root cause. However, if we are
no closer to figuring it out soon, we should probably take this patch so
as not to release a 4.4 that breaks CAPI.

Congrats to Andrew, btw, for apparently being the only CAPI developer
who is testing CAPI stuff against mainline at the moment.

Regards,
Daniel

> This reverts commit 527d10ef3a315d3cb9dc098dacd61889a6c26439.
>
> The reverted commit breaks cxlflash devices following an EEH reset.
> Attempting to load the cxlflash driver after a reset results in a call to
> pci_read_vpd() returning -ENODEV, causing driver initialisation to fail.
>
> At this stage, we don't fully understand why this is happening, and we
> also haven't tested whether this occurs for other cxl devices. In the
> meantime, though, revert the commit, especially as it was intended to be a
> non-functional change.
>
> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>
> ---
>
> This issue was identified by bisection following breakage observed in
> 4.4-rc1. I'm continuing to investigate the root cause (and testing on cxl
> devices other than cxlflash), as the commit in question shouldn't have
> caused problems.
> ---
>  arch/powerpc/kernel/eeh_driver.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 80dfe89..8d14feb 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -590,16 +590,10 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus)
>  	eeh_ops->configure_bridge(pe);
>  	eeh_pe_restore_bars(pe);
>  
> -	/*
> -	 * If it's PHB PE, the frozen state on all available PEs should have
> -	 * been cleared by the PHB reset. Otherwise, we unfreeze the PE and its
> -	 * child PEs because they might be in frozen state.
> -	 */
> -	if (!(pe->type & EEH_PE_PHB)) {
> -		rc = eeh_clear_pe_frozen_state(pe, false);
> -		if (rc)
> -			return rc;
> -	}
> +	/* Clear frozen state */
> +	rc = eeh_clear_pe_frozen_state(pe, false);
> +	if (rc)
> +		return rc;
>  
>  	/* Give the system 5 seconds to finish running the user-space
>  	 * hotplug shutdown scripts, e.g. ifdown for ethernet.  Yes,
> -- 
> Andrew Donnellan              Software Engineer, OzLabs
> andrew.donnellan@au1.ibm.com  Australia Development Lab, Canberra
> +61 2 6201 8874 (work)        IBM Australia Limited
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 859 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset"
  2015-12-03  7:16 [PATCH] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" Andrew Donnellan
  2015-12-03 20:46 ` Daniel Axtens
@ 2015-12-08  5:59 ` Andrew Donnellan
  2015-12-14  9:46   ` [v2] " Michael Ellerman
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Donnellan @ 2015-12-08  5:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan, Ian Munsie, Daniel Axtens

This reverts commit 527d10ef3a315d3cb9dc098dacd61889a6c26439.

The reverted commit breaks cxlflash devices following an EEH reset (and
possibly other cxl devices, however this has not been tested).

The reverted commit changed the behaviour of eeh_reset_device() so that PHB
PEs are not unfrozen following the completion of the reset. This should not
be problematic, as no device resources should have been associated with the
PHB PE.

However, when attempting to load the cxlflash driver after a reset, the
driver attempts to read Vital Product Data through a call to
pci_read_vpd() (which is called on the physical cxl device, not on the
virtual AFU device). pci_read_vpd() in turn attempts to read from the cxl
device's config space. This fails, as the PE it's trying to read from is
still frozen. In turn, the driver gets an -ENODEV and fails to initialise.

It appears this issue only affects some parts of the VPD area, as "lspci
-vvv", which only reads a subset of the VPD bytes, is not broken by the
original patch.

At this stage, we don't fully understand why we're trying to read a frozen
PE, and we don't know how this affects other cxl devices. It is possible
that there is an underlying bug in the cxl driver or the powerpc CAPI
support code, or alternatively a bug in the PCI resource allocation/mapping
code that is incorrectly mapping resources to PE#0.

As such, this fix is incomplete, however it is necessary to prevent a
serious regression in CAPI support.

In the meantime, revert the commit, especially as it was intended to be a
non-functional change.

Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Cc: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

---

Changes from V1:
 - Updated commit message to incorporate comments from Daniel Axtens

We would like to get this fix pushed as soon as possible - I'm on leave for
the rest of this week, so I'll volunteer Daniel to answer any remaining
questions, and I'm happy for him to revise this patch further if needed.
---
 arch/powerpc/kernel/eeh_driver.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 80dfe89..8d14feb 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -590,16 +590,10 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus)
 	eeh_ops->configure_bridge(pe);
 	eeh_pe_restore_bars(pe);
 
-	/*
-	 * If it's PHB PE, the frozen state on all available PEs should have
-	 * been cleared by the PHB reset. Otherwise, we unfreeze the PE and its
-	 * child PEs because they might be in frozen state.
-	 */
-	if (!(pe->type & EEH_PE_PHB)) {
-		rc = eeh_clear_pe_frozen_state(pe, false);
-		if (rc)
-			return rc;
-	}
+	/* Clear frozen state */
+	rc = eeh_clear_pe_frozen_state(pe, false);
+	if (rc)
+		return rc;
 
 	/* Give the system 5 seconds to finish running the user-space
 	 * hotplug shutdown scripts, e.g. ifdown for ethernet.  Yes,
-- 
Andrew Donnellan              Software Engineer, OzLabs
andrew.donnellan@au1.ibm.com  Australia Development Lab, Canberra
+61 2 6201 8874 (work)        IBM Australia Limited

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [v2] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset"
  2015-12-08  5:59 ` [PATCH v2] " Andrew Donnellan
@ 2015-12-14  9:46   ` Michael Ellerman
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Ellerman @ 2015-12-14  9:46 UTC (permalink / raw)
  To: Andrew Donnellan, linuxppc-dev; +Cc: Daniel Axtens, Gavin Shan, Ian Munsie

On Tue, 2015-08-12 at 05:59:25 UTC, Andrew Donnellan wrote:
> This reverts commit 527d10ef3a315d3cb9dc098dacd61889a6c26439.
> 
> The reverted commit breaks cxlflash devices following an EEH reset (and
> possibly other cxl devices, however this has not been tested).
...
> 
> In the meantime, revert the commit, especially as it was intended to be a
> non-functional change.
> 
> Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
> Cc: Ian Munsie <imunsie@au1.ibm.com>
> Cc: Daniel Axtens <dja@axtens.net>
> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/dc9c41bd9ece090b54eb8f1b

cheers

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-12-14  9:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-03  7:16 [PATCH] Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" Andrew Donnellan
2015-12-03 20:46 ` Daniel Axtens
2015-12-08  5:59 ` [PATCH v2] " Andrew Donnellan
2015-12-14  9:46   ` [v2] " Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.