linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
@ 2022-01-05  6:06 Kai-Heng Feng
  2022-01-05 20:12 ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Kai-Heng Feng @ 2022-01-05  6:06 UTC (permalink / raw)
  To: bhelgaas
  Cc: mika.westerberg, koba.ko, Kai-Heng Feng, Lukas Wunner,
	Stuart Hayes, Jan Kiszka, linux-pci, linux-kernel

The Thunderbolt root ports may constantly spew out uncorrected errors
from AER service:
[   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
[   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
[   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
[   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
[   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
[   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
[   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed

The link may not be reliable on external facing ports, so don't enable
AER on those ports.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/pci/pcie/portdrv_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index bda630889f955..d464d00ade8f2 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev)
 
 #ifdef CONFIG_PCIEAER
 	if (dev->aer_cap && pci_aer_available() &&
-	    (pcie_ports_native || host->native_aer)) {
+	    (pcie_ports_native || host->native_aer) &&
+	    !dev->external_facing) {
 		services |= PCIE_PORT_SERVICE_AER;
 
 		/*
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
  2022-01-05  6:06 [PATCH] PCI/portdrv: Skip enabling AER on external facing ports Kai-Heng Feng
@ 2022-01-05 20:12 ` Bjorn Helgaas
  2022-01-07  4:09   ` Kai-Heng Feng
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2022-01-05 20:12 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: bhelgaas, mika.westerberg, koba.ko, Lukas Wunner, Stuart Hayes,
	Jan Kiszka, linux-pci, linux-kernel

On Wed, Jan 05, 2022 at 02:06:41PM +0800, Kai-Heng Feng wrote:
> The Thunderbolt root ports may constantly spew out uncorrected errors
> from AER service:
> [   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> [   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> [   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> [   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
> [   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
> [   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
> [   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed

No timestamps needed here; they don't add to understanding the
problem.

> The link may not be reliable on external facing ports, so don't enable
> AER on those ports.

I'm not sure what you want to accomplish here.  If the errors are
legitimate and the result of some hardware issue like a bad cable, why
should we ignore them?  If they're caused by a software problem, we
should figure that out and fix it.

Does this occur on a specific instance of possibly flaky hardware?

You mention a spew of errors; do you think this is a single error that
we fail to clear correctly?  Or is it really many separate errors?

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/portdrv_core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index bda630889f955..d464d00ade8f2 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev)
>  
>  #ifdef CONFIG_PCIEAER
>  	if (dev->aer_cap && pci_aer_available() &&
> -	    (pcie_ports_native || host->native_aer)) {
> +	    (pcie_ports_native || host->native_aer) &&
> +	    !dev->external_facing) {
>  		services |= PCIE_PORT_SERVICE_AER;
>  
>  		/*
> -- 
> 2.33.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
  2022-01-05 20:12 ` Bjorn Helgaas
@ 2022-01-07  4:09   ` Kai-Heng Feng
  2022-01-21 10:55     ` Mika Westerberg
  0 siblings, 1 reply; 7+ messages in thread
From: Kai-Heng Feng @ 2022-01-07  4:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, mika.westerberg, koba.ko, Lukas Wunner, Stuart Hayes,
	Jan Kiszka, linux-pci, linux-kernel

On Thu, Jan 6, 2022 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jan 05, 2022 at 02:06:41PM +0800, Kai-Heng Feng wrote:
> > The Thunderbolt root ports may constantly spew out uncorrected errors
> > from AER service:
> > [   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > [   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > [   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > [   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > [   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
> > [   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
> > [   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
> > [   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed
>
> No timestamps needed here; they don't add to understanding the
> problem.

Got it. Will remove it for later iteration.

>
> > The link may not be reliable on external facing ports, so don't enable
> > AER on those ports.
>
> I'm not sure what you want to accomplish here.  If the errors are
> legitimate and the result of some hardware issue like a bad cable, why
> should we ignore them?  If they're caused by a software problem, we
> should figure that out and fix it.
>
> Does this occur on a specific instance of possibly flaky hardware?

Only from root ports of thunderbolt devices.

The error occurs as soon as the root port is runtime suspended to D3cold.

Runtime suspend the AER service can resolve the issue. I wonder if
it's the right thing to do here?
D3cold should also mean the PCI link is gone, disabling AER seems to
be a reasonable approach.

Kai-Heng

>
> You mention a spew of errors; do you think this is a single error that
> we fail to clear correctly?  Or is it really many separate errors?
>
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/portdrv_core.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> > index bda630889f955..d464d00ade8f2 100644
> > --- a/drivers/pci/pcie/portdrv_core.c
> > +++ b/drivers/pci/pcie/portdrv_core.c
> > @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev)
> >
> >  #ifdef CONFIG_PCIEAER
> >       if (dev->aer_cap && pci_aer_available() &&
> > -         (pcie_ports_native || host->native_aer)) {
> > +         (pcie_ports_native || host->native_aer) &&
> > +         !dev->external_facing) {
> >               services |= PCIE_PORT_SERVICE_AER;
> >
> >               /*
> > --
> > 2.33.1
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
  2022-01-07  4:09   ` Kai-Heng Feng
@ 2022-01-21 10:55     ` Mika Westerberg
  2022-01-21 12:31       ` Kai-Heng Feng
  0 siblings, 1 reply; 7+ messages in thread
From: Mika Westerberg @ 2022-01-21 10:55 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Bjorn Helgaas, bhelgaas, koba.ko, Lukas Wunner, Stuart Hayes,
	Jan Kiszka, linux-pci, linux-kernel

Hi Kai-Heng,

On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> Only from root ports of thunderbolt devices.
> 
> The error occurs as soon as the root port is runtime suspended to D3cold.
> 
> Runtime suspend the AER service can resolve the issue. I wonder if
> it's the right thing to do here?

I think you are right here. It seems that AER "service driver" is
completely missing PM hooks. Probably because it is more used in server
type of systems where power management is not priority.

> D3cold should also mean the PCI link is gone, disabling AER seems to
> be a reasonable approach.

Indeed - I think AER might trigger here because the link does "down" /
low power state if left enabled while the root port enters D3. Something
like below hack should disable it over low power transitions:

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..64138cf82db8 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1432,6 +1432,22 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
 }
 
+static int aer_suspend(struct pcie_device *dev)
+{
+	struct aer_rpc *rpc = get_service_data(dev);
+
+	aer_disable_rootport(rpc);
+	return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+	struct aer_rpc *rpc = get_service_data(dev);
+
+	aer_enable_rootport(rpc);
+	return 0;
+}
+
 static struct pcie_port_service_driver aerdriver = {
 	.name		= "aer",
 	.port_type	= PCIE_ANY_PORT,
@@ -1439,6 +1455,10 @@ static struct pcie_port_service_driver aerdriver = {
 
 	.probe		= aer_probe,
 	.remove		= aer_remove,
+	.suspend	= aer_suspend,
+	.resume		= aer_resume,
+	.runtime_suspend = aer_suspend,
+	.runtime_resume	= aer_resume,
 };
 
 /**

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
  2022-01-21 10:55     ` Mika Westerberg
@ 2022-01-21 12:31       ` Kai-Heng Feng
  2022-01-21 12:44         ` Mika Westerberg
  0 siblings, 1 reply; 7+ messages in thread
From: Kai-Heng Feng @ 2022-01-21 12:31 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, bhelgaas, koba.ko, Lukas Wunner, Stuart Hayes,
	Jan Kiszka, linux-pci, linux-kernel

Hi Mika,

On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi Kai-Heng,
>
> On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> > Only from root ports of thunderbolt devices.
> >
> > The error occurs as soon as the root port is runtime suspended to D3cold.
> >
> > Runtime suspend the AER service can resolve the issue. I wonder if
> > it's the right thing to do here?
>
> I think you are right here. It seems that AER "service driver" is
> completely missing PM hooks. Probably because it is more used in server
> type of systems where power management is not priority.

Here is my previous attempt to suspend AER:
https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/

>
> > D3cold should also mean the PCI link is gone, disabling AER seems to
> > be a reasonable approach.
>
> Indeed - I think AER might trigger here because the link does "down" /
> low power state if left enabled while the root port enters D3. Something
> like below hack should disable it over low power transitions:

Ubuntu kernel has been carrying the patch for quite some time:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/commit/?id=e82f15f1a26273b004054a81ef45937fb1b632e5

>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..64138cf82db8 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1432,6 +1432,22 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>         return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
>
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +       struct aer_rpc *rpc = get_service_data(dev);
> +
> +       aer_disable_rootport(rpc);
> +       return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +       struct aer_rpc *rpc = get_service_data(dev);
> +
> +       aer_enable_rootport(rpc);
> +       return 0;
> +}
> +
>  static struct pcie_port_service_driver aerdriver = {
>         .name           = "aer",
>         .port_type      = PCIE_ANY_PORT,
> @@ -1439,6 +1455,10 @@ static struct pcie_port_service_driver aerdriver = {
>
>         .probe          = aer_probe,
>         .remove         = aer_remove,
> +       .suspend        = aer_suspend,
> +       .resume         = aer_resume,
> +       .runtime_suspend = aer_suspend,
> +       .runtime_resume = aer_resume,
>  };

This patch is exactly what I tested.

Maybe only suspend/runtime_suspend AER when the target PM state is D3cold?
PCIe spec doesn't say how to handle AER in Link L2/L3Ready/L3, but I
think it's reasonable to suspend AER when power is loss.

Let me come up with a patch with that idea.

Kai-Heng

>
>  /**

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
  2022-01-21 12:31       ` Kai-Heng Feng
@ 2022-01-21 12:44         ` Mika Westerberg
  2022-01-21 14:25           ` Kai-Heng Feng
  0 siblings, 1 reply; 7+ messages in thread
From: Mika Westerberg @ 2022-01-21 12:44 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Bjorn Helgaas, bhelgaas, koba.ko, Lukas Wunner, Stuart Hayes,
	Jan Kiszka, linux-pci, linux-kernel

Hi,

On Fri, Jan 21, 2022 at 08:31:27PM +0800, Kai-Heng Feng wrote:
> Hi Mika,
> 
> On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Hi Kai-Heng,
> >
> > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> > > Only from root ports of thunderbolt devices.
> > >
> > > The error occurs as soon as the root port is runtime suspended to D3cold.
> > >
> > > Runtime suspend the AER service can resolve the issue. I wonder if
> > > it's the right thing to do here?
> >
> > I think you are right here. It seems that AER "service driver" is
> > completely missing PM hooks. Probably because it is more used in server
> > type of systems where power management is not priority.
> 
> Here is my previous attempt to suspend AER:
> https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/

That's great!

I think we should do the same for runtime PM paths too, though. Will you
take care of that as well? :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI/portdrv: Skip enabling AER on external facing ports
  2022-01-21 12:44         ` Mika Westerberg
@ 2022-01-21 14:25           ` Kai-Heng Feng
  0 siblings, 0 replies; 7+ messages in thread
From: Kai-Heng Feng @ 2022-01-21 14:25 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, bhelgaas, koba.ko, Lukas Wunner, Stuart Hayes,
	Jan Kiszka, linux-pci, linux-kernel

On Fri, Jan 21, 2022 at 8:44 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi,
>
> On Fri, Jan 21, 2022 at 08:31:27PM +0800, Kai-Heng Feng wrote:
> > Hi Mika,
> >
> > On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > Hi Kai-Heng,
> > >
> > > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> > > > Only from root ports of thunderbolt devices.
> > > >
> > > > The error occurs as soon as the root port is runtime suspended to D3cold.
> > > >
> > > > Runtime suspend the AER service can resolve the issue. I wonder if
> > > > it's the right thing to do here?
> > >
> > > I think you are right here. It seems that AER "service driver" is
> > > completely missing PM hooks. Probably because it is more used in server
> > > type of systems where power management is not priority.
> >
> > Here is my previous attempt to suspend AER:
> > https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/
>
> That's great!
>
> I think we should do the same for runtime PM paths too, though. Will you
> take care of that as well? :)

Yes that's the plan. I hope I can persuade Bjorn this time...

Kai-Heng

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-21 14:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-05  6:06 [PATCH] PCI/portdrv: Skip enabling AER on external facing ports Kai-Heng Feng
2022-01-05 20:12 ` Bjorn Helgaas
2022-01-07  4:09   ` Kai-Heng Feng
2022-01-21 10:55     ` Mika Westerberg
2022-01-21 12:31       ` Kai-Heng Feng
2022-01-21 12:44         ` Mika Westerberg
2022-01-21 14:25           ` Kai-Heng Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).