All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: sathyanarayanan.kuppuswamy@linux.intel.com,
	linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, koba.ko@canonical.com,
	Rajvi Jingar <rajvi.jingar@intel.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	david.e.box@linux.intel.com, bhelgaas@google.com,
	mika.westerberg@linux.intel.com, baolu.lu@linux.intel.com
Subject: Re: [PATCH v4 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state
Date: Fri, 22 Apr 2022 17:24:33 -0500	[thread overview]
Message-ID: <20220422222433.GA1464120@bhelgaas> (raw)
In-Reply-To: <20220408153159.106741-1-kai.heng.feng@canonical.com>

[+cc Rajvi, David]

On Fri, Apr 08, 2022 at 11:31:58PM +0800, Kai-Heng Feng wrote:
> On Intel Alder Lake platforms, Thunderbolt entering D3cold can cause
> some errors reported by AER:
> [   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> [   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> [   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> [   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
> [   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
> [   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
> [   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed
> 
> So disable AER service to avoid the noises from turning power rails
> on/off when the device is in low power states (D3hot and D3cold), as
> PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states
> that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready
> (D3hot), L2 (D3cold with aux power) and L3 (D3cold).

Help me walk through what's happening here, because I'm never very
confident about how error reporting works.  I *think* the Unsupported
Request error means some request was in progress and was not
completed.  I don't think a link going down should by itself cause
an Unsupported Request error because there's no *request*.

I have a theory about what happened here.  Decoding the TLP Header
(from PCIe r6.0, sec 2.2.1.1, 2.2.8.10) gives:

  34000000 (0011 0100 ...):
    Fmt               001        4 DW header, no data
    Type           1 0100        Msg, Local - Terminate at Receiver

  08000052 (0800 ... 0101 0010)
    Requester ID     0800        00:08.0
    Message Code     0101 0010   PTM Request

From your lspci in bugzilla, 08:00 has PTM enabled.  So my theory is
that:

  - 08:00.0 sent a PTM Request Message (a Posted Request)
  - 00:1d.0 received the PTM Request Message
  - The link transitioned to DL_Down
  - Per sec 2.9.1, 00:1d.0 discarded the Request and reported an
    Unsupported Request
  - Or, per sec 6.21.3, if 00:1d.0 received a PTM Request when its
    own PTM Enable was clear, it would also be treated as an
    Unsupported Request

So I suspect we should disable PTM on 08:00.0 before putting it in a
low-power state.  If you manually disable PTM on 08:00.0, do these
errors stop happening?

David did something like this [1], but just for Root Ports.  That
looks wrong to me because sec 6.21.3 says we should not have PTM
enabled in an Upstream Port (i.e., in a downstream device like
08:00.0) unless it is already enabled in the Downstream Port (i.e., in
the Root Port 00:1d.0).

Nit: can you remove the timestamps from the log?  They add clutter but
no useful information.

[1] https://git.kernel.org/linus/a697f072f5da

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v4:
>  - Explicitly states the spec version.
>  - Wording change. 
> 
> v3:
>  - Remove reference to ACS.
>  - Wording change.
> 
> v2:
>  - Wording change.
> 
>  drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b270..e4e9d4a3098d7 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev)
>  	return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_disable_rootport(rpc);
> +	return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_enable_rootport(rpc);
> +	return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  }
>  
>  static struct pcie_port_service_driver aerdriver = {
> -	.name		= "aer",
> -	.port_type	= PCIE_ANY_PORT,
> -	.service	= PCIE_PORT_SERVICE_AER,
> -
> -	.probe		= aer_probe,
> -	.remove		= aer_remove,
> +	.name			= "aer",
> +	.port_type		= PCIE_ANY_PORT,
> +	.service		= PCIE_PORT_SERVICE_AER,
> +	.probe			= aer_probe,
> +	.suspend		= aer_suspend,
> +	.resume			= aer_resume,
> +	.runtime_suspend	= aer_suspend,
> +	.runtime_resume		= aer_resume,
> +	.remove			= aer_remove,
>  };
>  
>  /**
> -- 
> 2.34.1
> 

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: bhelgaas@google.com, mika.westerberg@linux.intel.com,
	koba.ko@canonical.com, baolu.lu@linux.intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	Russell Currey <ruscur@russell.cc>,
	Oliver O'Halloran <oohall@gmail.com>,
	linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Rajvi Jingar <rajvi.jingar@intel.com>,
	david.e.box@linux.intel.com
Subject: Re: [PATCH v4 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state
Date: Fri, 22 Apr 2022 17:24:33 -0500	[thread overview]
Message-ID: <20220422222433.GA1464120@bhelgaas> (raw)
In-Reply-To: <20220408153159.106741-1-kai.heng.feng@canonical.com>

[+cc Rajvi, David]

On Fri, Apr 08, 2022 at 11:31:58PM +0800, Kai-Heng Feng wrote:
> On Intel Alder Lake platforms, Thunderbolt entering D3cold can cause
> some errors reported by AER:
> [   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> [   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> [   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> [   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
> [   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
> [   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
> [   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed
> 
> So disable AER service to avoid the noises from turning power rails
> on/off when the device is in low power states (D3hot and D3cold), as
> PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states
> that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready
> (D3hot), L2 (D3cold with aux power) and L3 (D3cold).

Help me walk through what's happening here, because I'm never very
confident about how error reporting works.  I *think* the Unsupported
Request error means some request was in progress and was not
completed.  I don't think a link going down should by itself cause
an Unsupported Request error because there's no *request*.

I have a theory about what happened here.  Decoding the TLP Header
(from PCIe r6.0, sec 2.2.1.1, 2.2.8.10) gives:

  34000000 (0011 0100 ...):
    Fmt               001        4 DW header, no data
    Type           1 0100        Msg, Local - Terminate at Receiver

  08000052 (0800 ... 0101 0010)
    Requester ID     0800        00:08.0
    Message Code     0101 0010   PTM Request

From your lspci in bugzilla, 08:00 has PTM enabled.  So my theory is
that:

  - 08:00.0 sent a PTM Request Message (a Posted Request)
  - 00:1d.0 received the PTM Request Message
  - The link transitioned to DL_Down
  - Per sec 2.9.1, 00:1d.0 discarded the Request and reported an
    Unsupported Request
  - Or, per sec 6.21.3, if 00:1d.0 received a PTM Request when its
    own PTM Enable was clear, it would also be treated as an
    Unsupported Request

So I suspect we should disable PTM on 08:00.0 before putting it in a
low-power state.  If you manually disable PTM on 08:00.0, do these
errors stop happening?

David did something like this [1], but just for Root Ports.  That
looks wrong to me because sec 6.21.3 says we should not have PTM
enabled in an Upstream Port (i.e., in a downstream device like
08:00.0) unless it is already enabled in the Downstream Port (i.e., in
the Root Port 00:1d.0).

Nit: can you remove the timestamps from the log?  They add clutter but
no useful information.

[1] https://git.kernel.org/linus/a697f072f5da

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v4:
>  - Explicitly states the spec version.
>  - Wording change. 
> 
> v3:
>  - Remove reference to ACS.
>  - Wording change.
> 
> v2:
>  - Wording change.
> 
>  drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b270..e4e9d4a3098d7 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev)
>  	return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_disable_rootport(rpc);
> +	return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_enable_rootport(rpc);
> +	return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  }
>  
>  static struct pcie_port_service_driver aerdriver = {
> -	.name		= "aer",
> -	.port_type	= PCIE_ANY_PORT,
> -	.service	= PCIE_PORT_SERVICE_AER,
> -
> -	.probe		= aer_probe,
> -	.remove		= aer_remove,
> +	.name			= "aer",
> +	.port_type		= PCIE_ANY_PORT,
> +	.service		= PCIE_PORT_SERVICE_AER,
> +	.probe			= aer_probe,
> +	.suspend		= aer_suspend,
> +	.resume			= aer_resume,
> +	.runtime_suspend	= aer_suspend,
> +	.runtime_resume		= aer_resume,
> +	.remove			= aer_remove,
>  };
>  
>  /**
> -- 
> 2.34.1
> 

  parent reply	other threads:[~2022-04-22 22:25 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-08 15:31 [PATCH v4 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state Kai-Heng Feng
2022-04-08 15:31 ` Kai-Heng Feng
2022-04-08 15:31 ` [PATCH v4 2/2] PCI/DPC: Disable DPC " Kai-Heng Feng
2022-04-08 15:31   ` Kai-Heng Feng
2022-04-18  2:41   ` Sathyanarayanan Kuppuswamy
2022-04-18  2:41     ` Sathyanarayanan Kuppuswamy
2022-06-21  2:27     ` Kai-Heng Feng
2022-06-21  2:27       ` Kai-Heng Feng
2022-06-23 17:28       ` Bjorn Helgaas
2022-06-23 17:28         ` Bjorn Helgaas
2022-04-22 22:24 ` Bjorn Helgaas [this message]
2022-04-22 22:24   ` [PATCH v4 1/2] PCI/AER: Disable AER " Bjorn Helgaas
2022-04-22 22:26   ` Bjorn Helgaas
2022-04-22 22:26     ` Bjorn Helgaas
2022-07-01  4:06     ` Kai-Heng Feng
2022-07-01  4:06       ` Kai-Heng Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220422222433.GA1464120@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=david.e.box@linux.intel.com \
    --cc=kai.heng.feng@canonical.com \
    --cc=koba.ko@canonical.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mika.westerberg@linux.intel.com \
    --cc=oohall@gmail.com \
    --cc=rajvi.jingar@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.