From: Bjorn Helgaas <helgaas@kernel.org> To: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: sathyanarayanan.kuppuswamy@linux.intel.com, linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, koba.ko@canonical.com, Rajvi Jingar <rajvi.jingar@intel.com>, Oliver O'Halloran <oohall@gmail.com>, david.e.box@linux.intel.com, bhelgaas@google.com, mika.westerberg@linux.intel.com, baolu.lu@linux.intel.com Subject: Re: [PATCH v4 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state Date: Fri, 22 Apr 2022 17:24:33 -0500 [thread overview] Message-ID: <20220422222433.GA1464120@bhelgaas> (raw) In-Reply-To: <20220408153159.106741-1-kai.heng.feng@canonical.com> [+cc Rajvi, David] On Fri, Apr 08, 2022 at 11:31:58PM +0800, Kai-Heng Feng wrote: > On Intel Alder Lake platforms, Thunderbolt entering D3cold can cause > some errors reported by AER: > [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 > [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 > [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) > [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 > [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) > [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) > [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed > > So disable AER service to avoid the noises from turning power rails > on/off when the device is in low power states (D3hot and D3cold), as > PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states > that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready > (D3hot), L2 (D3cold with aux power) and L3 (D3cold). Help me walk through what's happening here, because I'm never very confident about how error reporting works. I *think* the Unsupported Request error means some request was in progress and was not completed. I don't think a link going down should by itself cause an Unsupported Request error because there's no *request*. I have a theory about what happened here. Decoding the TLP Header (from PCIe r6.0, sec 2.2.1.1, 2.2.8.10) gives: 34000000 (0011 0100 ...): Fmt 001 4 DW header, no data Type 1 0100 Msg, Local - Terminate at Receiver 08000052 (0800 ... 0101 0010) Requester ID 0800 00:08.0 Message Code 0101 0010 PTM Request From your lspci in bugzilla, 08:00 has PTM enabled. So my theory is that: - 08:00.0 sent a PTM Request Message (a Posted Request) - 00:1d.0 received the PTM Request Message - The link transitioned to DL_Down - Per sec 2.9.1, 00:1d.0 discarded the Request and reported an Unsupported Request - Or, per sec 6.21.3, if 00:1d.0 received a PTM Request when its own PTM Enable was clear, it would also be treated as an Unsupported Request So I suspect we should disable PTM on 08:00.0 before putting it in a low-power state. If you manually disable PTM on 08:00.0, do these errors stop happening? David did something like this [1], but just for Root Ports. That looks wrong to me because sec 6.21.3 says we should not have PTM enabled in an Upstream Port (i.e., in a downstream device like 08:00.0) unless it is already enabled in the Downstream Port (i.e., in the Root Port 00:1d.0). Nit: can you remove the timestamps from the log? They add clutter but no useful information. [1] https://git.kernel.org/linus/a697f072f5da > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453 > Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > --- > v4: > - Explicitly states the spec version. > - Wording change. > > v3: > - Remove reference to ACS. > - Wording change. > > v2: > - Wording change. > > drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++------ > 1 file changed, 25 insertions(+), 6 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 9fa1f97e5b270..e4e9d4a3098d7 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev) > return 0; > } > > +static int aer_suspend(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_disable_rootport(rpc); > + return 0; > +} > + > +static int aer_resume(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_enable_rootport(rpc); > + return 0; > +} > + > /** > * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP > * @dev: pointer to Root Port, RCEC, or RCiEP > @@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev) > } > > static struct pcie_port_service_driver aerdriver = { > - .name = "aer", > - .port_type = PCIE_ANY_PORT, > - .service = PCIE_PORT_SERVICE_AER, > - > - .probe = aer_probe, > - .remove = aer_remove, > + .name = "aer", > + .port_type = PCIE_ANY_PORT, > + .service = PCIE_PORT_SERVICE_AER, > + .probe = aer_probe, > + .suspend = aer_suspend, > + .resume = aer_resume, > + .runtime_suspend = aer_suspend, > + .runtime_resume = aer_resume, > + .remove = aer_remove, > }; > > /** > -- > 2.34.1 >
WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org> To: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: bhelgaas@google.com, mika.westerberg@linux.intel.com, koba.ko@canonical.com, baolu.lu@linux.intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, Russell Currey <ruscur@russell.cc>, Oliver O'Halloran <oohall@gmail.com>, linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Rajvi Jingar <rajvi.jingar@intel.com>, david.e.box@linux.intel.com Subject: Re: [PATCH v4 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state Date: Fri, 22 Apr 2022 17:24:33 -0500 [thread overview] Message-ID: <20220422222433.GA1464120@bhelgaas> (raw) In-Reply-To: <20220408153159.106741-1-kai.heng.feng@canonical.com> [+cc Rajvi, David] On Fri, Apr 08, 2022 at 11:31:58PM +0800, Kai-Heng Feng wrote: > On Intel Alder Lake platforms, Thunderbolt entering D3cold can cause > some errors reported by AER: > [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 > [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 > [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) > [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 > [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) > [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) > [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed > > So disable AER service to avoid the noises from turning power rails > on/off when the device is in low power states (D3hot and D3cold), as > PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states > that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready > (D3hot), L2 (D3cold with aux power) and L3 (D3cold). Help me walk through what's happening here, because I'm never very confident about how error reporting works. I *think* the Unsupported Request error means some request was in progress and was not completed. I don't think a link going down should by itself cause an Unsupported Request error because there's no *request*. I have a theory about what happened here. Decoding the TLP Header (from PCIe r6.0, sec 2.2.1.1, 2.2.8.10) gives: 34000000 (0011 0100 ...): Fmt 001 4 DW header, no data Type 1 0100 Msg, Local - Terminate at Receiver 08000052 (0800 ... 0101 0010) Requester ID 0800 00:08.0 Message Code 0101 0010 PTM Request From your lspci in bugzilla, 08:00 has PTM enabled. So my theory is that: - 08:00.0 sent a PTM Request Message (a Posted Request) - 00:1d.0 received the PTM Request Message - The link transitioned to DL_Down - Per sec 2.9.1, 00:1d.0 discarded the Request and reported an Unsupported Request - Or, per sec 6.21.3, if 00:1d.0 received a PTM Request when its own PTM Enable was clear, it would also be treated as an Unsupported Request So I suspect we should disable PTM on 08:00.0 before putting it in a low-power state. If you manually disable PTM on 08:00.0, do these errors stop happening? David did something like this [1], but just for Root Ports. That looks wrong to me because sec 6.21.3 says we should not have PTM enabled in an Upstream Port (i.e., in a downstream device like 08:00.0) unless it is already enabled in the Downstream Port (i.e., in the Root Port 00:1d.0). Nit: can you remove the timestamps from the log? They add clutter but no useful information. [1] https://git.kernel.org/linus/a697f072f5da > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453 > Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > --- > v4: > - Explicitly states the spec version. > - Wording change. > > v3: > - Remove reference to ACS. > - Wording change. > > v2: > - Wording change. > > drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++------ > 1 file changed, 25 insertions(+), 6 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 9fa1f97e5b270..e4e9d4a3098d7 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev) > return 0; > } > > +static int aer_suspend(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_disable_rootport(rpc); > + return 0; > +} > + > +static int aer_resume(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_enable_rootport(rpc); > + return 0; > +} > + > /** > * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP > * @dev: pointer to Root Port, RCEC, or RCiEP > @@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev) > } > > static struct pcie_port_service_driver aerdriver = { > - .name = "aer", > - .port_type = PCIE_ANY_PORT, > - .service = PCIE_PORT_SERVICE_AER, > - > - .probe = aer_probe, > - .remove = aer_remove, > + .name = "aer", > + .port_type = PCIE_ANY_PORT, > + .service = PCIE_PORT_SERVICE_AER, > + .probe = aer_probe, > + .suspend = aer_suspend, > + .resume = aer_resume, > + .runtime_suspend = aer_suspend, > + .runtime_resume = aer_resume, > + .remove = aer_remove, > }; > > /** > -- > 2.34.1 >
next prev parent reply other threads:[~2022-04-22 22:25 UTC|newest] Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-04-08 15:31 [PATCH v4 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state Kai-Heng Feng 2022-04-08 15:31 ` Kai-Heng Feng 2022-04-08 15:31 ` [PATCH v4 2/2] PCI/DPC: Disable DPC " Kai-Heng Feng 2022-04-08 15:31 ` Kai-Heng Feng 2022-04-18 2:41 ` Sathyanarayanan Kuppuswamy 2022-04-18 2:41 ` Sathyanarayanan Kuppuswamy 2022-06-21 2:27 ` Kai-Heng Feng 2022-06-21 2:27 ` Kai-Heng Feng 2022-06-23 17:28 ` Bjorn Helgaas 2022-06-23 17:28 ` Bjorn Helgaas 2022-04-22 22:24 ` Bjorn Helgaas [this message] 2022-04-22 22:24 ` [PATCH v4 1/2] PCI/AER: Disable AER " Bjorn Helgaas 2022-04-22 22:26 ` Bjorn Helgaas 2022-04-22 22:26 ` Bjorn Helgaas 2022-07-01 4:06 ` Kai-Heng Feng 2022-07-01 4:06 ` Kai-Heng Feng
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220422222433.GA1464120@bhelgaas \ --to=helgaas@kernel.org \ --cc=baolu.lu@linux.intel.com \ --cc=bhelgaas@google.com \ --cc=david.e.box@linux.intel.com \ --cc=kai.heng.feng@canonical.com \ --cc=koba.ko@canonical.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-pci@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=mika.westerberg@linux.intel.com \ --cc=oohall@gmail.com \ --cc=rajvi.jingar@intel.com \ --cc=sathyanarayanan.kuppuswamy@linux.intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.