All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Roese <sr@denx.de>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "Pali Rohár" <pali@kernel.org>,
	linux-pci@vger.kernel.org, "Bjorn Helgaas" <bhelgaas@google.com>,
	"Keith Busch" <kbusch@kernel.org>,
	"Bharat Kumar Gogada" <bharatku@xilinx.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>
Subject: Re: PCIe AER generates no interrupts on host (ZynqMP)
Date: Fri, 14 Jan 2022 07:25:29 +0100	[thread overview]
Message-ID: <dd3d7cd9-6232-e793-a26c-a10e5cc13fae@denx.de> (raw)
In-Reply-To: <20220113213230.GA433650@bhelgaas>

On 1/13/22 22:32, Bjorn Helgaas wrote:
> On Thu, Jan 13, 2022 at 08:13:55AM +0100, Stefan Roese wrote:
>> On 1/12/22 18:49, Bjorn Helgaas wrote:
> 
>>> Ah.  I assume you have:
>>>
>>>     00:00.0 Root Port to [bus 01-??]
>>>     01:00.0 Switch Upstream Port to [bus 02-??]
>>>     02:0?.0 Switch Downstream Port to [bus 04-??]
>>
>> This is correct, yes.
>>
>>> pcie_portdrv_probe() claims 00:00.0 and clears CERE NFERE FERE URRE.
>>>
>>> aer_probe() claims 00:00.0 and enables CERE NFERE FERE URRE for all
>>> downstream devices, including 01:00.0.
>>>
>>> pcie_portdrv_probe() claims 01:00.0 and clears CERE NFERE FERE URRE
>>> again.
>>>
>>> aer_probe() declines to claim 01:00.0 because it's not a Root Port, so
>>> CERE NFERE FERE URRE remain cleared.
> 
>> I'm baffled a bit, that this problem of AER reporting being disabled in
>> the DevCtl regs of PCIe ports (all non root ports) was not noticed for
>> this long time. As AER is practically disabled in such setups.
> 
> The more common configuration is a Root Port leading directly to an
> Endpoint.  In that case, there would be no pcie_portdrv_probe() in the
> middle to disable reporting after aer_probe() has enabled it.
> 
> The issue you're seeing happens because of the switch in the middle,
> which is becoming more common recently with Thunderbolt.
> 
> I poked around on my laptop (Dell 5520 running v5.4):
> 
>    00:01.0 Root Port       to [bus 01]          CorrErr-
>    01:00.0 NVIDIA GPU                           CorrErr-
> 
>    00:1c.0 Root Port       to [bus 02]     AER  CorrErr+
>    02:00.0 Intel NIC                       AER  CorrErr-  <-- iwlwifi
> 
>    00:1c.1 Root Port       to [bus 03]     AER  CorrErr+
>    03:00.0 Realtek card reader             AER  CorrErr-  <-- rtsx_pci
> 
>    00:1d.0 Root Port       to [bus 04]     AER  CorrErr+
>    04:00.0 NVMe                            AER  CorrErr+
> 
>    00:1d.6 Root Port       to [bus 06-3e]  AER  CorrErr+
>    06:00.0 Thunderbolt USP to [bus 07-3e]  AER  CorrErr-
>    07:00.0 Thunderbolt DSP to [bus 08]     AER  CorrErr-
>    ...                                          CorrErr-
> 
> Everything in the Thunderbolt hierarchy has reporting disabled,
> probably because of the issue you are pointing out.
> 
> I can't explain the iwlwifi and rtsx_pci cases.  Both devices have AER
> and are directly below a Root Port that also has AER, so I would think
> reporting should be enabled.

This is because AER is enabled for the complete bus via
set_downstream_devices_error_reporting(), which calls
set_device_error_reporting():

static int set_device_error_reporting(struct pci_dev *dev, void *data)
...
	if ((type == PCI_EXP_TYPE_ROOT_PORT) ||
	    (type == PCI_EXP_TYPE_RC_EC) ||
	    (type == PCI_EXP_TYPE_UPSTREAM) ||
	    (type == PCI_EXP_TYPE_DOWNSTREAM)) {
		if (enable)
			pci_enable_pcie_error_reporting(dev);
		else
			pci_disable_pcie_error_reporting(dev);
	}

	if (enable)
		pcie_set_ecrc_checking(dev);

Not sure, why error reporting is not enabled for "normal" PCIe devices
but only pcie_set_ecrc_checking(). This behavior was integrated with the
intial AER support in commit 6c2b374d7485 ("PCI-Express AER
implemetation: AER core and aerdriver") in 2006.

This explains, why you have AER disabled in the DevCtl registers of your
iwlwifi and rtsx_pci devices.

Now you might be asking, why you have AER enabled in the NVMe drive.
This is because the NVMe driver specifically enables AER:

drivers/nvme/host/pci.c:
static int nvme_pci_enable(struct nvme_dev *dev)
{
...
	pci_enable_pcie_error_reporting(pdev);

There are other device drivers, especially ethernet, SCSI etc, which
also specifically call pci_enable_pcie_error_reporting() for their
PCIe devices.

So how to continue with this? Are we "brave enough" to enable AER
for normal PCIe devices as well in set_device_error_reporting()?
This would be a quite big change, as currently all PCIe devices have
AER disabled per default.

And we still have the problem that AER is disabled in all PCIe devices
except for the Root Port. This can be fixed by removing the AER
disabling from get_port_device_capability() as done in the patch I've
sent yesterday to the list.

Another idea that comes to my mind is, to change
pci_enable_pcie_error_reporting() to walk the PCI bus upstream
while enabling the device and enable AER in the DevCtl register in all
upstream PCIe devices (e.g. PCIe switch etc) found on this way up to
the PCIe Root Port. This way, AER will work on PCIe device, where it
is specifically enabled in the device driver. But only there - at
least in cases, where PCIe switches etc are involved.

Did I miss anything?

Thanks,
Stefan

  reply	other threads:[~2022-01-14  6:25 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-04  9:02 PCIe AER generates no interrupts on host (ZynqMP) Stefan Roese
2022-01-07 10:04 ` Pali Rohár
2022-01-07 20:34   ` Bjorn Helgaas
2022-01-07 21:31     ` Pali Rohár
2022-01-08  3:13       ` Bjorn Helgaas
2022-01-10 11:12         ` Pali Rohár
2022-01-10 12:17     ` Stefan Roese
2022-01-10 12:16   ` Stefan Roese
2022-01-11  8:14     ` Stefan Roese
2022-01-12 17:49       ` Bjorn Helgaas
2022-01-13  7:13         ` Stefan Roese
2022-01-13 21:32           ` Bjorn Helgaas
2022-01-14  6:25             ` Stefan Roese [this message]
2022-01-14 18:30               ` Bjorn Helgaas
2022-01-14 18:38                 ` Pali Rohár
2022-01-17  6:24                 ` Stefan Roese

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dd3d7cd9-6232-e793-a26c-a10e5cc13fae@denx.de \
    --to=sr@denx.de \
    --cc=bharatku@xilinx.com \
    --cc=bhelgaas@google.com \
    --cc=helgaas@kernel.org \
    --cc=kbusch@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=pali@kernel.org \
    --cc=rjw@rjwysocki.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.