All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI/portdrv: Enable error reporting on managed ports
@ 2018-09-04 18:33 Jon Derrick
  2018-10-09 17:56 ` Bjorn Helgaas
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Derrick @ 2018-09-04 18:33 UTC (permalink / raw)
  To: linux-pci
  Cc: linux-kernel, Bjorn Helgaas, Keith Busch, Sinan Kaya,
	Oza Pawandeep, Matthew Wilcox, Lukas Winner, Christoph Hellwig,
	Mika Westerberg, Jon Derrick

During probe, the port driver will disable error reporting and assumes
it will be enabled later by the AER driver's pci_walk_bus() sequence.
This may not be the case for host-bridge enabled root ports, who will
enable first error reporting on the bus during the root port probe, and
then disable error reporting on downstream devices during subsequent
probing of the bus.

A hotplugged port device may also fail to enable error reporting as the
AER driver has already run on the root bus.

Check for these conditions and enable error reporting during portdrv
probing.

Example case:
[  343.790573] pcieport 10000:00:00.0: pci_disable_pcie_error_reporting
[  343.809812] pcieport 10000:00:00.0: pci_enable_pcie_error_reporting
[  343.819506] pci 10000:01:00.0: pci_enable_pcie_error_reporting
[  343.828814] pci 10000:02:00.0: pci_enable_pcie_error_reporting
[  343.838089] pci 10000:02:01.0: pci_enable_pcie_error_reporting
[  343.847478] pci 10000:02:02.0: pci_enable_pcie_error_reporting
[  343.856659] pci 10000:02:03.0: pci_enable_pcie_error_reporting
[  343.865794] pci 10000:02:04.0: pci_enable_pcie_error_reporting
[  343.874875] pci 10000:02:05.0: pci_enable_pcie_error_reporting
[  343.883918] pci 10000:02:06.0: pci_enable_pcie_error_reporting
[  343.892922] pci 10000:02:07.0: pci_enable_pcie_error_reporting
[  343.918900] pcieport 10000:01:00.0: pci_disable_pcie_error_reporting
[  343.968426] pcieport 10000:02:00.0: pci_disable_pcie_error_reporting
[  344.028179] pcieport 10000:02:01.0: pci_disable_pcie_error_reporting
[  344.091269] pcieport 10000:02:02.0: pci_disable_pcie_error_reporting
[  344.156473] pcieport 10000:02:03.0: pci_disable_pcie_error_reporting
[  344.238042] pcieport 10000:02:04.0: pci_disable_pcie_error_reporting
[  344.321864] pcieport 10000:02:05.0: pci_disable_pcie_error_reporting
[  344.411601] pcieport 10000:02:06.0: pci_disable_pcie_error_reporting
[  344.505332] pcieport 10000:02:07.0: pci_disable_pcie_error_reporting
[  344.621824] nvme 10000:06:00.0: pci_enable_pcie_error_reporting

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/pcie/portdrv_core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 7c37d81..fdd953a 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -343,6 +343,16 @@ int pcie_port_device_register(struct pci_dev *dev)
 	if (!nr_service)
 		goto error_cleanup_irqs;
 
+#ifdef CONFIG_PCIEAER
+	/*
+	 * Enable error reporting for this port in case AER probing has already
+	 * run on the root bus or this port device is hot-inserted
+	 */
+	if (dev->aer_cap && pci_aer_available() &&
+	    (pcie_ports_native || pci_find_host_bridge(dev->bus)->native_aer))
+		pci_enable_pcie_error_reporting(dev);
+#endif
+
 	return 0;
 
 error_cleanup_irqs:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/portdrv: Enable error reporting on managed ports
  2018-09-04 18:33 [PATCH] PCI/portdrv: Enable error reporting on managed ports Jon Derrick
@ 2018-10-09 17:56 ` Bjorn Helgaas
  2018-10-09 19:51   ` Derrick, Jonathan
  0 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2018-10-09 17:56 UTC (permalink / raw)
  To: Jon Derrick
  Cc: linux-pci, linux-kernel, Keith Busch, Sinan Kaya, Oza Pawandeep,
	Matthew Wilcox, Lukas Wunner, Christoph Hellwig, Mika Westerberg

On Tue, Sep 04, 2018 at 12:33:09PM -0600, Jon Derrick wrote:
> During probe, the port driver will disable error reporting and assumes
> it will be enabled later by the AER driver's pci_walk_bus() sequence.
> This may not be the case for host-bridge enabled root ports, who will
> enable first error reporting on the bus during the root port probe, and
> then disable error reporting on downstream devices during subsequent
> probing of the bus.

I understand the hotplug case (see below), but help me understand this
"host-bridge enabled root ports" thing.  I'm not sure what that means.

We run pcie_portdrv_probe() for every root port, switch upstream port,
and switch downstream port, and it always disables error reporting for
the port:

  pcie_portdrv_probe          # pci_driver .probe
    pcie_port_device_register
      get_port_device_capability
        services |= PCIE_PORT_SERVICE_AER
        pci_disable_pcie_error_reporting
          # clear DEVCTL Error Reporting Enables

For root ports, we call aer_probe(), and it enables error reporting
for the entire tree below the root port:

  aer_probe                   # pcie_port_service .probe
    aer_enable_rootport
      set_downstream_devices_error_reporting(dev, true)
        pci_walk_bus(dev->subordinate, set_device_error_reporting)
          set_device_error_reporting
            if (Root Port || Upstream Port || Downstream Port)
              pci_enable_pcie_error_reporting
                # set DEVCTL Error Reporting Enables

This is definitely broken for hot-added switches because aer_probe()
is the only place we enable error reporting, and it's only run when we
enumerate a root port, not when we hot-add things below that root
port.

> A hotplugged port device may also fail to enable error reporting as the
> AER driver has already run on the root bus.

> Check for these conditions and enable error reporting during portdrv
> probing.
> 
> Example case:

pcie_portdrv_probe(10000:00:00.0):
> [  343.790573] pcieport 10000:00:00.0: pci_disable_pcie_error_reporting

aer_probe(10000:00:00.0):
> [  343.809812] pcieport 10000:00:00.0: pci_enable_pcie_error_reporting
> [  343.819506] pci 10000:01:00.0: pci_enable_pcie_error_reporting
> [  343.828814] pci 10000:02:00.0: pci_enable_pcie_error_reporting
> [  343.838089] pci 10000:02:01.0: pci_enable_pcie_error_reporting
> [  343.847478] pci 10000:02:02.0: pci_enable_pcie_error_reporting
> [  343.856659] pci 10000:02:03.0: pci_enable_pcie_error_reporting
> [  343.865794] pci 10000:02:04.0: pci_enable_pcie_error_reporting
> [  343.874875] pci 10000:02:05.0: pci_enable_pcie_error_reporting
> [  343.883918] pci 10000:02:06.0: pci_enable_pcie_error_reporting
> [  343.892922] pci 10000:02:07.0: pci_enable_pcie_error_reporting

pcie_portdrv_probe(10000:01:00.0):
> [  343.918900] pcieport 10000:01:00.0: pci_disable_pcie_error_reporting

pcie_portdrv_probe(10000:02:00.0):
> [  343.968426] pcieport 10000:02:00.0: pci_disable_pcie_error_reporting

...
> [  344.028179] pcieport 10000:02:01.0: pci_disable_pcie_error_reporting
> [  344.091269] pcieport 10000:02:02.0: pci_disable_pcie_error_reporting
> [  344.156473] pcieport 10000:02:03.0: pci_disable_pcie_error_reporting
> [  344.238042] pcieport 10000:02:04.0: pci_disable_pcie_error_reporting
> [  344.321864] pcieport 10000:02:05.0: pci_disable_pcie_error_reporting
> [  344.411601] pcieport 10000:02:06.0: pci_disable_pcie_error_reporting
> [  344.505332] pcieport 10000:02:07.0: pci_disable_pcie_error_reporting

> [  344.621824] nvme 10000:06:00.0: pci_enable_pcie_error_reporting
> 
> Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> ---
>  drivers/pci/pcie/portdrv_core.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 7c37d81..fdd953a 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -343,6 +343,16 @@ int pcie_port_device_register(struct pci_dev *dev)
>       if (!nr_service)
>               goto error_cleanup_irqs;
>  
> +#ifdef CONFIG_PCIEAER
> +     /*
> +      * Enable error reporting for this port in case AER probing has already
> +      * run on the root bus or this port device is hot-inserted
> +      */
> +     if (dev->aer_cap && pci_aer_available() &&
> +         (pcie_ports_native || pci_find_host_bridge(dev->bus)->native_aer))
> +             pci_enable_pcie_error_reporting(dev);
> +#endif

I plan to apply this after we clarify the changelog a bit, but I don't
really like this patch because it (and the corresponding code added by
2bd50dd800b5 ("PCI: PCIe: Disable PCIe port services during port
initialization")) seem a little out of place.

The way I think this *should* work is that the PCI core should arrange to
handle AER interrupts when it enumerates the devices that can generate
them (Root Ports and Root Complex Event Collectors), even before it
enumerates the devices below the Root Port.

Then the PCI core could directly enable the AER interrupts on all devices
as it enumerates them.  I would envision both cases being handled somewhere
like pci_aer_init() in pci_init_capabilities().

This would also allow us to get rid of the pci_enable_pcie_error_reporting()
calls that are currently sprinkled around in drivers, because that would be
handled by the core for all devices.

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/portdrv: Enable error reporting on managed ports
  2018-10-09 17:56 ` Bjorn Helgaas
@ 2018-10-09 19:51   ` Derrick, Jonathan
  2018-10-09 23:19     ` Bjorn Helgaas
  0 siblings, 1 reply; 6+ messages in thread
From: Derrick, Jonathan @ 2018-10-09 19:51 UTC (permalink / raw)
  To: helgaas
  Cc: linux-kernel, okaya, willy, hch, lukas, mika.westerberg, poza,
	linux-pci, Busch, Keith

[-- Attachment #1: Type: text/plain, Size: 8033 bytes --]

Hi Bjorn,

On Tue, 2018-10-09 at 12:56 -0500, Bjorn Helgaas wrote:
> On Tue, Sep 04, 2018 at 12:33:09PM -0600, Jon Derrick wrote:
> > During probe, the port driver will disable error reporting and
> > assumes
> > it will be enabled later by the AER driver's pci_walk_bus()
> > sequence.
> > This may not be the case for host-bridge enabled root ports, who
> > will
> > enable first error reporting on the bus during the root port probe,
> > and
> > then disable error reporting on downstream devices during
> > subsequent
> > probing of the bus.
> 
> I understand the hotplug case (see below), but help me understand
> this
> "host-bridge enabled root ports" thing.  I'm not sure what that
> means.
Sorry for the confusion. I meant a device which doesn't expose the root
ports with firmware but has to expose the root ports using
pci_create_root_bus or similar methods. These methods use a host bridge
aperture on the backend.


> 
> We run pcie_portdrv_probe() for every root port, switch upstream
> port,
> and switch downstream port, and it always disables error reporting
> for
> the port:
> 
>   pcie_portdrv_probe          # pci_driver .probe
>     pcie_port_device_register
>       get_port_device_capability
>         services |= PCIE_PORT_SERVICE_AER
>         pci_disable_pcie_error_reporting
>           # clear DEVCTL Error Reporting Enables
> 
> For root ports, we call aer_probe(), and it enables error reporting
> for the entire tree below the root port:
> 
>   aer_probe                   # pcie_port_service .probe
>     aer_enable_rootport
>       set_downstream_devices_error_reporting(dev, true)
>         pci_walk_bus(dev->subordinate, set_device_error_reporting)
>           set_device_error_reporting
>             if (Root Port || Upstream Port || Downstream Port)
>               pci_enable_pcie_error_reporting
>                 # set DEVCTL Error Reporting Enables
> 
> This is definitely broken for hot-added switches because aer_probe()
> is the only place we enable error reporting, and it's only run when
> we
> enumerate a root port, not when we hot-add things below that root
> port.

I don't currently have the hardware to test hotplugging a switch,
although I think it should be possible to test with Thunderbolt. Mika? 
:)

> 
> > A hotplugged port device may also fail to enable error reporting as
> > the
> > AER driver has already run on the root bus.
> > Check for these conditions and enable error reporting during
> > portdrv
> > probing.
> > 
> > Example case:
> 
> pcie_portdrv_probe(10000:00:00.0):
> > [  343.790573] pcieport 10000:00:00.0:
> > pci_disable_pcie_error_reporting
> 
> aer_probe(10000:00:00.0):
> > [  343.809812] pcieport 10000:00:00.0:
> > pci_enable_pcie_error_reporting
> > [  343.819506] pci 10000:01:00.0: pci_enable_pcie_error_reporting
> > [  343.828814] pci 10000:02:00.0: pci_enable_pcie_error_reporting
> > [  343.838089] pci 10000:02:01.0: pci_enable_pcie_error_reporting
> > [  343.847478] pci 10000:02:02.0: pci_enable_pcie_error_reporting
> > [  343.856659] pci 10000:02:03.0: pci_enable_pcie_error_reporting
> > [  343.865794] pci 10000:02:04.0: pci_enable_pcie_error_reporting
> > [  343.874875] pci 10000:02:05.0: pci_enable_pcie_error_reporting
> > [  343.883918] pci 10000:02:06.0: pci_enable_pcie_error_reporting
> > [  343.892922] pci 10000:02:07.0: pci_enable_pcie_error_reporting
> 
> pcie_portdrv_probe(10000:01:00.0):
> > [  343.918900] pcieport 10000:01:00.0:
> > pci_disable_pcie_error_reporting
> 
> pcie_portdrv_probe(10000:02:00.0):
> > [  343.968426] pcieport 10000:02:00.0:
> > pci_disable_pcie_error_reporting
> 
> ...
> > [  344.028179] pcieport 10000:02:01.0:
> > pci_disable_pcie_error_reporting
> > [  344.091269] pcieport 10000:02:02.0:
> > pci_disable_pcie_error_reporting
> > [  344.156473] pcieport 10000:02:03.0:
> > pci_disable_pcie_error_reporting
> > [  344.238042] pcieport 10000:02:04.0:
> > pci_disable_pcie_error_reporting
> > [  344.321864] pcieport 10000:02:05.0:
> > pci_disable_pcie_error_reporting
> > [  344.411601] pcieport 10000:02:06.0:
> > pci_disable_pcie_error_reporting
> > [  344.505332] pcieport 10000:02:07.0:
> > pci_disable_pcie_error_reporting
> > [  344.621824] nvme 10000:06:00.0: pci_enable_pcie_error_reporting

The main issue leading to this sequence of events is that the root port
and tree is enumerated, then the tree has drivers attached in-order.
During driver probe, the root port services are attached which enables
AER on the entire bus. Then the port's service drivers are attached,
which disables AER during initialization.


It looks like I don't see this issue in the non-host-bridge case
because this statement isn't executed:
#ifdef CONFIG_PCIEAER
        if (dev->aer_cap && pci_aer_available() &&
            (pcie_ports_native || host->native_aer)) {
                services |= PCIE_PORT_SERVICE_AER;

                /*
                 * Disable AER on this port in case it's been enabled
by the
                 * BIOS (the AER service driver will enable it when
necessary).
                 */         
                pci_disable_pcie_error_reporting(dev);
        }                                                              
#endif


So I think re-enabling or strictly enabling error reporting is the
correct course of action for hotplug support at the least. And doing it
after port service driver probe is done and IRQs are requested should
eliminate or reduce races with initialization, which is what above code
snippet looks to be doing in the first place along with
pcie_init_services.

> > 
> > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > ---
> >  drivers/pci/pcie/portdrv_core.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/pci/pcie/portdrv_core.c
> > b/drivers/pci/pcie/portdrv_core.c
> > index 7c37d81..fdd953a 100644
> > --- a/drivers/pci/pcie/portdrv_core.c
> > +++ b/drivers/pci/pcie/portdrv_core.c
> > @@ -343,6 +343,16 @@ int pcie_port_device_register(struct pci_dev
> > *dev)
> >       if (!nr_service)
> >               goto error_cleanup_irqs;
> >  
> > +#ifdef CONFIG_PCIEAER
> > +     /*
> > +      * Enable error reporting for this port in case AER probing
> > has already
> > +      * run on the root bus or this port device is hot-inserted
> > +      */
> > +     if (dev->aer_cap && pci_aer_available() &&
> > +         (pcie_ports_native || pci_find_host_bridge(dev->bus)-
> > >native_aer))
> > +             pci_enable_pcie_error_reporting(dev);
> > +#endif
> 
> I plan to apply this after we clarify the changelog a bit, but I
> don't
> really like this patch because it (and the corresponding code added
> by
> 2bd50dd800b5 ("PCI: PCIe: Disable PCIe port services during port
> initialization")) seem a little out of place.
> 
> The way I think this *should* work is that the PCI core should
> arrange to
> handle AER interrupts when it enumerates the devices that can
> generate
> them (Root Ports and Root Complex Event Collectors), even before it
> enumerates the devices below the Root Port.
To add to this, I'm not the biggest fan of the current way of having
AER enable notification for ports on the tree, but then having every
other PCI driver enable their own notification. It seems inconsistent
for the port driver to not do it.


> 
> Then the PCI core could directly enable the AER interrupts on all
> devices
> as it enumerates them.  I would envision both cases being handled
> somewhere
> like pci_aer_init() in pci_init_capabilities().
> 
> This would also allow us to get rid of the
> pci_enable_pcie_error_reporting()
> calls that are currently sprinkled around in drivers, because that
> would be
> handled by the core for all devices.
I agree with you here. It seems to make the most sense to handle it
inclusively and exclude devices as necessary.


> 
> Bjorn

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3278 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/portdrv: Enable error reporting on managed ports
  2018-10-09 19:51   ` Derrick, Jonathan
@ 2018-10-09 23:19     ` Bjorn Helgaas
  2018-10-11 11:58       ` Dongdong Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2018-10-09 23:19 UTC (permalink / raw)
  To: Derrick, Jonathan
  Cc: linux-kernel, okaya, willy, hch, lukas, mika.westerberg, poza,
	linux-pci, Busch, Keith

On Tue, Oct 09, 2018 at 07:51:58PM +0000, Derrick, Jonathan wrote:
> On Tue, 2018-10-09 at 12:56 -0500, Bjorn Helgaas wrote:
> > On Tue, Sep 04, 2018 at 12:33:09PM -0600, Jon Derrick wrote:
> > > During probe, the port driver will disable error reporting and
> > > assumes it will be enabled later by the AER driver's
> > > pci_walk_bus() sequence.  This may not be the case for
> > > host-bridge enabled root ports, who will enable first error
> > > reporting on the bus during the root port probe, and then
> > > disable error reporting on downstream devices during subsequent
> > > probing of the bus.
> > 
> > I understand the hotplug case (see below), but help me understand
> > this "host-bridge enabled root ports" thing.  I'm not sure what
> > that means.
>
> Sorry for the confusion. I meant a device which doesn't expose the
> root ports with firmware but has to expose the root ports using
> pci_create_root_bus or similar methods. These methods use a host
> bridge aperture on the backend.

I guess the contrast you're making is between drivers/acpi/pci_root.c,
which claims ACPI PNP0A03 and PNP0A08 devices, and what I call the
"native" host bridge drivers, which normally claim DT platform devices
and know about the register layout and programming model of those
devices?

In both cases the PCI core has to know about the host bridge apertures
(the address ranges on the primary (upstream) side of the host bridge
that are translated to PCI addresses on the secondary (downstream)
side of the bridge).

I'm still not seeing the connection between the ACPI/native
distinction or the host bridge apertures and this AER enablement
issue.  But I think the problem has to do with the ordering between
enumeration and portdrv/AER driver binding.  See below.

> > We run pcie_portdrv_probe() for every root port, switch upstream
> > port, and switch downstream port, and it always disables error
> > reporting for the port:
> > 
> >   pcie_portdrv_probe          # pci_driver .probe
> >     pcie_port_device_register
> >       get_port_device_capability
> >         services |= PCIE_PORT_SERVICE_AER
> >         pci_disable_pcie_error_reporting
> >           # clear DEVCTL Error Reporting Enables
> > 
> > For root ports, we call aer_probe(), and it enables error reporting
> > for the entire tree below the root port:
> > 
> >   aer_probe                   # pcie_port_service .probe
> >     aer_enable_rootport
> >       set_downstream_devices_error_reporting(dev, true)
> >         pci_walk_bus(dev->subordinate, set_device_error_reporting)
> >           set_device_error_reporting
> >             if (Root Port || Upstream Port || Downstream Port)
> >               pci_enable_pcie_error_reporting
> >                 # set DEVCTL Error Reporting Enables
> > 
> > This is definitely broken for hot-added switches because
> > aer_probe() is the only place we enable error reporting, and it's
> > only run when we enumerate a root port, not when we hot-add things
> > below that root port.
> 
> I don't currently have the hardware to test hotplugging a switch,
> although I think it should be possible to test with Thunderbolt.
> Mika?  :)

It seems clear enough to me that this is broken; I don't think we need
more testing to confirm it.  Your scenario below looks like it's
probably from VMD, given the domain number, and the fact that the
timestamps look like they're after boot suggests that VMD is being
loaded as a module.

I think I can work out the order of events there:

  - register pcie_portdriver   # device_initcall
  - register aerdriver         # device_initcall, after portdrv
  - load VMD module
  - add VMD host bridge to domain 10000
  - 10000:00:00.0: enumerate VMD root port
  - 10000:00:00.0: bind portdrv, disable error reporting
  - 10000:00:00.0: bind aerdriver, enable error reporting for children
  - 10000:01:00.0: enumerate VMD switch upstream port
  - 10000:01:00.0: bind portdrv, disable error reporting
  - 10000:01:00.0: do not bind AER driver (because not a root port)
  - 10000:02:0x.0: enumerate VMD switch downstream ports
  - 10000:02:0x.0: bind portdrv, disable error reporting
  - 10000:02:0x.0: do not bind AER driver (not root ports)
  - 10000:06:00.0: enumerate NVMe endpoint
  - 10000:06:00.0: nvme driver enables error reporting

The end state is that error reporting is enabled only for the root
port and the NVMe device, but not for the switch ports in between.

I think the critical ordering is the portdrv/AER driver registration
vs. the device enumeration.  If we enumerate the devices before
registering portdrv/AER (as is the typical case with ACPI host
bridges), when we register portdrv, we'll bind portdrv to all the
bridges *first*, which disables error reporting for all of them.

After that, we'll register the AER driver, which will be bound to all
the root ports and will enable error reporting for the subtrees below
them.

> > > A hotplugged port device may also fail to enable error reporting
> > > as the AER driver has already run on the root bus.  Check for
> > > these conditions and enable error reporting during portdrv
> > > probing.
> > > 
> > > Example case:
> > 
> > pcie_portdrv_probe(10000:00:00.0):
> > > [  343.790573] pcieport 10000:00:00.0:
> > > pci_disable_pcie_error_reporting
> > 
> > aer_probe(10000:00:00.0):
> > > [  343.809812] pcieport 10000:00:00.0:
> > > pci_enable_pcie_error_reporting
> > > [  343.819506] pci 10000:01:00.0: pci_enable_pcie_error_reporting
> > > [  343.828814] pci 10000:02:00.0: pci_enable_pcie_error_reporting
> > > [  343.838089] pci 10000:02:01.0: pci_enable_pcie_error_reporting
> > > [  343.847478] pci 10000:02:02.0: pci_enable_pcie_error_reporting
> > > [  343.856659] pci 10000:02:03.0: pci_enable_pcie_error_reporting
> > > [  343.865794] pci 10000:02:04.0: pci_enable_pcie_error_reporting
> > > [  343.874875] pci 10000:02:05.0: pci_enable_pcie_error_reporting
> > > [  343.883918] pci 10000:02:06.0: pci_enable_pcie_error_reporting
> > > [  343.892922] pci 10000:02:07.0: pci_enable_pcie_error_reporting
> > 
> > pcie_portdrv_probe(10000:01:00.0):
> > > [  343.918900] pcieport 10000:01:00.0:
> > > pci_disable_pcie_error_reporting
> > 
> > pcie_portdrv_probe(10000:02:00.0):
> > > [  343.968426] pcieport 10000:02:00.0:
> > > pci_disable_pcie_error_reporting
> > 
> > ...
> > > [  344.028179] pcieport 10000:02:01.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.091269] pcieport 10000:02:02.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.156473] pcieport 10000:02:03.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.238042] pcieport 10000:02:04.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.321864] pcieport 10000:02:05.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.411601] pcieport 10000:02:06.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.505332] pcieport 10000:02:07.0:
> > > pci_disable_pcie_error_reporting
> > > [  344.621824] nvme 10000:06:00.0: pci_enable_pcie_error_reporting
> 
> The main issue leading to this sequence of events is that the root
> port and tree is enumerated, then the tree has drivers attached
> in-order.  During driver probe, the root port services are attached
> which enables AER on the entire bus. Then the port's service drivers
> are attached, which disables AER during initialization.
> 
> It looks like I don't see this issue in the non-host-bridge case
> because this statement isn't executed:
>
> #ifdef CONFIG_PCIEAER
>         if (dev->aer_cap && pci_aer_available() &&
>             (pcie_ports_native || host->native_aer)) {
>                 services |= PCIE_PORT_SERVICE_AER;
> 
>                 /*
>                  * Disable AER on this port in case it's been enabled by the
>                  * BIOS (the AER service driver will enable it when necessary).
>                  */         
>                 pci_disable_pcie_error_reporting(dev);
>         }                                                              
> #endif

We should execute the statement above for all root ports, switch
upstream ports, and switch downstream ports, regardless of how they
were discovered.

I think the real issue is that if we register the drivers first, we
bind portdrv/AER to devices as they're enumerated.  But if we
enumerate the devices first, we bind portdrv to all of them first,
following by binding the AER driver to the root ports.  That ordering
difference explains the problem.

> So I think re-enabling or strictly enabling error reporting is the
> correct course of action for hotplug support at the least. And doing
> it after port service driver probe is done and IRQs are requested
> should eliminate or reduce races with initialization, which is what
> above code snippet looks to be doing in the first place along with
> pcie_init_services.
> 
> > > 
> > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > ---
> > >  drivers/pci/pcie/portdrv_core.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/drivers/pci/pcie/portdrv_core.c
> > > b/drivers/pci/pcie/portdrv_core.c
> > > index 7c37d81..fdd953a 100644
> > > --- a/drivers/pci/pcie/portdrv_core.c
> > > +++ b/drivers/pci/pcie/portdrv_core.c
> > > @@ -343,6 +343,16 @@ int pcie_port_device_register(struct pci_dev *dev)
> > >       if (!nr_service)
> > >               goto error_cleanup_irqs;
> > >  
> > > +#ifdef CONFIG_PCIEAER
> > > +     /*
> > > +      * Enable error reporting for this port in case AER probing has already
> > > +      * run on the root bus or this port device is hot-inserted
> > > +      */
> > > +     if (dev->aer_cap && pci_aer_available() &&
> > > +         (pcie_ports_native || pci_find_host_bridge(dev->bus)->native_aer))
> > > +             pci_enable_pcie_error_reporting(dev);
> > > +#endif
> > 
> > I plan to apply this after we clarify the changelog a bit, but I
> > don't really like this patch because it (and the corresponding
> > code added by 2bd50dd800b5 ("PCI: PCIe: Disable PCIe port services
> > during port initialization")) seem a little out of place.
> >
> > The way I think this *should* work is that the PCI core should
> > arrange to handle AER interrupts when it enumerates the devices
> > that can generate them (Root Ports and Root Complex Event
> > Collectors), even before it enumerates the devices below the Root
> > Port.
>
> To add to this, I'm not the biggest fan of the current way of having
> AER enable notification for ports on the tree, but then having every
> other PCI driver enable their own notification. It seems
> inconsistent for the port driver to not do it.

I agree, this is sort of a mess.  It would be nice if the AER code
could take care of this for all devices relevant to AER.  But the
portdrv structure is such that it can only bind to *ports*, not to
endpoints.  So there's not really a good way for the current AER
service driver to get involved when an endpoint is discovered.

> > Then the PCI core could directly enable the AER interrupts on all
> > devices as it enumerates them.  I would envision both cases being
> > handled somewhere like pci_aer_init() in pci_init_capabilities().
> > 
> > This would also allow us to get rid of the
> > pci_enable_pcie_error_reporting() calls that are currently
> > sprinkled around in drivers, because that would be handled by the
> > core for all devices.
>
> I agree with you here. It seems to make the most sense to handle it
> inclusively and exclude devices as necessary.

Sorry for the wall of text above.  Here's a proposal for an alternate
way to solve this by doing more of it in the AER driver.  Obviously I
can't test this so I don't know if it will solve the problem for you.

commit 15a6711649915ca3e9d1086dc88ff4b616b99aac
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Tue Oct 9 17:25:25 2018 -0500

    PCI/AER: Enable reporting for ports enumerated after AER driver registration
    
    Previously we enabled AER error reporting only for Switch Ports that were
    enumerated prior to registering the AER service driver.  Switch Ports
    enumerated after AER driver registration were left with error reporting
    disabled.
    
    A common order, which works correctly, is that we enumerate devices before
    registering portdrv and the AER driver:
    
      - Enumerate all the devices at boot-time
    
      - Register portdrv and bind it to all Root Ports and Switch Ports, which
        disables error reporting for these Ports
    
      - Register AER service driver and bind it to all Root Ports, which
        enables error reporting for the Root Ports and any Switch Ports below
        them
    
    But if we enumerate devices *after* registering portdrv and the AER driver,
    e.g., if a host bridge driver is loaded as a module, error reporting is not
    enabled correctly:
    
      - Register portdrv and AER driver (this happens at boot-time)
    
      - Enumerate a Root Port
    
      - Bind portdrv to Root Port, disabling its error reporting
    
      - Bind AER service driver to Root Port, enabling error reporting for it
        and its children (none, since we haven't enumerated them yet)
    
      - Enumerate Switch Port below the Root Port
    
      - Bind portdrv to Switch Port, disabling its error reporting
    
      - AER service driver doesn't bind to Switch Ports, so error reporting
        remains disabled
    
    Hot-adding a Switch fails similarly: error reporting is enabled correctly
    for the Root Port, but when the Switch is enumerated, the AER service
    driver doesn't claim it, so there's nothing to enable error reporting for
    the Switch Ports.
    
    Change the AER service driver so it binds to *all* PCIe ports, including
    Switch Upstream and Downstream Ports.  For Switch Ports, enable AER error
    reporting.
    
    Link: https://lore.kernel.org/linux-pci/1536085989-2956-1-git-send-email-jonathan.derrick@intel.com
    Based-on-patch-by: Jon Derrick <jonathan.derrick@intel.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 90b53abf621d..fe6c16461367 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1316,12 +1316,6 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
 	pci_read_config_dword(pdev, aer_pos + PCI_ERR_UNCOR_STATUS, &reg32);
 	pci_write_config_dword(pdev, aer_pos + PCI_ERR_UNCOR_STATUS, reg32);
 
-	/*
-	 * Enable error reporting for the root port device and downstream port
-	 * devices.
-	 */
-	set_downstream_devices_error_reporting(pdev, true);
-
 	/* Enable Root Port's interrupt in response to error messages */
 	pci_read_config_dword(pdev, aer_pos + PCI_ERR_ROOT_COMMAND, &reg32);
 	reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
@@ -1378,10 +1372,17 @@ static void aer_remove(struct pcie_device *dev)
  */
 static int aer_probe(struct pcie_device *dev)
 {
+	struct pci_dev *pdev = dev->port;
+	int type = pci_pcie_type(pdev);
 	int status;
 	struct aer_rpc *rpc;
 	struct device *device = &dev->device;
 
+	if (type == PCI_EXP_TYPE_UPSTREAM || type == PCI_EXP_TYPE_DOWNSTREAM) {
+		pci_enable_pcie_error_reporting(pdev);
+		return 0;
+	}
+
 	rpc = devm_kzalloc(device, sizeof(struct aer_rpc), GFP_KERNEL);
 	if (!rpc) {
 		dev_printk(KERN_DEBUG, device, "alloc AER rpc failed\n");
@@ -1439,7 +1440,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 
 static struct pcie_port_service_driver aerdriver = {
 	.name		= "aer",
-	.port_type	= PCI_EXP_TYPE_ROOT_PORT,
+	.port_type	= PCIE_ANY_PORT,
 	.service	= PCIE_PORT_SERVICE_AER,
 
 	.probe		= aer_probe,

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/portdrv: Enable error reporting on managed ports
  2018-10-09 23:19     ` Bjorn Helgaas
@ 2018-10-11 11:58       ` Dongdong Liu
  2018-10-11 15:16         ` Bjorn Helgaas
  0 siblings, 1 reply; 6+ messages in thread
From: Dongdong Liu @ 2018-10-11 11:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Derrick, Jonathan
  Cc: linux-kernel, okaya, willy, hch, lukas, mika.westerberg, poza,
	linux-pci, Busch, Keith

Hi Bjorn

> commit 15a6711649915ca3e9d1086dc88ff4b616b99aac
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Tue Oct 9 17:25:25 2018 -0500
>
>     PCI/AER: Enable reporting for ports enumerated after AER driver registration
>
>     Previously we enabled AER error reporting only for Switch Ports that were
>     enumerated prior to registering the AER service driver.  Switch Ports
>     enumerated after AER driver registration were left with error reporting
>     disabled.
>
>     A common order, which works correctly, is that we enumerate devices before
>     registering portdrv and the AER driver:
>
>       - Enumerate all the devices at boot-time
>
>       - Register portdrv and bind it to all Root Ports and Switch Ports, which
>         disables error reporting for these Ports
>
>       - Register AER service driver and bind it to all Root Ports, which
>         enables error reporting for the Root Ports and any Switch Ports below
>         them
>
>     But if we enumerate devices *after* registering portdrv and the AER driver,
>     e.g., if a host bridge driver is loaded as a module, error reporting is not
>     enabled correctly:
>
>       - Register portdrv and AER driver (this happens at boot-time)
>
>       - Enumerate a Root Port
>
>       - Bind portdrv to Root Port, disabling its error reporting
>
>       - Bind AER service driver to Root Port, enabling error reporting for it
>         and its children (none, since we haven't enumerated them yet)
>
>       - Enumerate Switch Port below the Root Port
>
>       - Bind portdrv to Switch Port, disabling its error reporting
>
>       - AER service driver doesn't bind to Switch Ports, so error reporting
>         remains disabled
>
>     Hot-adding a Switch fails similarly: error reporting is enabled correctly
>     for the Root Port, but when the Switch is enumerated, the AER service
>     driver doesn't claim it, so there's nothing to enable error reporting for
>     the Switch Ports.
>
>     Change the AER service driver so it binds to *all* PCIe ports, including
>     Switch Upstream and Downstream Ports.  For Switch Ports, enable AER error
>     reporting.
>
>     Link: https://lore.kernel.org/linux-pci/1536085989-2956-1-git-send-email-jonathan.derrick@intel.com
>     Based-on-patch-by: Jon Derrick <jonathan.derrick@intel.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 90b53abf621d..fe6c16461367 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1316,12 +1316,6 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
>  	pci_read_config_dword(pdev, aer_pos + PCI_ERR_UNCOR_STATUS, &reg32);
>  	pci_write_config_dword(pdev, aer_pos + PCI_ERR_UNCOR_STATUS, reg32);
>
> -	/*
> -	 * Enable error reporting for the root port device and downstream port
> -	 * devices.
> -	 */
> -	set_downstream_devices_error_reporting(pdev, true);
> -

Delete the code will also disable error reporting for the root port as
the portdrv to Root Port has disabled its error reporting,
so need to enable enable error reporting for the root port.
+pci_enable_pcie_error_reporting(pdev);

The patch looks good to me except this.

Thanks
Dongdong.

>  	/* Enable Root Port's interrupt in response to error messages */
>  	pci_read_config_dword(pdev, aer_pos + PCI_ERR_ROOT_COMMAND, &reg32);
>  	reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> @@ -1378,10 +1372,17 @@ static void aer_remove(struct pcie_device *dev)
>   */
>  static int aer_probe(struct pcie_device *dev)
>  {
> +	struct pci_dev *pdev = dev->port;
> +	int type = pci_pcie_type(pdev);
>  	int status;
>  	struct aer_rpc *rpc;
>  	struct device *device = &dev->device;
>
> +	if (type == PCI_EXP_TYPE_UPSTREAM || type == PCI_EXP_TYPE_DOWNSTREAM) {
> +		pci_enable_pcie_error_reporting(pdev);
> +		return 0;
> +	}
> +
>  	rpc = devm_kzalloc(device, sizeof(struct aer_rpc), GFP_KERNEL);
>  	if (!rpc) {
>  		dev_printk(KERN_DEBUG, device, "alloc AER rpc failed\n");
> @@ -1439,7 +1440,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>
>  static struct pcie_port_service_driver aerdriver = {
>  	.name		= "aer",
> -	.port_type	= PCI_EXP_TYPE_ROOT_PORT,
> +	.port_type	= PCIE_ANY_PORT,
>  	.service	= PCIE_PORT_SERVICE_AER,
>
>  	.probe		= aer_probe,
>
> .
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/portdrv: Enable error reporting on managed ports
  2018-10-11 11:58       ` Dongdong Liu
@ 2018-10-11 15:16         ` Bjorn Helgaas
  0 siblings, 0 replies; 6+ messages in thread
From: Bjorn Helgaas @ 2018-10-11 15:16 UTC (permalink / raw)
  To: Dongdong Liu
  Cc: Derrick, Jonathan, linux-kernel, okaya, willy, hch, lukas,
	mika.westerberg, poza, linux-pci, Busch, Keith

On Thu, Oct 11, 2018 at 07:58:47PM +0800, Dongdong Liu wrote:
> Hi Bjorn
> 
> > commit 15a6711649915ca3e9d1086dc88ff4b616b99aac
> > Author: Bjorn Helgaas <bhelgaas@google.com>
> > Date:   Tue Oct 9 17:25:25 2018 -0500
> > 
> >     PCI/AER: Enable reporting for ports enumerated after AER driver registration
> > 
> >     Previously we enabled AER error reporting only for Switch Ports that were
> >     enumerated prior to registering the AER service driver.  Switch Ports
> >     enumerated after AER driver registration were left with error reporting
> >     disabled.
> > 
> >     A common order, which works correctly, is that we enumerate devices before
> >     registering portdrv and the AER driver:
> > 
> >       - Enumerate all the devices at boot-time
> > 
> >       - Register portdrv and bind it to all Root Ports and Switch Ports, which
> >         disables error reporting for these Ports
> > 
> >       - Register AER service driver and bind it to all Root Ports, which
> >         enables error reporting for the Root Ports and any Switch Ports below
> >         them
> > 
> >     But if we enumerate devices *after* registering portdrv and the AER driver,
> >     e.g., if a host bridge driver is loaded as a module, error reporting is not
> >     enabled correctly:
> > 
> >       - Register portdrv and AER driver (this happens at boot-time)
> > 
> >       - Enumerate a Root Port
> > 
> >       - Bind portdrv to Root Port, disabling its error reporting
> > 
> >       - Bind AER service driver to Root Port, enabling error reporting for it
> >         and its children (none, since we haven't enumerated them yet)
> > 
> >       - Enumerate Switch Port below the Root Port
> > 
> >       - Bind portdrv to Switch Port, disabling its error reporting
> > 
> >       - AER service driver doesn't bind to Switch Ports, so error reporting
> >         remains disabled
> > 
> >     Hot-adding a Switch fails similarly: error reporting is enabled correctly
> >     for the Root Port, but when the Switch is enumerated, the AER service
> >     driver doesn't claim it, so there's nothing to enable error reporting for
> >     the Switch Ports.
> > 
> >     Change the AER service driver so it binds to *all* PCIe ports, including
> >     Switch Upstream and Downstream Ports.  For Switch Ports, enable AER error
> >     reporting.
> > 
> >     Link: https://lore.kernel.org/linux-pci/1536085989-2956-1-git-send-email-jonathan.derrick@intel.com
> >     Based-on-patch-by: Jon Derrick <jonathan.derrick@intel.com>
> >     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 90b53abf621d..fe6c16461367 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -1316,12 +1316,6 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
> >  	pci_read_config_dword(pdev, aer_pos + PCI_ERR_UNCOR_STATUS, &reg32);
> >  	pci_write_config_dword(pdev, aer_pos + PCI_ERR_UNCOR_STATUS, reg32);
> > 
> > -	/*
> > -	 * Enable error reporting for the root port device and downstream port
> > -	 * devices.
> > -	 */
> > -	set_downstream_devices_error_reporting(pdev, true);
> > -
> 
> Delete the code will also disable error reporting for the root port as
> the portdrv to Root Port has disabled its error reporting,
> so need to enable enable error reporting for the root port.
> +pci_enable_pcie_error_reporting(pdev);

Oh, you're right, thank you!

I'll post a "v3" to fix this, i.e.,

  v1 - Jon's original post, https://lore.kernel.org/linux-pci/1536085989-2956-1-git-send-email-jonathan.derrick@intel.com
  v2 - My rework, https://lore.kernel.org/linux-pci/20181009231915.GC5906@bhelgaas-glaptop.roam.corp.google.com
  v3 - My rework + enable error reporting for Root Ports

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-11 15:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-04 18:33 [PATCH] PCI/portdrv: Enable error reporting on managed ports Jon Derrick
2018-10-09 17:56 ` Bjorn Helgaas
2018-10-09 19:51   ` Derrick, Jonathan
2018-10-09 23:19     ` Bjorn Helgaas
2018-10-11 11:58       ` Dongdong Liu
2018-10-11 15:16         ` Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.