linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Recovering from AER: Uncorrected (Fatal) error
@ 2020-12-04 12:52 Hinko Kocevar
  2020-12-04 22:38 ` Bjorn Helgaas
  0 siblings, 1 reply; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-04 12:52 UTC (permalink / raw)
  To: linux-pci

Hi,

I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:

Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)

This is the complete PCI device tree that I'm working with:

$ sudo /usr/local/bin/pcicrawler -t
00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
 ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
 │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
 │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
 │  │     ├─04:00.0 downstream_port, slot 10, power: Off
 │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
 │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
 │  │     ├─04:02.0 downstream_port, slot 9, power: Off
 │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
 │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
 │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
 │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
 │  │     ├─04:09.0 downstream_port, slot 11, power: Off
 │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
 │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
 │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
 │  │     ├─04:10.0 downstream_port, slot 8, power: Off
 │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
 │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
 │  │     └─04:12.0 downstream_port, slot 7, power: Off
 │  ├─02:02.0 downstream_port, slot 2
 │  ├─02:08.0 downstream_port, slot 8
 │  ├─02:09.0 downstream_port, slot 9, power: Off
 │  └─02:0a.0 downstream_port, slot 10
 ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
 ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
 ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
 └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0

00:01.0 is on a CPU board, The 03:00.0 and everything below that is not on a CPU board (working with a micro TCA system here). I'm working with FPGA based devices seen as endpoints here.
After the error all the endpoints in the above list are not able to talk to CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI config space looks sane and accessible for those devices.

How can I debug this further? I'm getting the "I/O to channel is blocked" (pci_channel_io_frozen) state reported to the the endpoint devices I provide driver for.
Is there any way of telling if the PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; links up and running and similar? Is this information provided by the Linux kernel somehow?

For reference, I've experimented with AER inject and my tests showed that if the same type of error is injected in any other PCI device in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is recovered successfully, and I can continue working with the endpoint devices. In those cases the "I/O channel is in normal state" (pci_channel_io_normal) state was reported; only error injection into 00:01.0 reports pci_channel_io_frozen state. Recovery code in the endpoint drivers I maintain is just calling the pci_cleanup_aer_uncorrect_error_status() in error handler resume() callback.

FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel which I believe is quite old. At the moment I can not use newer kernel, but would be prepared to take that step if told that it would help.

Thank you in advance,

Hinko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar
@ 2020-12-04 22:38 ` Bjorn Helgaas
  2020-12-09 10:02   ` Hinko Kocevar
  0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2020-12-04 22:38 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: linux-pci

On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
> Hi,
> 
> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
> 
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
> 
> This is the complete PCI device tree that I'm working with:
> 
> $ sudo /usr/local/bin/pcicrawler -t
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>  │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>  │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>  │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>  │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>  │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>  │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>  │  │     └─04:12.0 downstream_port, slot 7, power: Off
>  │  ├─02:02.0 downstream_port, slot 2
>  │  ├─02:08.0 downstream_port, slot 8
>  │  ├─02:09.0 downstream_port, slot 9, power: Off
>  │  └─02:0a.0 downstream_port, slot 10
>  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
> 
> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
> not on a CPU board (working with a micro TCA system here). I'm
> working with FPGA based devices seen as endpoints here.  After the
> error all the endpoints in the above list are not able to talk to
> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
> config space looks sane and accessible for those devices.

This could be caused by a reset.  We probably do a secondary bus reset
on the Root Port, which resets all the devices below it.  After the
reset, config space of those downstream devices would be accessible,
but the PCI_COMMAND register may be cleared which means the device
wouldn't respond to MMIO reads.

None of this explains the original problem of the Completion Timeout,
of course.  The error source of 0x8 (00:01.0) is the root port, which
makes sense if it issued a request transaction and never got the
completion.  The root port *should* log the header of the request and
we should print it, but it looks like we didn't.  "lspci -vv" of the
device would show whether it's capable of this logging.

If you're talking to an FPGA, the most likely explanation is a bug in
the FPGA programming where it doesn't respond when it should.

A PCIe analyzer would show exactly what the problem is, but those are
expensive and rare.  But if you're dealing with FPGAs, maybe you have
access to one.

> How can I debug this further? I'm getting the "I/O to channel is
> blocked" (pci_channel_io_frozen) state reported to the the endpoint
> devices I provide driver for.  Is there any way of telling if the
> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
> links up and running and similar? Is this information provided by
> the Linux kernel somehow?
> 
> For reference, I've experimented with AER inject and my tests showed
> that if the same type of error is injected in any other PCI device
> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
> recovered successfully, and I can continue working with the endpoint
> devices. In those cases the "I/O channel is in normal state"
> (pci_channel_io_normal) state was reported; only error injection
> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
> the endpoint drivers I maintain is just calling the
> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
> callback.
> 
> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
> which I believe is quite old. At the moment I can not use newer
> kernel, but would be prepared to take that step if told that it
> would help.

It's really not practical for us to help debug a v3.10-based kernel;
that's over seven years old and AER handling has changed significantly
since then.  Also, CentOS is a distro kernel and includes many changes
on top of the v3.10 upstream kernel.  Those changes might be
theoretically open-source, but as a practical matter, the distro is a
better place to start for support of their kernel.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-04 22:38 ` Bjorn Helgaas
@ 2020-12-09 10:02   ` Hinko Kocevar
  2020-12-09 17:40     ` Bjorn Helgaas
  0 siblings, 1 reply; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-09 10:02 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci


________________________________________
From: Bjorn Helgaas <helgaas@kernel.org>
Sent: Friday, December 4, 2020 11:38 PM
To: Hinko Kocevar
Cc: linux-pci@vger.kernel.org
Subject: Re: Recovering from AER: Uncorrected (Fatal) error

On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
> Hi,
>
> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
>
> This is the complete PCI device tree that I'm working with:
>
> $ sudo /usr/local/bin/pcicrawler -t
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>  │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>  │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>  │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>  │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>  │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>  │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>  │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>  │  │     └─04:12.0 downstream_port, slot 7, power: Off
>  │  ├─02:02.0 downstream_port, slot 2
>  │  ├─02:08.0 downstream_port, slot 8
>  │  ├─02:09.0 downstream_port, slot 9, power: Off
>  │  └─02:0a.0 downstream_port, slot 10
>  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>
> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
> not on a CPU board (working with a micro TCA system here). I'm
> working with FPGA based devices seen as endpoints here.  After the
> error all the endpoints in the above list are not able to talk to
> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
> config space looks sane and accessible for those devices.

This could be caused by a reset.  We probably do a secondary bus reset
on the Root Port, which resets all the devices below it.  After the
reset, config space of those downstream devices would be accessible,
but the PCI_COMMAND register may be cleared which means the device
wouldn't respond to MMIO reads.

After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.

None of this explains the original problem of the Completion Timeout,
of course.  The error source of 0x8 (00:01.0) is the root port, which
makes sense if it issued a request transaction and never got the
completion.  The root port *should* log the header of the request and
we should print it, but it looks like we didn't.  "lspci -vv" of the
device would show whether it's capable of this logging.

I'm mostly concerned about the fact that the secondary bus is rendered useless after the reset/recovery. AIUI, this should not happen.

After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:

        Capabilities: [1c0 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000001 00000002 00000003
                RootCmd: CERptEn+ NFERptEn+ FERptEn+
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008

The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.

Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO               
Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2


Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?


If you're talking to an FPGA, the most likely explanation is a bug in
the FPGA programming where it doesn't respond when it should.

The thing is that this error does not happen that often and most of the time the card using FPGA is just fine. TBH, I am doing some stress testing of the card removal through the MCH (exercising hotplug) and this when I noticed the initial problem.

A PCIe analyzer would show exactly what the problem is, but those are
expensive and rare.  But if you're dealing with FPGAs, maybe you have
access to one.

Sadly, no, I do not have access to the analyzer.

> How can I debug this further? I'm getting the "I/O to channel is
> blocked" (pci_channel_io_frozen) state reported to the the endpoint
> devices I provide driver for.  Is there any way of telling if the
> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
> links up and running and similar? Is this information provided by
> the Linux kernel somehow?
>
> For reference, I've experimented with AER inject and my tests showed
> that if the same type of error is injected in any other PCI device
> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
> recovered successfully, and I can continue working with the endpoint
> devices. In those cases the "I/O channel is in normal state"
> (pci_channel_io_normal) state was reported; only error injection
> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
> the endpoint drivers I maintain is just calling the
> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
> callback.
>
> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
> which I believe is quite old. At the moment I can not use newer
> kernel, but would be prepared to take that step if told that it
> would help.

It's really not practical for us to help debug a v3.10-based kernel;
that's over seven years old and AER handling has changed significantly
since then.  Also, CentOS is a distro kernel and includes many changes
on top of the v3.10 upstream kernel.  Those changes might be
theoretically open-source, but as a practical matter, the distro is a
better place to start for support of their kernel.

I followed your advice and started using kernel 5.9.12; hope this is recent enough.

With that kernel I see a bit of different behavior than with the 3.10 earlier. Configuration access to the devices on the secondary bus is not OK; all 0xFFFFFFFF. This is what I see after the error was injected into the root port 00:01.0 and secondary bus reset was made as part of recovery:

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 128
02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: pcieport

03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: pcieport

04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: pcieport

0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: mrf-pci

The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED. How come that the 01:00.0 upstream port PCI_COMMAND register is cleared? I guess that explains why the rest of the devices in the chain down to the endpoint at 0e:00.0 are rendered useless.


Then, as part of having fun, I made the pci_bus_error_reset() *NOT* do the actual reset of the secondary by just returning 0 before the call to the pci_bus_reset(). In this case no bus reset is done, obviously, and MMIO access to my endpoint(s) is working just fine after the error was injected into the root port; as if nothing happened.

Is it expected that in cases of seeing this type of error the secondary bus would be in this kind of state? It looks to me that the problem is not with my endpoints but with the PCIe switch on the CPU board (01:00.0 <--> 02:0?.0).

Thanks!
//hinko



Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 10:02   ` Hinko Kocevar
@ 2020-12-09 17:40     ` Bjorn Helgaas
  2020-12-09 20:31       ` Hinko Kocevar
  2020-12-09 20:50       ` Hinko Kocevar
  0 siblings, 2 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2020-12-09 17:40 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: linux-pci

[non-standard email quoting below, ">" for both my text and Hinko's,
see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]

On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
> ________________________________________
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Friday, December 4, 2020 11:38 PM
> To: Hinko Kocevar
> Cc: linux-pci@vger.kernel.org
> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
> 
> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
> > Hi,
> >
> > I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
> >
> > Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
> > Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
> > Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
> > Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
> >
> > This is the complete PCI device tree that I'm working with:
> >
> > $ sudo /usr/local/bin/pcicrawler -t
> > 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
> >  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
> >  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
> >  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
> >  │  │     ├─04:00.0 downstream_port, slot 10, power: Off
> >  │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
> >  │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
> >  │  │     ├─04:02.0 downstream_port, slot 9, power: Off
> >  │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
> >  │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
> >  │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
> >  │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
> >  │  │     ├─04:09.0 downstream_port, slot 11, power: Off
> >  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
> >  │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
> >  │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
> >  │  │     ├─04:10.0 downstream_port, slot 8, power: Off
> >  │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
> >  │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
> >  │  │     └─04:12.0 downstream_port, slot 7, power: Off
> >  │  ├─02:02.0 downstream_port, slot 2
> >  │  ├─02:08.0 downstream_port, slot 8
> >  │  ├─02:09.0 downstream_port, slot 9, power: Off
> >  │  └─02:0a.0 downstream_port, slot 10
> >  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
> >  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
> >  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
> >  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
> >
> > 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
> > not on a CPU board (working with a micro TCA system here). I'm
> > working with FPGA based devices seen as endpoints here.  After the
> > error all the endpoints in the above list are not able to talk to
> > CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
> > config space looks sane and accessible for those devices.
> 
> This could be caused by a reset.  We probably do a secondary bus reset
> on the Root Port, which resets all the devices below it.  After the
> reset, config space of those downstream devices would be accessible,
> but the PCI_COMMAND register may be cleared which means the device
> wouldn't respond to MMIO reads.
> 
> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
> 
> None of this explains the original problem of the Completion Timeout,
> of course.  The error source of 0x8 (00:01.0) is the root port, which
> makes sense if it issued a request transaction and never got the
> completion.  The root port *should* log the header of the request and
> we should print it, but it looks like we didn't.  "lspci -vv" of the
> device would show whether it's capable of this logging.
> 
> I'm mostly concerned about the fact that the secondary bus is
> rendered useless after the reset/recovery. AIUI, this should not
> happen.

It should not.  The secondary bus should still work after the reset.

> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
> 
>         Capabilities: [1c0 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>                         MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000001 00000002 00000003
>                 RootCmd: CERptEn+ NFERptEn+ FERptEn+
>                 RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>                          FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>                 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
> 
> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
> 
> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO               
> Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
> 
> 
> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?

No, it shouldn't.  I'm pretty sure we only print those values and they
don't influence behavior at all.

> > How can I debug this further? I'm getting the "I/O to channel is
> > blocked" (pci_channel_io_frozen) state reported to the the endpoint
> > devices I provide driver for.  Is there any way of telling if the
> > PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
> > links up and running and similar? Is this information provided by
> > the Linux kernel somehow?
> >
> > For reference, I've experimented with AER inject and my tests showed
> > that if the same type of error is injected in any other PCI device
> > in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
> > recovered successfully, and I can continue working with the endpoint
> > devices. In those cases the "I/O channel is in normal state"
> > (pci_channel_io_normal) state was reported; only error injection
> > into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
> > the endpoint drivers I maintain is just calling the
> > pci_cleanup_aer_uncorrect_error_status() in error handler resume()
> > callback.
> >
> > FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
> > which I believe is quite old. At the moment I can not use newer
> > kernel, but would be prepared to take that step if told that it
> > would help.
> 
> It's really not practical for us to help debug a v3.10-based kernel;
> that's over seven years old and AER handling has changed significantly
> since then.  Also, CentOS is a distro kernel and includes many changes
> on top of the v3.10 upstream kernel.  Those changes might be
> theoretically open-source, but as a practical matter, the distro is a
> better place to start for support of their kernel.
> 
> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
> 
> With that kernel I see a bit of different behavior than with the
> 3.10 earlier. Configuration access to the devices on the secondary
> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
> was injected into the root port 00:01.0 and secondary bus reset was
> made as part of recovery:
> 
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Latency: 0, Cache Line Size: 64 bytes
> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
> 	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Interrupt: pin A routed to IRQ 128
> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
> 	!!! Unknown header type 7f
> 	Kernel driver in use: pcieport
> 
> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
> 	!!! Unknown header type 7f
> 	Kernel driver in use: pcieport
> 
> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
> 	!!! Unknown header type 7f
> 	Kernel driver in use: pcieport
> 
> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
> 	!!! Unknown header type 7f
> 	Kernel driver in use: mrf-pci
> 
> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
> How come that the 01:00.0 upstream port PCI_COMMAND register is
> cleared? I guess that explains why the rest of the devices in the
> chain down to the endpoint at 0e:00.0 are rendered useless.
> 
> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
> do the actual reset of the secondary by just returning 0 before the
> call to the pci_bus_reset(). In this case no bus reset is done,
> obviously, and MMIO access to my endpoint(s) is working just fine
> after the error was injected into the root port; as if nothing
> happened.
> 
> Is it expected that in cases of seeing this type of error the
> secondary bus would be in this kind of state? It looks to me that
> the problem is not with my endpoints but with the PCIe switch on the
> CPU board (01:00.0 <--> 02:0?.0).

Hmm.  Yeah, I don't see anything in the path that would restore the
01:00.0 config space, which seems like a problem.  Probably its
secondary/subordinate bus registers and window registers get reset to
0 ("lspci -vv" would show that) so config/mem/io space below it is
accessible.

I didn't have much confidence in our error handling to begin with, and
it's eroding ;)

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 17:40     ` Bjorn Helgaas
@ 2020-12-09 20:31       ` Hinko Kocevar
  2020-12-09 20:50       ` Hinko Kocevar
  1 sibling, 0 replies; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-09 20:31 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci

Hi Bjorn!

On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> [non-standard email quoting below, ">" for both my text and Hinko's,
> see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]
> 

Apologies. Probably due to the web MUA I used. Hopefully this will be 
better now.

> On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
>> ________________________________________
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Friday, December 4, 2020 11:38 PM
>> To: Hinko Kocevar
>> Cc: linux-pci@vger.kernel.org
>> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
>>
>> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
>>> Hi,
>>>
>>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>>>
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
>>>
>>> This is the complete PCI device tree that I'm working with:
>>>
>>> $ sudo /usr/local/bin/pcicrawler -t
>>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>>   │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>>>   │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>>>   │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>>>   │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>>>   │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>>   │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>>   │  │     └─04:12.0 downstream_port, slot 7, power: Off
>>>   │  ├─02:02.0 downstream_port, slot 2
>>>   │  ├─02:08.0 downstream_port, slot 8
>>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>>   │  └─02:0a.0 downstream_port, slot 10
>>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>
>>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
>>> not on a CPU board (working with a micro TCA system here). I'm
>>> working with FPGA based devices seen as endpoints here.  After the
>>> error all the endpoints in the above list are not able to talk to
>>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
>>> config space looks sane and accessible for those devices.
>>
>> This could be caused by a reset.  We probably do a secondary bus reset
>> on the Root Port, which resets all the devices below it.  After the
>> reset, config space of those downstream devices would be accessible,
>> but the PCI_COMMAND register may be cleared which means the device
>> wouldn't respond to MMIO reads.
>>
>> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
>>
>> None of this explains the original problem of the Completion Timeout,
>> of course.  The error source of 0x8 (00:01.0) is the root port, which
>> makes sense if it issued a request transaction and never got the
>> completion.  The root port *should* log the header of the request and
>> we should print it, but it looks like we didn't.  "lspci -vv" of the
>> device would show whether it's capable of this logging.
>>
>> I'm mostly concerned about the fact that the secondary bus is
>> rendered useless after the reset/recovery. AIUI, this should not
>> happen.
> 
> It should not.  The secondary bus should still work after the reset.

I was hoping to hear that!
> 
>> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
>>
>>          Capabilities: [1c0 v1] Advanced Error Reporting
>>                  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                  AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>>                          MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                  HeaderLog: 00000000 00000001 00000002 00000003
>>                  RootCmd: CERptEn+ NFERptEn+ FERptEn+
>>                  RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>>                           FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>>                  ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
>>
>> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
>>
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO
>> Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>>
>>
>> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?
> 
> No, it shouldn't.  I'm pretty sure we only print those values and they
> don't influence behavior at all.
> 
>>> How can I debug this further? I'm getting the "I/O to channel is
>>> blocked" (pci_channel_io_frozen) state reported to the the endpoint
>>> devices I provide driver for.  Is there any way of telling if the
>>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
>>> links up and running and similar? Is this information provided by
>>> the Linux kernel somehow?
>>>
>>> For reference, I've experimented with AER inject and my tests showed
>>> that if the same type of error is injected in any other PCI device
>>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
>>> recovered successfully, and I can continue working with the endpoint
>>> devices. In those cases the "I/O channel is in normal state"
>>> (pci_channel_io_normal) state was reported; only error injection
>>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
>>> the endpoint drivers I maintain is just calling the
>>> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
>>> callback.
>>>
>>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
>>> which I believe is quite old. At the moment I can not use newer
>>> kernel, but would be prepared to take that step if told that it
>>> would help.
>>
>> It's really not practical for us to help debug a v3.10-based kernel;
>> that's over seven years old and AER handling has changed significantly
>> since then.  Also, CentOS is a distro kernel and includes many changes
>> on top of the v3.10 upstream kernel.  Those changes might be
>> theoretically open-source, but as a practical matter, the distro is a
>> better place to start for support of their kernel.
>>
>> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
>>
>> With that kernel I see a bit of different behavior than with the
>> 3.10 earlier. Configuration access to the devices on the secondary
>> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
>> was injected into the root port 00:01.0 and secondary bus reset was
>> made as part of recovery:
>>
>> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Latency: 0, Cache Line Size: 64 bytes
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
>> 	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Interrupt: pin A routed to IRQ 128
>> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: mrf-pci
>>
>> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
>> How come that the 01:00.0 upstream port PCI_COMMAND register is
>> cleared? I guess that explains why the rest of the devices in the
>> chain down to the endpoint at 0e:00.0 are rendered useless.
>>
>> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
>> do the actual reset of the secondary by just returning 0 before the
>> call to the pci_bus_reset(). In this case no bus reset is done,
>> obviously, and MMIO access to my endpoint(s) is working just fine
>> after the error was injected into the root port; as if nothing
>> happened.
>>
>> Is it expected that in cases of seeing this type of error the
>> secondary bus would be in this kind of state? It looks to me that
>> the problem is not with my endpoints but with the PCIe switch on the
>> CPU board (01:00.0 <--> 02:0?.0).
> 
> Hmm.  Yeah, I don't see anything in the path that would restore the
> 01:00.0 config space, which seems like a problem.  Probably its
> secondary/subordinate bus registers and window registers get reset to
> 0 ("lspci -vv" would show that) so config/mem/io space below it is
> accessible.
> 
> I didn't have much confidence in our error handling to begin with, and
> it's eroding ;)
> 
> Bjorn
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 17:40     ` Bjorn Helgaas
  2020-12-09 20:31       ` Hinko Kocevar
@ 2020-12-09 20:50       ` Hinko Kocevar
  2020-12-09 21:32         ` Bjorn Helgaas
  1 sibling, 1 reply; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-09 20:50 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci

Hi Bjorn!

On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> [non-standard email quoting below, ">" for both my text and Hinko's,
> see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]
> 

Apologies. Probably due to the web MUA I used. Hopefully this will be 
better now.


> On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
>> ________________________________________
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Friday, December 4, 2020 11:38 PM
>> To: Hinko Kocevar
>> Cc: linux-pci@vger.kernel.org
>> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
>>
>> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
>>> Hi,
>>>
>>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>>>
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
>>>
>>> This is the complete PCI device tree that I'm working with:
>>>
>>> $ sudo /usr/local/bin/pcicrawler -t
>>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>>   │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>>>   │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>>>   │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>>>   │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>>>   │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>>   │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>>   │  │     └─04:12.0 downstream_port, slot 7, power: Off
>>>   │  ├─02:02.0 downstream_port, slot 2
>>>   │  ├─02:08.0 downstream_port, slot 8
>>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>>   │  └─02:0a.0 downstream_port, slot 10
>>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>
>>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
>>> not on a CPU board (working with a micro TCA system here). I'm
>>> working with FPGA based devices seen as endpoints here.  After the
>>> error all the endpoints in the above list are not able to talk to
>>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
>>> config space looks sane and accessible for those devices.
>>
>> This could be caused by a reset.  We probably do a secondary bus reset
>> on the Root Port, which resets all the devices below it.  After the
>> reset, config space of those downstream devices would be accessible,
>> but the PCI_COMMAND register may be cleared which means the device
>> wouldn't respond to MMIO reads.
>>
>> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
>>
>> None of this explains the original problem of the Completion Timeout,
>> of course.  The error source of 0x8 (00:01.0) is the root port, which
>> makes sense if it issued a request transaction and never got the
>> completion.  The root port *should* log the header of the request and
>> we should print it, but it looks like we didn't.  "lspci -vv" of the
>> device would show whether it's capable of this logging.
>>
>> I'm mostly concerned about the fact that the secondary bus is
>> rendered useless after the reset/recovery. AIUI, this should not
>> happen.
> 
> It should not.  The secondary bus should still work after the reset.


I was hoping to hear that!


> 
>> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
>>
>>          Capabilities: [1c0 v1] Advanced Error Reporting
>>                  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                  AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>>                          MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                  HeaderLog: 00000000 00000001 00000002 00000003
>>                  RootCmd: CERptEn+ NFERptEn+ FERptEn+
>>                  RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>>                           FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>>                  ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
>>
>> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
>>
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO
>> Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>>
>>
>> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?
> 
> No, it shouldn't.  I'm pretty sure we only print those values and they
> don't influence behavior at all.
> 
>>> How can I debug this further? I'm getting the "I/O to channel is
>>> blocked" (pci_channel_io_frozen) state reported to the the endpoint
>>> devices I provide driver for.  Is there any way of telling if the
>>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
>>> links up and running and similar? Is this information provided by
>>> the Linux kernel somehow?
>>>
>>> For reference, I've experimented with AER inject and my tests showed
>>> that if the same type of error is injected in any other PCI device
>>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
>>> recovered successfully, and I can continue working with the endpoint
>>> devices. In those cases the "I/O channel is in normal state"
>>> (pci_channel_io_normal) state was reported; only error injection
>>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
>>> the endpoint drivers I maintain is just calling the
>>> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
>>> callback.
>>>
>>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
>>> which I believe is quite old. At the moment I can not use newer
>>> kernel, but would be prepared to take that step if told that it
>>> would help.
>>
>> It's really not practical for us to help debug a v3.10-based kernel;
>> that's over seven years old and AER handling has changed significantly
>> since then.  Also, CentOS is a distro kernel and includes many changes
>> on top of the v3.10 upstream kernel.  Those changes might be
>> theoretically open-source, but as a practical matter, the distro is a
>> better place to start for support of their kernel.
>>
>> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
>>
>> With that kernel I see a bit of different behavior than with the
>> 3.10 earlier. Configuration access to the devices on the secondary
>> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
>> was injected into the root port 00:01.0 and secondary bus reset was
>> made as part of recovery:
>>
>> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Latency: 0, Cache Line Size: 64 bytes
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
>> 	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Interrupt: pin A routed to IRQ 128
>> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: mrf-pci
>>
>> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
>> How come that the 01:00.0 upstream port PCI_COMMAND register is
>> cleared? I guess that explains why the rest of the devices in the
>> chain down to the endpoint at 0e:00.0 are rendered useless.
>>
>> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
>> do the actual reset of the secondary by just returning 0 before the
>> call to the pci_bus_reset(). In this case no bus reset is done,
>> obviously, and MMIO access to my endpoint(s) is working just fine
>> after the error was injected into the root port; as if nothing
>> happened.
>>
>> Is it expected that in cases of seeing this type of error the
>> secondary bus would be in this kind of state? It looks to me that
>> the problem is not with my endpoints but with the PCIe switch on the
>> CPU board (01:00.0 <--> 02:0?.0).
> 
> Hmm.  Yeah, I don't see anything in the path that would restore the
> 01:00.0 config space, which seems like a problem.  Probably its
> secondary/subordinate bus registers and window registers get reset to
> 0 ("lspci -vv" would show that) so config/mem/io space below it is
> accessible.

I guess the save/restore of the switch registers would be in the core 
PCI(e) code; and not a switch vendor/part specific.

I still have interest in getting this to work. With my limited knowledge 
of the PCI code base, is there anything I could do to help? Is there a 
particular code base area where you expect this should be managed; pci.c 
or pcie/aer.c or maybe hotplug/ source? I can try and tinker with it..


Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset:

01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
00 [Normal decode])
         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR- FastB2B- DisINTx-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Interrupt: pin A routed to IRQ 128
         Region 0: Memory at df200000 (32-bit, non-prefetchable) 
[virtual] [size=256K]
         Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
         I/O behind bridge: 0000f000-00000fff [disabled]
         Memory behind bridge: fff00000-000fffff [disabled]
         Prefetchable memory behind bridge: 
00000000fff00000-00000000000fffff [disabled]
         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
         BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- 
FastB2B-
                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
         Capabilities: [40] Power Management version 3
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+
                 Address: 0000000000000000  Data: 0000
                 Masking: 00000000  Pending: 00000000
         Capabilities: [68] Express (v2) Upstream Port, MSI 00
                 DevCap: MaxPayload 1024 bytes, PhantFunc 0
                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ 
SlotPowerLimit 75.000W
                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                         MaxPayload 128 bytes, MaxReadReq 128 bytes
                 DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ 
AuxPwr- TransPend-
                 LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit 
Latency L1 <4us
                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                 LnkCtl: ASPM Disabled; Disabled- CommClk-
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                         TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                 DevCap2: Completion Timeout: Not Supported, TimeoutDis- 
NROPrPrP- LTR+
                          10BitTagComp- 10BitTagReq- OBFF Via message, 
ExtFmt- EETLPPrefix-
                          EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
                          FRS-
                          AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- 
LTR- OBFF Disabled,
                          AtomicOpsCtl: EgressBlck-
                 LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ 
Retimer- 2Retimers- DRS-
                 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
SpeedDis-
                          Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                          Compliance De-emphasis: -6dB
                 LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+ EqualizationPhase1+
                          EqualizationPhase2+ EqualizationPhase3+ 
LinkEqualizationRequest-
                          Retimer- 2Retimers- CrosslinkRes: unsupported
         Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 8725
         Capabilities: [100 v1] Device Serial Number ca-87-00-10-b5-df-0e-00
         Capabilities: [fb4 v1] Advanced Error Reporting
                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+
                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+
                 AERCap: First Error Pointer: 1f, ECRCGenCap+ ECRCGenEn- 
ECRCChkCap+ ECRCChkEn-
                         MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                 HeaderLog: 00000000 00000000 00000000 00000000
         Capabilities: [138 v1] Power Budgeting <?>
         Capabilities: [10c v1] Secondary PCI Express
                 LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                 LaneErrStat: 0
         Capabilities: [148 v1] Virtual Channel
                 Caps:   LPEVC=1 RefClk=100ns PATEntryBits=8
                 Arb:    Fixed- WRR32- WRR64- WRR128-
                 Ctrl:   ArbSelect=Fixed
                 Status: InProgress-
                 VC0:    Caps:   PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
                         Arb:    Fixed- WRR32- WRR64+ WRR128- TWRR128- 
WRR256-
                         Ctrl:   Enable+ ID=0 ArbSelect=WRR64 TC/VC=ff
                         Status: NegoPending- InProgress-
                         Port Arbitration Table <?>
                 VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                         Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128- 
WRR256-
                         Ctrl:   Enable- ID=1 ArbSelect=Fixed TC/VC=00
                         Status: NegoPending+ InProgress-
         Capabilities: [e00 v1] Multicast
                 McastCap: MaxGroups 64, ECRCRegen+
                 McastCtl: NumGroups 1, Enable-
                 McastBAR: IndexPos 0, BaseAddr 0000000000000000
                 McastReceiveVec:      0000000000000000
                 McastBlockAllVec:     0000000000000000
                 McastBlockUntransVec: 0000000000000000
                 McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 
0000000000000000
         Capabilities: [b00 v1] Latency Tolerance Reporting
                 Max snoop latency: 0ns
                 Max no snoop latency: 0ns
         Capabilities: [b70 v1] Vendor Specific Information: ID=0001 
Rev=0 Len=010 <?>
         Kernel driver in use: pcieport


And here is the diff of the lspci of 01:00.0 between the before and 
after the reset:

  01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
00 [Normal decode])
-	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
+	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
  	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
-	Latency: 0, Cache Line Size: 64 bytes
  	Interrupt: pin A routed to IRQ 128
-	Region 0: Memory at df200000 (32-bit, non-prefetchable) [size=256K]
-	Bus: primary=01, secondary=02, subordinate=13, sec-latency=0
+	Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual] 
[size=256K]
+	Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
  	I/O behind bridge: 0000f000-00000fff [disabled]
-	Memory behind bridge: df000000-df1fffff [size=2M]
-	Prefetchable memory behind bridge: 0000000090000000-00000000923fffff 
[size=36M]
+	Memory behind bridge: fff00000-000fffff [disabled]
+	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff 
[disabled]
  	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
-	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
+	BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
  		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
  	Capabilities: [40] Power Management version 3
  		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
  		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
-	Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
-		Address: 00000000fee00338  Data: 0000
-		Masking: 000000ff  Pending: 00000000
+	Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+
+		Address: 0000000000000000  Data: 0000
+		Masking: 00000000  Pending: 00000000
  	Capabilities: [68] Express (v2) Upstream Port, MSI 00
  		DevCap:	MaxPayload 1024 bytes, PhantFunc 0
  			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 75.000W
  		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
  			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
-			MaxPayload 256 bytes, MaxReadReq 128 bytes
+			MaxPayload 128 bytes, MaxReadReq 128 bytes
  		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
  		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
  			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
@@ -376,7 +375,7 @@
  			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
  			 FRS-
  			 AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
-		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF 
Disabled,
+		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF 
Disabled,
  			 AtomicOpsCtl: EgressBlck-
  		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ Retimer- 
2Retimers- DRS-
  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
@@ -402,7 +401,7 @@
  		LaneErrStat: 0
  	Capabilities: [148 v1] Virtual Channel
  		Caps:	LPEVC=1 RefClk=100ns PATEntryBits=8
-		Arb:	Fixed+ WRR32- WRR64- WRR128-
+		Arb:	Fixed- WRR32- WRR64- WRR128-
  		Ctrl:	ArbSelect=Fixed
  		Status:	InProgress-
  		VC0:	Caps:	PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
@@ -423,20 +422,20 @@
  		McastBlockUntransVec: 0000000000000000
  		McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 0000000000000000
  	Capabilities: [b00 v1] Latency Tolerance Reporting
-		Max snoop latency: 71680ns
-		Max no snoop latency: 71680ns
+		Max snoop latency: 0ns
+		Max no snoop latency: 0ns
  	Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 
Len=010 <?>
  	Kernel driver in use: pcieport


Looks like many fields have been zeroed out. Should a bare restoration 
of the previous registers values be enough or does some other magic 
(i.e. calling PCI functions/performing initialization tasks/etc) needs 
to happen to get that port back to sanity?


Thanks!
//hinko


> 
> I didn't have much confidence in our error handling to begin with, and
> it's eroding ;)
> 
> Bjorn
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 20:50       ` Hinko Kocevar
@ 2020-12-09 21:32         ` Bjorn Helgaas
  2020-12-09 22:55           ` Hinko Kocevar
  0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2020-12-09 21:32 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: linux-pci

On Wed, Dec 09, 2020 at 09:50:17PM +0100, Hinko Kocevar wrote:
> On 12/9/20 6:40 PM, Bjorn Helgaas wrote:

> > Hmm.  Yeah, I don't see anything in the path that would restore the
> > 01:00.0 config space, which seems like a problem.  Probably its
> > secondary/subordinate bus registers and window registers get reset to
> > 0 ("lspci -vv" would show that) so config/mem/io space below it is
> > accessible.
> 
> I guess the save/restore of the switch registers would be in the core PCI(e)
> code; and not a switch vendor/part specific.
> 
> I still have interest in getting this to work. With my limited knowledge of
> the PCI code base, is there anything I could do to help? Is there a
> particular code base area where you expect this should be managed; pci.c or
> pcie/aer.c or maybe hotplug/ source? I can try and tinker with it..
> 
> 
> Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset:
> 
> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00
> [Normal decode])
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 128
>         Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual]
> [size=256K]
>         Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
>         I/O behind bridge: 0000f000-00000fff [disabled]
>         Memory behind bridge: fff00000-000fffff [disabled]
>         Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
> [disabled]

Yep, this explains why nothing below the bridge is accessible.
Secondary/subordinate=0 means no config transactions will be
forwarded, and the windows are all disabled, so no MMIO transactions
will be forwarded either.

> Looks like many fields have been zeroed out. Should a bare
> restoration of the previous registers values be enough or does some
> other magic (i.e.  calling PCI functions/performing initialization
> tasks/etc) needs to happen to get that port back to sanity?

Good question.  Documentation/PCI/pci-error-recovery.rst has some
details about the design, including this:

  It is important for the platform to restore the PCI config space to
  the "fresh poweron" state, rather than the "last state". After a
  slot reset, the device driver will almost always use its standard
  device initialization routines, and an unusual config space setup
  may result in hung devices, kernel panics, or silent data
  corruption.

I'm not sure exactly how to interpret that.  It seems like "restoring
to 'fresh poweron' state" is basically just doing a reset, which gives
results like your 01:00.0 device above.  IIUC, we call a driver's
.slot_reset() callback after a reset, and it's supposed to be able to
initialize the device from power-on state.

01:00.0 is a Switch Upstream Port, so portdrv should claim it, and it
looks like pcie_portdrv_slot_reset() should be restoring some state.
But maybe dev->state_saved is false so pci_restore_state() bails out?
Maybe add some printks in that path to see where it's broken?

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 21:32         ` Bjorn Helgaas
@ 2020-12-09 22:55           ` Hinko Kocevar
  2020-12-10 12:56             ` Hinko Kocevar
  2020-12-14 21:23             ` Keith Busch
  0 siblings, 2 replies; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-09 22:55 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci



On 12/9/20 10:32 PM, Bjorn Helgaas wrote:
> On Wed, Dec 09, 2020 at 09:50:17PM +0100, Hinko Kocevar wrote:
>> On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> 
>>> Hmm.  Yeah, I don't see anything in the path that would restore the
>>> 01:00.0 config space, which seems like a problem.  Probably its
>>> secondary/subordinate bus registers and window registers get reset to
>>> 0 ("lspci -vv" would show that) so config/mem/io space below it is
>>> accessible.
>>
>> I guess the save/restore of the switch registers would be in the core PCI(e)
>> code; and not a switch vendor/part specific.
>>
>> I still have interest in getting this to work. With my limited knowledge of
>> the PCI code base, is there anything I could do to help? Is there a
>> particular code base area where you expect this should be managed; pci.c or
>> pcie/aer.c or maybe hotplug/ source? I can try and tinker with it..
>>
>>
>> Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset:
>>
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00
>> [Normal decode])
>>          Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx-
>>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>          Interrupt: pin A routed to IRQ 128
>>          Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual]
>> [size=256K]
>>          Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
>>          I/O behind bridge: 0000f000-00000fff [disabled]
>>          Memory behind bridge: fff00000-000fffff [disabled]
>>          Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>> [disabled]
> 
> Yep, this explains why nothing below the bridge is accessible.
> Secondary/subordinate=0 means no config transactions will be
> forwarded, and the windows are all disabled, so no MMIO transactions
> will be forwarded either.
> 
>> Looks like many fields have been zeroed out. Should a bare
>> restoration of the previous registers values be enough or does some
>> other magic (i.e.  calling PCI functions/performing initialization
>> tasks/etc) needs to happen to get that port back to sanity?
> 
> Good question.  Documentation/PCI/pci-error-recovery.rst has some
> details about the design, including this:
> 
>    It is important for the platform to restore the PCI config space to
>    the "fresh poweron" state, rather than the "last state". After a
>    slot reset, the device driver will almost always use its standard
>    device initialization routines, and an unusual config space setup
>    may result in hung devices, kernel panics, or silent data
>    corruption.
> 
> I'm not sure exactly how to interpret that.  It seems like "restoring
> to 'fresh poweron' state" is basically just doing a reset, which gives
> results like your 01:00.0 device above.  IIUC, we call a driver's
> .slot_reset() callback after a reset, and it's supposed to be able to
> initialize the device from power-on state.
> 
> 01:00.0 is a Switch Upstream Port, so portdrv should claim it, and it
> looks like pcie_portdrv_slot_reset() should be restoring some state.
> But maybe dev->state_saved is false so pci_restore_state() bails out?
> Maybe add some printks in that path to see where it's broken?
> 

Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success!

So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER 
and therefore the pcie_portdrv_slot_reset() is not called.

But the pcie_portdrv_err_resume() is called! Adding these two lines to 
pcie_portdrv_err_resume(), before the call to device_for_each_child():

         pci_restore_state(dev);
         pci_save_state(dev);

gets the secondary buses and the endpoint restored to the following state:

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen 
Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI 
Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 
[Normal decode])
	Physical Slot: 1
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI 
Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 
[Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
0e:00.0 Signal processing controller: Xilinx Corporation Device 7011
	Subsystem: Device 1a3e:132c
	Physical Slot: 2
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-


Very nice.

I have not verified if the buses are actually functional, yet, as I 
still see the endpoint not happy; I think its driver is just not doing 
any restoration currently. Knowing what it takes for switch ports to 
recover makes me think I can apply similar save/restore logic to my 
endpoint driver(s), too.


Cheers,
//hinko

> Bjorn
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 22:55           ` Hinko Kocevar
@ 2020-12-10 12:56             ` Hinko Kocevar
  2020-12-14 21:23             ` Keith Busch
  1 sibling, 0 replies; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-10 12:56 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci



On 12/9/20 11:55 PM, Hinko Kocevar wrote:
> 
> 
> On 12/9/20 10:32 PM, Bjorn Helgaas wrote:
>> On Wed, Dec 09, 2020 at 09:50:17PM +0100, Hinko Kocevar wrote:
>>> On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
>>
>>>> Hmm.  Yeah, I don't see anything in the path that would restore the
>>>> 01:00.0 config space, which seems like a problem.  Probably its
>>>> secondary/subordinate bus registers and window registers get reset to
>>>> 0 ("lspci -vv" would show that) so config/mem/io space below it is
>>>> accessible.
>>>
>>> I guess the save/restore of the switch registers would be in the core 
>>> PCI(e)
>>> code; and not a switch vendor/part specific.
>>>
>>> I still have interest in getting this to work. With my limited 
>>> knowledge of
>>> the PCI code base, is there anything I could do to help? Is there a
>>> particular code base area where you expect this should be managed; 
>>> pci.c or
>>> pcie/aer.c or maybe hotplug/ source? I can try and tinker with it..
>>>
>>>
>>> Here is the 'lspci -vv' of the 01:00.0 device at the time after the 
>>> reset:
>>>
>>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) 
>>> (prog-if 00
>>> [Normal decode])
>>>          Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- 
>>> ParErr-
>>> Stepping- SERR- FastB2B- DisINTx-
>>>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>          Interrupt: pin A routed to IRQ 128
>>>          Region 0: Memory at df200000 (32-bit, non-prefetchable) 
>>> [virtual]
>>> [size=256K]
>>>          Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
>>>          I/O behind bridge: 0000f000-00000fff [disabled]
>>>          Memory behind bridge: fff00000-000fffff [disabled]
>>>          Prefetchable memory behind bridge: 
>>> 00000000fff00000-00000000000fffff
>>> [disabled]
>>
>> Yep, this explains why nothing below the bridge is accessible.
>> Secondary/subordinate=0 means no config transactions will be
>> forwarded, and the windows are all disabled, so no MMIO transactions
>> will be forwarded either.
>>
>>> Looks like many fields have been zeroed out. Should a bare
>>> restoration of the previous registers values be enough or does some
>>> other magic (i.e.  calling PCI functions/performing initialization
>>> tasks/etc) needs to happen to get that port back to sanity?
>>
>> Good question.  Documentation/PCI/pci-error-recovery.rst has some
>> details about the design, including this:
>>
>>    It is important for the platform to restore the PCI config space to
>>    the "fresh poweron" state, rather than the "last state". After a
>>    slot reset, the device driver will almost always use its standard
>>    device initialization routines, and an unusual config space setup
>>    may result in hung devices, kernel panics, or silent data
>>    corruption.
>>
>> I'm not sure exactly how to interpret that.  It seems like "restoring
>> to 'fresh poweron' state" is basically just doing a reset, which gives
>> results like your 01:00.0 device above.  IIUC, we call a driver's
>> .slot_reset() callback after a reset, and it's supposed to be able to
>> initialize the device from power-on state.
>>
>> 01:00.0 is a Switch Upstream Port, so portdrv should claim it, and it
>> looks like pcie_portdrv_slot_reset() should be restoring some state.
>> But maybe dev->state_saved is false so pci_restore_state() bails out?
>> Maybe add some printks in that path to see where it's broken?
>>
> 
> Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success!
> 
> So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER 
> and therefore the pcie_portdrv_slot_reset() is not called.
> 
> But the pcie_portdrv_err_resume() is called! Adding these two lines to 
> pcie_portdrv_err_resume(), before the call to device_for_each_child():
> 
>          pci_restore_state(dev);
>          pci_save_state(dev);
> 
> gets the secondary buses and the endpoint restored to the following state:
> 
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen 
> Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
> 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
> 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI 
> Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 
> [Normal decode])
>      Physical Slot: 1
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI 
> Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 
> [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011
>      Subsystem: Device 1a3e:132c
>      Physical Slot: 2
>      Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx-
> 
> 
> Very nice.
> 
> I have not verified if the buses are actually functional, yet, as I 
> still see the endpoint not happy; I think its driver is just not doing 
> any restoration currently. Knowing what it takes for switch ports to 
> recover makes me think I can apply similar save/restore logic to my 
> endpoint driver(s), too.

I've added the same PCI save/restore state calls to my drivers' error 
resume callback, as well as save state to the probe. The results from 
injecting the error are as expected; the recovery is successful.

Further, I'm running torture tests as we speak, with 5 AMCs as 
endpoints, supported by two different kernel drivers.

As of this moment, 7 cases of:

[ 3745.957711] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error 
received: 0000:00:01.0
[ 3746.149747] pcieport 0000:00:01.0: PCIe Bus Error: 
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ 3746.149749] pcieport 0000:00:01.0:   device [8086:1901] error 
status/mask=00004000/00000000
[ 3746.149751] pcieport 0000:00:01.0:    [14] CmpltTO                (First)

were observed (not provoked by injection) and successfully recovered 
from. Before these changes PCI bus would essentially remain in 
non-operational state while now the bus recovers and next test iteration 
is successful without user intervention.

For the record, no other types of errors are observed on my side so far 
during the tests.

I would like to see the log header in the error reports above, as it is 
not visible for some reason, and that is what I'll look into next. At a 
glance, looks like the 'info->tlp_header_valid' check in the 
aer_print_error() is not passing, resulting in no call to 
__print_tlp_header().

Thanks!
//hinko

> 
> 
> Cheers,
> //hinko
> 
>> Bjorn
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-09 22:55           ` Hinko Kocevar
  2020-12-10 12:56             ` Hinko Kocevar
@ 2020-12-14 21:23             ` Keith Busch
  2020-12-15 12:56               ` Hinko Kocevar
  1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2020-12-14 21:23 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Bjorn Helgaas, linux-pci

On Wed, Dec 09, 2020 at 11:55:07PM +0100, Hinko Kocevar wrote:
> Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success!
> 
> So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER and
> therefore the pcie_portdrv_slot_reset() is not called.
> 
> But the pcie_portdrv_err_resume() is called! Adding these two lines to
> pcie_portdrv_err_resume(), before the call to device_for_each_child():
> 
>         pci_restore_state(dev);
>         pci_save_state(dev);

You need to do that with the current kernel or are you still using a
3.10? A more recent kernel shouldn't have needed such a fix after the
following commit was introduced:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=874b3251113a1e2cbe79c24994dc03fe4fe4b99b

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-14 21:23             ` Keith Busch
@ 2020-12-15 12:56               ` Hinko Kocevar
  2020-12-15 18:56                 ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Hinko Kocevar @ 2020-12-15 12:56 UTC (permalink / raw)
  To: Keith Busch; +Cc: Bjorn Helgaas, linux-pci

Hi Keith,

On 12/14/20 10:23 PM, Keith Busch wrote:
> On Wed, Dec 09, 2020 at 11:55:07PM +0100, Hinko Kocevar wrote:
>> Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success!
>>
>> So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER and
>> therefore the pcie_portdrv_slot_reset() is not called.
>>
>> But the pcie_portdrv_err_resume() is called! Adding these two lines to
>> pcie_portdrv_err_resume(), before the call to device_for_each_child():
>>
>>          pci_restore_state(dev);
>>          pci_save_state(dev);
> 
> You need to do that with the current kernel or are you still using a
> 3.10? A more recent kernel shouldn't have needed such a fix after the


This was tested on the 5.9.12 kernel at that time. As of today, I've 
re-ran the tests on Bjorn's git tree, pci/err branch from Sunday (I 
guess 5.10.0 version).

> following commit was introduced:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=874b3251113a1e2cbe79c24994dc03fe4fe4b99b
> 

I noticed the change you are pointing out when trying to propose a patch.

It made me curious on why the pcie_portdrv_slot_reset() is not invoked.

After sprinkling a couple of printk()'s around the pcie_do_recovery() 
and pcie_portdrv_err_handler's I can observe that the 
pcie_portdrv_slot_reset() is never called from pcie_do_recovery() due to 
status returned by reset_subordinates() (actually aer_root_reset() from 
pcie/aer.c) being PCI_ERS_RESULT_RECOVERED.

I reckon, in order to invoke the pcie_portdrv_slot_reset(), the 
aer_root_reset() should have returned PCI_ERS_RESULT_NEED_RESET.

As soon as I plug the calls to pci_restore_state() and pci_save_state() 
into the pcie_portdrv_err_resume() the bus and devices are operational 
again.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Recovering from AER: Uncorrected (Fatal) error
  2020-12-15 12:56               ` Hinko Kocevar
@ 2020-12-15 18:56                 ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2020-12-15 18:56 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: Bjorn Helgaas, linux-pci

On Tue, Dec 15, 2020 at 01:56:21PM +0100, Hinko Kocevar wrote:
> I noticed the change you are pointing out when trying to propose a patch.
> 
> It made me curious on why the pcie_portdrv_slot_reset() is not invoked.
> 
> After sprinkling a couple of printk()'s around the pcie_do_recovery() and
> pcie_portdrv_err_handler's I can observe that the pcie_portdrv_slot_reset()
> is never called from pcie_do_recovery() due to status returned by
> reset_subordinates() (actually aer_root_reset() from pcie/aer.c) being
> PCI_ERS_RESULT_RECOVERED.
> 
> I reckon, in order to invoke the pcie_portdrv_slot_reset(), the
> aer_root_reset() should have returned PCI_ERS_RESULT_NEED_RESET.
> 
> As soon as I plug the calls to pci_restore_state() and pci_save_state() into
> the pcie_portdrv_err_resume() the bus and devices are operational again.

It appears some sequences in the aer handling are out of sync with at
least a few driver's expectations. The .slot_reset callback is for when
'PCI slot has been reset' according the the code comments, so we should
be calling this after the reset to notify driver's of downstream devices
that event occured. I'll mess with it a bit and see if we can get
everyone aligned.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-12-15 18:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar
2020-12-04 22:38 ` Bjorn Helgaas
2020-12-09 10:02   ` Hinko Kocevar
2020-12-09 17:40     ` Bjorn Helgaas
2020-12-09 20:31       ` Hinko Kocevar
2020-12-09 20:50       ` Hinko Kocevar
2020-12-09 21:32         ` Bjorn Helgaas
2020-12-09 22:55           ` Hinko Kocevar
2020-12-10 12:56             ` Hinko Kocevar
2020-12-14 21:23             ` Keith Busch
2020-12-15 12:56               ` Hinko Kocevar
2020-12-15 18:56                 ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).