* Recovering from AER: Uncorrected (Fatal) error @ 2020-12-04 12:52 Hinko Kocevar 2020-12-04 22:38 ` Bjorn Helgaas 0 siblings, 1 reply; 12+ messages in thread From: Hinko Kocevar @ 2020-12-04 12:52 UTC (permalink / raw) To: linux-pci Hi, I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) This is the complete PCI device tree that I'm working with: $ sudo /usr/local/bin/pcicrawler -t 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) │ │ ├─04:00.0 downstream_port, slot 10, power: Off │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 │ │ ├─04:02.0 downstream_port, slot 9, power: Off │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 │ │ ├─04:09.0 downstream_port, slot 11, power: Off │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 │ │ ├─04:0b.0 downstream_port, slot 12, power: Off │ │ ├─04:10.0 downstream_port, slot 8, power: Off │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 │ │ └─04:12.0 downstream_port, slot 7, power: Off │ ├─02:02.0 downstream_port, slot 2 │ ├─02:08.0 downstream_port, slot 8 │ ├─02:09.0 downstream_port, slot 9, power: Off │ └─02:0a.0 downstream_port, slot 10 ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 00:01.0 is on a CPU board, The 03:00.0 and everything below that is not on a CPU board (working with a micro TCA system here). I'm working with FPGA based devices seen as endpoints here. After the error all the endpoints in the above list are not able to talk to CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI config space looks sane and accessible for those devices. How can I debug this further? I'm getting the "I/O to channel is blocked" (pci_channel_io_frozen) state reported to the the endpoint devices I provide driver for. Is there any way of telling if the PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; links up and running and similar? Is this information provided by the Linux kernel somehow? For reference, I've experimented with AER inject and my tests showed that if the same type of error is injected in any other PCI device in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is recovered successfully, and I can continue working with the endpoint devices. In those cases the "I/O channel is in normal state" (pci_channel_io_normal) state was reported; only error injection into 00:01.0 reports pci_channel_io_frozen state. Recovery code in the endpoint drivers I maintain is just calling the pci_cleanup_aer_uncorrect_error_status() in error handler resume() callback. FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel which I believe is quite old. At the moment I can not use newer kernel, but would be prepared to take that step if told that it would help. Thank you in advance, Hinko ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar @ 2020-12-04 22:38 ` Bjorn Helgaas 2020-12-09 10:02 ` Hinko Kocevar 0 siblings, 1 reply; 12+ messages in thread From: Bjorn Helgaas @ 2020-12-04 22:38 UTC (permalink / raw) To: Hinko Kocevar; +Cc: linux-pci On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: > Hi, > > I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) > > This is the complete PCI device tree that I'm working with: > > $ sudo /usr/local/bin/pcicrawler -t > 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 > ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 > │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 > │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) > │ │ ├─04:00.0 downstream_port, slot 10, power: Off > │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:02.0 downstream_port, slot 9, power: Off > │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:09.0 downstream_port, slot 11, power: Off > │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:0b.0 downstream_port, slot 12, power: Off > │ │ ├─04:10.0 downstream_port, slot 8, power: Off > │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 > │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 > │ │ └─04:12.0 downstream_port, slot 7, power: Off > │ ├─02:02.0 downstream_port, slot 2 > │ ├─02:08.0 downstream_port, slot 8 > │ ├─02:09.0 downstream_port, slot 9, power: Off > │ └─02:0a.0 downstream_port, slot 10 > ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 > ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 > ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 > └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > 00:01.0 is on a CPU board, The 03:00.0 and everything below that is > not on a CPU board (working with a micro TCA system here). I'm > working with FPGA based devices seen as endpoints here. After the > error all the endpoints in the above list are not able to talk to > CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI > config space looks sane and accessible for those devices. This could be caused by a reset. We probably do a secondary bus reset on the Root Port, which resets all the devices below it. After the reset, config space of those downstream devices would be accessible, but the PCI_COMMAND register may be cleared which means the device wouldn't respond to MMIO reads. None of this explains the original problem of the Completion Timeout, of course. The error source of 0x8 (00:01.0) is the root port, which makes sense if it issued a request transaction and never got the completion. The root port *should* log the header of the request and we should print it, but it looks like we didn't. "lspci -vv" of the device would show whether it's capable of this logging. If you're talking to an FPGA, the most likely explanation is a bug in the FPGA programming where it doesn't respond when it should. A PCIe analyzer would show exactly what the problem is, but those are expensive and rare. But if you're dealing with FPGAs, maybe you have access to one. > How can I debug this further? I'm getting the "I/O to channel is > blocked" (pci_channel_io_frozen) state reported to the the endpoint > devices I provide driver for. Is there any way of telling if the > PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; > links up and running and similar? Is this information provided by > the Linux kernel somehow? > > For reference, I've experimented with AER inject and my tests showed > that if the same type of error is injected in any other PCI device > in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is > recovered successfully, and I can continue working with the endpoint > devices. In those cases the "I/O channel is in normal state" > (pci_channel_io_normal) state was reported; only error injection > into 00:01.0 reports pci_channel_io_frozen state. Recovery code in > the endpoint drivers I maintain is just calling the > pci_cleanup_aer_uncorrect_error_status() in error handler resume() > callback. > > FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel > which I believe is quite old. At the moment I can not use newer > kernel, but would be prepared to take that step if told that it > would help. It's really not practical for us to help debug a v3.10-based kernel; that's over seven years old and AER handling has changed significantly since then. Also, CentOS is a distro kernel and includes many changes on top of the v3.10 upstream kernel. Those changes might be theoretically open-source, but as a practical matter, the distro is a better place to start for support of their kernel. Bjorn ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-04 22:38 ` Bjorn Helgaas @ 2020-12-09 10:02 ` Hinko Kocevar 2020-12-09 17:40 ` Bjorn Helgaas 0 siblings, 1 reply; 12+ messages in thread From: Hinko Kocevar @ 2020-12-09 10:02 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci ________________________________________ From: Bjorn Helgaas <helgaas@kernel.org> Sent: Friday, December 4, 2020 11:38 PM To: Hinko Kocevar Cc: linux-pci@vger.kernel.org Subject: Re: Recovering from AER: Uncorrected (Fatal) error On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: > Hi, > > I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) > > This is the complete PCI device tree that I'm working with: > > $ sudo /usr/local/bin/pcicrawler -t > 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 > ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 > │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 > │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) > │ │ ├─04:00.0 downstream_port, slot 10, power: Off > │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:02.0 downstream_port, slot 9, power: Off > │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:09.0 downstream_port, slot 11, power: Off > │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 > │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 > │ │ ├─04:0b.0 downstream_port, slot 12, power: Off > │ │ ├─04:10.0 downstream_port, slot 8, power: Off > │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 > │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 > │ │ └─04:12.0 downstream_port, slot 7, power: Off > │ ├─02:02.0 downstream_port, slot 2 > │ ├─02:08.0 downstream_port, slot 8 > │ ├─02:09.0 downstream_port, slot 9, power: Off > │ └─02:0a.0 downstream_port, slot 10 > ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 > ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 > ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 > └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > 00:01.0 is on a CPU board, The 03:00.0 and everything below that is > not on a CPU board (working with a micro TCA system here). I'm > working with FPGA based devices seen as endpoints here. After the > error all the endpoints in the above list are not able to talk to > CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI > config space looks sane and accessible for those devices. This could be caused by a reset. We probably do a secondary bus reset on the Root Port, which resets all the devices below it. After the reset, config space of those downstream devices would be accessible, but the PCI_COMMAND register may be cleared which means the device wouldn't respond to MMIO reads. After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset. None of this explains the original problem of the Completion Timeout, of course. The error source of 0x8 (00:01.0) is the root port, which makes sense if it issued a request transaction and never got the completion. The root port *should* log the header of the request and we should print it, but it looks like we didn't. "lspci -vv" of the device would show whether it's capable of this logging. I'm mostly concerned about the fact that the secondary bus is rendered useless after the reset/recovery. AIUI, this should not happen. After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors: Capabilities: [1c0 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000001 00000002 00000003 RootCmd: CERptEn+ NFERptEn+ FERptEn+ RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008 The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'. Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0 Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0 Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2 Question: would using different (real?) HEADER_LOG values affect the recovery behaviour? If you're talking to an FPGA, the most likely explanation is a bug in the FPGA programming where it doesn't respond when it should. The thing is that this error does not happen that often and most of the time the card using FPGA is just fine. TBH, I am doing some stress testing of the card removal through the MCH (exercising hotplug) and this when I noticed the initial problem. A PCIe analyzer would show exactly what the problem is, but those are expensive and rare. But if you're dealing with FPGAs, maybe you have access to one. Sadly, no, I do not have access to the analyzer. > How can I debug this further? I'm getting the "I/O to channel is > blocked" (pci_channel_io_frozen) state reported to the the endpoint > devices I provide driver for. Is there any way of telling if the > PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; > links up and running and similar? Is this information provided by > the Linux kernel somehow? > > For reference, I've experimented with AER inject and my tests showed > that if the same type of error is injected in any other PCI device > in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is > recovered successfully, and I can continue working with the endpoint > devices. In those cases the "I/O channel is in normal state" > (pci_channel_io_normal) state was reported; only error injection > into 00:01.0 reports pci_channel_io_frozen state. Recovery code in > the endpoint drivers I maintain is just calling the > pci_cleanup_aer_uncorrect_error_status() in error handler resume() > callback. > > FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel > which I believe is quite old. At the moment I can not use newer > kernel, but would be prepared to take that step if told that it > would help. It's really not practical for us to help debug a v3.10-based kernel; that's over seven years old and AER handling has changed significantly since then. Also, CentOS is a distro kernel and includes many changes on top of the v3.10 upstream kernel. Those changes might be theoretically open-source, but as a practical matter, the distro is a better place to start for support of their kernel. I followed your advice and started using kernel 5.9.12; hope this is recent enough. With that kernel I see a bit of different behavior than with the 3.10 earlier. Configuration access to the devices on the secondary bus is not OK; all 0xFFFFFFFF. This is what I see after the error was injected into the root port 00:01.0 and secondary bus reset was made as part of recovery: 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 128 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: pcieport 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: pcieport 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: pcieport 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: mrf-pci The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED. How come that the 01:00.0 upstream port PCI_COMMAND register is cleared? I guess that explains why the rest of the devices in the chain down to the endpoint at 0e:00.0 are rendered useless. Then, as part of having fun, I made the pci_bus_error_reset() *NOT* do the actual reset of the secondary by just returning 0 before the call to the pci_bus_reset(). In this case no bus reset is done, obviously, and MMIO access to my endpoint(s) is working just fine after the error was injected into the root port; as if nothing happened. Is it expected that in cases of seeing this type of error the secondary bus would be in this kind of state? It looks to me that the problem is not with my endpoints but with the PCIe switch on the CPU board (01:00.0 <--> 02:0?.0). Thanks! //hinko Bjorn ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 10:02 ` Hinko Kocevar @ 2020-12-09 17:40 ` Bjorn Helgaas 2020-12-09 20:31 ` Hinko Kocevar 2020-12-09 20:50 ` Hinko Kocevar 0 siblings, 2 replies; 12+ messages in thread From: Bjorn Helgaas @ 2020-12-09 17:40 UTC (permalink / raw) To: Hinko Kocevar; +Cc: linux-pci [non-standard email quoting below, ">" for both my text and Hinko's, see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style] On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote: > ________________________________________ > From: Bjorn Helgaas <helgaas@kernel.org> > Sent: Friday, December 4, 2020 11:38 PM > To: Hinko Kocevar > Cc: linux-pci@vger.kernel.org > Subject: Re: Recovering from AER: Uncorrected (Fatal) error > > On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: > > Hi, > > > > I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: > > > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) > > > > This is the complete PCI device tree that I'm working with: > > > > $ sudo /usr/local/bin/pcicrawler -t > > 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 > > ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 > > │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 > > │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) > > │ │ ├─04:00.0 downstream_port, slot 10, power: Off > > │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:02.0 downstream_port, slot 9, power: Off > > │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:09.0 downstream_port, slot 11, power: Off > > │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 > > │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 > > │ │ ├─04:0b.0 downstream_port, slot 12, power: Off > > │ │ ├─04:10.0 downstream_port, slot 8, power: Off > > │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 > > │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 > > │ │ └─04:12.0 downstream_port, slot 7, power: Off > > │ ├─02:02.0 downstream_port, slot 2 > > │ ├─02:08.0 downstream_port, slot 8 > > │ ├─02:09.0 downstream_port, slot 9, power: Off > > │ └─02:0a.0 downstream_port, slot 10 > > ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > > > 00:01.0 is on a CPU board, The 03:00.0 and everything below that is > > not on a CPU board (working with a micro TCA system here). I'm > > working with FPGA based devices seen as endpoints here. After the > > error all the endpoints in the above list are not able to talk to > > CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI > > config space looks sane and accessible for those devices. > > This could be caused by a reset. We probably do a secondary bus reset > on the Root Port, which resets all the devices below it. After the > reset, config space of those downstream devices would be accessible, > but the PCI_COMMAND register may be cleared which means the device > wouldn't respond to MMIO reads. > > After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset. > > None of this explains the original problem of the Completion Timeout, > of course. The error source of 0x8 (00:01.0) is the root port, which > makes sense if it issued a request transaction and never got the > completion. The root port *should* log the header of the request and > we should print it, but it looks like we didn't. "lspci -vv" of the > device would show whether it's capable of this logging. > > I'm mostly concerned about the fact that the secondary bus is > rendered useless after the reset/recovery. AIUI, this should not > happen. It should not. The secondary bus should still work after the reset. > After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors: > > Capabilities: [1c0 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 00000000 00000001 00000002 00000003 > RootCmd: CERptEn+ NFERptEn+ FERptEn+ > RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- > FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 > ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008 > > The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'. > > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0 > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0 > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 > Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO > Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2 > > > Question: would using different (real?) HEADER_LOG values affect the recovery behaviour? No, it shouldn't. I'm pretty sure we only print those values and they don't influence behavior at all. > > How can I debug this further? I'm getting the "I/O to channel is > > blocked" (pci_channel_io_frozen) state reported to the the endpoint > > devices I provide driver for. Is there any way of telling if the > > PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; > > links up and running and similar? Is this information provided by > > the Linux kernel somehow? > > > > For reference, I've experimented with AER inject and my tests showed > > that if the same type of error is injected in any other PCI device > > in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is > > recovered successfully, and I can continue working with the endpoint > > devices. In those cases the "I/O channel is in normal state" > > (pci_channel_io_normal) state was reported; only error injection > > into 00:01.0 reports pci_channel_io_frozen state. Recovery code in > > the endpoint drivers I maintain is just calling the > > pci_cleanup_aer_uncorrect_error_status() in error handler resume() > > callback. > > > > FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel > > which I believe is quite old. At the moment I can not use newer > > kernel, but would be prepared to take that step if told that it > > would help. > > It's really not practical for us to help debug a v3.10-based kernel; > that's over seven years old and AER handling has changed significantly > since then. Also, CentOS is a distro kernel and includes many changes > on top of the v3.10 upstream kernel. Those changes might be > theoretically open-source, but as a practical matter, the distro is a > better place to start for support of their kernel. > > I followed your advice and started using kernel 5.9.12; hope this is recent enough. > > With that kernel I see a bit of different behavior than with the > 3.10 earlier. Configuration access to the devices on the secondary > bus is not OK; all 0xFFFFFFFF. This is what I see after the error > was injected into the root port 00:01.0 and secondary bus reset was > made as part of recovery: > > 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 128 > 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: pcieport > > 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: pcieport > > 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: pcieport > > 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: mrf-pci > > The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED. > How come that the 01:00.0 upstream port PCI_COMMAND register is > cleared? I guess that explains why the rest of the devices in the > chain down to the endpoint at 0e:00.0 are rendered useless. > > Then, as part of having fun, I made the pci_bus_error_reset() *NOT* > do the actual reset of the secondary by just returning 0 before the > call to the pci_bus_reset(). In this case no bus reset is done, > obviously, and MMIO access to my endpoint(s) is working just fine > after the error was injected into the root port; as if nothing > happened. > > Is it expected that in cases of seeing this type of error the > secondary bus would be in this kind of state? It looks to me that > the problem is not with my endpoints but with the PCIe switch on the > CPU board (01:00.0 <--> 02:0?.0). Hmm. Yeah, I don't see anything in the path that would restore the 01:00.0 config space, which seems like a problem. Probably its secondary/subordinate bus registers and window registers get reset to 0 ("lspci -vv" would show that) so config/mem/io space below it is accessible. I didn't have much confidence in our error handling to begin with, and it's eroding ;) Bjorn ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 17:40 ` Bjorn Helgaas @ 2020-12-09 20:31 ` Hinko Kocevar 2020-12-09 20:50 ` Hinko Kocevar 1 sibling, 0 replies; 12+ messages in thread From: Hinko Kocevar @ 2020-12-09 20:31 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci Hi Bjorn! On 12/9/20 6:40 PM, Bjorn Helgaas wrote: > [non-standard email quoting below, ">" for both my text and Hinko's, > see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style] > Apologies. Probably due to the web MUA I used. Hopefully this will be better now. > On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote: >> ________________________________________ >> From: Bjorn Helgaas <helgaas@kernel.org> >> Sent: Friday, December 4, 2020 11:38 PM >> To: Hinko Kocevar >> Cc: linux-pci@vger.kernel.org >> Subject: Re: Recovering from AER: Uncorrected (Fatal) error >> >> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: >>> Hi, >>> >>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: >>> >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) >>> >>> This is the complete PCI device tree that I'm working with: >>> >>> $ sudo /usr/local/bin/pcicrawler -t >>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 >>> ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 >>> │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 >>> │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) >>> │ │ ├─04:00.0 downstream_port, slot 10, power: Off >>> │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:02.0 downstream_port, slot 9, power: Off >>> │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:09.0 downstream_port, slot 11, power: Off >>> │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:0b.0 downstream_port, slot 12, power: Off >>> │ │ ├─04:10.0 downstream_port, slot 8, power: Off >>> │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 >>> │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 >>> │ │ └─04:12.0 downstream_port, slot 7, power: Off >>> │ ├─02:02.0 downstream_port, slot 2 >>> │ ├─02:08.0 downstream_port, slot 8 >>> │ ├─02:09.0 downstream_port, slot 9, power: Off >>> │ └─02:0a.0 downstream_port, slot 10 >>> ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> >>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is >>> not on a CPU board (working with a micro TCA system here). I'm >>> working with FPGA based devices seen as endpoints here. After the >>> error all the endpoints in the above list are not able to talk to >>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI >>> config space looks sane and accessible for those devices. >> >> This could be caused by a reset. We probably do a secondary bus reset >> on the Root Port, which resets all the devices below it. After the >> reset, config space of those downstream devices would be accessible, >> but the PCI_COMMAND register may be cleared which means the device >> wouldn't respond to MMIO reads. >> >> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset. >> >> None of this explains the original problem of the Completion Timeout, >> of course. The error source of 0x8 (00:01.0) is the root port, which >> makes sense if it issued a request transaction and never got the >> completion. The root port *should* log the header of the request and >> we should print it, but it looks like we didn't. "lspci -vv" of the >> device would show whether it's capable of this logging. >> >> I'm mostly concerned about the fact that the secondary bus is >> rendered useless after the reset/recovery. AIUI, this should not >> happen. > > It should not. The secondary bus should still work after the reset. I was hoping to hear that! > >> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors: >> >> Capabilities: [1c0 v1] Advanced Error Reporting >> UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- >> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- >> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ >> AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- >> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- >> HeaderLog: 00000000 00000001 00000002 00000003 >> RootCmd: CERptEn+ NFERptEn+ FERptEn+ >> RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- >> FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 >> ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008 >> >> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'. >> >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0 >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0 >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO >> Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2 >> >> >> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour? > > No, it shouldn't. I'm pretty sure we only print those values and they > don't influence behavior at all. > >>> How can I debug this further? I'm getting the "I/O to channel is >>> blocked" (pci_channel_io_frozen) state reported to the the endpoint >>> devices I provide driver for. Is there any way of telling if the >>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; >>> links up and running and similar? Is this information provided by >>> the Linux kernel somehow? >>> >>> For reference, I've experimented with AER inject and my tests showed >>> that if the same type of error is injected in any other PCI device >>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is >>> recovered successfully, and I can continue working with the endpoint >>> devices. In those cases the "I/O channel is in normal state" >>> (pci_channel_io_normal) state was reported; only error injection >>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in >>> the endpoint drivers I maintain is just calling the >>> pci_cleanup_aer_uncorrect_error_status() in error handler resume() >>> callback. >>> >>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel >>> which I believe is quite old. At the moment I can not use newer >>> kernel, but would be prepared to take that step if told that it >>> would help. >> >> It's really not practical for us to help debug a v3.10-based kernel; >> that's over seven years old and AER handling has changed significantly >> since then. Also, CentOS is a distro kernel and includes many changes >> on top of the v3.10 upstream kernel. Those changes might be >> theoretically open-source, but as a practical matter, the distro is a >> better place to start for support of their kernel. >> >> I followed your advice and started using kernel 5.9.12; hope this is recent enough. >> >> With that kernel I see a bit of different behavior than with the >> 3.10 earlier. Configuration access to the devices on the secondary >> bus is not OK; all 0xFFFFFFFF. This is what I see after the error >> was injected into the root port 00:01.0 and secondary bus reset was >> made as part of recovery: >> >> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- >> Latency: 0, Cache Line Size: 64 bytes >> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) >> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- >> Interrupt: pin A routed to IRQ 128 >> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: pcieport >> >> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: pcieport >> >> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: pcieport >> >> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: mrf-pci >> >> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED. >> How come that the 01:00.0 upstream port PCI_COMMAND register is >> cleared? I guess that explains why the rest of the devices in the >> chain down to the endpoint at 0e:00.0 are rendered useless. >> >> Then, as part of having fun, I made the pci_bus_error_reset() *NOT* >> do the actual reset of the secondary by just returning 0 before the >> call to the pci_bus_reset(). In this case no bus reset is done, >> obviously, and MMIO access to my endpoint(s) is working just fine >> after the error was injected into the root port; as if nothing >> happened. >> >> Is it expected that in cases of seeing this type of error the >> secondary bus would be in this kind of state? It looks to me that >> the problem is not with my endpoints but with the PCIe switch on the >> CPU board (01:00.0 <--> 02:0?.0). > > Hmm. Yeah, I don't see anything in the path that would restore the > 01:00.0 config space, which seems like a problem. Probably its > secondary/subordinate bus registers and window registers get reset to > 0 ("lspci -vv" would show that) so config/mem/io space below it is > accessible. > > I didn't have much confidence in our error handling to begin with, and > it's eroding ;) > > Bjorn > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 17:40 ` Bjorn Helgaas 2020-12-09 20:31 ` Hinko Kocevar @ 2020-12-09 20:50 ` Hinko Kocevar 2020-12-09 21:32 ` Bjorn Helgaas 1 sibling, 1 reply; 12+ messages in thread From: Hinko Kocevar @ 2020-12-09 20:50 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci Hi Bjorn! On 12/9/20 6:40 PM, Bjorn Helgaas wrote: > [non-standard email quoting below, ">" for both my text and Hinko's, > see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style] > Apologies. Probably due to the web MUA I used. Hopefully this will be better now. > On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote: >> ________________________________________ >> From: Bjorn Helgaas <helgaas@kernel.org> >> Sent: Friday, December 4, 2020 11:38 PM >> To: Hinko Kocevar >> Cc: linux-pci@vger.kernel.org >> Subject: Re: Recovering from AER: Uncorrected (Fatal) error >> >> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: >>> Hi, >>> >>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller: >>> >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008 >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID) >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 >>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First) >>> >>> This is the complete PCI device tree that I'm working with: >>> >>> $ sudo /usr/local/bin/pcicrawler -t >>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 >>> ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 >>> │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4 >>> │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748) >>> │ │ ├─04:00.0 downstream_port, slot 10, power: Off >>> │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:02.0 downstream_port, slot 9, power: Off >>> │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:09.0 downstream_port, slot 11, power: Off >>> │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4 >>> │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024 >>> │ │ ├─04:0b.0 downstream_port, slot 12, power: Off >>> │ │ ├─04:10.0 downstream_port, slot 8, power: Off >>> │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1 >>> │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011 >>> │ │ └─04:12.0 downstream_port, slot 7, power: Off >>> │ ├─02:02.0 downstream_port, slot 2 >>> │ ├─02:08.0 downstream_port, slot 8 >>> │ ├─02:09.0 downstream_port, slot 9, power: Off >>> │ └─02:0a.0 downstream_port, slot 10 >>> ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 >>> >>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is >>> not on a CPU board (working with a micro TCA system here). I'm >>> working with FPGA based devices seen as endpoints here. After the >>> error all the endpoints in the above list are not able to talk to >>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI >>> config space looks sane and accessible for those devices. >> >> This could be caused by a reset. We probably do a secondary bus reset >> on the Root Port, which resets all the devices below it. After the >> reset, config space of those downstream devices would be accessible, >> but the PCI_COMMAND register may be cleared which means the device >> wouldn't respond to MMIO reads. >> >> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset. >> >> None of this explains the original problem of the Completion Timeout, >> of course. The error source of 0x8 (00:01.0) is the root port, which >> makes sense if it issued a request transaction and never got the >> completion. The root port *should* log the header of the request and >> we should print it, but it looks like we didn't. "lspci -vv" of the >> device would show whether it's capable of this logging. >> >> I'm mostly concerned about the fact that the secondary bus is >> rendered useless after the reset/recovery. AIUI, this should not >> happen. > > It should not. The secondary bus should still work after the reset. I was hoping to hear that! > >> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors: >> >> Capabilities: [1c0 v1] Advanced Error Reporting >> UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- >> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- >> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ >> AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- >> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- >> HeaderLog: 00000000 00000001 00000002 00000003 >> RootCmd: CERptEn+ NFERptEn+ FERptEn+ >> RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- >> FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 >> ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008 >> >> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'. >> >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0 >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0 >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 >> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO >> Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2 >> >> >> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour? > > No, it shouldn't. I'm pretty sure we only print those values and they > don't influence behavior at all. > >>> How can I debug this further? I'm getting the "I/O to channel is >>> blocked" (pci_channel_io_frozen) state reported to the the endpoint >>> devices I provide driver for. Is there any way of telling if the >>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; >>> links up and running and similar? Is this information provided by >>> the Linux kernel somehow? >>> >>> For reference, I've experimented with AER inject and my tests showed >>> that if the same type of error is injected in any other PCI device >>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is >>> recovered successfully, and I can continue working with the endpoint >>> devices. In those cases the "I/O channel is in normal state" >>> (pci_channel_io_normal) state was reported; only error injection >>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in >>> the endpoint drivers I maintain is just calling the >>> pci_cleanup_aer_uncorrect_error_status() in error handler resume() >>> callback. >>> >>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel >>> which I believe is quite old. At the moment I can not use newer >>> kernel, but would be prepared to take that step if told that it >>> would help. >> >> It's really not practical for us to help debug a v3.10-based kernel; >> that's over seven years old and AER handling has changed significantly >> since then. Also, CentOS is a distro kernel and includes many changes >> on top of the v3.10 upstream kernel. Those changes might be >> theoretically open-source, but as a practical matter, the distro is a >> better place to start for support of their kernel. >> >> I followed your advice and started using kernel 5.9.12; hope this is recent enough. >> >> With that kernel I see a bit of different behavior than with the >> 3.10 earlier. Configuration access to the devices on the secondary >> bus is not OK; all 0xFFFFFFFF. This is what I see after the error >> was injected into the root port 00:01.0 and secondary bus reset was >> made as part of recovery: >> >> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- >> Latency: 0, Cache Line Size: 64 bytes >> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) >> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- >> Interrupt: pin A routed to IRQ 128 >> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: pcieport >> >> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: pcieport >> >> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: pcieport >> >> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: mrf-pci >> >> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED. >> How come that the 01:00.0 upstream port PCI_COMMAND register is >> cleared? I guess that explains why the rest of the devices in the >> chain down to the endpoint at 0e:00.0 are rendered useless. >> >> Then, as part of having fun, I made the pci_bus_error_reset() *NOT* >> do the actual reset of the secondary by just returning 0 before the >> call to the pci_bus_reset(). In this case no bus reset is done, >> obviously, and MMIO access to my endpoint(s) is working just fine >> after the error was injected into the root port; as if nothing >> happened. >> >> Is it expected that in cases of seeing this type of error the >> secondary bus would be in this kind of state? It looks to me that >> the problem is not with my endpoints but with the PCIe switch on the >> CPU board (01:00.0 <--> 02:0?.0). > > Hmm. Yeah, I don't see anything in the path that would restore the > 01:00.0 config space, which seems like a problem. Probably its > secondary/subordinate bus registers and window registers get reset to > 0 ("lspci -vv" would show that) so config/mem/io space below it is > accessible. I guess the save/restore of the switch registers would be in the core PCI(e) code; and not a switch vendor/part specific. I still have interest in getting this to work. With my limited knowledge of the PCI code base, is there anything I could do to help? Is there a particular code base area where you expect this should be managed; pci.c or pcie/aer.c or maybe hotplug/ source? I can try and tinker with it.. Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset: 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 128 Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual] [size=256K] Bus: primary=00, secondary=00, subordinate=00, sec-latency=0 I/O behind bridge: 0000f000-00000fff [disabled] Memory behind bridge: fff00000-000fffff [disabled] Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [68] Express (v2) Upstream Port, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0 ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x8 (ok) TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: EgressBlck- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 8725 Capabilities: [100 v1] Device Serial Number ca-87-00-10-b5-df-0e-00 Capabilities: [fb4 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 1f, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [138 v1] Power Budgeting <?> Capabilities: [10c v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [148 v1] Virtual Channel Caps: LPEVC=1 RefClk=100ns PATEntryBits=8 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64+ WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=WRR64 TC/VC=ff Status: NegoPending- InProgress- Port Arbitration Table <?> VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed+ WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable- ID=1 ArbSelect=Fixed TC/VC=00 Status: NegoPending+ InProgress- Capabilities: [e00 v1] Multicast McastCap: MaxGroups 64, ECRCRegen+ McastCtl: NumGroups 1, Enable- McastBAR: IndexPos 0, BaseAddr 0000000000000000 McastReceiveVec: 0000000000000000 McastBlockAllVec: 0000000000000000 McastBlockUntransVec: 0000000000000000 McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 0000000000000000 Capabilities: [b00 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?> Kernel driver in use: pcieport And here is the diff of the lspci of 01:00.0 between the before and after the reset: 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) - Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ + Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- - Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 128 - Region 0: Memory at df200000 (32-bit, non-prefetchable) [size=256K] - Bus: primary=01, secondary=02, subordinate=13, sec-latency=0 + Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual] [size=256K] + Bus: primary=00, secondary=00, subordinate=00, sec-latency=0 I/O behind bridge: 0000f000-00000fff [disabled] - Memory behind bridge: df000000-df1fffff [size=2M] - Prefetchable memory behind bridge: 0000000090000000-00000000923fffff [size=36M] + Memory behind bridge: fff00000-000fffff [disabled] + Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- - BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B- + BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- - Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+ - Address: 00000000fee00338 Data: 0000 - Masking: 000000ff Pending: 00000000 + Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+ + Address: 0000000000000000 Data: 0000 + Masking: 00000000 Pending: 00000000 Capabilities: [68] Express (v2) Upstream Port, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0 ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ - MaxPayload 256 bytes, MaxReadReq 128 bytes + MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ @@ -376,7 +375,7 @@ EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS- - DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, + DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: EgressBlck- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- @@ -402,7 +401,7 @@ LaneErrStat: 0 Capabilities: [148 v1] Virtual Channel Caps: LPEVC=1 RefClk=100ns PATEntryBits=8 - Arb: Fixed+ WRR32- WRR64- WRR128- + Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans- @@ -423,20 +422,20 @@ McastBlockUntransVec: 0000000000000000 McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 0000000000000000 Capabilities: [b00 v1] Latency Tolerance Reporting - Max snoop latency: 71680ns - Max no snoop latency: 71680ns + Max snoop latency: 0ns + Max no snoop latency: 0ns Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?> Kernel driver in use: pcieport Looks like many fields have been zeroed out. Should a bare restoration of the previous registers values be enough or does some other magic (i.e. calling PCI functions/performing initialization tasks/etc) needs to happen to get that port back to sanity? Thanks! //hinko > > I didn't have much confidence in our error handling to begin with, and > it's eroding ;) > > Bjorn > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 20:50 ` Hinko Kocevar @ 2020-12-09 21:32 ` Bjorn Helgaas 2020-12-09 22:55 ` Hinko Kocevar 0 siblings, 1 reply; 12+ messages in thread From: Bjorn Helgaas @ 2020-12-09 21:32 UTC (permalink / raw) To: Hinko Kocevar; +Cc: linux-pci On Wed, Dec 09, 2020 at 09:50:17PM +0100, Hinko Kocevar wrote: > On 12/9/20 6:40 PM, Bjorn Helgaas wrote: > > Hmm. Yeah, I don't see anything in the path that would restore the > > 01:00.0 config space, which seems like a problem. Probably its > > secondary/subordinate bus registers and window registers get reset to > > 0 ("lspci -vv" would show that) so config/mem/io space below it is > > accessible. > > I guess the save/restore of the switch registers would be in the core PCI(e) > code; and not a switch vendor/part specific. > > I still have interest in getting this to work. With my limited knowledge of > the PCI code base, is there anything I could do to help? Is there a > particular code base area where you expect this should be managed; pci.c or > pcie/aer.c or maybe hotplug/ source? I can try and tinker with it.. > > > Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset: > > 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 > [Normal decode]) > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 128 > Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual] > [size=256K] > Bus: primary=00, secondary=00, subordinate=00, sec-latency=0 > I/O behind bridge: 0000f000-00000fff [disabled] > Memory behind bridge: fff00000-000fffff [disabled] > Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff > [disabled] Yep, this explains why nothing below the bridge is accessible. Secondary/subordinate=0 means no config transactions will be forwarded, and the windows are all disabled, so no MMIO transactions will be forwarded either. > Looks like many fields have been zeroed out. Should a bare > restoration of the previous registers values be enough or does some > other magic (i.e. calling PCI functions/performing initialization > tasks/etc) needs to happen to get that port back to sanity? Good question. Documentation/PCI/pci-error-recovery.rst has some details about the design, including this: It is important for the platform to restore the PCI config space to the "fresh poweron" state, rather than the "last state". After a slot reset, the device driver will almost always use its standard device initialization routines, and an unusual config space setup may result in hung devices, kernel panics, or silent data corruption. I'm not sure exactly how to interpret that. It seems like "restoring to 'fresh poweron' state" is basically just doing a reset, which gives results like your 01:00.0 device above. IIUC, we call a driver's .slot_reset() callback after a reset, and it's supposed to be able to initialize the device from power-on state. 01:00.0 is a Switch Upstream Port, so portdrv should claim it, and it looks like pcie_portdrv_slot_reset() should be restoring some state. But maybe dev->state_saved is false so pci_restore_state() bails out? Maybe add some printks in that path to see where it's broken? Bjorn ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 21:32 ` Bjorn Helgaas @ 2020-12-09 22:55 ` Hinko Kocevar 2020-12-10 12:56 ` Hinko Kocevar 2020-12-14 21:23 ` Keith Busch 0 siblings, 2 replies; 12+ messages in thread From: Hinko Kocevar @ 2020-12-09 22:55 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci On 12/9/20 10:32 PM, Bjorn Helgaas wrote: > On Wed, Dec 09, 2020 at 09:50:17PM +0100, Hinko Kocevar wrote: >> On 12/9/20 6:40 PM, Bjorn Helgaas wrote: > >>> Hmm. Yeah, I don't see anything in the path that would restore the >>> 01:00.0 config space, which seems like a problem. Probably its >>> secondary/subordinate bus registers and window registers get reset to >>> 0 ("lspci -vv" would show that) so config/mem/io space below it is >>> accessible. >> >> I guess the save/restore of the switch registers would be in the core PCI(e) >> code; and not a switch vendor/part specific. >> >> I still have interest in getting this to work. With my limited knowledge of >> the PCI code base, is there anything I could do to help? Is there a >> particular code base area where you expect this should be managed; pci.c or >> pcie/aer.c or maybe hotplug/ source? I can try and tinker with it.. >> >> >> Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset: >> >> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 >> [Normal decode]) >> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR- FastB2B- DisINTx- >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> <TAbort- <MAbort- >SERR- <PERR- INTx- >> Interrupt: pin A routed to IRQ 128 >> Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual] >> [size=256K] >> Bus: primary=00, secondary=00, subordinate=00, sec-latency=0 >> I/O behind bridge: 0000f000-00000fff [disabled] >> Memory behind bridge: fff00000-000fffff [disabled] >> Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff >> [disabled] > > Yep, this explains why nothing below the bridge is accessible. > Secondary/subordinate=0 means no config transactions will be > forwarded, and the windows are all disabled, so no MMIO transactions > will be forwarded either. > >> Looks like many fields have been zeroed out. Should a bare >> restoration of the previous registers values be enough or does some >> other magic (i.e. calling PCI functions/performing initialization >> tasks/etc) needs to happen to get that port back to sanity? > > Good question. Documentation/PCI/pci-error-recovery.rst has some > details about the design, including this: > > It is important for the platform to restore the PCI config space to > the "fresh poweron" state, rather than the "last state". After a > slot reset, the device driver will almost always use its standard > device initialization routines, and an unusual config space setup > may result in hung devices, kernel panics, or silent data > corruption. > > I'm not sure exactly how to interpret that. It seems like "restoring > to 'fresh poweron' state" is basically just doing a reset, which gives > results like your 01:00.0 device above. IIUC, we call a driver's > .slot_reset() callback after a reset, and it's supposed to be able to > initialize the device from power-on state. > > 01:00.0 is a Switch Upstream Port, so portdrv should claim it, and it > looks like pcie_portdrv_slot_reset() should be restoring some state. > But maybe dev->state_saved is false so pci_restore_state() bails out? > Maybe add some printks in that path to see where it's broken? > Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success! So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER and therefore the pcie_portdrv_slot_reset() is not called. But the pcie_portdrv_err_resume() is called! Adding these two lines to pcie_portdrv_err_resume(), before the call to device_for_each_child(): pci_restore_state(dev); pci_save_state(dev); gets the secondary buses and the endpoint restored to the following state: 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 [Normal decode]) Physical Slot: 1 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 Subsystem: Device 1a3e:132c Physical Slot: 2 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Very nice. I have not verified if the buses are actually functional, yet, as I still see the endpoint not happy; I think its driver is just not doing any restoration currently. Knowing what it takes for switch ports to recover makes me think I can apply similar save/restore logic to my endpoint driver(s), too. Cheers, //hinko > Bjorn > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 22:55 ` Hinko Kocevar @ 2020-12-10 12:56 ` Hinko Kocevar 2020-12-14 21:23 ` Keith Busch 1 sibling, 0 replies; 12+ messages in thread From: Hinko Kocevar @ 2020-12-10 12:56 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci On 12/9/20 11:55 PM, Hinko Kocevar wrote: > > > On 12/9/20 10:32 PM, Bjorn Helgaas wrote: >> On Wed, Dec 09, 2020 at 09:50:17PM +0100, Hinko Kocevar wrote: >>> On 12/9/20 6:40 PM, Bjorn Helgaas wrote: >> >>>> Hmm. Yeah, I don't see anything in the path that would restore the >>>> 01:00.0 config space, which seems like a problem. Probably its >>>> secondary/subordinate bus registers and window registers get reset to >>>> 0 ("lspci -vv" would show that) so config/mem/io space below it is >>>> accessible. >>> >>> I guess the save/restore of the switch registers would be in the core >>> PCI(e) >>> code; and not a switch vendor/part specific. >>> >>> I still have interest in getting this to work. With my limited >>> knowledge of >>> the PCI code base, is there anything I could do to help? Is there a >>> particular code base area where you expect this should be managed; >>> pci.c or >>> pcie/aer.c or maybe hotplug/ source? I can try and tinker with it.. >>> >>> >>> Here is the 'lspci -vv' of the 01:00.0 device at the time after the >>> reset: >>> >>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>> (prog-if 00 >>> [Normal decode]) >>> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- >>> ParErr- >>> Stepping- SERR- FastB2B- DisINTx- >>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>> <TAbort- <MAbort- >SERR- <PERR- INTx- >>> Interrupt: pin A routed to IRQ 128 >>> Region 0: Memory at df200000 (32-bit, non-prefetchable) >>> [virtual] >>> [size=256K] >>> Bus: primary=00, secondary=00, subordinate=00, sec-latency=0 >>> I/O behind bridge: 0000f000-00000fff [disabled] >>> Memory behind bridge: fff00000-000fffff [disabled] >>> Prefetchable memory behind bridge: >>> 00000000fff00000-00000000000fffff >>> [disabled] >> >> Yep, this explains why nothing below the bridge is accessible. >> Secondary/subordinate=0 means no config transactions will be >> forwarded, and the windows are all disabled, so no MMIO transactions >> will be forwarded either. >> >>> Looks like many fields have been zeroed out. Should a bare >>> restoration of the previous registers values be enough or does some >>> other magic (i.e. calling PCI functions/performing initialization >>> tasks/etc) needs to happen to get that port back to sanity? >> >> Good question. Documentation/PCI/pci-error-recovery.rst has some >> details about the design, including this: >> >> It is important for the platform to restore the PCI config space to >> the "fresh poweron" state, rather than the "last state". After a >> slot reset, the device driver will almost always use its standard >> device initialization routines, and an unusual config space setup >> may result in hung devices, kernel panics, or silent data >> corruption. >> >> I'm not sure exactly how to interpret that. It seems like "restoring >> to 'fresh poweron' state" is basically just doing a reset, which gives >> results like your 01:00.0 device above. IIUC, we call a driver's >> .slot_reset() callback after a reset, and it's supposed to be able to >> initialize the device from power-on state. >> >> 01:00.0 is a Switch Upstream Port, so portdrv should claim it, and it >> looks like pcie_portdrv_slot_reset() should be restoring some state. >> But maybe dev->state_saved is false so pci_restore_state() bails out? >> Maybe add some printks in that path to see where it's broken? >> > > Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success! > > So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER > and therefore the pcie_portdrv_slot_reset() is not called. > > But the pcie_portdrv_err_resume() is called! Adding these two lines to > pcie_portdrv_err_resume(), before the call to device_for_each_child(): > > pci_restore_state(dev); > pci_save_state(dev); > > gets the secondary buses and the endpoint restored to the following state: > > 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen > Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if > 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if > 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI > Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 > [Normal decode]) > Physical Slot: 1 > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI > Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca) (prog-if 00 > [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 > Subsystem: Device 1a3e:132c > Physical Slot: 2 > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > > > Very nice. > > I have not verified if the buses are actually functional, yet, as I > still see the endpoint not happy; I think its driver is just not doing > any restoration currently. Knowing what it takes for switch ports to > recover makes me think I can apply similar save/restore logic to my > endpoint driver(s), too. I've added the same PCI save/restore state calls to my drivers' error resume callback, as well as save state to the probe. The results from injecting the error are as expected; the recovery is successful. Further, I'm running torture tests as we speak, with 5 AMCs as endpoints, supported by two different kernel drivers. As of this moment, 7 cases of: [ 3745.957711] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0 [ 3746.149747] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) [ 3746.149749] pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000 [ 3746.149751] pcieport 0000:00:01.0: [14] CmpltTO (First) were observed (not provoked by injection) and successfully recovered from. Before these changes PCI bus would essentially remain in non-operational state while now the bus recovers and next test iteration is successful without user intervention. For the record, no other types of errors are observed on my side so far during the tests. I would like to see the log header in the error reports above, as it is not visible for some reason, and that is what I'll look into next. At a glance, looks like the 'info->tlp_header_valid' check in the aer_print_error() is not passing, resulting in no call to __print_tlp_header(). Thanks! //hinko > > > Cheers, > //hinko > >> Bjorn >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-09 22:55 ` Hinko Kocevar 2020-12-10 12:56 ` Hinko Kocevar @ 2020-12-14 21:23 ` Keith Busch 2020-12-15 12:56 ` Hinko Kocevar 1 sibling, 1 reply; 12+ messages in thread From: Keith Busch @ 2020-12-14 21:23 UTC (permalink / raw) To: Hinko Kocevar; +Cc: Bjorn Helgaas, linux-pci On Wed, Dec 09, 2020 at 11:55:07PM +0100, Hinko Kocevar wrote: > Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success! > > So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER and > therefore the pcie_portdrv_slot_reset() is not called. > > But the pcie_portdrv_err_resume() is called! Adding these two lines to > pcie_portdrv_err_resume(), before the call to device_for_each_child(): > > pci_restore_state(dev); > pci_save_state(dev); You need to do that with the current kernel or are you still using a 3.10? A more recent kernel shouldn't have needed such a fix after the following commit was introduced: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=874b3251113a1e2cbe79c24994dc03fe4fe4b99b ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-14 21:23 ` Keith Busch @ 2020-12-15 12:56 ` Hinko Kocevar 2020-12-15 18:56 ` Keith Busch 0 siblings, 1 reply; 12+ messages in thread From: Hinko Kocevar @ 2020-12-15 12:56 UTC (permalink / raw) To: Keith Busch; +Cc: Bjorn Helgaas, linux-pci Hi Keith, On 12/14/20 10:23 PM, Keith Busch wrote: > On Wed, Dec 09, 2020 at 11:55:07PM +0100, Hinko Kocevar wrote: >> Adding a bunch of printk()'s to portdrv_pci.c led to (partial) success! >> >> So, the pcie_portdrv_error_detected() returns PCI_ERS_RESULT_CAN_RECOVER and >> therefore the pcie_portdrv_slot_reset() is not called. >> >> But the pcie_portdrv_err_resume() is called! Adding these two lines to >> pcie_portdrv_err_resume(), before the call to device_for_each_child(): >> >> pci_restore_state(dev); >> pci_save_state(dev); > > You need to do that with the current kernel or are you still using a > 3.10? A more recent kernel shouldn't have needed such a fix after the This was tested on the 5.9.12 kernel at that time. As of today, I've re-ran the tests on Bjorn's git tree, pci/err branch from Sunday (I guess 5.10.0 version). > following commit was introduced: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=874b3251113a1e2cbe79c24994dc03fe4fe4b99b > I noticed the change you are pointing out when trying to propose a patch. It made me curious on why the pcie_portdrv_slot_reset() is not invoked. After sprinkling a couple of printk()'s around the pcie_do_recovery() and pcie_portdrv_err_handler's I can observe that the pcie_portdrv_slot_reset() is never called from pcie_do_recovery() due to status returned by reset_subordinates() (actually aer_root_reset() from pcie/aer.c) being PCI_ERS_RESULT_RECOVERED. I reckon, in order to invoke the pcie_portdrv_slot_reset(), the aer_root_reset() should have returned PCI_ERS_RESULT_NEED_RESET. As soon as I plug the calls to pci_restore_state() and pci_save_state() into the pcie_portdrv_err_resume() the bus and devices are operational again. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering from AER: Uncorrected (Fatal) error 2020-12-15 12:56 ` Hinko Kocevar @ 2020-12-15 18:56 ` Keith Busch 0 siblings, 0 replies; 12+ messages in thread From: Keith Busch @ 2020-12-15 18:56 UTC (permalink / raw) To: Hinko Kocevar; +Cc: Bjorn Helgaas, linux-pci On Tue, Dec 15, 2020 at 01:56:21PM +0100, Hinko Kocevar wrote: > I noticed the change you are pointing out when trying to propose a patch. > > It made me curious on why the pcie_portdrv_slot_reset() is not invoked. > > After sprinkling a couple of printk()'s around the pcie_do_recovery() and > pcie_portdrv_err_handler's I can observe that the pcie_portdrv_slot_reset() > is never called from pcie_do_recovery() due to status returned by > reset_subordinates() (actually aer_root_reset() from pcie/aer.c) being > PCI_ERS_RESULT_RECOVERED. > > I reckon, in order to invoke the pcie_portdrv_slot_reset(), the > aer_root_reset() should have returned PCI_ERS_RESULT_NEED_RESET. > > As soon as I plug the calls to pci_restore_state() and pci_save_state() into > the pcie_portdrv_err_resume() the bus and devices are operational again. It appears some sequences in the aer handling are out of sync with at least a few driver's expectations. The .slot_reset callback is for when 'PCI slot has been reset' according the the code comments, so we should be calling this after the reset to notify driver's of downstream devices that event occured. I'll mess with it a bit and see if we can get everyone aligned. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-12-15 18:57 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar 2020-12-04 22:38 ` Bjorn Helgaas 2020-12-09 10:02 ` Hinko Kocevar 2020-12-09 17:40 ` Bjorn Helgaas 2020-12-09 20:31 ` Hinko Kocevar 2020-12-09 20:50 ` Hinko Kocevar 2020-12-09 21:32 ` Bjorn Helgaas 2020-12-09 22:55 ` Hinko Kocevar 2020-12-10 12:56 ` Hinko Kocevar 2020-12-14 21:23 ` Keith Busch 2020-12-15 12:56 ` Hinko Kocevar 2020-12-15 18:56 ` Keith Busch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).