Re: Recovering from AER: Uncorrected (Fatal) error

From: Hinko Kocevar <hinko.kocevar@ess.eu>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: Recovering from AER: Uncorrected (Fatal) error
Date: Wed, 9 Dec 2020 21:31:21 +0100	[thread overview]
Message-ID: <9d234658-2ce5-593f-aff7-d524dc89a382@ess.eu> (raw)
In-Reply-To: <20201209174009.GA2532473@bjorn-Precision-5520>

Hi Bjorn!

On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> [non-standard email quoting below, ">" for both my text and Hinko's,
> see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]
> 

Apologies. Probably due to the web MUA I used. Hopefully this will be 
better now.

> On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
>> ________________________________________
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Friday, December 4, 2020 11:38 PM
>> To: Hinko Kocevar
>> Cc: linux-pci@vger.kernel.org
>> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
>>
>> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
>>> Hi,
>>>
>>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>>>
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
>>>
>>> This is the complete PCI device tree that I'm working with:
>>>
>>> $ sudo /usr/local/bin/pcicrawler -t
>>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>>   │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>>>   │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>>>   │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>>>   │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>>>   │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>>   │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>>   │  │     └─04:12.0 downstream_port, slot 7, power: Off
>>>   │  ├─02:02.0 downstream_port, slot 2
>>>   │  ├─02:08.0 downstream_port, slot 8
>>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>>   │  └─02:0a.0 downstream_port, slot 10
>>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>
>>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
>>> not on a CPU board (working with a micro TCA system here). I'm
>>> working with FPGA based devices seen as endpoints here.  After the
>>> error all the endpoints in the above list are not able to talk to
>>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
>>> config space looks sane and accessible for those devices.
>>
>> This could be caused by a reset.  We probably do a secondary bus reset
>> on the Root Port, which resets all the devices below it.  After the
>> reset, config space of those downstream devices would be accessible,
>> but the PCI_COMMAND register may be cleared which means the device
>> wouldn't respond to MMIO reads.
>>
>> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
>>
>> None of this explains the original problem of the Completion Timeout,
>> of course.  The error source of 0x8 (00:01.0) is the root port, which
>> makes sense if it issued a request transaction and never got the
>> completion.  The root port *should* log the header of the request and
>> we should print it, but it looks like we didn't.  "lspci -vv" of the
>> device would show whether it's capable of this logging.
>>
>> I'm mostly concerned about the fact that the secondary bus is
>> rendered useless after the reset/recovery. AIUI, this should not
>> happen.
> 
> It should not.  The secondary bus should still work after the reset.

I was hoping to hear that!
> 
>> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
>>
>>          Capabilities: [1c0 v1] Advanced Error Reporting
>>                  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                  AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>>                          MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                  HeaderLog: 00000000 00000001 00000002 00000003
>>                  RootCmd: CERptEn+ NFERptEn+ FERptEn+
>>                  RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>>                           FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>>                  ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
>>
>> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
>>
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO
>> Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>>
>>
>> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?
> 
> No, it shouldn't.  I'm pretty sure we only print those values and they
> don't influence behavior at all.
> 
>>> How can I debug this further? I'm getting the "I/O to channel is
>>> blocked" (pci_channel_io_frozen) state reported to the the endpoint
>>> devices I provide driver for.  Is there any way of telling if the
>>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
>>> links up and running and similar? Is this information provided by
>>> the Linux kernel somehow?
>>>
>>> For reference, I've experimented with AER inject and my tests showed
>>> that if the same type of error is injected in any other PCI device
>>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
>>> recovered successfully, and I can continue working with the endpoint
>>> devices. In those cases the "I/O channel is in normal state"
>>> (pci_channel_io_normal) state was reported; only error injection
>>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
>>> the endpoint drivers I maintain is just calling the
>>> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
>>> callback.
>>>
>>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
>>> which I believe is quite old. At the moment I can not use newer
>>> kernel, but would be prepared to take that step if told that it
>>> would help.
>>
>> It's really not practical for us to help debug a v3.10-based kernel;
>> that's over seven years old and AER handling has changed significantly
>> since then.  Also, CentOS is a distro kernel and includes many changes
>> on top of the v3.10 upstream kernel.  Those changes might be
>> theoretically open-source, but as a practical matter, the distro is a
>> better place to start for support of their kernel.
>>
>> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
>>
>> With that kernel I see a bit of different behavior than with the
>> 3.10 earlier. Configuration access to the devices on the secondary
>> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
>> was injected into the root port 00:01.0 and secondary bus reset was
>> made as part of recovery:
>>
>> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Latency: 0, Cache Line Size: 64 bytes
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
>> 	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Interrupt: pin A routed to IRQ 128
>> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: mrf-pci
>>
>> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
>> How come that the 01:00.0 upstream port PCI_COMMAND register is
>> cleared? I guess that explains why the rest of the devices in the
>> chain down to the endpoint at 0e:00.0 are rendered useless.
>>
>> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
>> do the actual reset of the secondary by just returning 0 before the
>> call to the pci_bus_reset(). In this case no bus reset is done,
>> obviously, and MMIO access to my endpoint(s) is working just fine
>> after the error was injected into the root port; as if nothing
>> happened.
>>
>> Is it expected that in cases of seeing this type of error the
>> secondary bus would be in this kind of state? It looks to me that
>> the problem is not with my endpoints but with the PCIe switch on the
>> CPU board (01:00.0 <--> 02:0?.0).
> 
> Hmm.  Yeah, I don't see anything in the path that would restore the
> 01:00.0 config space, which seems like a problem.  Probably its
> secondary/subordinate bus registers and window registers get reset to
> 0 ("lspci -vv" would show that) so config/mem/io space below it is
> accessible.
> 
> I didn't have much confidence in our error handling to begin with, and
> it's eroding ;)
> 
> Bjorn
>