All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hinko Kocevar <hinko.kocevar@ess.eu>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: Recovering from AER: Uncorrected (Fatal) error
Date: Wed, 9 Dec 2020 21:31:21 +0100	[thread overview]
Message-ID: <9d234658-2ce5-593f-aff7-d524dc89a382@ess.eu> (raw)
In-Reply-To: <20201209174009.GA2532473@bjorn-Precision-5520>

Hi Bjorn!

On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> [non-standard email quoting below, ">" for both my text and Hinko's,
> see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]
> 

Apologies. Probably due to the web MUA I used. Hopefully this will be 
better now.

> On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
>> ________________________________________
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Friday, December 4, 2020 11:38 PM
>> To: Hinko Kocevar
>> Cc: linux-pci@vger.kernel.org
>> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
>>
>> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
>>> Hi,
>>>
>>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>>>
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
>>>
>>> This is the complete PCI device tree that I'm working with:
>>>
>>> $ sudo /usr/local/bin/pcicrawler -t
>>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>>   │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>>>   │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>>>   │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>>>   │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>>>   │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>>   │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>>   │  │     └─04:12.0 downstream_port, slot 7, power: Off
>>>   │  ├─02:02.0 downstream_port, slot 2
>>>   │  ├─02:08.0 downstream_port, slot 8
>>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>>   │  └─02:0a.0 downstream_port, slot 10
>>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>
>>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
>>> not on a CPU board (working with a micro TCA system here). I'm
>>> working with FPGA based devices seen as endpoints here.  After the
>>> error all the endpoints in the above list are not able to talk to
>>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
>>> config space looks sane and accessible for those devices.
>>
>> This could be caused by a reset.  We probably do a secondary bus reset
>> on the Root Port, which resets all the devices below it.  After the
>> reset, config space of those downstream devices would be accessible,
>> but the PCI_COMMAND register may be cleared which means the device
>> wouldn't respond to MMIO reads.
>>
>> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
>>
>> None of this explains the original problem of the Completion Timeout,
>> of course.  The error source of 0x8 (00:01.0) is the root port, which
>> makes sense if it issued a request transaction and never got the
>> completion.  The root port *should* log the header of the request and
>> we should print it, but it looks like we didn't.  "lspci -vv" of the
>> device would show whether it's capable of this logging.
>>
>> I'm mostly concerned about the fact that the secondary bus is
>> rendered useless after the reset/recovery. AIUI, this should not
>> happen.
> 
> It should not.  The secondary bus should still work after the reset.

I was hoping to hear that!
> 
>> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
>>
>>          Capabilities: [1c0 v1] Advanced Error Reporting
>>                  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                  AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>>                          MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                  HeaderLog: 00000000 00000001 00000002 00000003
>>                  RootCmd: CERptEn+ NFERptEn+ FERptEn+
>>                  RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>>                           FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>>                  ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
>>
>> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
>>
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO
>> Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>>
>>
>> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?
> 
> No, it shouldn't.  I'm pretty sure we only print those values and they
> don't influence behavior at all.
> 
>>> How can I debug this further? I'm getting the "I/O to channel is
>>> blocked" (pci_channel_io_frozen) state reported to the the endpoint
>>> devices I provide driver for.  Is there any way of telling if the
>>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
>>> links up and running and similar? Is this information provided by
>>> the Linux kernel somehow?
>>>
>>> For reference, I've experimented with AER inject and my tests showed
>>> that if the same type of error is injected in any other PCI device
>>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
>>> recovered successfully, and I can continue working with the endpoint
>>> devices. In those cases the "I/O channel is in normal state"
>>> (pci_channel_io_normal) state was reported; only error injection
>>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
>>> the endpoint drivers I maintain is just calling the
>>> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
>>> callback.
>>>
>>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
>>> which I believe is quite old. At the moment I can not use newer
>>> kernel, but would be prepared to take that step if told that it
>>> would help.
>>
>> It's really not practical for us to help debug a v3.10-based kernel;
>> that's over seven years old and AER handling has changed significantly
>> since then.  Also, CentOS is a distro kernel and includes many changes
>> on top of the v3.10 upstream kernel.  Those changes might be
>> theoretically open-source, but as a practical matter, the distro is a
>> better place to start for support of their kernel.
>>
>> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
>>
>> With that kernel I see a bit of different behavior than with the
>> 3.10 earlier. Configuration access to the devices on the secondary
>> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
>> was injected into the root port 00:01.0 and secondary bus reset was
>> made as part of recovery:
>>
>> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Latency: 0, Cache Line Size: 64 bytes
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
>> 	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Interrupt: pin A routed to IRQ 128
>> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: mrf-pci
>>
>> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
>> How come that the 01:00.0 upstream port PCI_COMMAND register is
>> cleared? I guess that explains why the rest of the devices in the
>> chain down to the endpoint at 0e:00.0 are rendered useless.
>>
>> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
>> do the actual reset of the secondary by just returning 0 before the
>> call to the pci_bus_reset(). In this case no bus reset is done,
>> obviously, and MMIO access to my endpoint(s) is working just fine
>> after the error was injected into the root port; as if nothing
>> happened.
>>
>> Is it expected that in cases of seeing this type of error the
>> secondary bus would be in this kind of state? It looks to me that
>> the problem is not with my endpoints but with the PCIe switch on the
>> CPU board (01:00.0 <--> 02:0?.0).
> 
> Hmm.  Yeah, I don't see anything in the path that would restore the
> 01:00.0 config space, which seems like a problem.  Probably its
> secondary/subordinate bus registers and window registers get reset to
> 0 ("lspci -vv" would show that) so config/mem/io space below it is
> accessible.
> 
> I didn't have much confidence in our error handling to begin with, and
> it's eroding ;)
> 
> Bjorn
> 

  reply	other threads:[~2020-12-09 20:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar
2020-12-04 22:38 ` Bjorn Helgaas
2020-12-09 10:02   ` Hinko Kocevar
2020-12-09 17:40     ` Bjorn Helgaas
2020-12-09 20:31       ` Hinko Kocevar [this message]
2020-12-09 20:50       ` Hinko Kocevar
2020-12-09 21:32         ` Bjorn Helgaas
2020-12-09 22:55           ` Hinko Kocevar
2020-12-10 12:56             ` Hinko Kocevar
2020-12-14 21:23             ` Keith Busch
2020-12-15 12:56               ` Hinko Kocevar
2020-12-15 18:56                 ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9d234658-2ce5-593f-aff7-d524dc89a382@ess.eu \
    --to=hinko.kocevar@ess.eu \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.