From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63805C4361B for ; Wed, 9 Dec 2020 10:03:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 18B8723B6B for ; Wed, 9 Dec 2020 10:03:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729017AbgLIKC4 (ORCPT ); Wed, 9 Dec 2020 05:02:56 -0500 Received: from halon2.esss.lu.se ([194.47.240.53]:44976 "EHLO halon2.esss.lu.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729051AbgLIKC4 (ORCPT ); Wed, 9 Dec 2020 05:02:56 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ess.eu; s=dec2019; h=mime-version:content-transfer-encoding:content-type:in-reply-to:references: message-id:date:subject:cc:to:from:from; bh=wevMnQPVy7bBIOw/q8c1d5rzAT/cWecUeVVWvcaKMvg=; b=cZDJ1arcUPak3+68oTPhg2DqOHZz4s/xHkTMAm9Y12mTNW2T2ZfwrDmC+Pty5hFN/ppKsXjEly8SV L1w/vni8w2rvRX1ODopRh0FqodowNlR+EKAwCS2e2erfez1PzyZtA30UBbR7nVOTT6EdU2DQXjt15a 7ud1H3VW+E3fUdKs/aZB974us0xyMWaGhhPzkLZjbHJxc3xf9fx5KuOYtmBy0NyXhSMDheBNq5QT2Z gzJTriQFTM4S/cY4mLXNRkbR14VPoyidvj8STckgKQ5GbknWvFHaskN1iIKhS9BMTgaEDn6W8Ee1Bo +QK4cFULojSKsA176cr9NeHiC3rmCxg== Received: from mail.esss.lu.se (it-exch16-2.esss.lu.se [10.0.42.132]) by halon2.esss.lu.se (Halon) with ESMTPS id 974a3109-3a05-11eb-8373-005056a642a7; Wed, 09 Dec 2020 10:02:05 +0000 (UTC) Received: from it-exch16-4.esss.lu.se (10.0.42.134) by it-exch16-2.esss.lu.se (10.0.42.132) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Wed, 9 Dec 2020 11:02:11 +0100 Received: from it-exch16-4.esss.lu.se ([fe80::c5e0:cc1e:47fa:d859]) by it-exch16-4.esss.lu.se ([fe80::c5e0:cc1e:47fa:d859%5]) with mapi id 15.01.2106.003; Wed, 9 Dec 2020 11:02:11 +0100 From: Hinko Kocevar To: Bjorn Helgaas CC: "linux-pci@vger.kernel.org" Subject: Re: Recovering from AER: Uncorrected (Fatal) error Thread-Topic: Recovering from AER: Uncorrected (Fatal) error Thread-Index: AQHWyjnHBCvZDVkUKEWv/A/OfK2ZeKnndy4AgAcPNn8= Date: Wed, 9 Dec 2020 10:02:10 +0000 Message-ID: References: <1a9f75f828c04130b16b7e0a3ae7f1e0@ess.eu>,<20201204223834.GA1970318@bjorn-Precision-5520> In-Reply-To: <20201204223834.GA1970318@bjorn-Precision-5520> Accept-Language: en-US, sv-SE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [194.47.241.248] Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org ________________________________________ From: Bjorn Helgaas Sent: Friday, December 4, 2020 11:38 PM To: Hinko Kocevar Cc: linux-pci@vger.kernel.org Subject: Re: Recovering from AER: Uncorrected (Fatal) error On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote: > Hi, > > I'm trying to figure out how to recover from Uncorrected (Fatal) error th= at is seen by the PCI root on a CPU PCIe controller: > > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected = (Fatal) error received: id=3D0008 > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: s= everity=3DUncorrected (Fatal), type=3DTransaction Layer, id=3D0008(Requeste= r ID) > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:19= 01] error status/mask=3D00004000/00000000 > Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completio= n Timeout (First) > > This is the complete PCI device tree that I'm working with: > > $ sudo /usr/local/bin/pcicrawler -t > 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 > =86=8001:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 > =81 =86=8002:01.0 downstream_port, slot 1, device present, power: Off, = speed 8GT/s, width x4 > =81 =81 =84=8003:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8= 748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (87= 48) > =81 =81 =86=8004:00.0 downstream_port, slot 10, power: Off > =81 =81 =86=8004:01.0 downstream_port, slot 4, device present, powe= r: Off, speed 8GT/s, width x4 > =81 =81 =81 =84=8006:00.0 endpoint, Research Centre Juelich (1796)= , device 0024 > =81 =81 =86=8004:02.0 downstream_port, slot 9, power: Off > =81 =81 =86=8004:03.0 downstream_port, slot 3, device present, powe= r: Off, speed 8GT/s, width x4 > =81 =81 =81 =84=8008:00.0 endpoint, Research Centre Juelich (1796)= , device 0024 > =81 =81 =86=8004:08.0 downstream_port, slot 5, device present, powe= r: Off, speed 8GT/s, width x4 > =81 =81 =81 =84=8009:00.0 endpoint, Research Centre Juelich (1796)= , device 0024 > =81 =81 =86=8004:09.0 downstream_port, slot 11, power: Off > =81 =81 =86=8004:0a.0 downstream_port, slot 6, device present, powe= r: Off, speed 8GT/s, width x4 > =81 =81 =81 =84=800b:00.0 endpoint, Research Centre Juelich (1796)= , device 0024 > =81 =81 =86=8004:0b.0 downstream_port, slot 12, power: Off > =81 =81 =86=8004:10.0 downstream_port, slot 8, power: Off > =81 =81 =86=8004:11.0 downstream_port, slot 2, device present, powe= r: Off, speed 2.5GT/s, width x1 > =81 =81 =81 =84=800e:00.0 endpoint, Xilinx Corporation (10ee), dev= ice 7011 > =81 =81 =84=8004:12.0 downstream_port, slot 7, power: Off > =81 =86=8002:02.0 downstream_port, slot 2 > =81 =86=8002:08.0 downstream_port, slot 8 > =81 =86=8002:09.0 downstream_port, slot 9, power: Off > =81 =84=8002:0a.0 downstream_port, slot 10 > =86=8001:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 > =86=8001:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 > =86=8001:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 > =84=8001:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 > > 00:01.0 is on a CPU board, The 03:00.0 and everything below that is > not on a CPU board (working with a micro TCA system here). I'm > working with FPGA based devices seen as endpoints here. After the > error all the endpoints in the above list are not able to talk to > CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI > config space looks sane and accessible for those devices. This could be caused by a reset. We probably do a secondary bus reset on the Root Port, which resets all the devices below it. After the reset, config space of those downstream devices would be accessible, but the PCI_COMMAND register may be cleared which means the device wouldn't respond to MMIO reads. After adding some printk()'s to the PCIe code that deals with the reset, I = can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus= slots list check for empty returns true. So the the secondary bus is being= reset. None of this explains the original problem of the Completion Timeout, of course. The error source of 0x8 (00:01.0) is the root port, which makes sense if it issued a request transaction and never got the completion. The root port *should* log the header of the request and we should print it, but it looks like we didn't. "lspci -vv" of the device would show whether it's capable of this logging. I'm mostly concerned about the fact that the secondary bus is rendered usel= ess after the reset/recovery. AIUI, this should not happen. After upgrading my pciutils I get a bit more output from lspci -vv for the = Root port. It seems it is capable of logging the errors: Capabilities: [1c0 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- = RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- = RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- = RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFa= talErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFa= talErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECR= CChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000001 00000002 00000003 RootCmd: CERptEn+ NFERptEn+ FERptEn+ RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008 The above was provoked with the aer-inject of the same error as seen in ini= tial report, with bogus 'HEADER_LOG 0 1 2 3'. Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecti= ng errors 00000000/00004000 into device 0000:00:01.0 Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (F= atal) error received: 0000:00:01.0 Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: sev= erity=3DUncorrected (Fatal), type=3DTransaction Layer, (Requester ID) Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901= ] error status/mask=3D00004000/00000000 Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO = =20 Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected]= called .. state=3D2 Question: would using different (real?) HEADER_LOG values affect the recove= ry behaviour? If you're talking to an FPGA, the most likely explanation is a bug in the FPGA programming where it doesn't respond when it should. The thing is that this error does not happen that often and most of the tim= e the card using FPGA is just fine. TBH, I am doing some stress testing of = the card removal through the MCH (exercising hotplug) and this when I notic= ed the initial problem. A PCIe analyzer would show exactly what the problem is, but those are expensive and rare. But if you're dealing with FPGAs, maybe you have access to one. Sadly, no, I do not have access to the analyzer. > How can I debug this further? I'm getting the "I/O to channel is > blocked" (pci_channel_io_frozen) state reported to the the endpoint > devices I provide driver for. Is there any way of telling if the > PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ; > links up and running and similar? Is this information provided by > the Linux kernel somehow? > > For reference, I've experimented with AER inject and my tests showed > that if the same type of error is injected in any other PCI device > in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is > recovered successfully, and I can continue working with the endpoint > devices. In those cases the "I/O channel is in normal state" > (pci_channel_io_normal) state was reported; only error injection > into 00:01.0 reports pci_channel_io_frozen state. Recovery code in > the endpoint drivers I maintain is just calling the > pci_cleanup_aer_uncorrect_error_status() in error handler resume() > callback. > > FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel > which I believe is quite old. At the moment I can not use newer > kernel, but would be prepared to take that step if told that it > would help. It's really not practical for us to help debug a v3.10-based kernel; that's over seven years old and AER handling has changed significantly since then. Also, CentOS is a distro kernel and includes many changes on top of the v3.10 upstream kernel. Those changes might be theoretically open-source, but as a practical matter, the distro is a better place to start for support of their kernel. I followed your advice and started using kernel 5.9.12; hope this is recent= enough. With that kernel I see a bit of different behavior than with the 3.10 earli= er. Configuration access to the devices on the secondary bus is not OK; all= 0xFFFFFFFF. This is what I see after the error was injected into the root = port 00:01.0 and secondary bus reset was made as part of recovery: 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Co= re Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Steppi= ng- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=3Dfast >TAbort- SERR- TAbort- SERR- 02:0= ?.0). Thanks! //hinko Bjorn