From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70567C433FE for ; Fri, 4 Dec 2020 12:53:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1824222A85 for ; Fri, 4 Dec 2020 12:53:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726477AbgLDMxC (ORCPT ); Fri, 4 Dec 2020 07:53:02 -0500 Received: from halon2.esss.lu.se ([194.47.240.53]:63059 "EHLO halon2.esss.lu.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726441AbgLDMxB (ORCPT ); Fri, 4 Dec 2020 07:53:01 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ess.eu; s=dec2019; h=mime-version:content-transfer-encoding:content-type:message-id:date:subject: to:from:from; bh=BpSS3guQkCmfBSeFXPCUKdzqc4BcKJr4dmgYQKFGDpg=; b=dZst1QLTJS8I7li+2dAxRKioBEK2MCiIAR24aHuQ/jUUNANlOnNhBHxoiFOQJFnug/K7AMOnSjNHr 60liO+BM33Q8idP02ARaqenEKvun/XGsBvNsE9XBQGyS3Ki5XptAgWsj/uOIFtWIYG3FiYrERBN7mO Czgh6j7K2p7lnv06gZu8LwXJ6E2Lei0s+pcA/SDLfeKl2le2hP+/ZGwYQ1AMavp6unv7rjlXWa08nv eKX6StTrbqaltIRNF0htwbHdyI1ZTFuZcFZyt6c/hGaqst65sZgYzGKnOdi33wB7NdZSTj6C7ObbrZ eFddcA34EycdvutLdcRQt3hNR6NEXng== Received: from mail.esss.lu.se (it-exch16-4.esss.lu.se [10.0.42.134]) by halon2.esss.lu.se (Halon) with ESMTPS id 895c4051-362f-11eb-8373-005056a642a7; Fri, 04 Dec 2020 12:52:16 +0000 (UTC) Received: from it-exch16-4.esss.lu.se (10.0.42.134) by it-exch16-4.esss.lu.se (10.0.42.134) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Fri, 4 Dec 2020 13:52:18 +0100 Received: from it-exch16-4.esss.lu.se ([fe80::c5e0:cc1e:47fa:d859]) by it-exch16-4.esss.lu.se ([fe80::c5e0:cc1e:47fa:d859%5]) with mapi id 15.01.2106.003; Fri, 4 Dec 2020 13:52:18 +0100 From: Hinko Kocevar To: "linux-pci@vger.kernel.org" Subject: Recovering from AER: Uncorrected (Fatal) error Thread-Topic: Recovering from AER: Uncorrected (Fatal) error Thread-Index: AQHWyjnHBCvZDVkUKEWv/A/OfK2ZeA== Date: Fri, 4 Dec 2020 12:52:18 +0000 Message-ID: <1a9f75f828c04130b16b7e0a3ae7f1e0@ess.eu> Accept-Language: en-US, sv-SE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [194.47.241.248] Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Hi, I'm trying to figure out how to recover from Uncorrected (Fatal) error that= is seen by the PCI root on a CPU PCIe controller: Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (F= atal) error received: id=3D0008 Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: sev= erity=3DUncorrected (Fatal), type=3DTransaction Layer, id=3D0008(Requester = ID) Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901= ] error status/mask=3D00004000/00000000 Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion = Timeout (First) This is the complete PCI device tree that I'm working with: $ sudo /usr/local/bin/pcicrawler -t 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8 =86=8001:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725 =81 =86=8002:01.0 downstream_port, slot 1, device present, power: Off, sp= eed 8GT/s, width x4 =81 =81 =84=8003:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 874= 8 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748= ) =81 =81 =86=8004:00.0 downstream_port, slot 10, power: Off =81 =81 =86=8004:01.0 downstream_port, slot 4, device present, power:= Off, speed 8GT/s, width x4 =81 =81 =81 =84=8006:00.0 endpoint, Research Centre Juelich (1796), = device 0024 =81 =81 =86=8004:02.0 downstream_port, slot 9, power: Off =81 =81 =86=8004:03.0 downstream_port, slot 3, device present, power:= Off, speed 8GT/s, width x4 =81 =81 =81 =84=8008:00.0 endpoint, Research Centre Juelich (1796), = device 0024 =81 =81 =86=8004:08.0 downstream_port, slot 5, device present, power:= Off, speed 8GT/s, width x4 =81 =81 =81 =84=8009:00.0 endpoint, Research Centre Juelich (1796), = device 0024 =81 =81 =86=8004:09.0 downstream_port, slot 11, power: Off =81 =81 =86=8004:0a.0 downstream_port, slot 6, device present, power:= Off, speed 8GT/s, width x4 =81 =81 =81 =84=800b:00.0 endpoint, Research Centre Juelich (1796), = device 0024 =81 =81 =86=8004:0b.0 downstream_port, slot 12, power: Off =81 =81 =86=8004:10.0 downstream_port, slot 8, power: Off =81 =81 =86=8004:11.0 downstream_port, slot 2, device present, power:= Off, speed 2.5GT/s, width x1 =81 =81 =81 =84=800e:00.0 endpoint, Xilinx Corporation (10ee), devic= e 7011 =81 =81 =84=8004:12.0 downstream_port, slot 7, power: Off =81 =86=8002:02.0 downstream_port, slot 2 =81 =86=8002:08.0 downstream_port, slot 8 =81 =86=8002:09.0 downstream_port, slot 9, power: Off =81 =84=8002:0a.0 downstream_port, slot 10 =86=8001:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0 =86=8001:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0 =86=8001:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0 =84=8001:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0 00:01.0 is on a CPU board, The 03:00.0 and everything below that is not on = a CPU board (working with a micro TCA system here). I'm working with FPGA b= ased devices seen as endpoints here. After the error all the endpoints in the above list are not able to talk to= CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI config= space looks sane and accessible for those devices. How can I debug this further? I'm getting the "I/O to channel is blocked" (= pci_channel_io_frozen) state reported to the the endpoint devices I provide= driver for. Is there any way of telling if the PCI switch devices between 00:01.0 ... 0= 6:00.0 have all recovered ; links up and running and similar? Is this infor= mation provided by the Linux kernel somehow? For reference, I've experimented with AER inject and my tests showed that i= f the same type of error is injected in any other PCI device in the path 01= :00.0 ... 06:00.0, IOW not into 00:01.0, the link is recovered successfully= , and I can continue working with the endpoint devices. In those cases the = "I/O channel is in normal state" (pci_channel_io_normal) state was reported= ; only error injection into 00:01.0 reports pci_channel_io_frozen state. Re= covery code in the endpoint drivers I maintain is just calling the pci_clea= nup_aer_uncorrect_error_status() in error handler resume() callback. FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel which I bel= ieve is quite old. At the moment I can not use newer kernel, but would be p= repared to take that step if told that it would help. Thank you in advance, Hinko