From: Hinko Kocevar <hinko.kocevar@ess.eu>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: Recovering from AER: Uncorrected (Fatal) error
Date: Wed, 9 Dec 2020 21:50:17 +0100 [thread overview]
Message-ID: <d1ad4c01-ce73-dcc0-99b1-5fe45a984800@ess.eu> (raw)
In-Reply-To: <20201209174009.GA2532473@bjorn-Precision-5520>
Hi Bjorn!
On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> [non-standard email quoting below, ">" for both my text and Hinko's,
> see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]
>
Apologies. Probably due to the web MUA I used. Hopefully this will be
better now.
> On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
>> ________________________________________
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Friday, December 4, 2020 11:38 PM
>> To: Hinko Kocevar
>> Cc: linux-pci@vger.kernel.org
>> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
>>
>> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
>>> Hi,
>>>
>>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>>>
>>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000
>>> Dec 1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout (First)
>>>
>>> This is the complete PCI device tree that I'm working with:
>>>
>>> $ sudo /usr/local/bin/pcicrawler -t
>>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>> ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>> │ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>> │ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>> │ │ ├─04:00.0 downstream_port, slot 10, power: Off
>>> │ │ ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>> │ │ │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>> │ │ ├─04:02.0 downstream_port, slot 9, power: Off
>>> │ │ ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>> │ │ │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>> │ │ ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>>> │ │ │ └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>> │ │ ├─04:09.0 downstream_port, slot 11, power: Off
>>> │ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>> │ │ │ └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>> │ │ ├─04:0b.0 downstream_port, slot 12, power: Off
>>> │ │ ├─04:10.0 downstream_port, slot 8, power: Off
>>> │ │ ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>> │ │ │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>> │ │ └─04:12.0 downstream_port, slot 7, power: Off
>>> │ ├─02:02.0 downstream_port, slot 2
>>> │ ├─02:08.0 downstream_port, slot 8
>>> │ ├─02:09.0 downstream_port, slot 9, power: Off
>>> │ └─02:0a.0 downstream_port, slot 10
>>> ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>> ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>> ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>> └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>
>>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
>>> not on a CPU board (working with a micro TCA system here). I'm
>>> working with FPGA based devices seen as endpoints here. After the
>>> error all the endpoints in the above list are not able to talk to
>>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
>>> config space looks sane and accessible for those devices.
>>
>> This could be caused by a reset. We probably do a secondary bus reset
>> on the Root Port, which resets all the devices below it. After the
>> reset, config space of those downstream devices would be accessible,
>> but the PCI_COMMAND register may be cleared which means the device
>> wouldn't respond to MMIO reads.
>>
>> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
>>
>> None of this explains the original problem of the Completion Timeout,
>> of course. The error source of 0x8 (00:01.0) is the root port, which
>> makes sense if it issued a request transaction and never got the
>> completion. The root port *should* log the header of the request and
>> we should print it, but it looks like we didn't. "lspci -vv" of the
>> device would show whether it's capable of this logging.
>>
>> I'm mostly concerned about the fact that the secondary bus is
>> rendered useless after the reset/recovery. AIUI, this should not
>> happen.
>
> It should not. The secondary bus should still work after the reset.
I was hoping to hear that!
>
>> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
>>
>> Capabilities: [1c0 v1] Advanced Error Reporting
>> UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>> AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>> HeaderLog: 00000000 00000001 00000002 00000003
>> RootCmd: CERptEn+ NFERptEn+ FERptEn+
>> RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>> FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>> ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
>>
>> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
>>
>> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
>> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
>> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000
>> Dec 9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] CmpltTO
>> Dec 9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>>
>>
>> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?
>
> No, it shouldn't. I'm pretty sure we only print those values and they
> don't influence behavior at all.
>
>>> How can I debug this further? I'm getting the "I/O to channel is
>>> blocked" (pci_channel_io_frozen) state reported to the the endpoint
>>> devices I provide driver for. Is there any way of telling if the
>>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
>>> links up and running and similar? Is this information provided by
>>> the Linux kernel somehow?
>>>
>>> For reference, I've experimented with AER inject and my tests showed
>>> that if the same type of error is injected in any other PCI device
>>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
>>> recovered successfully, and I can continue working with the endpoint
>>> devices. In those cases the "I/O channel is in normal state"
>>> (pci_channel_io_normal) state was reported; only error injection
>>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
>>> the endpoint drivers I maintain is just calling the
>>> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
>>> callback.
>>>
>>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
>>> which I believe is quite old. At the moment I can not use newer
>>> kernel, but would be prepared to take that step if told that it
>>> would help.
>>
>> It's really not practical for us to help debug a v3.10-based kernel;
>> that's over seven years old and AER handling has changed significantly
>> since then. Also, CentOS is a distro kernel and includes many changes
>> on top of the v3.10 upstream kernel. Those changes might be
>> theoretically open-source, but as a practical matter, the distro is a
>> better place to start for support of their kernel.
>>
>> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
>>
>> With that kernel I see a bit of different behavior than with the
>> 3.10 earlier. Configuration access to the devices on the secondary
>> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
>> was injected into the root port 00:01.0 and secondary bus reset was
>> made as part of recovery:
>>
>> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> Latency: 0, Cache Line Size: 64 bytes
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
>> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> Interrupt: pin A routed to IRQ 128
>> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
>> !!! Unknown header type 7f
>> Kernel driver in use: pcieport
>>
>> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> !!! Unknown header type 7f
>> Kernel driver in use: pcieport
>>
>> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> !!! Unknown header type 7f
>> Kernel driver in use: pcieport
>>
>> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
>> !!! Unknown header type 7f
>> Kernel driver in use: mrf-pci
>>
>> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
>> How come that the 01:00.0 upstream port PCI_COMMAND register is
>> cleared? I guess that explains why the rest of the devices in the
>> chain down to the endpoint at 0e:00.0 are rendered useless.
>>
>> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
>> do the actual reset of the secondary by just returning 0 before the
>> call to the pci_bus_reset(). In this case no bus reset is done,
>> obviously, and MMIO access to my endpoint(s) is working just fine
>> after the error was injected into the root port; as if nothing
>> happened.
>>
>> Is it expected that in cases of seeing this type of error the
>> secondary bus would be in this kind of state? It looks to me that
>> the problem is not with my endpoints but with the PCIe switch on the
>> CPU board (01:00.0 <--> 02:0?.0).
>
> Hmm. Yeah, I don't see anything in the path that would restore the
> 01:00.0 config space, which seems like a problem. Probably its
> secondary/subordinate bus registers and window registers get reset to
> 0 ("lspci -vv" would show that) so config/mem/io space below it is
> accessible.
I guess the save/restore of the switch registers would be in the core
PCI(e) code; and not a switch vendor/part specific.
I still have interest in getting this to work. With my limited knowledge
of the PCI code base, is there anything I could do to help? Is there a
particular code base area where you expect this should be managed; pci.c
or pcie/aer.c or maybe hotplug/ source? I can try and tinker with it..
Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset:
01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if
00 [Normal decode])
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 128
Region 0: Memory at df200000 (32-bit, non-prefetchable)
[virtual] [size=256K]
Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled]
Memory behind bridge: fff00000-000fffff [disabled]
Prefetchable memory behind bridge:
00000000fff00000-00000000000fffff [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset-
FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [68] Express (v2) Upstream Port, MSI 00
DevCap: MaxPayload 1024 bytes, PhantFunc 0
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+
SlotPowerLimit 75.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+
AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit
Latency L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-
NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via message,
ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LTR- OBFF Disabled,
AtomicOpsCtl: EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+
Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 8725
Capabilities: [100 v1] Device Serial Number ca-87-00-10-b5-df-0e-00
Capabilities: [fb4 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr+
AERCap: First Error Pointer: 1f, ECRCGenCap+ ECRCGenEn-
ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [138 v1] Power Budgeting <?>
Capabilities: [10c v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [148 v1] Virtual Channel
Caps: LPEVC=1 RefClk=100ns PATEntryBits=8
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64+ WRR128- TWRR128-
WRR256-
Ctrl: Enable+ ID=0 ArbSelect=WRR64 TC/VC=ff
Status: NegoPending- InProgress-
Port Arbitration Table <?>
VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
Ctrl: Enable- ID=1 ArbSelect=Fixed TC/VC=00
Status: NegoPending+ InProgress-
Capabilities: [e00 v1] Multicast
McastCap: MaxGroups 64, ECRCRegen+
McastCtl: NumGroups 1, Enable-
McastBAR: IndexPos 0, BaseAddr 0000000000000000
McastReceiveVec: 0000000000000000
McastBlockAllVec: 0000000000000000
McastBlockUntransVec: 0000000000000000
McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr
0000000000000000
Capabilities: [b00 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [b70 v1] Vendor Specific Information: ID=0001
Rev=0 Len=010 <?>
Kernel driver in use: pcieport
And here is the diff of the lspci of 01:00.0 between the before and
after the reset:
01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if
00 [Normal decode])
- Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
+ Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
- Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 128
- Region 0: Memory at df200000 (32-bit, non-prefetchable) [size=256K]
- Bus: primary=01, secondary=02, subordinate=13, sec-latency=0
+ Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual]
[size=256K]
+ Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled]
- Memory behind bridge: df000000-df1fffff [size=2M]
- Prefetchable memory behind bridge: 0000000090000000-00000000923fffff
[size=36M]
+ Memory behind bridge: fff00000-000fffff [disabled]
+ Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
[disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
- BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
+ BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
- Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
- Address: 00000000fee00338 Data: 0000
- Masking: 000000ff Pending: 00000000
+ Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+
+ Address: 0000000000000000 Data: 0000
+ Masking: 00000000 Pending: 00000000
Capabilities: [68] Express (v2) Upstream Port, MSI 00
DevCap: MaxPayload 1024 bytes, PhantFunc 0
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 75.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
- MaxPayload 256 bytes, MaxReadReq 128 bytes
+ MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
@@ -376,7 +375,7 @@
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF
Disabled,
+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF
Disabled,
AtomicOpsCtl: EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ Retimer-
2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
@@ -402,7 +401,7 @@
LaneErrStat: 0
Capabilities: [148 v1] Virtual Channel
Caps: LPEVC=1 RefClk=100ns PATEntryBits=8
- Arb: Fixed+ WRR32- WRR64- WRR128-
+ Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
@@ -423,20 +422,20 @@
McastBlockUntransVec: 0000000000000000
McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 0000000000000000
Capabilities: [b00 v1] Latency Tolerance Reporting
- Max snoop latency: 71680ns
- Max no snoop latency: 71680ns
+ Max snoop latency: 0ns
+ Max no snoop latency: 0ns
Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0
Len=010 <?>
Kernel driver in use: pcieport
Looks like many fields have been zeroed out. Should a bare restoration
of the previous registers values be enough or does some other magic
(i.e. calling PCI functions/performing initialization tasks/etc) needs
to happen to get that port back to sanity?
Thanks!
//hinko
>
> I didn't have much confidence in our error handling to begin with, and
> it's eroding ;)
>
> Bjorn
>
next prev parent reply other threads:[~2020-12-09 20:51 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar
2020-12-04 22:38 ` Bjorn Helgaas
2020-12-09 10:02 ` Hinko Kocevar
2020-12-09 17:40 ` Bjorn Helgaas
2020-12-09 20:31 ` Hinko Kocevar
2020-12-09 20:50 ` Hinko Kocevar [this message]
2020-12-09 21:32 ` Bjorn Helgaas
2020-12-09 22:55 ` Hinko Kocevar
2020-12-10 12:56 ` Hinko Kocevar
2020-12-14 21:23 ` Keith Busch
2020-12-15 12:56 ` Hinko Kocevar
2020-12-15 18:56 ` Keith Busch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d1ad4c01-ce73-dcc0-99b1-5fe45a984800@ess.eu \
--to=hinko.kocevar@ess.eu \
--cc=helgaas@kernel.org \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).