linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hinko Kocevar <hinko.kocevar@ess.eu>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: Recovering from AER: Uncorrected (Fatal) error
Date: Wed, 9 Dec 2020 21:50:17 +0100	[thread overview]
Message-ID: <d1ad4c01-ce73-dcc0-99b1-5fe45a984800@ess.eu> (raw)
In-Reply-To: <20201209174009.GA2532473@bjorn-Precision-5520>

Hi Bjorn!

On 12/9/20 6:40 PM, Bjorn Helgaas wrote:
> [non-standard email quoting below, ">" for both my text and Hinko's,
> see https://en.wikipedia.org/wiki/Posting_style#Interleaved_style]
> 

Apologies. Probably due to the web MUA I used. Hopefully this will be 
better now.


> On Wed, Dec 09, 2020 at 10:02:10AM +0000, Hinko Kocevar wrote:
>> ________________________________________
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Friday, December 4, 2020 11:38 PM
>> To: Hinko Kocevar
>> Cc: linux-pci@vger.kernel.org
>> Subject: Re: Recovering from AER: Uncorrected (Fatal) error
>>
>> On Fri, Dec 04, 2020 at 12:52:18PM +0000, Hinko Kocevar wrote:
>>> Hi,
>>>
>>> I'm trying to figure out how to recover from Uncorrected (Fatal) error that is seen by the PCI root on a CPU PCIe controller:
>>>
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>>> Dec  1 02:16:37 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout     (First)
>>>
>>> This is the complete PCI device tree that I'm working with:
>>>
>>> $ sudo /usr/local/bin/pcicrawler -t
>>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>>   │  │     ├─04:00.0 downstream_port, slot 10, power: Off
>>>   │  │     ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:02.0 downstream_port, slot 9, power: Off
>>>   │  │     ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─09:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:09.0 downstream_port, slot 11, power: Off
>>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>>   │  │     │  └─0b:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>>   │  │     ├─04:0b.0 downstream_port, slot 12, power: Off
>>>   │  │     ├─04:10.0 downstream_port, slot 8, power: Off
>>>   │  │     ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>>   │  │     │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>>   │  │     └─04:12.0 downstream_port, slot 7, power: Off
>>>   │  ├─02:02.0 downstream_port, slot 2
>>>   │  ├─02:08.0 downstream_port, slot 8
>>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>>   │  └─02:0a.0 downstream_port, slot 10
>>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>>
>>> 00:01.0 is on a CPU board, The 03:00.0 and everything below that is
>>> not on a CPU board (working with a micro TCA system here). I'm
>>> working with FPGA based devices seen as endpoints here.  After the
>>> error all the endpoints in the above list are not able to talk to
>>> CPU anymore; register reads return 0xFFFFFFFF. At the same time PCI
>>> config space looks sane and accessible for those devices.
>>
>> This could be caused by a reset.  We probably do a secondary bus reset
>> on the Root Port, which resets all the devices below it.  After the
>> reset, config space of those downstream devices would be accessible,
>> but the PCI_COMMAND register may be cleared which means the device
>> wouldn't respond to MMIO reads.
>>
>> After adding some printk()'s to the PCIe code that deals with the reset, I can see that the pci_bus_error_reset() calls the pci_bus_reset() due to bus slots list check for empty returns true. So the the secondary bus is being reset.
>>
>> None of this explains the original problem of the Completion Timeout,
>> of course.  The error source of 0x8 (00:01.0) is the root port, which
>> makes sense if it issued a request transaction and never got the
>> completion.  The root port *should* log the header of the request and
>> we should print it, but it looks like we didn't.  "lspci -vv" of the
>> device would show whether it's capable of this logging.
>>
>> I'm mostly concerned about the fact that the secondary bus is
>> rendered useless after the reset/recovery. AIUI, this should not
>> happen.
> 
> It should not.  The secondary bus should still work after the reset.


I was hoping to hear that!


> 
>> After upgrading my pciutils I get a bit more output from lspci -vv for the Root port. It seems it is capable of logging the errors:
>>
>>          Capabilities: [1c0 v1] Advanced Error Reporting
>>                  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                  UESvrt: DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                  AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>>                          MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                  HeaderLog: 00000000 00000001 00000002 00000003
>>                  RootCmd: CERptEn+ NFERptEn+ FERptEn+
>>                  RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>>                           FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>>                  ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0008
>>
>> The above was provoked with the aer-inject of the same error as seen in initial report, with bogus 'HEADER_LOG 0 1 2 3'.
>>
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>> Dec  9 10:25:19 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] CmpltTO
>> Dec  9 10:25:19 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>>
>>
>> Question: would using different (real?) HEADER_LOG values affect the recovery behaviour?
> 
> No, it shouldn't.  I'm pretty sure we only print those values and they
> don't influence behavior at all.
> 
>>> How can I debug this further? I'm getting the "I/O to channel is
>>> blocked" (pci_channel_io_frozen) state reported to the the endpoint
>>> devices I provide driver for.  Is there any way of telling if the
>>> PCI switch devices between 00:01.0 ... 06:00.0 have all recovered ;
>>> links up and running and similar? Is this information provided by
>>> the Linux kernel somehow?
>>>
>>> For reference, I've experimented with AER inject and my tests showed
>>> that if the same type of error is injected in any other PCI device
>>> in the path 01:00.0 ... 06:00.0, IOW not into 00:01.0, the link is
>>> recovered successfully, and I can continue working with the endpoint
>>> devices. In those cases the "I/O channel is in normal state"
>>> (pci_channel_io_normal) state was reported; only error injection
>>> into 00:01.0 reports pci_channel_io_frozen state. Recovery code in
>>> the endpoint drivers I maintain is just calling the
>>> pci_cleanup_aer_uncorrect_error_status() in error handler resume()
>>> callback.
>>>
>>> FYI, this is on 3.10.0-1160.6.1.el7.x86_64.debug CentOS7 kernel
>>> which I believe is quite old. At the moment I can not use newer
>>> kernel, but would be prepared to take that step if told that it
>>> would help.
>>
>> It's really not practical for us to help debug a v3.10-based kernel;
>> that's over seven years old and AER handling has changed significantly
>> since then.  Also, CentOS is a distro kernel and includes many changes
>> on top of the v3.10 upstream kernel.  Those changes might be
>> theoretically open-source, but as a practical matter, the distro is a
>> better place to start for support of their kernel.
>>
>> I followed your advice and started using kernel 5.9.12; hope this is recent enough.
>>
>> With that kernel I see a bit of different behavior than with the
>> 3.10 earlier. Configuration access to the devices on the secondary
>> bus is not OK; all 0xFFFFFFFF. This is what I see after the error
>> was injected into the root port 00:01.0 and secondary bus reset was
>> made as part of recovery:
>>
>> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
>> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Latency: 0, Cache Line Size: 64 bytes
>> 01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode])
>> 	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Interrupt: pin A routed to IRQ 128
>> 02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: pcieport
>>
>> 0e:00.0 Signal processing controller: Xilinx Corporation Device 7011 (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
>> 	Kernel driver in use: mrf-pci
>>
>> The return code from aer_root_reset() is PCI_ERS_RESULT_RECOVERED.
>> How come that the 01:00.0 upstream port PCI_COMMAND register is
>> cleared? I guess that explains why the rest of the devices in the
>> chain down to the endpoint at 0e:00.0 are rendered useless.
>>
>> Then, as part of having fun, I made the pci_bus_error_reset() *NOT*
>> do the actual reset of the secondary by just returning 0 before the
>> call to the pci_bus_reset(). In this case no bus reset is done,
>> obviously, and MMIO access to my endpoint(s) is working just fine
>> after the error was injected into the root port; as if nothing
>> happened.
>>
>> Is it expected that in cases of seeing this type of error the
>> secondary bus would be in this kind of state? It looks to me that
>> the problem is not with my endpoints but with the PCIe switch on the
>> CPU board (01:00.0 <--> 02:0?.0).
> 
> Hmm.  Yeah, I don't see anything in the path that would restore the
> 01:00.0 config space, which seems like a problem.  Probably its
> secondary/subordinate bus registers and window registers get reset to
> 0 ("lspci -vv" would show that) so config/mem/io space below it is
> accessible.

I guess the save/restore of the switch registers would be in the core 
PCI(e) code; and not a switch vendor/part specific.

I still have interest in getting this to work. With my limited knowledge 
of the PCI code base, is there anything I could do to help? Is there a 
particular code base area where you expect this should be managed; pci.c 
or pcie/aer.c or maybe hotplug/ source? I can try and tinker with it..


Here is the 'lspci -vv' of the 01:00.0 device at the time after the reset:

01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
00 [Normal decode])
         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR- FastB2B- DisINTx-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Interrupt: pin A routed to IRQ 128
         Region 0: Memory at df200000 (32-bit, non-prefetchable) 
[virtual] [size=256K]
         Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
         I/O behind bridge: 0000f000-00000fff [disabled]
         Memory behind bridge: fff00000-000fffff [disabled]
         Prefetchable memory behind bridge: 
00000000fff00000-00000000000fffff [disabled]
         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
         BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- 
FastB2B-
                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
         Capabilities: [40] Power Management version 3
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+
                 Address: 0000000000000000  Data: 0000
                 Masking: 00000000  Pending: 00000000
         Capabilities: [68] Express (v2) Upstream Port, MSI 00
                 DevCap: MaxPayload 1024 bytes, PhantFunc 0
                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ 
SlotPowerLimit 75.000W
                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                         MaxPayload 128 bytes, MaxReadReq 128 bytes
                 DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ 
AuxPwr- TransPend-
                 LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit 
Latency L1 <4us
                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                 LnkCtl: ASPM Disabled; Disabled- CommClk-
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                         TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                 DevCap2: Completion Timeout: Not Supported, TimeoutDis- 
NROPrPrP- LTR+
                          10BitTagComp- 10BitTagReq- OBFF Via message, 
ExtFmt- EETLPPrefix-
                          EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
                          FRS-
                          AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- 
LTR- OBFF Disabled,
                          AtomicOpsCtl: EgressBlck-
                 LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ 
Retimer- 2Retimers- DRS-
                 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
SpeedDis-
                          Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                          Compliance De-emphasis: -6dB
                 LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+ EqualizationPhase1+
                          EqualizationPhase2+ EqualizationPhase3+ 
LinkEqualizationRequest-
                          Retimer- 2Retimers- CrosslinkRes: unsupported
         Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 8725
         Capabilities: [100 v1] Device Serial Number ca-87-00-10-b5-df-0e-00
         Capabilities: [fb4 v1] Advanced Error Reporting
                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+
                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+
                 AERCap: First Error Pointer: 1f, ECRCGenCap+ ECRCGenEn- 
ECRCChkCap+ ECRCChkEn-
                         MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                 HeaderLog: 00000000 00000000 00000000 00000000
         Capabilities: [138 v1] Power Budgeting <?>
         Capabilities: [10c v1] Secondary PCI Express
                 LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                 LaneErrStat: 0
         Capabilities: [148 v1] Virtual Channel
                 Caps:   LPEVC=1 RefClk=100ns PATEntryBits=8
                 Arb:    Fixed- WRR32- WRR64- WRR128-
                 Ctrl:   ArbSelect=Fixed
                 Status: InProgress-
                 VC0:    Caps:   PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
                         Arb:    Fixed- WRR32- WRR64+ WRR128- TWRR128- 
WRR256-
                         Ctrl:   Enable+ ID=0 ArbSelect=WRR64 TC/VC=ff
                         Status: NegoPending- InProgress-
                         Port Arbitration Table <?>
                 VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                         Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128- 
WRR256-
                         Ctrl:   Enable- ID=1 ArbSelect=Fixed TC/VC=00
                         Status: NegoPending+ InProgress-
         Capabilities: [e00 v1] Multicast
                 McastCap: MaxGroups 64, ECRCRegen+
                 McastCtl: NumGroups 1, Enable-
                 McastBAR: IndexPos 0, BaseAddr 0000000000000000
                 McastReceiveVec:      0000000000000000
                 McastBlockAllVec:     0000000000000000
                 McastBlockUntransVec: 0000000000000000
                 McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 
0000000000000000
         Capabilities: [b00 v1] Latency Tolerance Reporting
                 Max snoop latency: 0ns
                 Max no snoop latency: 0ns
         Capabilities: [b70 v1] Vendor Specific Information: ID=0001 
Rev=0 Len=010 <?>
         Kernel driver in use: pcieport


And here is the diff of the lspci of 01:00.0 between the before and 
after the reset:

  01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 
00 [Normal decode])
-	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
+	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
  	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
-	Latency: 0, Cache Line Size: 64 bytes
  	Interrupt: pin A routed to IRQ 128
-	Region 0: Memory at df200000 (32-bit, non-prefetchable) [size=256K]
-	Bus: primary=01, secondary=02, subordinate=13, sec-latency=0
+	Region 0: Memory at df200000 (32-bit, non-prefetchable) [virtual] 
[size=256K]
+	Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
  	I/O behind bridge: 0000f000-00000fff [disabled]
-	Memory behind bridge: df000000-df1fffff [size=2M]
-	Prefetchable memory behind bridge: 0000000090000000-00000000923fffff 
[size=36M]
+	Memory behind bridge: fff00000-000fffff [disabled]
+	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff 
[disabled]
  	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
-	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
+	BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
  		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
  	Capabilities: [40] Power Management version 3
  		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
  		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
-	Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
-		Address: 00000000fee00338  Data: 0000
-		Masking: 000000ff  Pending: 00000000
+	Capabilities: [48] MSI: Enable- Count=1/8 Maskable+ 64bit+
+		Address: 0000000000000000  Data: 0000
+		Masking: 00000000  Pending: 00000000
  	Capabilities: [68] Express (v2) Upstream Port, MSI 00
  		DevCap:	MaxPayload 1024 bytes, PhantFunc 0
  			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 75.000W
  		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
  			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
-			MaxPayload 256 bytes, MaxReadReq 128 bytes
+			MaxPayload 128 bytes, MaxReadReq 128 bytes
  		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
  		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
  			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
@@ -376,7 +375,7 @@
  			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
  			 FRS-
  			 AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
-		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF 
Disabled,
+		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF 
Disabled,
  			 AtomicOpsCtl: EgressBlck-
  		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ Retimer- 
2Retimers- DRS-
  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
@@ -402,7 +401,7 @@
  		LaneErrStat: 0
  	Capabilities: [148 v1] Virtual Channel
  		Caps:	LPEVC=1 RefClk=100ns PATEntryBits=8
-		Arb:	Fixed+ WRR32- WRR64- WRR128-
+		Arb:	Fixed- WRR32- WRR64- WRR128-
  		Ctrl:	ArbSelect=Fixed
  		Status:	InProgress-
  		VC0:	Caps:	PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
@@ -423,20 +422,20 @@
  		McastBlockUntransVec: 0000000000000000
  		McastOverlayBAR: OverlaySize 0 (disabled), BaseAddr 0000000000000000
  	Capabilities: [b00 v1] Latency Tolerance Reporting
-		Max snoop latency: 71680ns
-		Max no snoop latency: 71680ns
+		Max snoop latency: 0ns
+		Max no snoop latency: 0ns
  	Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 
Len=010 <?>
  	Kernel driver in use: pcieport


Looks like many fields have been zeroed out. Should a bare restoration 
of the previous registers values be enough or does some other magic 
(i.e. calling PCI functions/performing initialization tasks/etc) needs 
to happen to get that port back to sanity?


Thanks!
//hinko


> 
> I didn't have much confidence in our error handling to begin with, and
> it's eroding ;)
> 
> Bjorn
> 

  parent reply	other threads:[~2020-12-09 20:51 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 12:52 Recovering from AER: Uncorrected (Fatal) error Hinko Kocevar
2020-12-04 22:38 ` Bjorn Helgaas
2020-12-09 10:02   ` Hinko Kocevar
2020-12-09 17:40     ` Bjorn Helgaas
2020-12-09 20:31       ` Hinko Kocevar
2020-12-09 20:50       ` Hinko Kocevar [this message]
2020-12-09 21:32         ` Bjorn Helgaas
2020-12-09 22:55           ` Hinko Kocevar
2020-12-10 12:56             ` Hinko Kocevar
2020-12-14 21:23             ` Keith Busch
2020-12-15 12:56               ` Hinko Kocevar
2020-12-15 18:56                 ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1ad4c01-ce73-dcc0-99b1-5fe45a984800@ess.eu \
    --to=hinko.kocevar@ess.eu \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).