linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Dmesg filled with "AER: Corrected error received"
@ 2015-12-18 10:30 David Henningsson
  2015-12-22 21:57 ` Bjorn Helgaas
  2015-12-29 15:58 ` Bjorn Helgaas
  0 siblings, 2 replies; 6+ messages in thread
From: David Henningsson @ 2015-12-18 10:30 UTC (permalink / raw)
  To: linux-pci, bhelgaas

[-- Attachment #1: Type: text/plain, Size: 1283 bytes --]

Hi Linux PCI maintainers,

My dmesg gets filled with a few lines repeated over and over again:

pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
pcieport 0000:00:1c.0: can't find device of ID00e0
pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical 
Layer, id=00e0(Receiver ID)
pcieport 0000:00:1c.0:   device [8086:9d14] error 
status/mask=00000001/00002000
pcieport 0000:00:1c.0:    [ 0] Receiver Error

This happens 10-30 times per second (!), so dmesg fills up quickly. The 
bug is present in both vanilla and Ubuntu kernels.

I'm happy to try to help by providing more info as requested, and I'm 
also able to build kernels to test patches (although that might take 
some time, especially during the upcoming holidays).

Computer: Dell Inspiron 13-7359
Kernel version: Linux version 4.4.0-040400rc5-generic (kernel@gloin) 
(gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201512140221 SMP 
Mon Dec 14 02:23:36 UTC 2015
CPU: Skylake i3-6100U
Lspci: attached
Dmesg: I'm attaching an extract from kern.log which shows the most 
recent boot and a few seconds thereafter.
Downstream bug report: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

Best Regards,
   David

[-- Attachment #2: lspci.txt --]
[-- Type: text/plain, Size: 15567 bytes --]

00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 08)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
	Latency: 0
	Capabilities: [e0] Vendor Specific Information: Len=10 <?>

00:02.0 VGA compatible controller: Intel Corporation Sky Lake Integrated Graphics (rev 07) (prog-if 00 [VGA controller])
	DeviceName:  Onboard IGD
	Subsystem: Dell Device 06fd
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 277
	Region 0: Memory at de000000 (64-bit, non-prefetchable) [size=16M]
	Region 2: Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at f000 [size=64]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [40] Vendor Specific Information: Len=0c <?>
	Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
	Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00018  Data: 0000
	Capabilities: [d0] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] #1b
	Capabilities: [200 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [300 v1] #13
	Kernel driver in use: i915

00:04.0 Signal processing controller: Intel Corporation Device 1903 (rev 08)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at df120000 (64-bit, non-prefetchable) [size=32K]
	Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000
	Capabilities: [d0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [e0] Vendor Specific Information: Len=0c <?>
	Kernel driver in use: proc_thermal

00:14.0 USB controller: Intel Corporation Device 9d2f (rev 21) (prog-if 30 [XHCI])
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 123
	Region 0: Memory at df110000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [70] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
		Address: 00000000fee00258  Data: 0000
	Kernel driver in use: xhci_hcd

00:15.0 Signal processing controller: Intel Corporation Device 9d60 (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at df137000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] Vendor Specific Information: Len=14 <?>
	Kernel driver in use: intel-lpss

00:15.1 Signal processing controller: Intel Corporation Device 9d61 (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 17
	Region 0: Memory at df136000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] Vendor Specific Information: Len=14 <?>
	Kernel driver in use: intel-lpss

00:16.0 Communication controller: Intel Corporation Device 9d3a (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 290
	Region 0: Memory at df135000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [8c] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee002b8  Data: 0000
	Kernel driver in use: mei_me

00:17.0 SATA controller: Intel Corporation Device 9d03 (rev 21) (prog-if 01 [AHCI 1.0])
	Subsystem: Dell Device 06fd
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 276
	Region 0: Memory at df130000 (32-bit, non-prefetchable) [size=8K]
	Region 1: Memory at df134000 (32-bit, non-prefetchable) [size=256]
	Region 2: I/O ports at f090 [size=8]
	Region 3: I/O ports at f080 [size=4]
	Region 4: I/O ports at f060 [size=32]
	Region 5: Memory at df133000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00278  Data: 0000
	Capabilities: [70] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
	Kernel driver in use: ahci

00:1c.0 PCI bridge: Intel Corporation Device 9d14 (rev f1) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 122
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000e000-0000efff
	Memory behind bridge: df000000-df0fffff
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #5, Speed 8GT/s, Width x1, ASPM L1, Exit Latency L0s <1us, L1 <16us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #8, PowerLimit 10.000W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet+ LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00238  Data: 0000
	Capabilities: [90] Subsystem: Dell Device 06fd
	Capabilities: [a0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [140 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [200 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=40us PortTPowerOnTime=10us
	Capabilities: [220 v1] #19
	Kernel driver in use: pcieport

00:1f.0 ISA bridge: Intel Corporation Device 9d48 (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0

00:1f.2 Memory controller: Intel Corporation Device 9d21 (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Region 0: Memory at df12c000 (32-bit, non-prefetchable) [disabled] [size=16K]

00:1f.3 Audio device: Intel Corporation Device 9d70 (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32
	Interrupt: pin A routed to IRQ 291
	Region 0: Memory at df128000 (64-bit, non-prefetchable) [size=16K]
	Region 4: Memory at df100000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee002f8  Data: 0000
	Kernel driver in use: snd_hda_intel

00:1f.4 SMBus: Intel Corporation Device 9d23 (rev 21)
	Subsystem: Dell Device 06fd
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at df132000 (64-bit, non-prefetchable) [size=256]
	Region 4: I/O ports at f040 [size=32]

01:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8723BE PCIe Wireless Network Adapter
	Subsystem: Realtek Semiconductor Co., Ltd. Device 8739
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: I/O ports at e000 [size=256]
	Region 2: Memory at df000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Via message/WAKE#
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR+, OBFF Disabled
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [140 v1] Device Serial Number 00-23-b7-fe-ff-4c-e0-00
	Capabilities: [150 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns
	Capabilities: [158 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=150us PortTPowerOnTime=150us
	Kernel driver in use: rtl8723be


[-- Attachment #3: dmesg.txt.tar.gz --]
[-- Type: application/gzip, Size: 19747 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dmesg filled with "AER: Corrected error received"
  2015-12-18 10:30 Dmesg filled with "AER: Corrected error received" David Henningsson
@ 2015-12-22 21:57 ` Bjorn Helgaas
  2015-12-23  8:06   ` David Henningsson
  2015-12-29 15:58 ` Bjorn Helgaas
  1 sibling, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2015-12-22 21:57 UTC (permalink / raw)
  To: David Henningsson; +Cc: linux-pci, bhelgaas

Hi David,

On Fri, Dec 18, 2015 at 11:30:33AM +0100, David Henningsson wrote:
> Hi Linux PCI maintainers,
> 
> My dmesg gets filled with a few lines repeated over and over again:
> 
> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> pcieport 0000:00:1c.0: can't find device of ID00e0
> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e0(Receiver ID)
> pcieport 0000:00:1c.0:   device [8086:9d14] error
> status/mask=00000001/00002000
> pcieport 0000:00:1c.0:    [ 0] Receiver Error
> 
> This happens 10-30 times per second (!), so dmesg fills up quickly.
> The bug is present in both vanilla and Ubuntu kernels.
> 
> I'm happy to try to help by providing more info as requested, and
> I'm also able to build kernels to test patches (although that might
> take some time, especially during the upcoming holidays).
> 
> Computer: Dell Inspiron 13-7359
> Kernel version: Linux version 4.4.0-040400rc5-generic (kernel@gloin)
> (gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201512140221
> SMP Mon Dec 14 02:23:36 UTC 2015
> CPU: Skylake i3-6100U
> Lspci: attached
> Dmesg: I'm attaching an extract from kern.log which shows the most
> recent boot and a few seconds thereafter.
> Downstream bug report:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

Thanks a lot for the report and logs.  Do you happen to know whether this
is a regression, and if so, when it first appeared?

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dmesg filled with "AER: Corrected error received"
  2015-12-22 21:57 ` Bjorn Helgaas
@ 2015-12-23  8:06   ` David Henningsson
  0 siblings, 0 replies; 6+ messages in thread
From: David Henningsson @ 2015-12-23  8:06 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, bhelgaas



On 2015-12-22 22:57, Bjorn Helgaas wrote:
> Hi David,
>
> On Fri, Dec 18, 2015 at 11:30:33AM +0100, David Henningsson wrote:
>> Hi Linux PCI maintainers,
>>
>> My dmesg gets filled with a few lines repeated over and over again:
>>
>> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
>> pcieport 0000:00:1c.0: can't find device of ID00e0
>> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
>> pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
>> type=Physical Layer, id=00e0(Receiver ID)
>> pcieport 0000:00:1c.0:   device [8086:9d14] error
>> status/mask=00000001/00002000
>> pcieport 0000:00:1c.0:    [ 0] Receiver Error
>>
>> This happens 10-30 times per second (!), so dmesg fills up quickly.
>> The bug is present in both vanilla and Ubuntu kernels.
>>
>> I'm happy to try to help by providing more info as requested, and
>> I'm also able to build kernels to test patches (although that might
>> take some time, especially during the upcoming holidays).
>>
>> Computer: Dell Inspiron 13-7359
>> Kernel version: Linux version 4.4.0-040400rc5-generic (kernel@gloin)
>> (gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201512140221
>> SMP Mon Dec 14 02:23:36 UTC 2015
>> CPU: Skylake i3-6100U
>> Lspci: attached
>> Dmesg: I'm attaching an extract from kern.log which shows the most
>> recent boot and a few seconds thereafter.
>> Downstream bug report:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
>
> Thanks a lot for the report and logs.  Do you happen to know whether this
> is a regression, and if so, when it first appeared?

Hi Bjorn and thanks for looking at it,

I'm not sure how far back I can go and still have Skylake working (this 
is a new laptop), but I just tried booting a 4.0.9 vanilla kernel and 
the problem is present there as well.

If anything, it seems to be at an even higher rate in 4.0.9, because in 
4.0.9, I get reports about "printk messages dropped", which I don't get 
in 4.4.0-rc5.

-- 
David Henningsson, Canonical Ltd.
https://launchpad.net/~diwic

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dmesg filled with "AER: Corrected error received"
  2015-12-18 10:30 Dmesg filled with "AER: Corrected error received" David Henningsson
  2015-12-22 21:57 ` Bjorn Helgaas
@ 2015-12-29 15:58 ` Bjorn Helgaas
  2015-12-30 12:52   ` David Henningsson
  2016-01-15 23:21   ` Bjorn Helgaas
  1 sibling, 2 replies; 6+ messages in thread
From: Bjorn Helgaas @ 2015-12-29 15:58 UTC (permalink / raw)
  To: David Henningsson; +Cc: linux-pci, bhelgaas

On Fri, Dec 18, 2015 at 11:30:33AM +0100, David Henningsson wrote:
> Hi Linux PCI maintainers,
> 
> My dmesg gets filled with a few lines repeated over and over again:
> 
> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> pcieport 0000:00:1c.0: can't find device of ID00e0
> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e0(Receiver ID)
> pcieport 0000:00:1c.0:   device [8086:9d14] error
> status/mask=00000001/00002000
> pcieport 0000:00:1c.0:    [ 0] Receiver Error
> 
> This happens 10-30 times per second (!), so dmesg fills up quickly.
> The bug is present in both vanilla and Ubuntu kernels.

This is a pretty obvious bug in our AER code.  We normally clear
correctable errors by writing the PCI_ERR_COR_STATUS register in
handle_error_source().  The execution path looks like this:

  aer_isr_one_error
    aer_print_port_info
    if (find_source_device())
      aer_process_err_devices
        handle_error_source
          pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)

In this case, find_source_device() printed "can't find device of
ID00e0" [sic] and returned false, so we don't call
aer_process_err_devices().  The error is never cleared, so
we discover it again and again.

I'll work on fixing this.  Incidentally, there's another report
with similar symptoms here:

  https://bugzilla.kernel.org/show_bug.cgi?id=109691

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dmesg filled with "AER: Corrected error received"
  2015-12-29 15:58 ` Bjorn Helgaas
@ 2015-12-30 12:52   ` David Henningsson
  2016-01-15 23:21   ` Bjorn Helgaas
  1 sibling, 0 replies; 6+ messages in thread
From: David Henningsson @ 2015-12-30 12:52 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, bhelgaas

Hi,

Indeed booting with pci=noaer (as suggested in the other bug) works 
around this issue as well. I'll use that for the time being.

Thanks for working on it!

// David

On 2015-12-29 16:58, Bjorn Helgaas wrote:
> On Fri, Dec 18, 2015 at 11:30:33AM +0100, David Henningsson wrote:
>> Hi Linux PCI maintainers,
>>
>> My dmesg gets filled with a few lines repeated over and over again:
>>
>> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
>> pcieport 0000:00:1c.0: can't find device of ID00e0
>> pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
>> pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
>> type=Physical Layer, id=00e0(Receiver ID)
>> pcieport 0000:00:1c.0:   device [8086:9d14] error
>> status/mask=00000001/00002000
>> pcieport 0000:00:1c.0:    [ 0] Receiver Error
>>
>> This happens 10-30 times per second (!), so dmesg fills up quickly.
>> The bug is present in both vanilla and Ubuntu kernels.
>
> This is a pretty obvious bug in our AER code.  We normally clear
> correctable errors by writing the PCI_ERR_COR_STATUS register in
> handle_error_source().  The execution path looks like this:
>
>    aer_isr_one_error
>      aer_print_port_info
>      if (find_source_device())
>        aer_process_err_devices
>          handle_error_source
>            pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)
>
> In this case, find_source_device() printed "can't find device of
> ID00e0" [sic] and returned false, so we don't call
> aer_process_err_devices().  The error is never cleared, so
> we discover it again and again.
>
> I'll work on fixing this.  Incidentally, there's another report
> with similar symptoms here:
>
>    https://bugzilla.kernel.org/show_bug.cgi?id=109691
>
> Bjorn
>

-- 
David Henningsson, Canonical Ltd.
https://launchpad.net/~diwic

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dmesg filled with "AER: Corrected error received"
  2015-12-29 15:58 ` Bjorn Helgaas
  2015-12-30 12:52   ` David Henningsson
@ 2016-01-15 23:21   ` Bjorn Helgaas
  1 sibling, 0 replies; 6+ messages in thread
From: Bjorn Helgaas @ 2016-01-15 23:21 UTC (permalink / raw)
  To: David Henningsson; +Cc: linux-pci, bhelgaas

On Tue, Dec 29, 2015 at 09:58:22AM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 18, 2015 at 11:30:33AM +0100, David Henningsson wrote:
> > Hi Linux PCI maintainers,
> > 
> > My dmesg gets filled with a few lines repeated over and over again:
> > 
> > pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> > pcieport 0000:00:1c.0: can't find device of ID00e0
> > pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> > pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
> > type=Physical Layer, id=00e0(Receiver ID)
> > pcieport 0000:00:1c.0:   device [8086:9d14] error
> > status/mask=00000001/00002000
> > pcieport 0000:00:1c.0:    [ 0] Receiver Error
> > 
> > This happens 10-30 times per second (!), so dmesg fills up quickly.
> > The bug is present in both vanilla and Ubuntu kernels.
> 
> This is a pretty obvious bug in our AER code.  We normally clear
> correctable errors by writing the PCI_ERR_COR_STATUS register in
> handle_error_source().  The execution path looks like this:
> 
>   aer_isr_one_error
>     aer_print_port_info
>     if (find_source_device())
>       aer_process_err_devices
>         handle_error_source
>           pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)
> 
> In this case, find_source_device() printed "can't find device of
> ID00e0" [sic] and returned false, so we don't call
> aer_process_err_devices().  The error is never cleared, so
> we discover it again and again.
> 
> I'll work on fixing this.  Incidentally, there's another report
> with similar symptoms here:
> 
>   https://bugzilla.kernel.org/show_bug.cgi?id=109691

I've thought about this problem a bit, but realistically I don't have
time to do the fix I'd like to do, which would involve reading the AER
status registers in the ISR and also *clearing* the error indication,
also in the ISR.  I think the current design, where we read bits of
the status in various places, and clear it in yet other locations, is
error-prone.

Anybody else who is interested should feel free to take a crack at it.

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-01-15 23:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-18 10:30 Dmesg filled with "AER: Corrected error received" David Henningsson
2015-12-22 21:57 ` Bjorn Helgaas
2015-12-23  8:06   ` David Henningsson
2015-12-29 15:58 ` Bjorn Helgaas
2015-12-30 12:52   ` David Henningsson
2016-01-15 23:21   ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).