From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750842AbbG1V3v (ORCPT ); Tue, 28 Jul 2015 17:29:51 -0400 Received: from mail-io0-f178.google.com ([209.85.223.178]:36304 "EHLO mail-io0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750723AbbG1V3t (ORCPT ); Tue, 28 Jul 2015 17:29:49 -0400 Date: Tue, 28 Jul 2015 16:29:44 -0500 From: Bjorn Helgaas To: Duc Dang Cc: Tanmay Inamdar , "linux-pci@vger.kernel.org" , linux-arm , "linux-kernel@vger.kernel.org" Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 Message-ID: <20150728212944.GA12958@google.com> References: <20150724224258.GA23990@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: > On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas wrote: > > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang wrote: > >> Hi Bjorn, > >> > >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas wrote: > >>> > >>> I regularly see faults like this on an APM X-Gene: > >>> > >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > >>> 32 KB ICACHE, 32 KB DCACHE > >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > >>> ... > >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > >>> Internal error: : 96000010 [#1] SMP > >>> Modules linked in: > >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > >>> Hardware name: APM X-Gene Mustang board (DT) > >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > >>> PC is at pci_generic_config_read32+0x4c/0xb8 > >>> LR is at pci_generic_config_read32+0x40/0xb8 > >>> pc : [] lr : [] pstate: 600001c5 > >>> ... > >>> Call trace: > >>> [] pci_generic_config_read32+0x4c/0xb8 > >>> [] pci_user_read_config_byte+0x60/0xc4 > >>> [] pci_read_config+0x15c/0x238 > >>> [] sysfs_kf_bin_read+0x68/0xa0 > >>> [] kernfs_fop_read+0x9c/0x1ac > >>> [] __vfs_read+0x44/0x128 > >>> [] vfs_read+0x84/0x144 > >>> [] SyS_read+0x50/0xb0 > >> > >> The log shows kernel gets an exception when trying to access Mellanox > >> card configuration space. This is usually due to suboptimal PCIe > >> SerDes parameters are using in your board, which will cause bad link > >> quality. > >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a > >> U-Boot upgrade to our latest X-Gene U-Boot release. > > > > I installed U-Boot 1.15.12, which I thought was the latest. I'm still > > seeing this issue regularly, approx once/hour. > > Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good > version to use. Are you running any PCIe traffic test when the error > happens? Nope, the machine was either idle or running a reboot test; no PCIe stress test or anything. > And it will be useful if you can share your "lspci -vvv" output when > the board is running, we can check to see if there is any error status > reported. Here's some lspci output and info about the firmware I'm running. Obviously this lspci output was collected before a crash. I have also seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port. U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33) CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz 32 KB ICACHE, 32 KB DCACHE SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz Boot from SPI-NOR Slimpro FW: Ver: 2.4 (build 01.15.12.00 2015/05/20) PMD: 970 mV SOC: 950 mV Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board I2C: ready DRAM: ECC 32 GiB @ 1600MHz SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB MMC: X-Gene SD/SDIO/eMMC: 0 PCIE0: (RC) X8 GEN-3 link up 00:00.0 - 10e8:e004 - Bridge device 01:00.0 - 15b3:1007 - Network controller # lspci -vvv 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+ LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #1, PowerLimit 10.000W; Interlock- NoCompl- SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Off, PwrInd Off, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock- Changed: MRL- PresDet- LinkState+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [180 v1] #19 Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 Kernel driver in use: pcieport 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- References: <20150724224258.GA23990@google.com> Message-ID: <20150728212944.GA12958@google.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote: > On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas wrote: > > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang wrote: > >> Hi Bjorn, > >> > >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas wrote: > >>> > >>> I regularly see faults like this on an APM X-Gene: > >>> > >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33) > >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz > >>> 32 KB ICACHE, 32 KB DCACHE > >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz > >>> ... > >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034 > >>> Internal error: : 96000010 [#1] SMP > >>> Modules linked in: > >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3 > >>> Hardware name: APM X-Gene Mustang board (DT) > >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000 > >>> PC is at pci_generic_config_read32+0x4c/0xb8 > >>> LR is at pci_generic_config_read32+0x40/0xb8 > >>> pc : [] lr : [] pstate: 600001c5 > >>> ... > >>> Call trace: > >>> [] pci_generic_config_read32+0x4c/0xb8 > >>> [] pci_user_read_config_byte+0x60/0xc4 > >>> [] pci_read_config+0x15c/0x238 > >>> [] sysfs_kf_bin_read+0x68/0xa0 > >>> [] kernfs_fop_read+0x9c/0x1ac > >>> [] __vfs_read+0x44/0x128 > >>> [] vfs_read+0x84/0x144 > >>> [] SyS_read+0x50/0xb0 > >> > >> The log shows kernel gets an exception when trying to access Mellanox > >> card configuration space. This is usually due to suboptimal PCIe > >> SerDes parameters are using in your board, which will cause bad link > >> quality. > >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a > >> U-Boot upgrade to our latest X-Gene U-Boot release. > > > > I installed U-Boot 1.15.12, which I thought was the latest. I'm still > > seeing this issue regularly, approx once/hour. > > Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good > version to use. Are you running any PCIe traffic test when the error > happens? Nope, the machine was either idle or running a reboot test; no PCIe stress test or anything. > And it will be useful if you can share your "lspci -vvv" output when > the board is running, we can check to see if there is any error status > reported. Here's some lspci output and info about the firmware I'm running. Obviously this lspci output was collected before a crash. I have also seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port. U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33) CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz 32 KB ICACHE, 32 KB DCACHE SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz Boot from SPI-NOR Slimpro FW: Ver: 2.4 (build 01.15.12.00 2015/05/20) PMD: 970 mV SOC: 950 mV Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board I2C: ready DRAM: ECC 32 GiB @ 1600MHz SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB MMC: X-Gene SD/SDIO/eMMC: 0 PCIE0: (RC) X8 GEN-3 link up 00:00.0 - 10e8:e004 - Bridge device 01:00.0 - 15b3:1007 - Network controller # lspci -vvv 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+ LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #1, PowerLimit 10.000W; Interlock- NoCompl- SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Off, PwrInd Off, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock- Changed: MRL- PresDet- LinkState+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [180 v1] #19 Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 Kernel driver in use: pcieport 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-