PCIe AER testing

* PCIe AER testing
@ 2017-11-02 17:47 David Laight
  0 siblings, 0 replies; only message in thread
From: David Laight @ 2017-11-02 17:47 UTC (permalink / raw)
  To: linux-pci

I'm (still) trying to provoke aerdrv to report a PCIe error.
I've found a system where a 4.13 kernel (ubuntu 17.0) reports
'_OSC: OS now control [... AER ...]' and aerdrv (and PCIe PME)
use interrupts 125 and 126.

One of those host ports is connected to an igb ethernet chip,
the other to one of our cards.

I can generate PCIe read or write cycles that are outside the
BARs - and the card duly fills in data into own AER registers.
But I'm not seeing any messages from aerdrv or even the interrupt
count going up.

CPU is an i7-7700 (according to /proc/cpuinfo).

The root port's (Intel Sunrise Point-H) AER registers are:
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
                RootCmd: CERptEn+ NFERptEn+ FERptEn+
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000

Our card's are:
        Capabilities: [800 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 14, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 40000001 0000000f df202000 00000000

The HeaderLog is that of a write beyond the end of BAR2.
If I unmask NonFatalErr the HeaderLog contains that of invalid reads.
(Writing to the status bits clears them.)

I think the endpoint should be sending some kind of TLP to the root that
would update the status registers and then interrupt the host.
Unfortunately I can only trace TLP that match the BARs.

I can try taking down the PCIe link - I believe the root port should log
that itself?

Any ideas?
Is it worth me trying to get the igb to generate the same errors?

	David

^ permalink raw reply	[flat|nested] only message in thread