linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* AMD PCI Bridge: Hardware error from APEI
@ 2020-06-27 18:23 Hans-Peter Jansen
  2020-07-07  6:56 ` Hans-Peter Jansen
  0 siblings, 1 reply; 5+ messages in thread
From: Hans-Peter Jansen @ 2020-06-27 18:23 UTC (permalink / raw)
  To: linux-kernel

Dear hacker from the order of the penguins,

we're facing a disturbing issue here after swapping a motherboard of a 
mission critical system:

Jun 27 20:05:29 server kernel: {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jun 27 20:05:29 server kernel: {10}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun 27 20:05:29 server kernel: {10}[Hardware Error]: event severity: corrected
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  Error 0, type: corrected
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  fru_text: PcieError
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   section_type: PCIe error
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   port_type: 4, root port
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   version: 0.2
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   command: 0x0407, status: 0x0010
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   device_id: 0000:60:03.1
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   slot: 19
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   secondary_bus: 0x62
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   class_code: 060400
Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER:    [12] Timeout               
Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID


Jun 27 20:05:51 server kernel: {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jun 27 20:05:51 server kernel: {11}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun 27 20:05:51 server kernel: {11}[Hardware Error]: event severity: corrected
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  Error 0, type: corrected
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  fru_text: PcieError
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   section_type: PCIe error
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   port_type: 4, root port
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   version: 0.2
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   command: 0x0407, status: 0x0010
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   device_id: 0000:60:03.1
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   slot: 19
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   secondary_bus: 0x62
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   class_code: 060400
Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER:    [12] Timeout               
Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID


60:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 44, NUMA node 3
        Bus: primary=60, secondary=62, subordinate=62, sec-latency=0
        I/O behind bridge: None
        Memory behind bridge: e3600000-e36fffff [size=1M]
        Prefetchable memory behind bridge: None
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Root Port (Slot+), MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [270] #19
        Capabilities: [2a0] Access Control Services
        Capabilities: [370] L1 PM Substates
        Capabilities: [380] Downstream Port Containment
        Capabilities: [3c4] #23
        Kernel driver in use: pcieport

It's probably related to a satellite receiver card, since it only appeared 
after plugging:

62:00.0 Multimedia controller: Digital Devices GmbH Max
        Subsystem: Digital Devices GmbH Max S8 4/8
        Flags: bus master, fast devsel, latency 0, IRQ 161, NUMA node 3
        Memory at e3600000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 3
        Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
        Capabilities: [90] Express Endpoint, MSI 00
        Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
        Kernel driver in use: ddbridge
        Kernel modules: ddbridge

Specs:
ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI 
(officially supported RAM modules)

openSUSE 15.1, Kernel 5.7.5

Cheers,
Pete



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD PCI Bridge: Hardware error from APEI
  2020-06-27 18:23 AMD PCI Bridge: Hardware error from APEI Hans-Peter Jansen
@ 2020-07-07  6:56 ` Hans-Peter Jansen
  2020-07-11 16:32   ` Hans-Peter Jansen
  0 siblings, 1 reply; 5+ messages in thread
From: Hans-Peter Jansen @ 2020-07-07  6:56 UTC (permalink / raw)
  To: linux-kernel

Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> Dear hacker from the order of the penguins,
> 
> we're facing a disturbing issue here after swapping a motherboard of a
> mission critical system:
> 
> Jun 27 20:05:29 server kernel: {10}[Hardware Error]: Hardware error from
> APEI Generic Hardware Error Source: 4 Jun 27 20:05:29 server kernel:
> {10}[Hardware Error]: It has been corrected by h/w and requires no further
> action Jun 27 20:05:29 server kernel: {10}[Hardware Error]: event severity:
> corrected Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  Error 0,
> type: corrected Jun 27 20:05:29 server kernel: {10}[Hardware Error]: 
> fru_text: PcieError Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  
> section_type: PCIe error Jun 27 20:05:29 server kernel: {10}[Hardware
> Error]:   port_type: 4, root port Jun 27 20:05:29 server kernel:
> {10}[Hardware Error]:   version: 0.2 Jun 27 20:05:29 server kernel:
> {10}[Hardware Error]:   command: 0x0407, status: 0x0010 Jun 27 20:05:29
> server kernel: {10}[Hardware Error]:   device_id: 0000:60:03.1 Jun 27
> 20:05:29 server kernel: {10}[Hardware Error]:   slot: 19
> Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   secondary_bus: 0x62
> Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   vendor_id: 0x1022,
> device_id: 0x1453 Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  
> class_code: 060400 Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  
> bridge: secondary_status: 0x2000, control: 0x0012 Jun 27 20:05:29 server
> kernel: pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER:   
> [12] Timeout Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER:
> aer_layer=Data Link Layer, aer_agent=Transmitter ID
> 
> 
> Jun 27 20:05:51 server kernel: {11}[Hardware Error]: Hardware error from
> APEI Generic Hardware Error Source: 4 Jun 27 20:05:51 server kernel:
> {11}[Hardware Error]: It has been corrected by h/w and requires no further
> action Jun 27 20:05:51 server kernel: {11}[Hardware Error]: event severity:
> corrected Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  Error 0,
> type: corrected Jun 27 20:05:51 server kernel: {11}[Hardware Error]: 
> fru_text: PcieError Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  
> section_type: PCIe error Jun 27 20:05:51 server kernel: {11}[Hardware
> Error]:   port_type: 4, root port Jun 27 20:05:51 server kernel:
> {11}[Hardware Error]:   version: 0.2 Jun 27 20:05:51 server kernel:
> {11}[Hardware Error]:   command: 0x0407, status: 0x0010 Jun 27 20:05:51
> server kernel: {11}[Hardware Error]:   device_id: 0000:60:03.1 Jun 27
> 20:05:51 server kernel: {11}[Hardware Error]:   slot: 19
> Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   secondary_bus: 0x62
> Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   vendor_id: 0x1022,
> device_id: 0x1453 Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  
> class_code: 060400 Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  
> bridge: secondary_status: 0x2000, control: 0x0012 Jun 27 20:05:51 server
> kernel: pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER:   
> [12] Timeout Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER:
> aer_layer=Data Link Layer, aer_agent=Transmitter ID
> 
> 
> 60:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master,
> fast devsel, latency 0, IRQ 44, NUMA node 3 Bus: primary=60, secondary=62,
> subordinate=62, sec-latency=0 I/O behind bridge: None
>         Memory behind bridge: e3600000-e36fffff [size=1M]
>         Prefetchable memory behind bridge: None
>         Capabilities: [50] Power Management version 3
>         Capabilities: [58] Express Root Port (Slot+), MSI 00
>         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD]
> Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8]
> HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor
> Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150]
> Advanced Error Reporting
>         Capabilities: [270] #19
>         Capabilities: [2a0] Access Control Services
>         Capabilities: [370] L1 PM Substates
>         Capabilities: [380] Downstream Port Containment
>         Capabilities: [3c4] #23
>         Kernel driver in use: pcieport
> 
> It's probably related to a satellite receiver card, since it only appeared
> after plugging:
> 
> 62:00.0 Multimedia controller: Digital Devices GmbH Max
>         Subsystem: Digital Devices GmbH Max S8 4/8
>         Flags: bus master, fast devsel, latency 0, IRQ 161, NUMA node 3
>         Memory at e3600000 (64-bit, non-prefetchable) [size=64K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
>         Capabilities: [90] Express Endpoint, MSI 00
>         Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
> Len=00c <?> Kernel driver in use: ddbridge
>         Kernel modules: ddbridge
> 
> Specs:
> ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> (officially supported RAM modules)
>
> openSUSE 15.1, Kernel 5.7.5

Not sure, how to proceed with this one?

After 9½ days uptime, it cumulated about 34,000 incidents:

2020-07-07T08:31:37.275535+02:00 tyrex kernel: [822744.211956] {34202}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2020-07-07T08:31:37.275557+02:00 tyrex kernel: [822744.211969] {34202}[Hardware Error]: It has been corrected by h/w and requires no further action
2020-07-07T08:31:37.275559+02:00 tyrex kernel: [822744.211977] {34202}[Hardware Error]: event severity: corrected
2020-07-07T08:31:37.275567+02:00 tyrex kernel: [822744.211983] {34202}[Hardware Error]:  Error 0, type: corrected
2020-07-07T08:31:37.275569+02:00 tyrex kernel: [822744.211989] {34202}[Hardware Error]:  fru_text: PcieError
2020-07-07T08:31:37.275570+02:00 tyrex kernel: [822744.211995] {34202}[Hardware Error]:   section_type: PCIe error
2020-07-07T08:31:37.275571+02:00 tyrex kernel: [822744.212001] {34202}[Hardware Error]:   port_type: 4, root port
2020-07-07T08:31:37.275573+02:00 tyrex kernel: [822744.212006] {34202}[Hardware Error]:   version: 0.2
2020-07-07T08:31:37.275574+02:00 tyrex kernel: [822744.212012] {34202}[Hardware Error]:   command: 0x0407, status: 0x0010
2020-07-07T08:31:37.275615+02:00 tyrex kernel: [822744.212019] {34202}[Hardware Error]:   device_id: 0000:60:03.1
2020-07-07T08:31:37.275631+02:00 tyrex kernel: [822744.212024] {34202}[Hardware Error]:   slot: 19
2020-07-07T08:31:37.275634+02:00 tyrex kernel: [822744.212029] {34202}[Hardware Error]:   secondary_bus: 0x62
2020-07-07T08:31:37.275636+02:00 tyrex kernel: [822744.212035] {34202}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
2020-07-07T08:31:37.275637+02:00 tyrex kernel: [822744.212041] {34202}[Hardware Error]:   class_code: 060400
2020-07-07T08:31:37.275639+02:00 tyrex kernel: [822744.212046] {34202}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
2020-07-07T08:31:37.275661+02:00 tyrex kernel: [822744.212148] pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-07-07T08:31:37.275664+02:00 tyrex kernel: [822744.212158] pcieport 0000:60:03.1: AER:    [12] Timeout               
2020-07-07T08:31:37.275667+02:00 tyrex kernel: [822744.212167] pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
2020-07-07T08:31:43.167444+02:00 tyrex kernel: [822750.099749] pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-07-07T08:31:43.167474+02:00 tyrex kernel: [822750.099765] pcieport 0000:60:03.1: AER:    [12] Timeout               
2020-07-07T08:31:43.167476+02:00 tyrex kernel: [822750.099773] pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

Needless so say, that this is no permanent solution.

Any ideas anybody?

In fact, with 25 years of build and operation experience of Linux based 
systems and servers, I wouldn't have expected such amount of hassle with 
setting up commodity hardware in 2020. It almost feels as if diverging 
forces are dominating hardware, BIOS, firmware, and kernel developers 
more than ever before..

Certainly, one could argue that the decision for a low end AMD EPYC based 
system was a courageous one.

Pete



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD PCI Bridge: Hardware error from APEI
  2020-07-07  6:56 ` Hans-Peter Jansen
@ 2020-07-11 16:32   ` Hans-Peter Jansen
  2020-07-15  8:11     ` Hans-Peter Jansen
  0 siblings, 1 reply; 5+ messages in thread
From: Hans-Peter Jansen @ 2020-07-11 16:32 UTC (permalink / raw)
  To: linux-kernel

Am Dienstag, 7. Juli 2020, 08:56:41 CEST schrieben Sie:
> Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> > Dear hacker from the order of the penguins,
> > 
> > we're facing a disturbing issue here after swapping a motherboard of a
> > mission critical system:
> > 
> > Specs:
> > ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> > (officially supported RAM modules)
> > 
> > openSUSE 15.1, Kernel 5.7.5
> 
> Not sure, how to proceed with this one?
> 
> After 9½ days uptime, it cumulated about 34,000 incidents:
>
> [...]
> 
> Needless so say, that this is no permanent solution.
> 
> Any ideas anybody?

After swapping the PCIe slot for the Digital Devices Max S8 4/8, the error has 
moved: 

2020-07-11T18:25:34.380002+02:00 tyrex kernel: [  889.223783] {20}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2020-07-11T18:25:34.380025+02:00 tyrex kernel: [  889.223787] {20}[Hardware Error]: It has been corrected by h/w and requires no further action
2020-07-11T18:25:34.380028+02:00 tyrex kernel: [  889.223789] {20}[Hardware Error]: event severity: corrected
2020-07-11T18:25:34.380031+02:00 tyrex kernel: [  889.223791] {20}[Hardware Error]:  Error 0, type: corrected
2020-07-11T18:25:34.380032+02:00 tyrex kernel: [  889.223793] {20}[Hardware Error]:  fru_text: PcieError
2020-07-11T18:25:34.380034+02:00 tyrex kernel: [  889.223795] {20}[Hardware Error]:   section_type: PCIe error
2020-07-11T18:25:34.380577+02:00 tyrex kernel: [  889.223796] {20}[Hardware Error]:   port_type: 4, root port
2020-07-11T18:25:34.380586+02:00 tyrex kernel: [  889.223798] {20}[Hardware Error]:   version: 0.2
2020-07-11T18:25:34.380588+02:00 tyrex kernel: [  889.223800] {20}[Hardware Error]:   command: 0x0407, status: 0x0010
2020-07-11T18:25:34.380590+02:00 tyrex kernel: [  889.223802] {20}[Hardware Error]:   device_id: 0000:40:03.1
2020-07-11T18:25:34.380591+02:00 tyrex kernel: [  889.223803] {20}[Hardware Error]:   slot: 16
2020-07-11T18:25:34.380593+02:00 tyrex kernel: [  889.223804] {20}[Hardware Error]:   secondary_bus: 0x41
2020-07-11T18:25:34.380595+02:00 tyrex kernel: [  889.223806] {20}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
2020-07-11T18:25:34.380597+02:00 tyrex kernel: [  889.223808] {20}[Hardware Error]:   class_code: 060400
2020-07-11T18:25:34.380599+02:00 tyrex kernel: [  889.223810] {20}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
2020-07-11T18:25:34.380601+02:00 tyrex kernel: [  889.223908] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-07-11T18:25:34.380603+02:00 tyrex kernel: [  889.223912] pcieport 0000:40:03.1: AER:    [12] Timeout               
2020-07-11T18:25:34.380605+02:00 tyrex kernel: [  889.223915] pcieport 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

It looks like the system is creating such devices on demand:

40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 2
        Bus: primary=40, secondary=41, subordinate=41, sec-latency=0
        I/O behind bridge: None
        Memory behind bridge: e5d00000-e5dfffff [size=1M]
        Prefetchable memory behind bridge: None
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Root Port (Slot+), MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [270] #19
        Capabilities: [2a0] Access Control Services
        Capabilities: [370] L1 PM Substates
        Capabilities: [380] Downstream Port Containment
        Capabilities: [3c4] #23
        Kernel driver in use: pcieport

in order to handle:

41:00.0 Multimedia controller: Digital Devices GmbH Max
        Subsystem: Digital Devices GmbH Max S8 4/8
        Flags: bus master, fast devsel, latency 0, IRQ 181, NUMA node 2
        Memory at e5d00000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 3
        Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
        Capabilities: [90] Express Endpoint, MSI 00
        Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
        Kernel driver in use: ddbridge
        Kernel modules: ddbridge

Hrmpf.

Pete




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD PCI Bridge: Hardware error from APEI
  2020-07-11 16:32   ` Hans-Peter Jansen
@ 2020-07-15  8:11     ` Hans-Peter Jansen
  2021-05-15 17:11       ` Hans-Peter Jansen
  0 siblings, 1 reply; 5+ messages in thread
From: Hans-Peter Jansen @ 2020-07-15  8:11 UTC (permalink / raw)
  To: linux-kernel

Am Samstag, 11. Juli 2020, 18:32:21 CEST schrieben Sie:
> Am Dienstag, 7. Juli 2020, 08:56:41 CEST schrieben Sie:
> > Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> > > Dear hacker from the order of the penguins,
> > > 
> > > we're facing a disturbing issue here after swapping a motherboard of a
> > > mission critical system:
> > > 
> > > Specs:
> > > ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> > > (officially supported RAM modules)
> > > 
> > > openSUSE 15.1, Kernel 5.7.5
> > 
> > Not sure, how to proceed with this one?
> > 
> > After 9½ days uptime, it cumulated about 34,000 incidents:
> > 
> > [...]
> > 
> > Needless so say, that this is no permanent solution.
> > 
> > Any ideas anybody?
> 
> After swapping the PCIe slot for the Digital Devices Max S8 4/8, the error
> has moved:
> 
> 2020-07-11T18:25:34.380002+02:00 tyrex kernel: [  889.223783] {20}[Hardware
> Error]: Hardware error from APEI Generic Hardware Error Source: 4
> 2020-07-11T18:25:34.380025+02:00 tyrex kernel: [  889.223787] {20}[Hardware
> Error]: It has been corrected by h/w and requires no further action
> 2020-07-11T18:25:34.380028+02:00 tyrex kernel: [  889.223789] {20}[Hardware
> Error]: event severity: corrected 2020-07-11T18:25:34.380031+02:00 tyrex
> kernel: [  889.223791] {20}[Hardware Error]:  Error 0, type: corrected
> 2020-07-11T18:25:34.380032+02:00 tyrex kernel: [  889.223793] {20}[Hardware
> Error]:  fru_text: PcieError 2020-07-11T18:25:34.380034+02:00 tyrex kernel:
> [  889.223795] {20}[Hardware Error]:   section_type: PCIe error
> 2020-07-11T18:25:34.380577+02:00 tyrex kernel: [  889.223796] {20}[Hardware
> Error]:   port_type: 4, root port 2020-07-11T18:25:34.380586+02:00 tyrex
> kernel: [  889.223798] {20}[Hardware Error]:   version: 0.2
> 2020-07-11T18:25:34.380588+02:00 tyrex kernel: [  889.223800] {20}[Hardware
> Error]:   command: 0x0407, status: 0x0010 2020-07-11T18:25:34.380590+02:00
> tyrex kernel: [  889.223802] {20}[Hardware Error]:   device_id:
> 0000:40:03.1 2020-07-11T18:25:34.380591+02:00 tyrex kernel: [  889.223803]
> {20}[Hardware Error]:   slot: 16 2020-07-11T18:25:34.380593+02:00 tyrex
> kernel: [  889.223804] {20}[Hardware Error]:   secondary_bus: 0x41
> 2020-07-11T18:25:34.380595+02:00 tyrex kernel: [  889.223806] {20}[Hardware
> Error]:   vendor_id: 0x1022, device_id: 0x1453
> 2020-07-11T18:25:34.380597+02:00 tyrex kernel: [  889.223808] {20}[Hardware
> Error]:   class_code: 060400 2020-07-11T18:25:34.380599+02:00 tyrex kernel:
> [  889.223810] {20}[Hardware Error]:   bridge: secondary_status: 0x2000,
> control: 0x0012 2020-07-11T18:25:34.380601+02:00 tyrex kernel: [ 
> 889.223908] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 2020-07-11T18:25:34.380603+02:00 tyrex kernel: [  889.223912]
> pcieport 0000:40:03.1: AER:    [12] Timeout
> 2020-07-11T18:25:34.380605+02:00 tyrex kernel: [  889.223915] pcieport
> 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
> 
> It looks like the system is creating such devices on demand:
> 
> 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master,
> fast devsel, latency 0, IRQ 39, NUMA node 2 Bus: primary=40, secondary=41,
> subordinate=41, sec-latency=0 I/O behind bridge: None
>         Memory behind bridge: e5d00000-e5dfffff [size=1M]
>         Prefetchable memory behind bridge: None
>         Capabilities: [50] Power Management version 3
>         Capabilities: [58] Express Root Port (Slot+), MSI 00
>         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD]
> Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8]
> HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor
> Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150]
> Advanced Error Reporting
>         Capabilities: [270] #19
>         Capabilities: [2a0] Access Control Services
>         Capabilities: [370] L1 PM Substates
>         Capabilities: [380] Downstream Port Containment
>         Capabilities: [3c4] #23
>         Kernel driver in use: pcieport
> 
> in order to handle:
> 
> 41:00.0 Multimedia controller: Digital Devices GmbH Max
>         Subsystem: Digital Devices GmbH Max S8 4/8
>         Flags: bus master, fast devsel, latency 0, IRQ 181, NUMA node 2
>         Memory at e5d00000 (64-bit, non-prefetchable) [size=64K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
>         Capabilities: [90] Express Endpoint, MSI 00
>         Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
> Len=00c <?> Kernel driver in use: ddbridge
>         Kernel modules: ddbridge

Here's the initialization sequence of these devices:

Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: [1022:1453] type 01 class 0x060400
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PME# supported from D0 D3hot D3cold
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: [dd01:0007] type 00 class 0x048000
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: reg 0x10: [mem 0xe5d00000-0xe5d0ffff 64bit]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: Adding to iommu group 41
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: Adding to iommu group 47
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: PME: Signaling with IRQ 39
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: AER: enabled with IRQ 39
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+

The last line is somewhat suspicious, but hard to decipher:

DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+

I'm pretty sure, this is related, but the deeper meaning is denied me.

Would be nice, if some enlightened person could shed some light
into this abyss.

Pete



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AMD PCI Bridge: Hardware error from APEI
  2020-07-15  8:11     ` Hans-Peter Jansen
@ 2021-05-15 17:11       ` Hans-Peter Jansen
  0 siblings, 0 replies; 5+ messages in thread
From: Hans-Peter Jansen @ 2021-05-15 17:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: tom.l.nguyen, yanmin.zhang, andrew.patterson

Dear maintainers,

since my bug record got *no* attraction since a couple of month, I decided to 
pull some people into the loop.

This is about:
https://bugzilla.kernel.org/show_bug.cgi?id=209331

This issue is immanent since 5.3.18 (first tested version) up to 5.12.2.
5.12.4 is in preparation (but I'm pretty confident to reproduce it as well.

This might be some bad hardware interaction, but given, any attempt to 
*disable* these messages failed, I would call this a general issue.

See: https://bugzilla.kernel.org/show_bug.cgi?id=209331#c2

I would like to know, how to get to the bottom of it, and if all attempts 
fail, how to disable this behavior.

Sure, this is a production system, but I'm able and prepared to conduct the 
necessary tests/actions of course.

Thanks,
Pete

Am Mittwoch, 15. Juli 2020, 10:11:11 CEST schrieb Hans-Peter Jansen:
> Am Samstag, 11. Juli 2020, 18:32:21 CEST schrieben Sie:
> > Am Dienstag, 7. Juli 2020, 08:56:41 CEST schrieben Sie:
> > > Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> > > > Dear hacker from the order of the penguins,
> > > > 
> > > > we're facing a disturbing issue here after swapping a motherboard of a
> > > > mission critical system:
> > > > 
> > > > Specs:
> > > > ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> > > > (officially supported RAM modules)
> > > > 
> > > > openSUSE 15.1, Kernel 5.7.5
> > > 
> > > Not sure, how to proceed with this one?
> > > 
> > > After 9½ days uptime, it cumulated about 34,000 incidents:
> > > 
> > > [...]
> > > 
> > > Needless so say, that this is no permanent solution.
> > > 
> > > Any ideas anybody?
> > 
> > After swapping the PCIe slot for the Digital Devices Max S8 4/8, the error
> > has moved:
> > 
> > 2020-07-11T18:25:34.380002+02:00 tyrex kernel: [  889.223783]
> > {20}[Hardware
> > Error]: Hardware error from APEI Generic Hardware Error Source: 4
> > 2020-07-11T18:25:34.380025+02:00 tyrex kernel: [  889.223787]
> > {20}[Hardware
> > Error]: It has been corrected by h/w and requires no further action
> > 2020-07-11T18:25:34.380028+02:00 tyrex kernel: [  889.223789]
> > {20}[Hardware
> > Error]: event severity: corrected 2020-07-11T18:25:34.380031+02:00 tyrex
> > kernel: [  889.223791] {20}[Hardware Error]:  Error 0, type: corrected
> > 2020-07-11T18:25:34.380032+02:00 tyrex kernel: [  889.223793]
> > {20}[Hardware
> > Error]:  fru_text: PcieError 2020-07-11T18:25:34.380034+02:00 tyrex
> > kernel:
> > [  889.223795] {20}[Hardware Error]:   section_type: PCIe error
> > 2020-07-11T18:25:34.380577+02:00 tyrex kernel: [  889.223796]
> > {20}[Hardware
> > Error]:   port_type: 4, root port 2020-07-11T18:25:34.380586+02:00 tyrex
> > kernel: [  889.223798] {20}[Hardware Error]:   version: 0.2
> > 2020-07-11T18:25:34.380588+02:00 tyrex kernel: [  889.223800]
> > {20}[Hardware
> > Error]:   command: 0x0407, status: 0x0010 2020-07-11T18:25:34.380590+02:00
> > tyrex kernel: [  889.223802] {20}[Hardware Error]:   device_id:
> > 0000:40:03.1 2020-07-11T18:25:34.380591+02:00 tyrex kernel: [  889.223803]
> > {20}[Hardware Error]:   slot: 16 2020-07-11T18:25:34.380593+02:00 tyrex
> > kernel: [  889.223804] {20}[Hardware Error]:   secondary_bus: 0x41
> > 2020-07-11T18:25:34.380595+02:00 tyrex kernel: [  889.223806]
> > {20}[Hardware
> > Error]:   vendor_id: 0x1022, device_id: 0x1453
> > 2020-07-11T18:25:34.380597+02:00 tyrex kernel: [  889.223808]
> > {20}[Hardware
> > Error]:   class_code: 060400 2020-07-11T18:25:34.380599+02:00 tyrex
> > kernel:
> > [  889.223810] {20}[Hardware Error]:   bridge: secondary_status: 0x2000,
> > control: 0x0012 2020-07-11T18:25:34.380601+02:00 tyrex kernel: [
> > 889.223908] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask:
> > 0x00006000 2020-07-11T18:25:34.380603+02:00 tyrex kernel: [  889.223912]
> > pcieport 0000:40:03.1: AER:    [12] Timeout
> > 2020-07-11T18:25:34.380605+02:00 tyrex kernel: [  889.223915] pcieport
> > 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
> > 
> > It looks like the system is creating such devices on demand:
> > 
> > 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> > 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master,
> > fast devsel, latency 0, IRQ 39, NUMA node 2 Bus: primary=40, secondary=41,
> > subordinate=41, sec-latency=0 I/O behind bridge: None
> > 
> >         Memory behind bridge: e5d00000-e5dfffff [size=1M]
> >         Prefetchable memory behind bridge: None
> >         Capabilities: [50] Power Management version 3
> >         Capabilities: [58] Express Root Port (Slot+), MSI 00
> >         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >         Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD]
> > 
> > Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8]
> > HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor
> > Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150]
> > Advanced Error Reporting
> > 
> >         Capabilities: [270] #19
> >         Capabilities: [2a0] Access Control Services
> >         Capabilities: [370] L1 PM Substates
> >         Capabilities: [380] Downstream Port Containment
> >         Capabilities: [3c4] #23
> >         Kernel driver in use: pcieport
> > 
> > in order to handle:
> > 
> > 41:00.0 Multimedia controller: Digital Devices GmbH Max
> > 
> >         Subsystem: Digital Devices GmbH Max S8 4/8
> >         Flags: bus master, fast devsel, latency 0, IRQ 181, NUMA node 2
> >         Memory at e5d00000 (64-bit, non-prefetchable) [size=64K]
> >         Capabilities: [50] Power Management version 3
> >         Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
> >         Capabilities: [90] Express Endpoint, MSI 00
> >         Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
> > 
> > Len=00c <?> Kernel driver in use: ddbridge
> > 
> >         Kernel modules: ddbridge
> 
> Here's the initialization sequence of these devices:
> 
> Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: [1022:1453] type 01 class
> 0x060400 Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PME# supported
> from D0 D3hot D3cold Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0:
> [dd01:0007] type 00 class 0x048000 Jul 13 12:19:27 tyrex kernel: pci
> 0000:41:00.0: reg 0x10: [mem 0xe5d00000-0xe5d0ffff 64bit] Jul 13 12:19:27
> tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41] Jul 13 12:19:27
> tyrex kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
> Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41] Jul
> 13 12:19:27 tyrex kernel: pci 0000:40:03.1:   bridge window [mem
> 0xe5d00000-0xe5dfffff] Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1:
> Adding to iommu group 41 Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0:
> Adding to iommu group 47 Jul 13 12:19:27 tyrex kernel: pcieport
> 0000:40:03.1: PME: Signaling with IRQ 39 Jul 13 12:19:27 tyrex kernel:
> pcieport 0000:40:03.1: AER: enabled with IRQ 39 Jul 13 12:19:27 tyrex
> kernel: pcieport 0000:40:03.1: DPC: error containment capabilities: Int Msg
> #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
> 
> The last line is somewhat suspicious, but hard to decipher:
> 
> DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+
> SwTrigger+ RP PIO Log 6, DL_ActiveErr+
> 
> I'm pretty sure, this is related, but the deeper meaning is denied me.
> 
> Would be nice, if some enlightened person could shed some light
> into this abyss.
> 
> Pete





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-15 17:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-27 18:23 AMD PCI Bridge: Hardware error from APEI Hans-Peter Jansen
2020-07-07  6:56 ` Hans-Peter Jansen
2020-07-11 16:32   ` Hans-Peter Jansen
2020-07-15  8:11     ` Hans-Peter Jansen
2021-05-15 17:11       ` Hans-Peter Jansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).