linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel oops while using AER inject
@ 2020-12-03 16:00 Hinko Kocevar
  2020-12-14 18:37 ` Guilherme Piccoli
  2020-12-14 18:51 ` Keith Busch
  0 siblings, 2 replies; 5+ messages in thread
From: Hinko Kocevar @ 2020-12-03 16:00 UTC (permalink / raw)
  To: linux-pci

Hi all,

I'm trying to use AER inject module and the aer-inject tool to trigger AER handling on a micro TCA based system. The reason is that I'm seeing this particular error appearing every now and then and is reported by the 00:01.0 PCI slot; happens to be root PCIe port on the CPU card. The system also contains a MCH with a PEX 8748 PCIe switch that provides PCIe connectivity to the regular AMC cards inside the crate. There are 5 AMCs connected in the crate , 4x use sis8300drv and one mrf kernel drivers (messages of those as the recovery is happening are seen below).

lspci output:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics P630 (rev 04)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.1 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1c.3 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #4 (rev f1)
00:1f.0 ISA bridge: Intel Corporation CM238 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
01:00.1 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
01:00.2 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
01:00.3 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
01:00.4 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:02.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:0a.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:01.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:02.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:03.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:08.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:09.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:0a.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:0b.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:10.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:12.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
06:00.0 Signal processing controller: Research Centre Juelich Device 0024
08:00.0 Signal processing controller: Research Centre Juelich Device 0024
09:00.0 Signal processing controller: Research Centre Juelich Device 0024
0b:00.0 Signal processing controller: Research Centre Juelich Device 0024
0e:00.0 Signal processing controller: Xilinx Corporation Device 7011
15:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
16:00.0 Ethernet controller: Intel Corporation I210 Gigabit Backplane Connection (rev 03)
17:00.0 Ethernet controller: Intel Corporation I210 Gigabit Backplane Connection (rev 03)
18:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)

Note that I had to manually remove following devices to make the recovery report success:

echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove 
echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove 
echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove 
echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove 
echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove 
echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove 
echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove 
echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove 

After that, the PCI device tree looks like this:

00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
 └─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
    └─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
       └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
          ├─04:00.0 downstream_port, slot 10, power: Off
          ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
          │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
          ├─04:02.0 downstream_port, slot 9, power: Off
          ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
          │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
          ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
          │  └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
          ├─04:09.0 downstream_port, slot 11, power: Off
          ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
          │  └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
          ├─04:0b.0 downstream_port, slot 12, power: Off
          ├─04:10.0 downstream_port, slot 8, power: Off
          ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
          │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
          └─04:12.0 downstream_port, slot 7, power: Off
00:01.1 root_port, slot 2, device present
00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
 └─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
 └─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
 └─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
 └─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)

Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.

Dec  3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout    
Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
Dec  3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
Dec  3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'

At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.

Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.

Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec  3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Dec  3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Dec  3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0 
Dec  3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP 
Dec  3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
Dec  3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
Dec  3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.6.1.el7.x86_64.debug #1
Dec  3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
Dec  3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
Dec  3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>]  [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Dec  3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30  EFLAGS: 00010046
Dec  3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
Dec  3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
Dec  3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
Dec  3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
Dec  3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
Dec  3 15:25:59 bd-cpu18 kernel: FS:  00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
Dec  3 15:25:59 bd-cpu18 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
Dec  3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec  3 15:25:59 bd-cpu18 kernel: Call Trace:
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
Dec  3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00 
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.711+0000: 2132: info : libvirt version: 4.5.0, package: 23.el7_7.6 (CentOS BuildSystem <http://bugs.centos.org>, 2020-03-17-23:39:10, x86-01.bsys.centos.org)
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.711+0000: 2132: info : hostname: bd-cpu18.cslab.esss.lu.se
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.711+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.1 not found: could not access /sys/bus/pci/devices/0000:01:00.1/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.713+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.2 not found: could not access /sys/bus/pci/devices/0000:01:00.2/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.714+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.3 not found: could not access /sys/bus/pci/devices/0000:01:00.3/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.714+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.4 not found: could not access /sys/bus/pci/devices/0000:01:00.4/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.715+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:02.0 not found: could not access /sys/bus/pci/devices/0000:02:02.0/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.715+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:09.0 not found: could not access /sys/bus/pci/devices/0000:02:09.0/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.715+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:08.0 not found: could not access /sys/bus/pci/devices/0000:02:08.0/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.716+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:0a.0 not found: could not access /sys/bus/pci/devices/0000:02:0a.0/config: No such file or directory
Dec  3 15:25:59 bd-cpu18 kernel: RIP  [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Dec  3 15:25:59 bd-cpu18 kernel: RSP <ffff9b22b5bc7a30>
Dec  3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000
Dec  3 15:25:59 bd-cpu18 kernel: ---[ end trace 452aceb2952fc499 ]---
Dec  3 15:25:59 bd-cpu18 kernel: BUG: sleeping function called from invalid context at kernel/rwsem.c:21
Dec  3 15:25:59 bd-cpu18 kernel: in_atomic(): 1, irqs_disabled(): 1, pid: 8150, name: bash
Dec  3 15:25:59 bd-cpu18 kernel: INFO: lockdep is turned off.
Dec  3 15:25:59 bd-cpu18 kernel: irq event stamp: 87980
Dec  3 15:25:59 bd-cpu18 kernel: hardirqs last  enabled at (87979): [<ffffffffa308bcd6>] _raw_spin_unlock_irqrestore+0x36/0x70
Dec  3 15:25:59 bd-cpu18 kernel: hardirqs last disabled at (87980): [<ffffffffa308c55b>] _raw_spin_lock_irqsave+0x2b/0xa0
Dec  3 15:25:59 bd-cpu18 kernel: softirqs last  enabled at (87300): [<ffffffffa28ba134>] __do_softirq+0x1a4/0x470
Dec  3 15:25:59 bd-cpu18 kernel: softirqs last disabled at (87281): [<ffffffffa309c2bc>] call_softirq+0x1c/0x30
Dec  3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G      D    OE  ------------   3.10.0-1160.6.1.el7.x86_64.debug #1
Dec  3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
Dec  3 15:25:59 bd-cpu18 kernel: Call Trace:
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3080c37>] dump_stack+0x19/0x1b
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28f500b>] __might_sleep+0x17b/0x240
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3087aba>] down_read+0x2a/0xb0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28ccad3>] exit_signals+0x33/0x150
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b5f6c>] do_exit+0xcc/0xb80
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2390>] ? kmsg_dump+0x130/0x1d0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2294>] ? kmsg_dump+0x34/0x1d0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa308f910>] oops_end+0xa0/0xe0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28836a4>] no_context+0x144/0x310
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa293eac3>] ? __bfs+0x103/0x280
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa288398a>] __bad_area_nosemaphore+0x11a/0x230
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2883ab4>] bad_area_nosemaphore+0x14/0x20
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092ce8>] __do_page_fault+0x358/0x5a0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2944229>] ? __lock_acquire+0xa49/0x1630
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092f65>] do_page_fault+0x35/0x90
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa308e818>] page_fault+0x28/0x30
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffc0b182f7>] ? aer_inj_read_config+0x87/0x160 [aer_inject]
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
Dec  3 15:25:59 bd-cpu18 kernel: BUG: scheduling while atomic: bash/8150/0x10000003
Dec  3 15:25:59 bd-cpu18 kernel: INFO: lockdep is turned off.
Dec  3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
Dec  3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
Dec  3 15:25:59 bd-cpu18 kernel: irq event stamp: 87980
Dec  3 15:25:59 bd-cpu18 kernel: hardirqs last  enabled at (87979): [<ffffffffa308bcd6>] _raw_spin_unlock_irqrestore+0x36/0x70
Dec  3 15:25:59 bd-cpu18 kernel: hardirqs last disabled at (87980): [<ffffffffa308c55b>] _raw_spin_lock_irqsave+0x2b/0xa0
Dec  3 15:25:59 bd-cpu18 kernel: softirqs last  enabled at (87300): [<ffffffffa28ba134>] __do_softirq+0x1a4/0x470
Dec  3 15:25:59 bd-cpu18 kernel: softirqs last disabled at (87281): [<ffffffffa309c2bc>] call_softirq+0x1c/0x30
Dec  3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G      D    OE  ------------   3.10.0-1160.6.1.el7.x86_64.debug #1
Dec  3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
Dec  3 15:25:59 bd-cpu18 kernel: Call Trace:
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3080c37>] dump_stack+0x19/0x1b
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3078933>] __schedule_bug+0x7d/0x8c
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa30887ec>] __schedule+0x7cc/0x810
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28f98e6>] __cond_resched+0x26/0x30
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3088b7a>] _cond_resched+0x3a/0x50
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3087abf>] down_read+0x2f/0xb0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28ccad3>] exit_signals+0x33/0x150
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b5f6c>] do_exit+0xcc/0xb80
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2390>] ? kmsg_dump+0x130/0x1d0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2294>] ? kmsg_dump+0x34/0x1d0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa308f910>] oops_end+0xa0/0xe0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa28836a4>] no_context+0x144/0x310
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa293eac3>] ? __bfs+0x103/0x280
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa288398a>] __bad_area_nosemaphore+0x11a/0x230
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2883ab4>] bad_area_nosemaphore+0x14/0x20
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092ce8>] __do_page_fault+0x358/0x5a0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2944229>] ? __lock_acquire+0xa49/0x1630
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092f65>] do_page_fault+0x35/0x90
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa308e818>] page_fault+0x28/0x30
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffc0b182f7>] ? aer_inj_read_config+0x87/0x160 [aer_inject]
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
Dec  3 15:25:59 bd-cpu18 kernel: note: bash[8150] exited with preempt_count 2
Dec  3 15:26:00 bd-cpu18 sh: abrt-dump-oops: Found oopses: 2
Dec  3 15:26:00 bd-cpu18 sh: abrt-dump-oops: Creating problem directories
Dec  3 15:26:00 bd-cpu18 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on
Dec  3 15:26:01 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:01 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:02 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:02 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:02 bd-cpu18 abrt-server: Looking for kernel package
Dec  3 15:26:02 bd-cpu18 abrt-dump-oops: Reported 2 kernel oopses to Abrt
Dec  3 15:26:02 bd-cpu18 abrt-server: Kernel package kernel-debug-3.10.0-1160.6.1.el7.x86_64 found
Dec  3 15:26:04 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:04 bd-cpu18 abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N] 
Dec  3 15:26:04 bd-cpu18 abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N] 
Dec  3 15:26:04 bd-cpu18 abrt-server: Sending an email...
Dec  3 15:26:04 bd-cpu18 abrt-server: Sending a notification email to: root@localhost
Dec  3 15:26:04 bd-cpu18 abrt-server: Email was sent to: root@localhost
Dec  3 15:26:05 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:05 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:05 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:06 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:06 bd-cpu18 abrt-server: Looking for kernel package
Dec  3 15:26:06 bd-cpu18 abrt-server: Kernel package kernel-debug-3.10.0-1160.6.1.el7.x86_64 found
Dec  3 15:26:07 bd-cpu18 rtkit-daemon[1338]: The canary thread is apparently starving. Taking action.
Dec  3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Demoting known real-time threads.
Dec  3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Successfully demoted thread 2594 of process 2591 (/usr/bin/pulseaudio).
Dec  3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Successfully demoted thread 2591 of process 2591 (/usr/bin/pulseaudio).
Dec  3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Demoted 2 threads.
Dec  3 15:26:08 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec  3 15:26:08 bd-cpu18 abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N] 
Dec  3 15:26:08 bd-cpu18 abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N] 
Dec  3 15:26:08 bd-cpu18 abrt-server: Sending an email...
Dec  3 15:26:08 bd-cpu18 abrt-server: Sending a notification email to: root@localhost


I guess this is the root cause of oops:

Dec  3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Dec  3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]

Looking at the actual source code of the running kernel this is the source of the offending  aer_inj_read_config():


static int aer_inj_read_config(struct pci_bus *bus, unsigned int devfn,
			       int where, int size, u32 *val)
{
	u32 *sim;
	struct aer_error *err;
	unsigned long flags;
	struct pci_ops *ops;
	struct pci_ops *my_ops;
	int domain;
	int rv;

	spin_lock_irqsave(&inject_lock, flags);
	if (size != sizeof(u32))
		goto out;
	domain = pci_domain_nr(bus);
	if (domain < 0)
		goto out;
	err = __find_aer_error(domain, bus->number, devfn);
	if (!err)
		goto out;

	sim = find_pci_config_dword(err, where, NULL);
	if (sim) {
		*val = *sim;
		spin_unlock_irqrestore(&inject_lock, flags);
		return 0;
	}
out:
	ops = __find_pci_bus_ops(bus);
	/*
	 * pci_lock must already be held, so we can directly
	 * manipulate bus->ops.  Many config access functions,
	 * including pci_generic_config_read() require the original
	 * bus->ops be installed to function, so temporarily put them
	 * back.
	 */
	my_ops = bus->ops;
	bus->ops = ops;
	rv = ops->read(bus, devfn, where, size, val);
	bus->ops = my_ops;
	spin_unlock_irqrestore(&inject_lock, flags);
	return rv;
}


Here is 'objdump -D aer_inject.ko':

0000000000000270 <aer_inj_read_config>:
 270:   e8 00 00 00 00          callq  275 <aer_inj_read_config+0x5>
 275:   55                      push   %rbp
 276:   48 89 e5                mov    %rsp,%rbp
 279:   41 57                   push   %r15
 27b:   49 89 ff                mov    %rdi,%r15
 27e:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
 285:   41 56                   push   %r14
 287:   4d 89 c6                mov    %r8,%r14
 28a:   41 55                   push   %r13
 28c:   41 54                   push   %r12
 28e:   41 89 d4                mov    %edx,%r12d
 291:   53                      push   %rbx
 292:   89 f3                   mov    %esi,%ebx
 294:   48 83 ec 10             sub    $0x10,%rsp
 298:   89 4d cc                mov    %ecx,-0x34(%rbp)
 29b:   e8 00 00 00 00          callq  2a0 <aer_inj_read_config+0x30>
 2a0:   8b 4d cc                mov    -0x34(%rbp),%ecx
 2a3:   48 89 45 d0             mov    %rax,-0x30(%rbp)
 2a7:   83 f9 04                cmp    $0x4,%ecx
 2aa:   0f 84 88 00 00 00       je     338 <aer_inj_read_config+0xc8>
 2b0:   4c 8b 0d 00 00 00 00    mov    0x0(%rip),%r9        # 2b7 <aer_inj_read_config+0x47>
 2b7:   49 81 f9 00 00 00 00    cmp    $0x0,%r9
 2be:   75 14                   jne    2d4 <aer_inj_read_config+0x64>
 2c0:   eb 6e                   jmp    330 <aer_inj_read_config+0xc0>
 2c2:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
 2c8:   4d 8b 09                mov    (%r9),%r9
 2cb:   49 81 f9 00 00 00 00    cmp    $0x0,%r9
 2d2:   74 5c                   je     330 <aer_inj_read_config+0xc0>
 2d4:   4d 3b 79 10             cmp    0x10(%r9),%r15
 2d8:   75 ee                   jne    2c8 <aer_inj_read_config+0x58>
 2da:   49 8b 41 18             mov    0x18(%r9),%rax
 2de:   4d 8b af b8 00 00 00    mov    0xb8(%r15),%r13
 2e5:   89 de                   mov    %ebx,%esi
 2e7:   49 89 87 b8 00 00 00    mov    %rax,0xb8(%r15)
 2ee:   4d 89 f0                mov    %r14,%r8
 2f1:   44 89 e2                mov    %r12d,%edx
 2f4:   4c 89 ff                mov    %r15,%rdi
 2f7:   48 8b 00                mov    (%rax),%rax
 2fa:   e8 00 00 00 00          callq  2ff <aer_inj_read_config+0x8f>
 2ff:   48 8b 75 d0             mov    -0x30(%rbp),%rsi
 303:   89 c3                   mov    %eax,%ebx
 305:   4d 89 af b8 00 00 00    mov    %r13,0xb8(%r15)
 30c:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
 313:   e8 00 00 00 00          callq  318 <aer_inj_read_config+0xa8>
 318:   89 d8                   mov    %ebx,%eax
 31a:   48 83 c4 10             add    $0x10,%rsp
 31e:   5b                      pop    %rbx
 31f:   41 5c                   pop    %r12
 321:   41 5d                   pop    %r13
 323:   41 5e                   pop    %r14
 325:   41 5f                   pop    %r15
 327:   5d                      pop    %rbp
 328:   c3                      retq   
 329:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 330:   31 c0                   xor    %eax,%eax
 332:   eb aa                   jmp    2de <aer_inj_read_config+0x6e>
 334:   0f 1f 40 00             nopl   0x0(%rax)
 338:   49 8b 87 c8 00 00 00    mov    0xc8(%r15),%rax
 33f:   8b 00                   mov    (%rax),%eax
 341:   85 c0                   test   %eax,%eax
 343:   0f 88 67 ff ff ff       js     2b0 <aer_inj_read_config+0x40>
 349:   48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi        # 350 <aer_inj_read_config+0xe0>
 350:   41 0f b6 97 d8 00 00    movzbl 0xd8(%r15),%edx
 357:   00 
 358:   48 81 ff 00 00 00 00    cmp    $0x0,%rdi
 35f:   0f 84 4b ff ff ff       je     2b0 <aer_inj_read_config+0x40>
 365:   3b 47 10                cmp    0x10(%rdi),%eax
 368:   74 1b                   je     385 <aer_inj_read_config+0x115>
 36a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
 370:   48 8b 3f                mov    (%rdi),%rdi
 373:   48 81 ff 00 00 00 00    cmp    $0x0,%rdi
 37a:   0f 84 30 ff ff ff       je     2b0 <aer_inj_read_config+0x40>
 380:   3b 47 10                cmp    0x10(%rdi),%eax
 383:   75 eb                   jne    370 <aer_inj_read_config+0x100>
 385:   3b 57 14                cmp    0x14(%rdi),%edx
 388:   75 e6                   jne    370 <aer_inj_read_config+0x100>
 38a:   3b 5f 18                cmp    0x18(%rdi),%ebx
 38d:   75 e1                   jne    370 <aer_inj_read_config+0x100>
 38f:   48 85 ff                test   %rdi,%rdi
 392:   0f 84 18 ff ff ff       je     2b0 <aer_inj_read_config+0x40>
 398:   31 d2                   xor    %edx,%edx
 39a:   44 89 e6                mov    %r12d,%esi
 39d:   89 4d cc                mov    %ecx,-0x34(%rbp)
 3a0:   e8 5b fc ff ff          callq  0 <find_pci_config_dword>
 3a5:   48 85 c0                test   %rax,%rax
 3a8:   8b 4d cc                mov    -0x34(%rbp),%ecx
 3ab:   0f 84 ff fe ff ff       je     2b0 <aer_inj_read_config+0x40>
 3b1:   8b 00                   mov    (%rax),%eax
 3b3:   48 8b 75 d0             mov    -0x30(%rbp),%rsi
 3b7:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
 3be:   41 89 06                mov    %eax,(%r14)
 3c1:   e8 00 00 00 00          callq  3c6 <aer_inj_read_config+0x156>
 3c6:   31 c0                   xor    %eax,%eax
 3c8:   e9 4d ff ff ff          jmpq   31a <aer_inj_read_config+0xaa>
 3cd:   0f 1f 00                nopl   (%rax)




Thank you!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel oops while using AER inject
  2020-12-03 16:00 Kernel oops while using AER inject Hinko Kocevar
@ 2020-12-14 18:37 ` Guilherme Piccoli
  2020-12-15  9:25   ` Hinko Kocevar
  2020-12-14 18:51 ` Keith Busch
  1 sibling, 1 reply; 5+ messages in thread
From: Guilherme Piccoli @ 2020-12-14 18:37 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: linux-pci

I see you're running a 3.10 modified kernel (Red Hat / CentOS ?) -
suggest you to try the upstream kernel, if possible. If the error
persists in the mainline kernel, it's likely you can get more support
here!

Cheers,


Guilherme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel oops while using AER inject
  2020-12-03 16:00 Kernel oops while using AER inject Hinko Kocevar
  2020-12-14 18:37 ` Guilherme Piccoli
@ 2020-12-14 18:51 ` Keith Busch
  2020-12-15  9:23   ` Hinko Kocevar
  1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2020-12-14 18:51 UTC (permalink / raw)
  To: Hinko Kocevar; +Cc: linux-pci

On Thu, Dec 03, 2020 at 04:00:04PM +0000, Hinko Kocevar wrote:
> Note that I had to manually remove following devices to make the recovery report success:
> 
> echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove 
> 
> After that, the PCI device tree looks like this:
> 
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  └─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>     └─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>        └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>           ├─04:00.0 downstream_port, slot 10, power: Off
>           ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>           │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>           ├─04:02.0 downstream_port, slot 9, power: Off
>           ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>           │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>           ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
>           │  └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>           ├─04:09.0 downstream_port, slot 11, power: Off
>           ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
>           │  └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>           ├─04:0b.0 downstream_port, slot 12, power: Off
>           ├─04:10.0 downstream_port, slot 8, power: Off
>           ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>           │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>           └─04:12.0 downstream_port, slot 7, power: Off
> 00:01.1 root_port, slot 2, device present
> 00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
>  └─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
> 00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
>  └─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
> 00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
>  └─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
> 00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
>  └─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
> 
> Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.
> 
> Dec  3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout    
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
> Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'
> 
> At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.
> 
> Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.
> 
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
> Dec  3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
> Dec  3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0 
> Dec  3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP 
> Dec  3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
> Dec  3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
> Dec  3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.6.1.el7.x86_64.debug #1
> Dec  3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
> Dec  3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
> Dec  3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>]  [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
> Dec  3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30  EFLAGS: 00010046
> Dec  3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> Dec  3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
> Dec  3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
> Dec  3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
> Dec  3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
> Dec  3 15:25:59 bd-cpu18 kernel: FS:  00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
> Dec  3 15:25:59 bd-cpu18 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Dec  3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
> Dec  3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec  3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Dec  3 15:25:59 bd-cpu18 kernel: Call Trace:
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
> Dec  3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00 

I believe this is a known issue with aer_inject when you re-enumerate
devices below a bridge with injected errors. I reported it here:

  https://lore.kernel.org/linux-pci/20180918235848.26694-3-keith.busch@intel.com/

Essentially, the re-enumerated devices inherit the injected bus_ops, but
aer_inject doesn't know about those devices.

The solution I proposed, however, had some CPU architectural and kernel
config limitations that made it less appealing.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel oops while using AER inject
  2020-12-14 18:51 ` Keith Busch
@ 2020-12-15  9:23   ` Hinko Kocevar
  0 siblings, 0 replies; 5+ messages in thread
From: Hinko Kocevar @ 2020-12-15  9:23 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci



On 12/14/20 7:51 PM, Keith Busch wrote:
> On Thu, Dec 03, 2020 at 04:00:04PM +0000, Hinko Kocevar wrote:
>> Note that I had to manually remove following devices to make the recovery report success:
>>
>> echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove
>>
>> After that, the PCI device tree looks like this:
>>
>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>   └─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>      └─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>         └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>            ├─04:00.0 downstream_port, slot 10, power: Off
>>            ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>>            │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>            ├─04:02.0 downstream_port, slot 9, power: Off
>>            ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>>            │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>>            ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
>>            │  └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>>            ├─04:09.0 downstream_port, slot 11, power: Off
>>            ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
>>            │  └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>>            ├─04:0b.0 downstream_port, slot 12, power: Off
>>            ├─04:10.0 downstream_port, slot 8, power: Off
>>            ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>>            │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>>            └─04:12.0 downstream_port, slot 7, power: Off
>> 00:01.1 root_port, slot 2, device present
>> 00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
>>   └─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
>> 00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
>>   └─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
>> 00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
>>   └─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
>> 00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
>>   └─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
>>
>> Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.
>>
>> Dec  3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
>> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout
>> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
>> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
>> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
>> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
>> Dec  3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
>> Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
>> Dec  3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
>> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
>> Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
>> Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'
>>
>> At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.
>>
>> Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.
>>
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec  3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
>> Dec  3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
>> Dec  3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0
>> Dec  3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP
>> Dec  3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
>> Dec  3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
>> Dec  3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.6.1.el7.x86_64.debug #1
>> Dec  3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
>> Dec  3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
>> Dec  3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>]  [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
>> Dec  3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30  EFLAGS: 00010046
>> Dec  3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
>> Dec  3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
>> Dec  3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
>> Dec  3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
>> Dec  3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
>> Dec  3 15:25:59 bd-cpu18 kernel: FS:  00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
>> Dec  3 15:25:59 bd-cpu18 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Dec  3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
>> Dec  3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Dec  3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Dec  3 15:25:59 bd-cpu18 kernel: Call Trace:
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
>> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
>> Dec  3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00
> 
> I believe this is a known issue with aer_inject when you re-enumerate
> devices below a bridge with injected errors. I reported it here:
> 
>    https://lore.kernel.org/linux-pci/20180918235848.26694-3-keith.busch@intel.com/
> 
> Essentially, the re-enumerated devices inherit the injected bus_ops, but
> aer_inject doesn't know about those devices.
> 
> The solution I proposed, however, had some CPU architectural and kernel
> config limitations that made it less appealing.
> 

I see that the changes that were introduced seem to touch the exact 
point of failure as I'm seeing.

I've stopped using the 3.10 kernel since, and went for 5.9.12 for my 
work. More recently I've been also looking at using the Bjorn's git 
tree, pci/err branch.

Will report if I still see the breakage in these recent kernels.

Thanks!
//hinko


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel oops while using AER inject
  2020-12-14 18:37 ` Guilherme Piccoli
@ 2020-12-15  9:25   ` Hinko Kocevar
  0 siblings, 0 replies; 5+ messages in thread
From: Hinko Kocevar @ 2020-12-15  9:25 UTC (permalink / raw)
  To: Guilherme Piccoli; +Cc: linux-pci



On 12/14/20 7:37 PM, Guilherme Piccoli wrote:
> I see you're running a 3.10 modified kernel (Red Hat / CentOS ?) -
> suggest you to try the upstream kernel, if possible. If the error
> persists in the mainline kernel, it's likely you can get more support
> here!
> 

I'm going to use with more recent kernels from 5.9.x series as well as 
Bjorn's git when reporting issues from now on.

Cheers,
//hinko

> Cheers,
> 
> 
> Guilherme
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-15  9:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 16:00 Kernel oops while using AER inject Hinko Kocevar
2020-12-14 18:37 ` Guilherme Piccoli
2020-12-15  9:25   ` Hinko Kocevar
2020-12-14 18:51 ` Keith Busch
2020-12-15  9:23   ` Hinko Kocevar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).