* Kernel oops while using AER inject
@ 2020-12-03 16:00 Hinko Kocevar
2020-12-14 18:37 ` Guilherme Piccoli
2020-12-14 18:51 ` Keith Busch
0 siblings, 2 replies; 5+ messages in thread
From: Hinko Kocevar @ 2020-12-03 16:00 UTC (permalink / raw)
To: linux-pci
Hi all,
I'm trying to use AER inject module and the aer-inject tool to trigger AER handling on a micro TCA based system. The reason is that I'm seeing this particular error appearing every now and then and is reported by the 00:01.0 PCI slot; happens to be root PCIe port on the CPU card. The system also contains a MCH with a PEX 8748 PCIe switch that provides PCIe connectivity to the regular AMC cards inside the crate. There are 5 AMCs connected in the crate , 4x use sis8300drv and one mrf kernel drivers (messages of those as the recovery is happening are seen below).
lspci output:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics P630 (rev 04)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.1 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1c.3 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #4 (rev f1)
00:1f.0 ISA bridge: Intel Corporation CM238 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
01:00.1 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
01:00.2 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
01:00.3 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
01:00.4 System peripheral: PLX Technology, Inc. Device 87d0 (rev ca)
02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:02.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
02:0a.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
03:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:00.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:01.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:02.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:03.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:08.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:09.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:0a.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:0b.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:10.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:11.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
04:12.0 PCI bridge: PLX Technology, Inc. PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (rev ca)
06:00.0 Signal processing controller: Research Centre Juelich Device 0024
08:00.0 Signal processing controller: Research Centre Juelich Device 0024
09:00.0 Signal processing controller: Research Centre Juelich Device 0024
0b:00.0 Signal processing controller: Research Centre Juelich Device 0024
0e:00.0 Signal processing controller: Xilinx Corporation Device 7011
15:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
16:00.0 Ethernet controller: Intel Corporation I210 Gigabit Backplane Connection (rev 03)
17:00.0 Ethernet controller: Intel Corporation I210 Gigabit Backplane Connection (rev 03)
18:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
Note that I had to manually remove following devices to make the recovery report success:
echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove
echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove
echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove
echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove
echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove
echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove
echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove
echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove
After that, the PCI device tree looks like this:
00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
└─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
└─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
└─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
├─04:00.0 downstream_port, slot 10, power: Off
├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
│ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
├─04:02.0 downstream_port, slot 9, power: Off
├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
│ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
│ └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
├─04:09.0 downstream_port, slot 11, power: Off
├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
│ └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
├─04:0b.0 downstream_port, slot 12, power: Off
├─04:10.0 downstream_port, slot 8, power: Off
├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
│ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
└─04:12.0 downstream_port, slot 7, power: Off
00:01.1 root_port, slot 2, device present
00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
└─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
└─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
└─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
└─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.
Dec 3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000
Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout
Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
Dec 3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
Dec 3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
Dec 3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
Dec 3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
Dec 3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Dec 3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'
At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.
Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
Dec 3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
Dec 3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Dec 3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0
Dec 3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP
Dec 3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
Dec 3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
Dec 3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.6.1.el7.x86_64.debug #1
Dec 3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
Dec 3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
Dec 3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>] [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Dec 3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30 EFLAGS: 00010046
Dec 3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
Dec 3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
Dec 3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
Dec 3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
Dec 3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
Dec 3 15:25:59 bd-cpu18 kernel: FS: 00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
Dec 3 15:25:59 bd-cpu18 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
Dec 3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 3 15:25:59 bd-cpu18 kernel: Call Trace:
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
Dec 3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.711+0000: 2132: info : libvirt version: 4.5.0, package: 23.el7_7.6 (CentOS BuildSystem <http://bugs.centos.org>, 2020-03-17-23:39:10, x86-01.bsys.centos.org)
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.711+0000: 2132: info : hostname: bd-cpu18.cslab.esss.lu.se
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.711+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.1 not found: could not access /sys/bus/pci/devices/0000:01:00.1/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.713+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.2 not found: could not access /sys/bus/pci/devices/0000:01:00.2/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.714+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.3 not found: could not access /sys/bus/pci/devices/0000:01:00.3/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.714+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:01:00.4 not found: could not access /sys/bus/pci/devices/0000:01:00.4/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.715+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:02.0 not found: could not access /sys/bus/pci/devices/0000:02:02.0/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.715+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:09.0 not found: could not access /sys/bus/pci/devices/0000:02:09.0/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.715+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:08.0 not found: could not access /sys/bus/pci/devices/0000:02:08.0/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 libvirtd: 2020-12-03 14:25:59.716+0000: 2132: error : virPCIDeviceNew:1784 : Device 0000:02:0a.0 not found: could not access /sys/bus/pci/devices/0000:02:0a.0/config: No such file or directory
Dec 3 15:25:59 bd-cpu18 kernel: RIP [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Dec 3 15:25:59 bd-cpu18 kernel: RSP <ffff9b22b5bc7a30>
Dec 3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000
Dec 3 15:25:59 bd-cpu18 kernel: ---[ end trace 452aceb2952fc499 ]---
Dec 3 15:25:59 bd-cpu18 kernel: BUG: sleeping function called from invalid context at kernel/rwsem.c:21
Dec 3 15:25:59 bd-cpu18 kernel: in_atomic(): 1, irqs_disabled(): 1, pid: 8150, name: bash
Dec 3 15:25:59 bd-cpu18 kernel: INFO: lockdep is turned off.
Dec 3 15:25:59 bd-cpu18 kernel: irq event stamp: 87980
Dec 3 15:25:59 bd-cpu18 kernel: hardirqs last enabled at (87979): [<ffffffffa308bcd6>] _raw_spin_unlock_irqrestore+0x36/0x70
Dec 3 15:25:59 bd-cpu18 kernel: hardirqs last disabled at (87980): [<ffffffffa308c55b>] _raw_spin_lock_irqsave+0x2b/0xa0
Dec 3 15:25:59 bd-cpu18 kernel: softirqs last enabled at (87300): [<ffffffffa28ba134>] __do_softirq+0x1a4/0x470
Dec 3 15:25:59 bd-cpu18 kernel: softirqs last disabled at (87281): [<ffffffffa309c2bc>] call_softirq+0x1c/0x30
Dec 3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G D OE ------------ 3.10.0-1160.6.1.el7.x86_64.debug #1
Dec 3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
Dec 3 15:25:59 bd-cpu18 kernel: Call Trace:
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3080c37>] dump_stack+0x19/0x1b
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28f500b>] __might_sleep+0x17b/0x240
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3087aba>] down_read+0x2a/0xb0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28ccad3>] exit_signals+0x33/0x150
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b5f6c>] do_exit+0xcc/0xb80
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2390>] ? kmsg_dump+0x130/0x1d0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2294>] ? kmsg_dump+0x34/0x1d0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa308f910>] oops_end+0xa0/0xe0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28836a4>] no_context+0x144/0x310
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa293eac3>] ? __bfs+0x103/0x280
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa288398a>] __bad_area_nosemaphore+0x11a/0x230
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2883ab4>] bad_area_nosemaphore+0x14/0x20
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092ce8>] __do_page_fault+0x358/0x5a0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2944229>] ? __lock_acquire+0xa49/0x1630
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092f65>] do_page_fault+0x35/0x90
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa308e818>] page_fault+0x28/0x30
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffc0b182f7>] ? aer_inj_read_config+0x87/0x160 [aer_inject]
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
Dec 3 15:25:59 bd-cpu18 kernel: BUG: scheduling while atomic: bash/8150/0x10000003
Dec 3 15:25:59 bd-cpu18 kernel: INFO: lockdep is turned off.
Dec 3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
Dec 3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
Dec 3 15:25:59 bd-cpu18 kernel: irq event stamp: 87980
Dec 3 15:25:59 bd-cpu18 kernel: hardirqs last enabled at (87979): [<ffffffffa308bcd6>] _raw_spin_unlock_irqrestore+0x36/0x70
Dec 3 15:25:59 bd-cpu18 kernel: hardirqs last disabled at (87980): [<ffffffffa308c55b>] _raw_spin_lock_irqsave+0x2b/0xa0
Dec 3 15:25:59 bd-cpu18 kernel: softirqs last enabled at (87300): [<ffffffffa28ba134>] __do_softirq+0x1a4/0x470
Dec 3 15:25:59 bd-cpu18 kernel: softirqs last disabled at (87281): [<ffffffffa309c2bc>] call_softirq+0x1c/0x30
Dec 3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G D OE ------------ 3.10.0-1160.6.1.el7.x86_64.debug #1
Dec 3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
Dec 3 15:25:59 bd-cpu18 kernel: Call Trace:
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3080c37>] dump_stack+0x19/0x1b
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3078933>] __schedule_bug+0x7d/0x8c
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa30887ec>] __schedule+0x7cc/0x810
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28f98e6>] __cond_resched+0x26/0x30
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3088b7a>] _cond_resched+0x3a/0x50
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3087abf>] down_read+0x2f/0xb0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28ccad3>] exit_signals+0x33/0x150
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b5f6c>] do_exit+0xcc/0xb80
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2390>] ? kmsg_dump+0x130/0x1d0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28b2294>] ? kmsg_dump+0x34/0x1d0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa308f910>] oops_end+0xa0/0xe0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa28836a4>] no_context+0x144/0x310
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa293eac3>] ? __bfs+0x103/0x280
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa288398a>] __bad_area_nosemaphore+0x11a/0x230
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2883ab4>] bad_area_nosemaphore+0x14/0x20
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092ce8>] __do_page_fault+0x358/0x5a0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2944229>] ? __lock_acquire+0xa49/0x1630
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3092f65>] do_page_fault+0x35/0x90
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa308e818>] page_fault+0x28/0x30
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffc0b182f7>] ? aer_inj_read_config+0x87/0x160 [aer_inject]
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
Dec 3 15:25:59 bd-cpu18 kernel: note: bash[8150] exited with preempt_count 2
Dec 3 15:26:00 bd-cpu18 sh: abrt-dump-oops: Found oopses: 2
Dec 3 15:26:00 bd-cpu18 sh: abrt-dump-oops: Creating problem directories
Dec 3 15:26:00 bd-cpu18 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on
Dec 3 15:26:01 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:01 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:02 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:02 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:02 bd-cpu18 abrt-server: Looking for kernel package
Dec 3 15:26:02 bd-cpu18 abrt-dump-oops: Reported 2 kernel oopses to Abrt
Dec 3 15:26:02 bd-cpu18 abrt-server: Kernel package kernel-debug-3.10.0-1160.6.1.el7.x86_64 found
Dec 3 15:26:04 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:04 bd-cpu18 abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N]
Dec 3 15:26:04 bd-cpu18 abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N]
Dec 3 15:26:04 bd-cpu18 abrt-server: Sending an email...
Dec 3 15:26:04 bd-cpu18 abrt-server: Sending a notification email to: root@localhost
Dec 3 15:26:04 bd-cpu18 abrt-server: Email was sent to: root@localhost
Dec 3 15:26:05 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:05 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:05 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:06 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:06 bd-cpu18 abrt-server: Looking for kernel package
Dec 3 15:26:06 bd-cpu18 abrt-server: Kernel package kernel-debug-3.10.0-1160.6.1.el7.x86_64 found
Dec 3 15:26:07 bd-cpu18 rtkit-daemon[1338]: The canary thread is apparently starving. Taking action.
Dec 3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Demoting known real-time threads.
Dec 3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Successfully demoted thread 2594 of process 2591 (/usr/bin/pulseaudio).
Dec 3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Successfully demoted thread 2591 of process 2591 (/usr/bin/pulseaudio).
Dec 3 15:26:07 bd-cpu18 rtkit-daemon[1338]: Demoted 2 threads.
Dec 3 15:26:08 bd-cpu18 abrt-server: '/var/spool/abrt/oops-2020-12-03-14:42:51-4266-0' is not a problem directory
Dec 3 15:26:08 bd-cpu18 abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N]
Dec 3 15:26:08 bd-cpu18 abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N]
Dec 3 15:26:08 bd-cpu18 abrt-server: Sending an email...
Dec 3 15:26:08 bd-cpu18 abrt-server: Sending a notification email to: root@localhost
I guess this is the root cause of oops:
Dec 3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
Dec 3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
Looking at the actual source code of the running kernel this is the source of the offending aer_inj_read_config():
static int aer_inj_read_config(struct pci_bus *bus, unsigned int devfn,
int where, int size, u32 *val)
{
u32 *sim;
struct aer_error *err;
unsigned long flags;
struct pci_ops *ops;
struct pci_ops *my_ops;
int domain;
int rv;
spin_lock_irqsave(&inject_lock, flags);
if (size != sizeof(u32))
goto out;
domain = pci_domain_nr(bus);
if (domain < 0)
goto out;
err = __find_aer_error(domain, bus->number, devfn);
if (!err)
goto out;
sim = find_pci_config_dword(err, where, NULL);
if (sim) {
*val = *sim;
spin_unlock_irqrestore(&inject_lock, flags);
return 0;
}
out:
ops = __find_pci_bus_ops(bus);
/*
* pci_lock must already be held, so we can directly
* manipulate bus->ops. Many config access functions,
* including pci_generic_config_read() require the original
* bus->ops be installed to function, so temporarily put them
* back.
*/
my_ops = bus->ops;
bus->ops = ops;
rv = ops->read(bus, devfn, where, size, val);
bus->ops = my_ops;
spin_unlock_irqrestore(&inject_lock, flags);
return rv;
}
Here is 'objdump -D aer_inject.ko':
0000000000000270 <aer_inj_read_config>:
270: e8 00 00 00 00 callq 275 <aer_inj_read_config+0x5>
275: 55 push %rbp
276: 48 89 e5 mov %rsp,%rbp
279: 41 57 push %r15
27b: 49 89 ff mov %rdi,%r15
27e: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
285: 41 56 push %r14
287: 4d 89 c6 mov %r8,%r14
28a: 41 55 push %r13
28c: 41 54 push %r12
28e: 41 89 d4 mov %edx,%r12d
291: 53 push %rbx
292: 89 f3 mov %esi,%ebx
294: 48 83 ec 10 sub $0x10,%rsp
298: 89 4d cc mov %ecx,-0x34(%rbp)
29b: e8 00 00 00 00 callq 2a0 <aer_inj_read_config+0x30>
2a0: 8b 4d cc mov -0x34(%rbp),%ecx
2a3: 48 89 45 d0 mov %rax,-0x30(%rbp)
2a7: 83 f9 04 cmp $0x4,%ecx
2aa: 0f 84 88 00 00 00 je 338 <aer_inj_read_config+0xc8>
2b0: 4c 8b 0d 00 00 00 00 mov 0x0(%rip),%r9 # 2b7 <aer_inj_read_config+0x47>
2b7: 49 81 f9 00 00 00 00 cmp $0x0,%r9
2be: 75 14 jne 2d4 <aer_inj_read_config+0x64>
2c0: eb 6e jmp 330 <aer_inj_read_config+0xc0>
2c2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
2c8: 4d 8b 09 mov (%r9),%r9
2cb: 49 81 f9 00 00 00 00 cmp $0x0,%r9
2d2: 74 5c je 330 <aer_inj_read_config+0xc0>
2d4: 4d 3b 79 10 cmp 0x10(%r9),%r15
2d8: 75 ee jne 2c8 <aer_inj_read_config+0x58>
2da: 49 8b 41 18 mov 0x18(%r9),%rax
2de: 4d 8b af b8 00 00 00 mov 0xb8(%r15),%r13
2e5: 89 de mov %ebx,%esi
2e7: 49 89 87 b8 00 00 00 mov %rax,0xb8(%r15)
2ee: 4d 89 f0 mov %r14,%r8
2f1: 44 89 e2 mov %r12d,%edx
2f4: 4c 89 ff mov %r15,%rdi
2f7: 48 8b 00 mov (%rax),%rax
2fa: e8 00 00 00 00 callq 2ff <aer_inj_read_config+0x8f>
2ff: 48 8b 75 d0 mov -0x30(%rbp),%rsi
303: 89 c3 mov %eax,%ebx
305: 4d 89 af b8 00 00 00 mov %r13,0xb8(%r15)
30c: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
313: e8 00 00 00 00 callq 318 <aer_inj_read_config+0xa8>
318: 89 d8 mov %ebx,%eax
31a: 48 83 c4 10 add $0x10,%rsp
31e: 5b pop %rbx
31f: 41 5c pop %r12
321: 41 5d pop %r13
323: 41 5e pop %r14
325: 41 5f pop %r15
327: 5d pop %rbp
328: c3 retq
329: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
330: 31 c0 xor %eax,%eax
332: eb aa jmp 2de <aer_inj_read_config+0x6e>
334: 0f 1f 40 00 nopl 0x0(%rax)
338: 49 8b 87 c8 00 00 00 mov 0xc8(%r15),%rax
33f: 8b 00 mov (%rax),%eax
341: 85 c0 test %eax,%eax
343: 0f 88 67 ff ff ff js 2b0 <aer_inj_read_config+0x40>
349: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 350 <aer_inj_read_config+0xe0>
350: 41 0f b6 97 d8 00 00 movzbl 0xd8(%r15),%edx
357: 00
358: 48 81 ff 00 00 00 00 cmp $0x0,%rdi
35f: 0f 84 4b ff ff ff je 2b0 <aer_inj_read_config+0x40>
365: 3b 47 10 cmp 0x10(%rdi),%eax
368: 74 1b je 385 <aer_inj_read_config+0x115>
36a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
370: 48 8b 3f mov (%rdi),%rdi
373: 48 81 ff 00 00 00 00 cmp $0x0,%rdi
37a: 0f 84 30 ff ff ff je 2b0 <aer_inj_read_config+0x40>
380: 3b 47 10 cmp 0x10(%rdi),%eax
383: 75 eb jne 370 <aer_inj_read_config+0x100>
385: 3b 57 14 cmp 0x14(%rdi),%edx
388: 75 e6 jne 370 <aer_inj_read_config+0x100>
38a: 3b 5f 18 cmp 0x18(%rdi),%ebx
38d: 75 e1 jne 370 <aer_inj_read_config+0x100>
38f: 48 85 ff test %rdi,%rdi
392: 0f 84 18 ff ff ff je 2b0 <aer_inj_read_config+0x40>
398: 31 d2 xor %edx,%edx
39a: 44 89 e6 mov %r12d,%esi
39d: 89 4d cc mov %ecx,-0x34(%rbp)
3a0: e8 5b fc ff ff callq 0 <find_pci_config_dword>
3a5: 48 85 c0 test %rax,%rax
3a8: 8b 4d cc mov -0x34(%rbp),%ecx
3ab: 0f 84 ff fe ff ff je 2b0 <aer_inj_read_config+0x40>
3b1: 8b 00 mov (%rax),%eax
3b3: 48 8b 75 d0 mov -0x30(%rbp),%rsi
3b7: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
3be: 41 89 06 mov %eax,(%r14)
3c1: e8 00 00 00 00 callq 3c6 <aer_inj_read_config+0x156>
3c6: 31 c0 xor %eax,%eax
3c8: e9 4d ff ff ff jmpq 31a <aer_inj_read_config+0xaa>
3cd: 0f 1f 00 nopl (%rax)
Thank you!
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel oops while using AER inject
2020-12-03 16:00 Kernel oops while using AER inject Hinko Kocevar
@ 2020-12-14 18:37 ` Guilherme Piccoli
2020-12-15 9:25 ` Hinko Kocevar
2020-12-14 18:51 ` Keith Busch
1 sibling, 1 reply; 5+ messages in thread
From: Guilherme Piccoli @ 2020-12-14 18:37 UTC (permalink / raw)
To: Hinko Kocevar; +Cc: linux-pci
I see you're running a 3.10 modified kernel (Red Hat / CentOS ?) -
suggest you to try the upstream kernel, if possible. If the error
persists in the mainline kernel, it's likely you can get more support
here!
Cheers,
Guilherme
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel oops while using AER inject
2020-12-03 16:00 Kernel oops while using AER inject Hinko Kocevar
2020-12-14 18:37 ` Guilherme Piccoli
@ 2020-12-14 18:51 ` Keith Busch
2020-12-15 9:23 ` Hinko Kocevar
1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2020-12-14 18:51 UTC (permalink / raw)
To: Hinko Kocevar; +Cc: linux-pci
On Thu, Dec 03, 2020 at 04:00:04PM +0000, Hinko Kocevar wrote:
> Note that I had to manually remove following devices to make the recovery report success:
>
> echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove
> echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove
> echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove
> echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove
>
> After that, the PCI device tree looks like this:
>
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
> └─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
> └─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
> └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
> ├─04:00.0 downstream_port, slot 10, power: Off
> ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
> │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
> ├─04:02.0 downstream_port, slot 9, power: Off
> ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
> │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
> ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
> │ └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
> ├─04:09.0 downstream_port, slot 11, power: Off
> ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
> │ └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
> ├─04:0b.0 downstream_port, slot 12, power: Off
> ├─04:10.0 downstream_port, slot 8, power: Off
> ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
> │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
> └─04:12.0 downstream_port, slot 7, power: Off
> 00:01.1 root_port, slot 2, device present
> 00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
> └─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
> 00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
> └─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
> 00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
> └─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
> 00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
> └─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
>
> Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.
>
> Dec 3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000
> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout
> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
> Dec 3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
> Dec 3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
> Dec 3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
> Dec 3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
> Dec 3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
> Dec 3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'
>
> At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.
>
> Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.
>
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec 3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
> Dec 3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
> Dec 3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0
> Dec 3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP
> Dec 3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
> Dec 3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
> Dec 3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.6.1.el7.x86_64.debug #1
> Dec 3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
> Dec 3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
> Dec 3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>] [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
> Dec 3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30 EFLAGS: 00010046
> Dec 3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> Dec 3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
> Dec 3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
> Dec 3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
> Dec 3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
> Dec 3 15:25:59 bd-cpu18 kernel: FS: 00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
> Dec 3 15:25:59 bd-cpu18 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Dec 3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
> Dec 3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec 3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Dec 3 15:25:59 bd-cpu18 kernel: Call Trace:
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
> Dec 3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00
I believe this is a known issue with aer_inject when you re-enumerate
devices below a bridge with injected errors. I reported it here:
https://lore.kernel.org/linux-pci/20180918235848.26694-3-keith.busch@intel.com/
Essentially, the re-enumerated devices inherit the injected bus_ops, but
aer_inject doesn't know about those devices.
The solution I proposed, however, had some CPU architectural and kernel
config limitations that made it less appealing.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel oops while using AER inject
2020-12-14 18:51 ` Keith Busch
@ 2020-12-15 9:23 ` Hinko Kocevar
0 siblings, 0 replies; 5+ messages in thread
From: Hinko Kocevar @ 2020-12-15 9:23 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-pci
On 12/14/20 7:51 PM, Keith Busch wrote:
> On Thu, Dec 03, 2020 at 04:00:04PM +0000, Hinko Kocevar wrote:
>> Note that I had to manually remove following devices to make the recovery report success:
>>
>> echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove
>> echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove
>>
>> After that, the PCI device tree looks like this:
>>
>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>> └─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>> └─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>> └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>> ├─04:00.0 downstream_port, slot 10, power: Off
>> ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>> │ └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>> ├─04:02.0 downstream_port, slot 9, power: Off
>> ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>> │ └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>> ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
>> │ └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>> ├─04:09.0 downstream_port, slot 11, power: Off
>> ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
>> │ └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>> ├─04:0b.0 downstream_port, slot 12, power: Off
>> ├─04:10.0 downstream_port, slot 8, power: Off
>> ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>> │ └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>> └─04:12.0 downstream_port, slot 7, power: Off
>> 00:01.1 root_port, slot 2, device present
>> 00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
>> └─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
>> 00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
>> └─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
>> 00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
>> └─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
>> 00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
>> └─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
>>
>> Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.
>>
>> Dec 3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
>> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
>> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
>> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: device [8086:1901] error status/mask=00004000/00000000
>> Dec 3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: [14] Completion Timeout
>> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
>> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
>> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
>> Dec 3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
>> Dec 3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
>> Dec 3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
>> Dec 3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
>> Dec 3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
>> Dec 3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
>> Dec 3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'
>>
>> At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.
>>
>> Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.
>>
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> Dec 3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
>> Dec 3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
>> Dec 3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0
>> Dec 3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP
>> Dec 3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
>> Dec 3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
>> Dec 3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.6.1.el7.x86_64.debug #1
>> Dec 3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
>> Dec 3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
>> Dec 3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>] [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
>> Dec 3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30 EFLAGS: 00010046
>> Dec 3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
>> Dec 3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
>> Dec 3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
>> Dec 3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
>> Dec 3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
>> Dec 3 15:25:59 bd-cpu18 kernel: FS: 00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
>> Dec 3 15:25:59 bd-cpu18 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Dec 3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
>> Dec 3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Dec 3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Dec 3 15:25:59 bd-cpu18 kernel: Call Trace:
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
>> Dec 3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
>> Dec 3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00
>
> I believe this is a known issue with aer_inject when you re-enumerate
> devices below a bridge with injected errors. I reported it here:
>
> https://lore.kernel.org/linux-pci/20180918235848.26694-3-keith.busch@intel.com/
>
> Essentially, the re-enumerated devices inherit the injected bus_ops, but
> aer_inject doesn't know about those devices.
>
> The solution I proposed, however, had some CPU architectural and kernel
> config limitations that made it less appealing.
>
I see that the changes that were introduced seem to touch the exact
point of failure as I'm seeing.
I've stopped using the 3.10 kernel since, and went for 5.9.12 for my
work. More recently I've been also looking at using the Bjorn's git
tree, pci/err branch.
Will report if I still see the breakage in these recent kernels.
Thanks!
//hinko
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel oops while using AER inject
2020-12-14 18:37 ` Guilherme Piccoli
@ 2020-12-15 9:25 ` Hinko Kocevar
0 siblings, 0 replies; 5+ messages in thread
From: Hinko Kocevar @ 2020-12-15 9:25 UTC (permalink / raw)
To: Guilherme Piccoli; +Cc: linux-pci
On 12/14/20 7:37 PM, Guilherme Piccoli wrote:
> I see you're running a 3.10 modified kernel (Red Hat / CentOS ?) -
> suggest you to try the upstream kernel, if possible. If the error
> persists in the mainline kernel, it's likely you can get more support
> here!
>
I'm going to use with more recent kernels from 5.9.x series as well as
Bjorn's git when reporting issues from now on.
Cheers,
//hinko
> Cheers,
>
>
> Guilherme
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-12-15 9:27 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 16:00 Kernel oops while using AER inject Hinko Kocevar
2020-12-14 18:37 ` Guilherme Piccoli
2020-12-15 9:25 ` Hinko Kocevar
2020-12-14 18:51 ` Keith Busch
2020-12-15 9:23 ` Hinko Kocevar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).