Linux-PCI Archive on lore.kernel.org
 help / color / Atom feed
* [BUG] rk3399-rockpro64 pcie synchronous external abort
@ 2019-11-04 18:55 Peter Geis
  2019-11-09  1:08 ` Peter Geis
  2019-11-12  0:13 ` Bjorn Helgaas
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Geis @ 2019-11-04 18:55 UTC (permalink / raw)
  To: shawn.lin, Heiko Stuebner; +Cc: linux-pci, linux-rockchip

Good Morning,

I'm attempting to debug an issue with the rockpro64 pcie port.
It would appear that the port does not like various cards, including
cards of the same make that randomly work or do not work, such as
Intel i340 based NICs.
I'm experiencing it with a GTX645 gpu.

This seems to be a long running issue, referenced both at [0], and [1].
There was an attempt to rectify it, by adding a delay between training
and probing [2], but that doesn't seem to be the issue here.
It appears that when we probe further into the card, such as devfn >
1, we trigger the bug.
I've added a print statement that prints the devfn, address, and size
information, which you can see below.

I've attempted setting the available number of lanes to 1 as well, to
no difference.

If anyone could point me in the right direction as to where to
continue debugging, I'd greatly appreciate it.

[0] https://github.com/ayufan-rock64/linux-build/issues/254
[1] https://github.com/rockchip-linux/kernel/issues/116
[2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b

[  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
for per-lane PHY
[  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
[  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
[  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
[  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
[  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
[  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
[  198.550656] rockchip-pcie f8000000.pcie:   MEM
0xfa000000..0xfbdfffff -> 0xfa000000
[  198.550676] rockchip-pcie f8000000.pcie:    IO
0xfbe00000..0xfbefffff -> 0xfbe00000
[  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
[  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
[  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
[  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
(bus address [0xfbe00000-0xfbefffff])
[  198.552965] pci_bus 0000:00: scanning bus
[  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
[  198.555508] pci 0000:00:00.0: supports D1
[  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[  198.556023] pci 0000:00:00.0: PME# disabled
[  198.561245] pci_bus 0000:00: fixups for bus
[  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
[  198.567008] pci_bus 0000:01: scanning bus
[  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
[  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
[  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
64bit pref]
[  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
64bit pref]
[  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
[  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
[  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
[  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
[  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
with 8 GT/s x16 link)
[  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
decodes=io+mem,owns=none,locks=none
[  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
[  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
<snip>
[  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
[  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
[  198.750881] Internal error: synchronous external abort: 96000210
[#1] PREEMPT SMP
[  198.751581] Modules linked in: drm_panel_orientation_quirks
pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
crc_ccitt nf_defrag_ipv6
[  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
[  198.753791] Hardware name: Pine64 RockPro64 (DT)
[  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
[  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
[  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
[  198.755773] sp : ffff8000132af530
[  198.756079] x29: ffff8000132af530 x28: 0000000000000000
[  198.756565] x27: 0000000000000001 x26: 0000000000000000
[  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
[  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
[  198.758018] x21: 0000000000002000 x20: 0000000000000004
[  198.758502] x19: 0000000000102000 x18: 0000000000000001
[  198.758987] x17: 0000000000000000 x16: 0000000000000000
[  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
[  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
[  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
[  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
[  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
[  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
[  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
[  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
[  198.763348] Call trace:
[  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
[  198.764057]  pci_bus_read_config_dword+0x88/0xd0
[  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
[  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
[  198.765413]  pci_scan_single_device+0x84/0xf8
[  198.765812]  pci_scan_slot+0x7c/0x120
[  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
[  198.766579]  pci_scan_bridge_extend+0x350/0x588
[  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
[  198.767430]  pci_scan_child_bus+0x24/0x30
[  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
[  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
[  198.768770]  platform_drv_probe+0x58/0xa8
[  198.769139]  really_probe+0xe0/0x318
[  198.769468]  driver_probe_device+0x5c/0xf0
[  198.769844]  device_driver_attach+0x74/0x80
[  198.770227]  __driver_attach+0x64/0xe8
[  198.770572]  bus_for_each_dev+0x84/0xd8
[  198.770924]  driver_attach+0x30/0x40
[  198.771253]  bus_add_driver+0x188/0x1e8
[  198.771605]  driver_register+0x64/0x110
[  198.771956]  __platform_driver_register+0x54/0x60
[  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
[  198.772998]  do_one_initcall+0x94/0x390
[  198.773353]  do_init_module+0x88/0x268
[  198.773697]  load_module+0x1e18/0x2198
[  198.774043]  __do_sys_finit_module+0xd0/0xe8
[  198.774435]  __arm64_sys_finit_module+0x28/0x38
[  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
[  198.775297]  el0_svc_handler+0x34/0xa0
[  198.775645]  el0_svc+0x14/0x40
[  198.775928]  el0_sync_handler+0x118/0x290
[  198.776295]  el0_sync+0x164/0x180
[  198.776609] Code: bad PC value
[  198.776897] ---[ end trace 88fc77651b5e2909 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-04 18:55 [BUG] rk3399-rockpro64 pcie synchronous external abort Peter Geis
@ 2019-11-09  1:08 ` Peter Geis
  2019-11-09 16:37   ` Doug Anderson
  2019-11-12  0:13 ` Bjorn Helgaas
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-09  1:08 UTC (permalink / raw)
  To: shawn.lin, Heiko Stuebner; +Cc: linux-pci, linux-rockchip

Good Evening,

I'm not sure, but I believe the pcie address space built into the
rk3399 is not large enough to accommodate the pcie addresses the card
requires.
I've been trying to figure out if it's possible to use system ram
instead, but so far I haven't been successful.
Also, the ram layout for the rk3399 is odd considering the TRM, if
anyone has any insights in to that, I'd be grateful.

On Mon, Nov 4, 2019 at 1:55 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> Good Morning,
>
> I'm attempting to debug an issue with the rockpro64 pcie port.
> It would appear that the port does not like various cards, including
> cards of the same make that randomly work or do not work, such as
> Intel i340 based NICs.
> I'm experiencing it with a GTX645 gpu.
>
> This seems to be a long running issue, referenced both at [0], and [1].
> There was an attempt to rectify it, by adding a delay between training
> and probing [2], but that doesn't seem to be the issue here.
> It appears that when we probe further into the card, such as devfn >
> 1, we trigger the bug.
> I've added a print statement that prints the devfn, address, and size
> information, which you can see below.
>
> I've attempted setting the available number of lanes to 1 as well, to
> no difference.
>
> If anyone could point me in the right direction as to where to
> continue debugging, I'd greatly appreciate it.
>
> [0] https://github.com/ayufan-rock64/linux-build/issues/254
> [1] https://github.com/rockchip-linux/kernel/issues/116
> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
>
> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> for per-lane PHY
> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> 0xfa000000..0xfbdfffff -> 0xfa000000
> [  198.550676] rockchip-pcie f8000000.pcie:    IO
> 0xfbe00000..0xfbefffff -> 0xfbe00000
> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> (bus address [0xfbe00000-0xfbefffff])
> [  198.552965] pci_bus 0000:00: scanning bus
> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> [  198.555508] pci 0000:00:00.0: supports D1
> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> [  198.556023] pci 0000:00:00.0: PME# disabled
> [  198.561245] pci_bus 0000:00: fixups for bus
> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> 00-00]), reconfiguring
> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> [  198.567008] pci_bus 0000:01: scanning bus
> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> 64bit pref]
> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> 64bit pref]
> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> with 8 GT/s x16 link)
> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> decodes=io+mem,owns=none,locks=none
> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> <snip>
> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> [  198.750881] Internal error: synchronous external abort: 96000210
> [#1] PREEMPT SMP
> [  198.751581] Modules linked in: drm_panel_orientation_quirks
> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> crc_ccitt nf_defrag_ipv6
> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> [  198.755773] sp : ffff8000132af530
> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> [  198.763348] Call trace:
> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> [  198.765413]  pci_scan_single_device+0x84/0xf8
> [  198.765812]  pci_scan_slot+0x7c/0x120
> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> [  198.767430]  pci_scan_child_bus+0x24/0x30
> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> [  198.768770]  platform_drv_probe+0x58/0xa8
> [  198.769139]  really_probe+0xe0/0x318
> [  198.769468]  driver_probe_device+0x5c/0xf0
> [  198.769844]  device_driver_attach+0x74/0x80
> [  198.770227]  __driver_attach+0x64/0xe8
> [  198.770572]  bus_for_each_dev+0x84/0xd8
> [  198.770924]  driver_attach+0x30/0x40
> [  198.771253]  bus_add_driver+0x188/0x1e8
> [  198.771605]  driver_register+0x64/0x110
> [  198.771956]  __platform_driver_register+0x54/0x60
> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> [  198.772998]  do_one_initcall+0x94/0x390
> [  198.773353]  do_init_module+0x88/0x268
> [  198.773697]  load_module+0x1e18/0x2198
> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> [  198.775297]  el0_svc_handler+0x34/0xa0
> [  198.775645]  el0_svc+0x14/0x40
> [  198.775928]  el0_sync_handler+0x118/0x290
> [  198.776295]  el0_sync+0x164/0x180
> [  198.776609] Code: bad PC value
> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-09  1:08 ` Peter Geis
@ 2019-11-09 16:37   ` Doug Anderson
  2019-11-10 15:43     ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Doug Anderson @ 2019-11-09 16:37 UTC (permalink / raw)
  To: Peter Geis
  Cc: Shawn Lin, Heiko Stuebner, linux-pci, open list:ARM/Rockchip SoC...

Hi,

On Fri, Nov 8, 2019 at 5:09 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> Good Evening,
>
> I'm not sure, but I believe the pcie address space built into the
> rk3399 is not large enough to accommodate the pcie addresses the card
> requires.
> I've been trying to figure out if it's possible to use system ram
> instead, but so far I haven't been successful.
> Also, the ram layout for the rk3399 is odd considering the TRM, if
> anyone has any insights in to that, I'd be grateful.
>
> On Mon, Nov 4, 2019 at 1:55 PM Peter Geis <pgwipeout@gmail.com> wrote:
> >
> > Good Morning,
> >
> > I'm attempting to debug an issue with the rockpro64 pcie port.
> > It would appear that the port does not like various cards, including
> > cards of the same make that randomly work or do not work, such as
> > Intel i340 based NICs.
> > I'm experiencing it with a GTX645 gpu.
> >
> > This seems to be a long running issue, referenced both at [0], and [1].
> > There was an attempt to rectify it, by adding a delay between training
> > and probing [2], but that doesn't seem to be the issue here.
> > It appears that when we probe further into the card, such as devfn >
> > 1, we trigger the bug.
> > I've added a print statement that prints the devfn, address, and size
> > information, which you can see below.
> >
> > I've attempted setting the available number of lanes to 1 as well, to
> > no difference.
> >
> > If anyone could point me in the right direction as to where to
> > continue debugging, I'd greatly appreciate it.

I don't have tons of knowledge here, but your thread happened to fly
by my Inbox and it reminded me of issues we faced during the bringup
of rk3399-gru-kevin where our WiFi driver would kill the whole system
in random / hard to debug ways.

If I remember, the problem was that the rk3399 PCIe behaved very badly
when you try to access the bus is in certain power save modes.  I
think that traditional x86-based controllers just return 0xff in the
same state, but rk3399 gives some sorts of asynchronous bus errors.

IIRC our problem was fixed with:

https://crrev.com/c/402092 - FROMLIST: mwifiex: fix corner case power save issue

...that's about all the knowledge I have around this area, but
possibly it could point you in the right direction?

-Doug

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-09 16:37   ` Doug Anderson
@ 2019-11-10 15:43     ` Peter Geis
  2019-11-12  0:03       ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-10 15:43 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Shawn Lin, Heiko Stuebner, linux-pci, open list:ARM/Rockchip SoC...

On Sat, Nov 9, 2019 at 11:38 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Fri, Nov 8, 2019 at 5:09 PM Peter Geis <pgwipeout@gmail.com> wrote:
> >
> > Good Evening,
> >
> > I'm not sure, but I believe the pcie address space built into the
> > rk3399 is not large enough to accommodate the pcie addresses the card
> > requires.
> > I've been trying to figure out if it's possible to use system ram
> > instead, but so far I haven't been successful.
> > Also, the ram layout for the rk3399 is odd considering the TRM, if
> > anyone has any insights in to that, I'd be grateful.
> >
> > On Mon, Nov 4, 2019 at 1:55 PM Peter Geis <pgwipeout@gmail.com> wrote:
> > >
> > > Good Morning,
> > >
> > > I'm attempting to debug an issue with the rockpro64 pcie port.
> > > It would appear that the port does not like various cards, including
> > > cards of the same make that randomly work or do not work, such as
> > > Intel i340 based NICs.
> > > I'm experiencing it with a GTX645 gpu.
> > >
> > > This seems to be a long running issue, referenced both at [0], and [1].
> > > There was an attempt to rectify it, by adding a delay between training
> > > and probing [2], but that doesn't seem to be the issue here.
> > > It appears that when we probe further into the card, such as devfn >
> > > 1, we trigger the bug.
> > > I've added a print statement that prints the devfn, address, and size
> > > information, which you can see below.
> > >
> > > I've attempted setting the available number of lanes to 1 as well, to
> > > no difference.
> > >
> > > If anyone could point me in the right direction as to where to
> > > continue debugging, I'd greatly appreciate it.
>
> I don't have tons of knowledge here, but your thread happened to fly
> by my Inbox and it reminded me of issues we faced during the bringup
> of rk3399-gru-kevin where our WiFi driver would kill the whole system
> in random / hard to debug ways.
>
> If I remember, the problem was that the rk3399 PCIe behaved very badly
> when you try to access the bus is in certain power save modes.  I
> think that traditional x86-based controllers just return 0xff in the
> same state, but rk3399 gives some sorts of asynchronous bus errors.
>
> IIRC our problem was fixed with:
>
> https://crrev.com/c/402092 - FROMLIST: mwifiex: fix corner case power save issue
>
> ...that's about all the knowledge I have around this area, but
> possibly it could point you in the right direction?
>
> -Doug

Thanks!

Unfortunately this is happening before it even gets done probing the
pcie controller, and has not had a chance to enter a power save state.
I plugged in an i350 two port nic and examined the assigned address spaces.
I've attached it below.
Judging by the usage, I think this controller has enough address space
for another two port NIC, and that's about it.
I'm pretty sure now that the rk3399 controller just doesn't have the
address space to map larger devices.
I'm pretty sure the IOMMU would allow us to address system memory as
pcie address space and overcome this limitation, but I don't know how
to do that.

The address space for the nic is below:
f8000000-f8ffffff : axi-base
fa000000-fbdfffff : MEM
  fa000000-fa4fffff : PCI Bus 0000:01
    fa000000-fa07ffff : 0000:01:00.0
      fa000000-fa07ffff : igb
    fa080000-fa0fffff : 0000:01:00.0
    fa100000-fa17ffff : 0000:01:00.1
      fa100000-fa17ffff : igb
    fa180000-fa1fffff : 0000:01:00.1
    fa200000-fa27ffff : 0000:01:00.0
    fa280000-fa2fffff : 0000:01:00.0
    fa300000-fa37ffff : 0000:01:00.1
    fa380000-fa3fffff : 0000:01:00.1
    fa400000-fa403fff : 0000:01:00.0
      fa400000-fa403fff : igb
    fa404000-fa407fff : 0000:01:00.1
      fa404000-fa407fff : igb
fd000000-fdffffff : f8000000.pcie

$ dmesg | grep pci | grep -v devfn
[   64.408414] pci 0000:01:00.1: reg 0x1c: [mem 0x00000000-0x00003fff]
[   64.408508] pci 0000:01:00.1: reg 0x30: [mem 0x00000000-0x0007ffff pref]
[   64.408546] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 512)
[   64.409002] pci 0000:01:00.1: PME# supported from D0 D3hot D3cold
[   64.409016] pci 0000:01:00.1: PME# disabled
[   64.409179] pci 0000:01:00.1: reg 0x184: [mem 0x00000000-0x0000ffff
64bit pref]
[   64.409181] pci 0000:01:00.1: VF(n) BAR0 space: [mem
0x00000000-0x0007ffff 64bit pref] (contains BAR0 for 8 VFs)
[   64.409268] pci 0000:01:00.1: reg 0x190: [mem 0x00000000-0x0000ffff
64bit pref]
[   64.409271] pci 0000:01:00.1: VF(n) BAR3 space: [mem
0x00000000-0x0007ffff 64bit pref] (contains BAR3 for 8 VFs)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-10 15:43     ` Peter Geis
@ 2019-11-12  0:03       ` Bjorn Helgaas
  2019-11-12  0:21         ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2019-11-12  0:03 UTC (permalink / raw)
  To: Peter Geis
  Cc: Doug Anderson, Shawn Lin, Heiko Stuebner, linux-pci,
	open list:ARM/Rockchip SoC...

On Sun, Nov 10, 2019 at 10:43:48AM -0500, Peter Geis wrote:

> I plugged in an i350 two port nic and examined the assigned address spaces.
> I've attached it below.
> Judging by the usage, I think this controller has enough address space
> for another two port NIC, and that's about it.
> I'm pretty sure now that the rk3399 controller just doesn't have the
> address space to map larger devices.
> I'm pretty sure the IOMMU would allow us to address system memory as
> pcie address space and overcome this limitation, but I don't know how
> to do that.

I don't think you're out of MMIO space, at least in this instance.  It
looks like you have 32MB available and the two-port NIC on bus 01 only
takes 5MB.

The IOMMU is used for DMA (e.g., reads/writes initiated by the NIC),
while the MMIO space is used for CPU programmed I/O (reads/writes done
by the driver running on the CPU).

> The address space for the nic is below:
> f8000000-f8ffffff : axi-base
> fa000000-fbdfffff : MEM

32MB.

>   fa000000-fa4fffff : PCI Bus 0000:01

5MB.

>     fa000000-fa07ffff : 0000:01:00.0
>       fa000000-fa07ffff : igb
>     fa080000-fa0fffff : 0000:01:00.0
>     fa100000-fa17ffff : 0000:01:00.1
>       fa100000-fa17ffff : igb
>     fa180000-fa1fffff : 0000:01:00.1
>     fa200000-fa27ffff : 0000:01:00.0
>     fa280000-fa2fffff : 0000:01:00.0
>     fa300000-fa37ffff : 0000:01:00.1
>     fa380000-fa3fffff : 0000:01:00.1
>     fa400000-fa403fff : 0000:01:00.0
>       fa400000-fa403fff : igb
>     fa404000-fa407fff : 0000:01:00.1
>       fa404000-fa407fff : igb
> fd000000-fdffffff : f8000000.pcie

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-04 18:55 [BUG] rk3399-rockpro64 pcie synchronous external abort Peter Geis
  2019-11-09  1:08 ` Peter Geis
@ 2019-11-12  0:13 ` Bjorn Helgaas
  2019-11-12  0:30   ` Peter Geis
  1 sibling, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2019-11-12  0:13 UTC (permalink / raw)
  To: Peter Geis; +Cc: shawn.lin, Heiko Stuebner, linux-pci, linux-rockchip

On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> Good Morning,
> 
> I'm attempting to debug an issue with the rockpro64 pcie port.
> It would appear that the port does not like various cards, including
> cards of the same make that randomly work or do not work, such as
> Intel i340 based NICs.
> I'm experiencing it with a GTX645 gpu.
> 
> This seems to be a long running issue, referenced both at [0], and [1].
> There was an attempt to rectify it, by adding a delay between training
> and probing [2], but that doesn't seem to be the issue here.
> It appears that when we probe further into the card, such as devfn >
> 1, we trigger the bug.
> I've added a print statement that prints the devfn, address, and size
> information, which you can see below.
> 
> I've attempted setting the available number of lanes to 1 as well, to
> no difference.
> 
> If anyone could point me in the right direction as to where to
> continue debugging, I'd greatly appreciate it.
> 
> [0] https://github.com/ayufan-rock64/linux-build/issues/254
> [1] https://github.com/rockchip-linux/kernel/issues/116
> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> 
> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> for per-lane PHY
> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> 0xfa000000..0xfbdfffff -> 0xfa000000
> [  198.550676] rockchip-pcie f8000000.pcie:    IO
> 0xfbe00000..0xfbefffff -> 0xfbe00000
> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> (bus address [0xfbe00000-0xfbefffff])
> [  198.552965] pci_bus 0000:00: scanning bus
> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> [  198.555508] pci 0000:00:00.0: supports D1
> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> [  198.556023] pci 0000:00:00.0: PME# disabled
> [  198.561245] pci_bus 0000:00: fixups for bus
> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> 00-00]), reconfiguring
> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> [  198.567008] pci_bus 0000:01: scanning bus
> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> 64bit pref]
> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> 64bit pref]
> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> with 8 GT/s x16 link)
> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> decodes=io+mem,owns=none,locks=none
> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> <snip>
> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> [  198.750881] Internal error: synchronous external abort: 96000210
> [#1] PREEMPT SMP

Is there really supposed to be a device at 01:00.2?

Maybe this is just the PCIe Unsupported Request error that we expect
to get when trying to read config space of a device that doesn't
exist.

On "most" platforms, we just get ~0 data back when that happens, but
I'm not sure that's always the case on arm64.  I think it depends on
how the PCIe host bridge is designed, and there might be some CPU
configuration, too.

> [  198.751581] Modules linked in: drm_panel_orientation_quirks
> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> crc_ccitt nf_defrag_ipv6
> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> [  198.755773] sp : ffff8000132af530
> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> [  198.763348] Call trace:
> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> [  198.765413]  pci_scan_single_device+0x84/0xf8
> [  198.765812]  pci_scan_slot+0x7c/0x120
> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> [  198.767430]  pci_scan_child_bus+0x24/0x30
> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> [  198.768770]  platform_drv_probe+0x58/0xa8
> [  198.769139]  really_probe+0xe0/0x318
> [  198.769468]  driver_probe_device+0x5c/0xf0
> [  198.769844]  device_driver_attach+0x74/0x80
> [  198.770227]  __driver_attach+0x64/0xe8
> [  198.770572]  bus_for_each_dev+0x84/0xd8
> [  198.770924]  driver_attach+0x30/0x40
> [  198.771253]  bus_add_driver+0x188/0x1e8
> [  198.771605]  driver_register+0x64/0x110
> [  198.771956]  __platform_driver_register+0x54/0x60
> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> [  198.772998]  do_one_initcall+0x94/0x390
> [  198.773353]  do_init_module+0x88/0x268
> [  198.773697]  load_module+0x1e18/0x2198
> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> [  198.775297]  el0_svc_handler+0x34/0xa0
> [  198.775645]  el0_svc+0x14/0x40
> [  198.775928]  el0_sync_handler+0x118/0x290
> [  198.776295]  el0_sync+0x164/0x180
> [  198.776609] Code: bad PC value
> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12  0:03       ` Bjorn Helgaas
@ 2019-11-12  0:21         ` Peter Geis
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Geis @ 2019-11-12  0:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Doug Anderson, Shawn Lin, Heiko Stuebner, linux-pci,
	open list:ARM/Rockchip SoC...

On Mon, Nov 11, 2019 at 7:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sun, Nov 10, 2019 at 10:43:48AM -0500, Peter Geis wrote:
>
> > I plugged in an i350 two port nic and examined the assigned address spaces.
> > I've attached it below.
> > Judging by the usage, I think this controller has enough address space
> > for another two port NIC, and that's about it.
> > I'm pretty sure now that the rk3399 controller just doesn't have the
> > address space to map larger devices.
> > I'm pretty sure the IOMMU would allow us to address system memory as
> > pcie address space and overcome this limitation, but I don't know how
> > to do that.
>
> I don't think you're out of MMIO space, at least in this instance.  It
> looks like you have 32MB available and the two-port NIC on bus 01 only
> takes 5MB.
>
> The IOMMU is used for DMA (e.g., reads/writes initiated by the NIC),
> while the MMIO space is used for CPU programmed I/O (reads/writes done
> by the driver running on the CPU).
>
> > The address space for the nic is below:
> > f8000000-f8ffffff : axi-base
> > fa000000-fbdfffff : MEM
>
> 32MB.
>
> >   fa000000-fa4fffff : PCI Bus 0000:01
>
> 5MB.
Just a note, this is not the device that triggered the bug.
This is a i350 NIC, i posted it as a comparison to my original email.
>
> >     fa000000-fa07ffff : 0000:01:00.0
> >       fa000000-fa07ffff : igb
> >     fa080000-fa0fffff : 0000:01:00.0
> >     fa100000-fa17ffff : 0000:01:00.1
> >       fa100000-fa17ffff : igb
> >     fa180000-fa1fffff : 0000:01:00.1
> >     fa200000-fa27ffff : 0000:01:00.0
> >     fa280000-fa2fffff : 0000:01:00.0
> >     fa300000-fa37ffff : 0000:01:00.1
> >     fa380000-fa3fffff : 0000:01:00.1
> >     fa400000-fa403fff : 0000:01:00.0
> >       fa400000-fa403fff : igb
> >     fa404000-fa407fff : 0000:01:00.1
> >       fa404000-fa407fff : igb
> > fd000000-fdffffff : f8000000.pcie

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12  0:13 ` Bjorn Helgaas
@ 2019-11-12  0:30   ` Peter Geis
  2019-11-12  2:29     ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-12  0:30 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Shawn Lin, Heiko Stuebner, linux-pci, open list:ARM/Rockchip SoC...

On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> > Good Morning,
> >
> > I'm attempting to debug an issue with the rockpro64 pcie port.
> > It would appear that the port does not like various cards, including
> > cards of the same make that randomly work or do not work, such as
> > Intel i340 based NICs.
> > I'm experiencing it with a GTX645 gpu.
> >
> > This seems to be a long running issue, referenced both at [0], and [1].
> > There was an attempt to rectify it, by adding a delay between training
> > and probing [2], but that doesn't seem to be the issue here.
> > It appears that when we probe further into the card, such as devfn >
> > 1, we trigger the bug.
> > I've added a print statement that prints the devfn, address, and size
> > information, which you can see below.
> >
> > I've attempted setting the available number of lanes to 1 as well, to
> > no difference.
> >
> > If anyone could point me in the right direction as to where to
> > continue debugging, I'd greatly appreciate it.
> >
> > [0] https://github.com/ayufan-rock64/linux-build/issues/254
> > [1] https://github.com/rockchip-linux/kernel/issues/116
> > [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> >
> > [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> > for per-lane PHY
> > [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> > [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> > [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> > [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> > [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> > [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> > [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> > [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> > [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> > 0xfa000000..0xfbdfffff -> 0xfa000000
> > [  198.550676] rockchip-pcie f8000000.pcie:    IO
> > 0xfbe00000..0xfbefffff -> 0xfbe00000
> > [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> > [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> > [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> > [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > (bus address [0xfbe00000-0xfbefffff])
> > [  198.552965] pci_bus 0000:00: scanning bus
> > [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> > [  198.555508] pci 0000:00:00.0: supports D1
> > [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > [  198.556023] pci 0000:00:00.0: PME# disabled
> > [  198.561245] pci_bus 0000:00: fixups for bus
> > [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> > [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> > 00-00]), reconfiguring
> > [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> > [  198.567008] pci_bus 0000:01: scanning bus
> > [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> > [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > 64bit pref]
> > [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > 64bit pref]
> > [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> > [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> > [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> > [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> > with 8 GT/s x16 link)
> > [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> > decodes=io+mem,owns=none,locks=none
> > [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> > [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> > <snip>
> > [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> > [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> > [  198.750881] Internal error: synchronous external abort: 96000210
> > [#1] PREEMPT SMP
>
> Is there really supposed to be a device at 01:00.2?
>
> Maybe this is just the PCIe Unsupported Request error that we expect
> to get when trying to read config space of a device that doesn't
> exist.
>
> On "most" platforms, we just get ~0 data back when that happens, but
> I'm not sure that's always the case on arm64.  I think it depends on
> how the PCIe host bridge is designed, and there might be some CPU
> configuration, too.

Yes, this is a GTX645 video card.
Nvidia cards usually have two to three devices,
The GPU proper, the audio device for the hdmi output, and the i2c controller.

I do think that this driver is missing sanity checking on the
addressing, since the BRCM driver for the RPI4 doesn't try to
enumerate a video card, since it checks if the MMIO space is large
enough to fit the BAR before assigning the addresses. See [3]. Also in
that thread he was able to increase the address space provided to the
BRCM driver and fix the issue, but I don't see how we could do that on
the rk3399.

pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
128 MB, which already exceeds our address space.
I think the driver is just overflowing the address space.

[3] https://twitter.com/domipheus/status/1167586160077627393
>
> > [  198.751581] Modules linked in: drm_panel_orientation_quirks
> > pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> > crc_ccitt nf_defrag_ipv6
> > [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> > 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> > [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> > [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> > [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> > [  198.755773] sp : ffff8000132af530
> > [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> > [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> > [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> > [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> > [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> > [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> > [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> > [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> > [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> > [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> > [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> > [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> > [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> > [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> > [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> > [  198.763348] Call trace:
> > [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> > [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> > [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> > [  198.765413]  pci_scan_single_device+0x84/0xf8
> > [  198.765812]  pci_scan_slot+0x7c/0x120
> > [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> > [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> > [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> > [  198.767430]  pci_scan_child_bus+0x24/0x30
> > [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> > [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> > [  198.768770]  platform_drv_probe+0x58/0xa8
> > [  198.769139]  really_probe+0xe0/0x318
> > [  198.769468]  driver_probe_device+0x5c/0xf0
> > [  198.769844]  device_driver_attach+0x74/0x80
> > [  198.770227]  __driver_attach+0x64/0xe8
> > [  198.770572]  bus_for_each_dev+0x84/0xd8
> > [  198.770924]  driver_attach+0x30/0x40
> > [  198.771253]  bus_add_driver+0x188/0x1e8
> > [  198.771605]  driver_register+0x64/0x110
> > [  198.771956]  __platform_driver_register+0x54/0x60
> > [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> > [  198.772998]  do_one_initcall+0x94/0x390
> > [  198.773353]  do_init_module+0x88/0x268
> > [  198.773697]  load_module+0x1e18/0x2198
> > [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> > [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> > [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> > [  198.775297]  el0_svc_handler+0x34/0xa0
> > [  198.775645]  el0_svc+0x14/0x40
> > [  198.775928]  el0_sync_handler+0x118/0x290
> > [  198.776295]  el0_sync+0x164/0x180
> > [  198.776609] Code: bad PC value
> > [  198.776897] ---[ end trace 88fc77651b5e2909 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12  0:30   ` Peter Geis
@ 2019-11-12  2:29     ` Bjorn Helgaas
  2019-11-12 15:55       ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2019-11-12  2:29 UTC (permalink / raw)
  To: Peter Geis
  Cc: Shawn Lin, Heiko Stuebner, linux-pci, open list:ARM/Rockchip SoC...

On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> > > Good Morning,
> > >
> > > I'm attempting to debug an issue with the rockpro64 pcie port.
> > > It would appear that the port does not like various cards, including
> > > cards of the same make that randomly work or do not work, such as
> > > Intel i340 based NICs.
> > > I'm experiencing it with a GTX645 gpu.
> > >
> > > This seems to be a long running issue, referenced both at [0], and [1].
> > > There was an attempt to rectify it, by adding a delay between training
> > > and probing [2], but that doesn't seem to be the issue here.
> > > It appears that when we probe further into the card, such as devfn >
> > > 1, we trigger the bug.
> > > I've added a print statement that prints the devfn, address, and size
> > > information, which you can see below.
> > >
> > > I've attempted setting the available number of lanes to 1 as well, to
> > > no difference.
> > >
> > > If anyone could point me in the right direction as to where to
> > > continue debugging, I'd greatly appreciate it.
> > >
> > > [0] https://github.com/ayufan-rock64/linux-build/issues/254
> > > [1] https://github.com/rockchip-linux/kernel/issues/116
> > > [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> > >
> > > [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> > > for per-lane PHY
> > > [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> > > [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> > > [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> > > [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> > > [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> > > [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> > > [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> > > [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> > > [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> > > 0xfa000000..0xfbdfffff -> 0xfa000000
> > > [  198.550676] rockchip-pcie f8000000.pcie:    IO
> > > 0xfbe00000..0xfbefffff -> 0xfbe00000
> > > [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> > > [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> > > [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> > > [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > > (bus address [0xfbe00000-0xfbefffff])
> > > [  198.552965] pci_bus 0000:00: scanning bus
> > > [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> > > [  198.555508] pci 0000:00:00.0: supports D1
> > > [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > > [  198.556023] pci 0000:00:00.0: PME# disabled
> > > [  198.561245] pci_bus 0000:00: fixups for bus
> > > [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> > > [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> > > 00-00]), reconfiguring
> > > [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> > > [  198.567008] pci_bus 0000:01: scanning bus
> > > [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> > > [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > > [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > > 64bit pref]
> > > [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > > 64bit pref]
> > > [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> > > [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> > > [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > > [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> > > [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > > limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> > > with 8 GT/s x16 link)
> > > [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> > > decodes=io+mem,owns=none,locks=none
> > > [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> > > [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > > [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> > > <snip>
> > > [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> > > [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> > > [  198.750881] Internal error: synchronous external abort: 96000210
> > > [#1] PREEMPT SMP
> >
> > Is there really supposed to be a device at 01:00.2?
> >
> > Maybe this is just the PCIe Unsupported Request error that we expect
> > to get when trying to read config space of a device that doesn't
> > exist.
> >
> > On "most" platforms, we just get ~0 data back when that happens, but
> > I'm not sure that's always the case on arm64.  I think it depends on
> > how the PCIe host bridge is designed, and there might be some CPU
> > configuration, too.
> 
> Yes, this is a GTX645 video card.
> Nvidia cards usually have two to three devices,
> The GPU proper, the audio device for the hdmi output, and the i2c controller.
> 
> I do think that this driver is missing sanity checking on the
> addressing, since the BRCM driver for the RPI4 doesn't try to
> enumerate a video card, since it checks if the MMIO space is large
> enough to fit the BAR before assigning the addresses. See [3]. Also in
> that thread he was able to increase the address space provided to the
> BRCM driver and fix the issue, but I don't see how we could do that on
> the rk3399.
> 
> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
> 128 MB, which already exceeds our address space.
> I think the driver is just overflowing the address space.

If we don't have enough space to assign all the device BARs, I think a
driver will still be able to claim the device, but when the driver
calls pci_enable_device(), it should fail.  Lack of address space
should not cause a PCIe error.

But in this case, none of that matters because we're still enumerating
devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
of trying to bind drivers to devices, so the driver isn't involved at
all yet.

The backtrace says we're trying to read the Vendor ID of a device, and
your debug output suggests we're trying to enumerate 01:00.2.  If you
put that card in another system, you could find out how many functions
it has.

Or if you swapped this with other cards where you know the number of
functions, you could see if the external abort always happens when
probing for the first unimplemented function.

If the Root Port (00:00.0) supports AER, you could also dump out the
status registers from the AER capability and figure out whether it
logged a PCIe error.  This would be sort of like what
aer_process_err_devices() does.  A bit of a hassle to do this by hand
in the exception path, but could be enlightening, just as a debug
tool.

> [3] https://twitter.com/domipheus/status/1167586160077627393
> >
> > > [  198.751581] Modules linked in: drm_panel_orientation_quirks
> > > pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> > > crc_ccitt nf_defrag_ipv6
> > > [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> > > 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> > > [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> > > [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> > > [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> > > [  198.755773] sp : ffff8000132af530
> > > [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> > > [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> > > [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> > > [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> > > [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> > > [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> > > [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> > > [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> > > [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> > > [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> > > [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> > > [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> > > [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> > > [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> > > [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> > > [  198.763348] Call trace:
> > > [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> > > [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> > > [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> > > [  198.765413]  pci_scan_single_device+0x84/0xf8
> > > [  198.765812]  pci_scan_slot+0x7c/0x120
> > > [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> > > [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> > > [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> > > [  198.767430]  pci_scan_child_bus+0x24/0x30
> > > [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> > > [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> > > [  198.768770]  platform_drv_probe+0x58/0xa8
> > > [  198.769139]  really_probe+0xe0/0x318
> > > [  198.769468]  driver_probe_device+0x5c/0xf0
> > > [  198.769844]  device_driver_attach+0x74/0x80
> > > [  198.770227]  __driver_attach+0x64/0xe8
> > > [  198.770572]  bus_for_each_dev+0x84/0xd8
> > > [  198.770924]  driver_attach+0x30/0x40
> > > [  198.771253]  bus_add_driver+0x188/0x1e8
> > > [  198.771605]  driver_register+0x64/0x110
> > > [  198.771956]  __platform_driver_register+0x54/0x60
> > > [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> > > [  198.772998]  do_one_initcall+0x94/0x390
> > > [  198.773353]  do_init_module+0x88/0x268
> > > [  198.773697]  load_module+0x1e18/0x2198
> > > [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> > > [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> > > [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> > > [  198.775297]  el0_svc_handler+0x34/0xa0
> > > [  198.775645]  el0_svc+0x14/0x40
> > > [  198.775928]  el0_sync_handler+0x118/0x290
> > > [  198.776295]  el0_sync+0x164/0x180
> > > [  198.776609] Code: bad PC value
> > > [  198.776897] ---[ end trace 88fc77651b5e2909 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12  2:29     ` Bjorn Helgaas
@ 2019-11-12 15:55       ` Peter Geis
  2019-11-12 19:15         ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-12 15:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Shawn Lin, Heiko Stuebner, linux-pci, open list:ARM/Rockchip SoC...

On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
> > On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> > > > Good Morning,
> > > >
> > > > I'm attempting to debug an issue with the rockpro64 pcie port.
> > > > It would appear that the port does not like various cards, including
> > > > cards of the same make that randomly work or do not work, such as
> > > > Intel i340 based NICs.
> > > > I'm experiencing it with a GTX645 gpu.
> > > >
> > > > This seems to be a long running issue, referenced both at [0], and [1].
> > > > There was an attempt to rectify it, by adding a delay between training
> > > > and probing [2], but that doesn't seem to be the issue here.
> > > > It appears that when we probe further into the card, such as devfn >
> > > > 1, we trigger the bug.
> > > > I've added a print statement that prints the devfn, address, and size
> > > > information, which you can see below.
> > > >
> > > > I've attempted setting the available number of lanes to 1 as well, to
> > > > no difference.
> > > >
> > > > If anyone could point me in the right direction as to where to
> > > > continue debugging, I'd greatly appreciate it.
> > > >
> > > > [0] https://github.com/ayufan-rock64/linux-build/issues/254
> > > > [1] https://github.com/rockchip-linux/kernel/issues/116
> > > > [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> > > >
> > > > [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> > > > for per-lane PHY
> > > > [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> > > > [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> > > > [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> > > > [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> > > > [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> > > > [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> > > > [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> > > > [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> > > > [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> > > > 0xfa000000..0xfbdfffff -> 0xfa000000
> > > > [  198.550676] rockchip-pcie f8000000.pcie:    IO
> > > > 0xfbe00000..0xfbefffff -> 0xfbe00000
> > > > [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> > > > [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> > > > [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> > > > [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > > > (bus address [0xfbe00000-0xfbefffff])
> > > > [  198.552965] pci_bus 0000:00: scanning bus
> > > > [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> > > > [  198.555508] pci 0000:00:00.0: supports D1
> > > > [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > > > [  198.556023] pci 0000:00:00.0: PME# disabled
> > > > [  198.561245] pci_bus 0000:00: fixups for bus
> > > > [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> > > > [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> > > > 00-00]), reconfiguring
> > > > [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> > > > [  198.567008] pci_bus 0000:01: scanning bus
> > > > [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> > > > [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > > > [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > > > 64bit pref]
> > > > [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > > > 64bit pref]
> > > > [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> > > > [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> > > > [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > > > [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> > > > [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > > > limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> > > > with 8 GT/s x16 link)
> > > > [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> > > > decodes=io+mem,owns=none,locks=none
> > > > [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> > > > [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > > > [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> > > > <snip>
> > > > [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> > > > [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> > > > [  198.750881] Internal error: synchronous external abort: 96000210
> > > > [#1] PREEMPT SMP
> > >
> > > Is there really supposed to be a device at 01:00.2?
> > >
> > > Maybe this is just the PCIe Unsupported Request error that we expect
> > > to get when trying to read config space of a device that doesn't
> > > exist.
> > >
> > > On "most" platforms, we just get ~0 data back when that happens, but
> > > I'm not sure that's always the case on arm64.  I think it depends on
> > > how the PCIe host bridge is designed, and there might be some CPU
> > > configuration, too.
> >
> > Yes, this is a GTX645 video card.
> > Nvidia cards usually have two to three devices,
> > The GPU proper, the audio device for the hdmi output, and the i2c controller.
> >
> > I do think that this driver is missing sanity checking on the
> > addressing, since the BRCM driver for the RPI4 doesn't try to
> > enumerate a video card, since it checks if the MMIO space is large
> > enough to fit the BAR before assigning the addresses. See [3]. Also in
> > that thread he was able to increase the address space provided to the
> > BRCM driver and fix the issue, but I don't see how we could do that on
> > the rk3399.
> >
> > pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
> > 128 MB, which already exceeds our address space.
> > I think the driver is just overflowing the address space.
>
> If we don't have enough space to assign all the device BARs, I think a
> driver will still be able to claim the device, but when the driver
> calls pci_enable_device(), it should fail.  Lack of address space
> should not cause a PCIe error.
>
> But in this case, none of that matters because we're still enumerating
> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
> of trying to bind drivers to devices, so the driver isn't involved at
> all yet.
For clarification, the driver I'm referring to is the rk3399-pcie host driver.
>
> The backtrace says we're trying to read the Vendor ID of a device, and
> your debug output suggests we're trying to enumerate 01:00.2.  If you
> put that card in another system, you could find out how many functions
> it has.
>
> Or if you swapped this with other cards where you know the number of
> functions, you could see if the external abort always happens when
> probing for the first unimplemented function.
This card definitely has more than one function.
Before my original message I hacked in some code to make the driver
return 0xff when devfn > 1, and the scan passed, but as soon as
nouveau attempted to access the device, the entire kernel exploded.

Another reason I believe the address assignments are overflowing and
corrupting other address assignments is after the external abort, the
entire PCIE controller is inaccessible.
$ lspci
pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
lspci: Unable to read the standard configuration space header of
device 0000:01:00.1
pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
lspci: Unable to read the standard configuration space header of
device 0000:00:00.0
pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
lspci: Unable to read the standard configuration space header of
device 0000:01:00.0

Attempting to rescan the bus or any other pci function results in a
hung kernel task.
>
> If the Root Port (00:00.0) supports AER, you could also dump out the
> status registers from the AER capability and figure out whether it
> logged a PCIe error.  This would be sort of like what
> aer_process_err_devices() does.  A bit of a hassle to do this by hand
> in the exception path, but could be enlightening, just as a debug
> tool.

Is there a way to handle external synchronous aborts in a device driver?
If so, I'll definitely look into plugging in the aer status functions.

>
> > [3] https://twitter.com/domipheus/status/1167586160077627393
> > >
> > > > [  198.751581] Modules linked in: drm_panel_orientation_quirks
> > > > pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> > > > crc_ccitt nf_defrag_ipv6
> > > > [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> > > > 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> > > > [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> > > > [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> > > > [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > > [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> > > > [  198.755773] sp : ffff8000132af530
> > > > [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> > > > [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> > > > [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> > > > [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> > > > [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> > > > [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> > > > [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> > > > [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> > > > [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> > > > [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> > > > [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> > > > [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> > > > [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> > > > [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> > > > [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> > > > [  198.763348] Call trace:
> > > > [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > > [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> > > > [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> > > > [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> > > > [  198.765413]  pci_scan_single_device+0x84/0xf8
> > > > [  198.765812]  pci_scan_slot+0x7c/0x120
> > > > [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> > > > [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> > > > [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> > > > [  198.767430]  pci_scan_child_bus+0x24/0x30
> > > > [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> > > > [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> > > > [  198.768770]  platform_drv_probe+0x58/0xa8
> > > > [  198.769139]  really_probe+0xe0/0x318
> > > > [  198.769468]  driver_probe_device+0x5c/0xf0
> > > > [  198.769844]  device_driver_attach+0x74/0x80
> > > > [  198.770227]  __driver_attach+0x64/0xe8
> > > > [  198.770572]  bus_for_each_dev+0x84/0xd8
> > > > [  198.770924]  driver_attach+0x30/0x40
> > > > [  198.771253]  bus_add_driver+0x188/0x1e8
> > > > [  198.771605]  driver_register+0x64/0x110
> > > > [  198.771956]  __platform_driver_register+0x54/0x60
> > > > [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> > > > [  198.772998]  do_one_initcall+0x94/0x390
> > > > [  198.773353]  do_init_module+0x88/0x268
> > > > [  198.773697]  load_module+0x1e18/0x2198
> > > > [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> > > > [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> > > > [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> > > > [  198.775297]  el0_svc_handler+0x34/0xa0
> > > > [  198.775645]  el0_svc+0x14/0x40
> > > > [  198.775928]  el0_sync_handler+0x118/0x290
> > > > [  198.776295]  el0_sync+0x164/0x180
> > > > [  198.776609] Code: bad PC value
> > > > [  198.776897] ---[ end trace 88fc77651b5e2909 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12 15:55       ` Peter Geis
@ 2019-11-12 19:15         ` Robin Murphy
  2019-11-12 19:41           ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2019-11-12 19:15 UTC (permalink / raw)
  To: Peter Geis, Bjorn Helgaas
  Cc: linux-pci, Shawn Lin, Heiko Stuebner, open list:ARM/Rockchip SoC...

On 12/11/2019 3:55 pm, Peter Geis wrote:
> On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>
>> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
>>> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
>>>>> Good Morning,
>>>>>
>>>>> I'm attempting to debug an issue with the rockpro64 pcie port.
>>>>> It would appear that the port does not like various cards, including
>>>>> cards of the same make that randomly work or do not work, such as
>>>>> Intel i340 based NICs.
>>>>> I'm experiencing it with a GTX645 gpu.
>>>>>
>>>>> This seems to be a long running issue, referenced both at [0], and [1].
>>>>> There was an attempt to rectify it, by adding a delay between training
>>>>> and probing [2], but that doesn't seem to be the issue here.
>>>>> It appears that when we probe further into the card, such as devfn >
>>>>> 1, we trigger the bug.
>>>>> I've added a print statement that prints the devfn, address, and size
>>>>> information, which you can see below.
>>>>>
>>>>> I've attempted setting the available number of lanes to 1 as well, to
>>>>> no difference.
>>>>>
>>>>> If anyone could point me in the right direction as to where to
>>>>> continue debugging, I'd greatly appreciate it.
>>>>>
>>>>> [0] https://github.com/ayufan-rock64/linux-build/issues/254
>>>>> [1] https://github.com/rockchip-linux/kernel/issues/116
>>>>> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
>>>>>
>>>>> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
>>>>> for per-lane PHY
>>>>> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
>>>>> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
>>>>> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
>>>>> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
>>>>> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
>>>>> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
>>>>> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
>>>>> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
>>>>> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
>>>>> 0xfa000000..0xfbdfffff -> 0xfa000000
>>>>> [  198.550676] rockchip-pcie f8000000.pcie:    IO
>>>>> 0xfbe00000..0xfbefffff -> 0xfbe00000
>>>>> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
>>>>> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
>>>>> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
>>>>> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
>>>>> (bus address [0xfbe00000-0xfbefffff])
>>>>> [  198.552965] pci_bus 0000:00: scanning bus
>>>>> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
>>>>> [  198.555508] pci 0000:00:00.0: supports D1
>>>>> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
>>>>> [  198.556023] pci 0000:00:00.0: PME# disabled
>>>>> [  198.561245] pci_bus 0000:00: fixups for bus
>>>>> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
>>>>> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
>>>>> 00-00]), reconfiguring
>>>>> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
>>>>> [  198.567008] pci_bus 0000:01: scanning bus
>>>>> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
>>>>> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
>>>>> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
>>>>> 64bit pref]
>>>>> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
>>>>> 64bit pref]
>>>>> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
>>>>> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
>>>>> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
>>>>> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
>>>>> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
>>>>> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
>>>>> with 8 GT/s x16 link)
>>>>> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
>>>>> decodes=io+mem,owns=none,locks=none
>>>>> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
>>>>> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
>>>>> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
>>>>> <snip>
>>>>> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
>>>>> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
>>>>> [  198.750881] Internal error: synchronous external abort: 96000210
>>>>> [#1] PREEMPT SMP
>>>>
>>>> Is there really supposed to be a device at 01:00.2?
>>>>
>>>> Maybe this is just the PCIe Unsupported Request error that we expect
>>>> to get when trying to read config space of a device that doesn't
>>>> exist.
>>>>
>>>> On "most" platforms, we just get ~0 data back when that happens, but
>>>> I'm not sure that's always the case on arm64.  I think it depends on
>>>> how the PCIe host bridge is designed, and there might be some CPU
>>>> configuration, too.
>>>
>>> Yes, this is a GTX645 video card.
>>> Nvidia cards usually have two to three devices,
>>> The GPU proper, the audio device for the hdmi output, and the i2c controller.
>>>
>>> I do think that this driver is missing sanity checking on the
>>> addressing, since the BRCM driver for the RPI4 doesn't try to
>>> enumerate a video card, since it checks if the MMIO space is large
>>> enough to fit the BAR before assigning the addresses. See [3]. Also in
>>> that thread he was able to increase the address space provided to the
>>> BRCM driver and fix the issue, but I don't see how we could do that on
>>> the rk3399.
>>>
>>> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
>>> 128 MB, which already exceeds our address space.
>>> I think the driver is just overflowing the address space.
>>
>> If we don't have enough space to assign all the device BARs, I think a
>> driver will still be able to claim the device, but when the driver
>> calls pci_enable_device(), it should fail.  Lack of address space
>> should not cause a PCIe error.
>>
>> But in this case, none of that matters because we're still enumerating
>> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
>> of trying to bind drivers to devices, so the driver isn't involved at
>> all yet.
> For clarification, the driver I'm referring to is the rk3399-pcie host driver.
>>
>> The backtrace says we're trying to read the Vendor ID of a device, and
>> your debug output suggests we're trying to enumerate 01:00.2.  If you
>> put that card in another system, you could find out how many functions
>> it has.
>>
>> Or if you swapped this with other cards where you know the number of
>> functions, you could see if the external abort always happens when
>> probing for the first unimplemented function.
> This card definitely has more than one function.

FWIW, random lspci logs I grovelled up off the internet show cards with 
the same IDs only implementing functions 0 and 1, which does suggest 
that maybe function 2 really doesn't exist but the card handles 
unsupported config requests in a way that this particular bridge/root 
port doesn't quite deal with properly.

> Before my original message I hacked in some code to make the driver
> return 0xff when devfn > 1, and the scan passed, but as soon as
> nouveau attempted to access the device, the entire kernel exploded.
> 
> Another reason I believe the address assignments are overflowing and
> corrupting other address assignments is after the external abort, the
> entire PCIE controller is inaccessible.
> $ lspci
> pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
> lspci: Unable to read the standard configuration space header of
> device 0000:01:00.1
> pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
> lspci: Unable to read the standard configuration space header of
> device 0000:00:00.0
> pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
> lspci: Unable to read the standard configuration space header of
> device 0000:01:00.0
> 
> Attempting to rescan the bus or any other pci function results in a
> hung kernel task.

You crashed deep in the kernel part-way through the driver's probe 
routine; don't expect any partly-initialised bits of that driver to 
actually work. I bet you also can't load/unload modules and do various 
other things either, because the thread will almost certainly have been 
holding a bunch of locks, too. It's not worth trying to reason about the 
state of anything after that kind of unexpected crash ;)

>> If the Root Port (00:00.0) supports AER, you could also dump out the
>> status registers from the AER capability and figure out whether it
>> logged a PCIe error.  This would be sort of like what
>> aer_process_err_devices() does.  A bit of a hassle to do this by hand
>> in the exception path, but could be enlightening, just as a debug
>> tool.
> 
> Is there a way to handle external synchronous aborts in a device driver?

Not easily on arm64 - the closest thing we have is some arch-internal 
machinery for undefined instruction hooks. You'd have to hack something 
gruesome into do_sea()...

For starters it might be worth enabling all the debug prints in 
rockchip-pcie to see if there's any error IRQ corresponding to the 
aborted access.

Robin.

> If so, I'll definitely look into plugging in the aer status functions.
> 
>>
>>> [3] https://twitter.com/domipheus/status/1167586160077627393
>>>>
>>>>> [  198.751581] Modules linked in: drm_panel_orientation_quirks
>>>>> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
>>>>> crc_ccitt nf_defrag_ipv6
>>>>> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
>>>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
>>>>> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
>>>>> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
>>>>> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
>>>>> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
>>>>> [  198.755773] sp : ffff8000132af530
>>>>> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
>>>>> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
>>>>> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
>>>>> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
>>>>> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
>>>>> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
>>>>> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
>>>>> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
>>>>> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
>>>>> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
>>>>> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
>>>>> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
>>>>> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
>>>>> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
>>>>> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
>>>>> [  198.763348] Call trace:
>>>>> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
>>>>> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
>>>>> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
>>>>> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
>>>>> [  198.765413]  pci_scan_single_device+0x84/0xf8
>>>>> [  198.765812]  pci_scan_slot+0x7c/0x120
>>>>> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
>>>>> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
>>>>> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
>>>>> [  198.767430]  pci_scan_child_bus+0x24/0x30
>>>>> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
>>>>> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
>>>>> [  198.768770]  platform_drv_probe+0x58/0xa8
>>>>> [  198.769139]  really_probe+0xe0/0x318
>>>>> [  198.769468]  driver_probe_device+0x5c/0xf0
>>>>> [  198.769844]  device_driver_attach+0x74/0x80
>>>>> [  198.770227]  __driver_attach+0x64/0xe8
>>>>> [  198.770572]  bus_for_each_dev+0x84/0xd8
>>>>> [  198.770924]  driver_attach+0x30/0x40
>>>>> [  198.771253]  bus_add_driver+0x188/0x1e8
>>>>> [  198.771605]  driver_register+0x64/0x110
>>>>> [  198.771956]  __platform_driver_register+0x54/0x60
>>>>> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
>>>>> [  198.772998]  do_one_initcall+0x94/0x390
>>>>> [  198.773353]  do_init_module+0x88/0x268
>>>>> [  198.773697]  load_module+0x1e18/0x2198
>>>>> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
>>>>> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
>>>>> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
>>>>> [  198.775297]  el0_svc_handler+0x34/0xa0
>>>>> [  198.775645]  el0_svc+0x14/0x40
>>>>> [  198.775928]  el0_sync_handler+0x118/0x290
>>>>> [  198.776295]  el0_sync+0x164/0x180
>>>>> [  198.776609] Code: bad PC value
>>>>> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---
> 
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12 19:15         ` Robin Murphy
@ 2019-11-12 19:41           ` Peter Geis
  2019-11-13  1:06             ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-12 19:41 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Bjorn Helgaas, linux-pci, Shawn Lin, Heiko Stuebner,
	open list:ARM/Rockchip SoC...

On Tue, Nov 12, 2019 at 2:15 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 12/11/2019 3:55 pm, Peter Geis wrote:
> > On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>
> >> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
> >>> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>>> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> >>>>> Good Morning,
> >>>>>
> >>>>> I'm attempting to debug an issue with the rockpro64 pcie port.
> >>>>> It would appear that the port does not like various cards, including
> >>>>> cards of the same make that randomly work or do not work, such as
> >>>>> Intel i340 based NICs.
> >>>>> I'm experiencing it with a GTX645 gpu.
> >>>>>
> >>>>> This seems to be a long running issue, referenced both at [0], and [1].
> >>>>> There was an attempt to rectify it, by adding a delay between training
> >>>>> and probing [2], but that doesn't seem to be the issue here.
> >>>>> It appears that when we probe further into the card, such as devfn >
> >>>>> 1, we trigger the bug.
> >>>>> I've added a print statement that prints the devfn, address, and size
> >>>>> information, which you can see below.
> >>>>>
> >>>>> I've attempted setting the available number of lanes to 1 as well, to
> >>>>> no difference.
> >>>>>
> >>>>> If anyone could point me in the right direction as to where to
> >>>>> continue debugging, I'd greatly appreciate it.
> >>>>>
> >>>>> [0] https://github.com/ayufan-rock64/linux-build/issues/254
> >>>>> [1] https://github.com/rockchip-linux/kernel/issues/116
> >>>>> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> >>>>>
> >>>>> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> >>>>> for per-lane PHY
> >>>>> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> >>>>> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> >>>>> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> >>>>> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> >>>>> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> >>>>> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> >>>>> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> >>>>> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> >>>>> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> >>>>> 0xfa000000..0xfbdfffff -> 0xfa000000
> >>>>> [  198.550676] rockchip-pcie f8000000.pcie:    IO
> >>>>> 0xfbe00000..0xfbefffff -> 0xfbe00000
> >>>>> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> >>>>> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> >>>>> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> >>>>> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> >>>>> (bus address [0xfbe00000-0xfbefffff])
> >>>>> [  198.552965] pci_bus 0000:00: scanning bus
> >>>>> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> >>>>> [  198.555508] pci 0000:00:00.0: supports D1
> >>>>> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> >>>>> [  198.556023] pci 0000:00:00.0: PME# disabled
> >>>>> [  198.561245] pci_bus 0000:00: fixups for bus
> >>>>> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> >>>>> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> >>>>> 00-00]), reconfiguring
> >>>>> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> >>>>> [  198.567008] pci_bus 0000:01: scanning bus
> >>>>> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> >>>>> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> >>>>> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> >>>>> 64bit pref]
> >>>>> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> >>>>> 64bit pref]
> >>>>> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> >>>>> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> >>>>> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> >>>>> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> >>>>> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> >>>>> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> >>>>> with 8 GT/s x16 link)
> >>>>> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> >>>>> decodes=io+mem,owns=none,locks=none
> >>>>> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> >>>>> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> >>>>> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> >>>>> <snip>
> >>>>> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> >>>>> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> >>>>> [  198.750881] Internal error: synchronous external abort: 96000210
> >>>>> [#1] PREEMPT SMP
> >>>>
> >>>> Is there really supposed to be a device at 01:00.2?
> >>>>
> >>>> Maybe this is just the PCIe Unsupported Request error that we expect
> >>>> to get when trying to read config space of a device that doesn't
> >>>> exist.
> >>>>
> >>>> On "most" platforms, we just get ~0 data back when that happens, but
> >>>> I'm not sure that's always the case on arm64.  I think it depends on
> >>>> how the PCIe host bridge is designed, and there might be some CPU
> >>>> configuration, too.
> >>>
> >>> Yes, this is a GTX645 video card.
> >>> Nvidia cards usually have two to three devices,
> >>> The GPU proper, the audio device for the hdmi output, and the i2c controller.
> >>>
> >>> I do think that this driver is missing sanity checking on the
> >>> addressing, since the BRCM driver for the RPI4 doesn't try to
> >>> enumerate a video card, since it checks if the MMIO space is large
> >>> enough to fit the BAR before assigning the addresses. See [3]. Also in
> >>> that thread he was able to increase the address space provided to the
> >>> BRCM driver and fix the issue, but I don't see how we could do that on
> >>> the rk3399.
> >>>
> >>> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
> >>> 128 MB, which already exceeds our address space.
> >>> I think the driver is just overflowing the address space.
> >>
> >> If we don't have enough space to assign all the device BARs, I think a
> >> driver will still be able to claim the device, but when the driver
> >> calls pci_enable_device(), it should fail.  Lack of address space
> >> should not cause a PCIe error.
> >>
> >> But in this case, none of that matters because we're still enumerating
> >> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
> >> of trying to bind drivers to devices, so the driver isn't involved at
> >> all yet.
> > For clarification, the driver I'm referring to is the rk3399-pcie host driver.
> >>
> >> The backtrace says we're trying to read the Vendor ID of a device, and
> >> your debug output suggests we're trying to enumerate 01:00.2.  If you
> >> put that card in another system, you could find out how many functions
> >> it has.
> >>
> >> Or if you swapped this with other cards where you know the number of
> >> functions, you could see if the external abort always happens when
> >> probing for the first unimplemented function.
> > This card definitely has more than one function.
>
> FWIW, random lspci logs I grovelled up off the internet show cards with
> the same IDs only implementing functions 0 and 1, which does suggest
> that maybe function 2 really doesn't exist but the card handles
> unsupported config requests in a way that this particular bridge/root
> port doesn't quite deal with properly.

Thanks! Is there a sane way we could make the kernel handle this in
place of the controller?
(It's apparently based on the designware ip, but it doesn't use their
driver for some reason)

>
> > Before my original message I hacked in some code to make the driver
> > return 0xff when devfn > 1, and the scan passed, but as soon as
> > nouveau attempted to access the device, the entire kernel exploded.
> >
> > Another reason I believe the address assignments are overflowing and
> > corrupting other address assignments is after the external abort, the
> > entire PCIE controller is inaccessible.
> > $ lspci
> > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
> > lspci: Unable to read the standard configuration space header of
> > device 0000:01:00.1
> > pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
> > lspci: Unable to read the standard configuration space header of
> > device 0000:00:00.0
> > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
> > lspci: Unable to read the standard configuration space header of
> > device 0000:01:00.0
> >
> > Attempting to rescan the bus or any other pci function results in a
> > hung kernel task.
>
> You crashed deep in the kernel part-way through the driver's probe
> routine; don't expect any partly-initialised bits of that driver to
> actually work. I bet you also can't load/unload modules and do various
> other things either, because the thread will almost certainly have been
> holding a bunch of locks, too. It's not worth trying to reason about the
> state of anything after that kind of unexpected crash ;)

That is observed, but I think it is actually more destructive than that.
Attempting to do an access after the fact, then performing a soft
reset results in a spectacular crash when it tries to bring the BIG
cluster online.

[    0.203160] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
[    5.355606] CPU4: failed to come online
[    5.363258] CPU4: failed in unknown state : 0x0
Unhandled Exception in EL3.
x30 =           0xUhandled Exception in L3.
x30 =           UNhandled ExcEption in EL3.
x30 =   0xUnhandled Excepti2n in EL_.
x30 =           0xUnhandled Exception in EL3.
x30 =           lxUnhadled Exception in EL3.
x30 =           0xUnhandled Excepion in EL3.
x30 =           0xUnhandled Exception in EL3.
x30 =   0xUnhandled Eception in EL3.
x30 =           0xUnhandled Exception in EL3.
x30 =           0xUnhandled Exception in EL3.x30 =
0xUnhandled Exception in EL3.
x30 =           0xUnhandled Excepton i2 EL3.
x30 =           0xUnhandUnhandled Exceptionein EL3.lx30 =
 0xUnhandled Exception in EL3.
x30 =           0xUnhandled Unhanded Exception in EL3.
x30 =           0xUnhandled Exceptin in EL3Unhandled Exception inEL3.
x0 =            0xUnhandled Exception in EL3.
x30 =           0xUnhandled Exception in EL3.
x30 =           0xUnhandled Exception in EL.
x30 =           0xUnhandled Exception in EL3.
x30 =           0xUnhandled Exception in EL3.
x30 =           0xUnhandled xception in EL3
x30 =           0xUnhcndled Exception in EL3.
x30=            0xUnhandled Excextion in EL3.
x30 =   0xUnhandled Exception in EL3.
x30 =           0x[   10.484411] CPU5: failed to come online
[   10.484807] CPU5: failed in unknown state : 0x0
[   10.485374] smp: Brought up 1 node, 4 CPUs

It goes on until it tries to do it's first DMA setup, then just hangs.
>
> >> If the Root Port (00:00.0) supports AER, you could also dump out the
> >> status registers from the AER capability and figure out whether it
> >> logged a PCIe error.  This would be sort of like what
> >> aer_process_err_devices() does.  A bit of a hassle to do this by hand
> >> in the exception path, but could be enlightening, just as a debug
> >> tool.
> >
> > Is there a way to handle external synchronous aborts in a device driver?
>
> Not easily on arm64 - the closest thing we have is some arch-internal
> machinery for undefined instruction hooks. You'd have to hack something
> gruesome into do_sea()...
>
> For starters it might be worth enabling all the debug prints in
> rockchip-pcie to see if there's any error IRQ corresponding to the
> aborted access.

Debug prints are already enabled, I had to hack in the print to show
what addresses it was crashing on.

Your explanation was the last piece for me to understand how this all
works, thanks!
>
> Robin.
>
> > If so, I'll definitely look into plugging in the aer status functions.
> >
> >>
> >>> [3] https://twitter.com/domipheus/status/1167586160077627393
> >>>>
> >>>>> [  198.751581] Modules linked in: drm_panel_orientation_quirks
> >>>>> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> >>>>> crc_ccitt nf_defrag_ipv6
> >>>>> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> >>>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> >>>>> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> >>>>> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> >>>>> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> >>>>> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> >>>>> [  198.755773] sp : ffff8000132af530
> >>>>> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> >>>>> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> >>>>> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> >>>>> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> >>>>> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> >>>>> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> >>>>> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> >>>>> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> >>>>> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> >>>>> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> >>>>> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> >>>>> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> >>>>> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> >>>>> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> >>>>> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> >>>>> [  198.763348] Call trace:
> >>>>> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> >>>>> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> >>>>> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> >>>>> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> >>>>> [  198.765413]  pci_scan_single_device+0x84/0xf8
> >>>>> [  198.765812]  pci_scan_slot+0x7c/0x120
> >>>>> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> >>>>> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> >>>>> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> >>>>> [  198.767430]  pci_scan_child_bus+0x24/0x30
> >>>>> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> >>>>> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> >>>>> [  198.768770]  platform_drv_probe+0x58/0xa8
> >>>>> [  198.769139]  really_probe+0xe0/0x318
> >>>>> [  198.769468]  driver_probe_device+0x5c/0xf0
> >>>>> [  198.769844]  device_driver_attach+0x74/0x80
> >>>>> [  198.770227]  __driver_attach+0x64/0xe8
> >>>>> [  198.770572]  bus_for_each_dev+0x84/0xd8
> >>>>> [  198.770924]  driver_attach+0x30/0x40
> >>>>> [  198.771253]  bus_add_driver+0x188/0x1e8
> >>>>> [  198.771605]  driver_register+0x64/0x110
> >>>>> [  198.771956]  __platform_driver_register+0x54/0x60
> >>>>> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> >>>>> [  198.772998]  do_one_initcall+0x94/0x390
> >>>>> [  198.773353]  do_init_module+0x88/0x268
> >>>>> [  198.773697]  load_module+0x1e18/0x2198
> >>>>> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> >>>>> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> >>>>> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> >>>>> [  198.775297]  el0_svc_handler+0x34/0xa0
> >>>>> [  198.775645]  el0_svc+0x14/0x40
> >>>>> [  198.775928]  el0_sync_handler+0x118/0x290
> >>>>> [  198.776295]  el0_sync+0x164/0x180
> >>>>> [  198.776609] Code: bad PC value
> >>>>> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---
> >
> > _______________________________________________
> > Linux-rockchip mailing list
> > Linux-rockchip@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-rockchip
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-12 19:41           ` Peter Geis
@ 2019-11-13  1:06             ` Peter Geis
  2019-11-13  1:19               ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-13  1:06 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Bjorn Helgaas, linux-pci, Shawn Lin, Heiko Stuebner,
	open list:ARM/Rockchip SoC...

On Tue, Nov 12, 2019 at 2:41 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 2:15 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 12/11/2019 3:55 pm, Peter Geis wrote:
> > > On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >>
> > >> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
> > >>> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >>>> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> > >>>>> Good Morning,
> > >>>>>
> > >>>>> I'm attempting to debug an issue with the rockpro64 pcie port.
> > >>>>> It would appear that the port does not like various cards, including
> > >>>>> cards of the same make that randomly work or do not work, such as
> > >>>>> Intel i340 based NICs.
> > >>>>> I'm experiencing it with a GTX645 gpu.
> > >>>>>
> > >>>>> This seems to be a long running issue, referenced both at [0], and [1].
> > >>>>> There was an attempt to rectify it, by adding a delay between training
> > >>>>> and probing [2], but that doesn't seem to be the issue here.
> > >>>>> It appears that when we probe further into the card, such as devfn >
> > >>>>> 1, we trigger the bug.
> > >>>>> I've added a print statement that prints the devfn, address, and size
> > >>>>> information, which you can see below.
> > >>>>>
> > >>>>> I've attempted setting the available number of lanes to 1 as well, to
> > >>>>> no difference.
> > >>>>>
> > >>>>> If anyone could point me in the right direction as to where to
> > >>>>> continue debugging, I'd greatly appreciate it.
> > >>>>>
> > >>>>> [0] https://github.com/ayufan-rock64/linux-build/issues/254
> > >>>>> [1] https://github.com/rockchip-linux/kernel/issues/116
> > >>>>> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> > >>>>>
> > >>>>> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> > >>>>> for per-lane PHY
> > >>>>> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> > >>>>> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> > >>>>> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> > >>>>> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> > >>>>> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> > >>>>> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> > >>>>> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> > >>>>> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> > >>>>> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> > >>>>> 0xfa000000..0xfbdfffff -> 0xfa000000
> > >>>>> [  198.550676] rockchip-pcie f8000000.pcie:    IO
> > >>>>> 0xfbe00000..0xfbefffff -> 0xfbe00000
> > >>>>> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> > >>>>> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> > >>>>> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> > >>>>> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > >>>>> (bus address [0xfbe00000-0xfbefffff])
> > >>>>> [  198.552965] pci_bus 0000:00: scanning bus
> > >>>>> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> > >>>>> [  198.555508] pci 0000:00:00.0: supports D1
> > >>>>> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > >>>>> [  198.556023] pci 0000:00:00.0: PME# disabled
> > >>>>> [  198.561245] pci_bus 0000:00: fixups for bus
> > >>>>> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> > >>>>> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> > >>>>> 00-00]), reconfiguring
> > >>>>> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> > >>>>> [  198.567008] pci_bus 0000:01: scanning bus
> > >>>>> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> > >>>>> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > >>>>> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > >>>>> 64bit pref]
> > >>>>> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > >>>>> 64bit pref]
> > >>>>> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> > >>>>> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> > >>>>> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > >>>>> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> > >>>>> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > >>>>> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> > >>>>> with 8 GT/s x16 link)
> > >>>>> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> > >>>>> decodes=io+mem,owns=none,locks=none
> > >>>>> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> > >>>>> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > >>>>> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> > >>>>> <snip>
> > >>>>> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> > >>>>> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> > >>>>> [  198.750881] Internal error: synchronous external abort: 96000210
> > >>>>> [#1] PREEMPT SMP
> > >>>>
> > >>>> Is there really supposed to be a device at 01:00.2?
> > >>>>
> > >>>> Maybe this is just the PCIe Unsupported Request error that we expect
> > >>>> to get when trying to read config space of a device that doesn't
> > >>>> exist.
> > >>>>
> > >>>> On "most" platforms, we just get ~0 data back when that happens, but
> > >>>> I'm not sure that's always the case on arm64.  I think it depends on
> > >>>> how the PCIe host bridge is designed, and there might be some CPU
> > >>>> configuration, too.
> > >>>
> > >>> Yes, this is a GTX645 video card.
> > >>> Nvidia cards usually have two to three devices,
> > >>> The GPU proper, the audio device for the hdmi output, and the i2c controller.
> > >>>
> > >>> I do think that this driver is missing sanity checking on the
> > >>> addressing, since the BRCM driver for the RPI4 doesn't try to
> > >>> enumerate a video card, since it checks if the MMIO space is large
> > >>> enough to fit the BAR before assigning the addresses. See [3]. Also in
> > >>> that thread he was able to increase the address space provided to the
> > >>> BRCM driver and fix the issue, but I don't see how we could do that on
> > >>> the rk3399.
> > >>>
> > >>> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
> > >>> 128 MB, which already exceeds our address space.
> > >>> I think the driver is just overflowing the address space.
> > >>
> > >> If we don't have enough space to assign all the device BARs, I think a
> > >> driver will still be able to claim the device, but when the driver
> > >> calls pci_enable_device(), it should fail.  Lack of address space
> > >> should not cause a PCIe error.
> > >>
> > >> But in this case, none of that matters because we're still enumerating
> > >> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
> > >> of trying to bind drivers to devices, so the driver isn't involved at
> > >> all yet.
> > > For clarification, the driver I'm referring to is the rk3399-pcie host driver.
> > >>
> > >> The backtrace says we're trying to read the Vendor ID of a device, and
> > >> your debug output suggests we're trying to enumerate 01:00.2.  If you
> > >> put that card in another system, you could find out how many functions
> > >> it has.
> > >>
> > >> Or if you swapped this with other cards where you know the number of
> > >> functions, you could see if the external abort always happens when
> > >> probing for the first unimplemented function.
> > > This card definitely has more than one function.
> >
> > FWIW, random lspci logs I grovelled up off the internet show cards with
> > the same IDs only implementing functions 0 and 1, which does suggest
> > that maybe function 2 really doesn't exist but the card handles
> > unsupported config requests in a way that this particular bridge/root
> > port doesn't quite deal with properly.
>
> Thanks! Is there a sane way we could make the kernel handle this in
> place of the controller?
> (It's apparently based on the designware ip, but it doesn't use their
> driver for some reason)
>
> >
> > > Before my original message I hacked in some code to make the driver
> > > return 0xff when devfn > 1, and the scan passed, but as soon as
> > > nouveau attempted to access the device, the entire kernel exploded.

This is what happens when probing is prevented from reading above
devfn 1, and nouveau gets a shot.
[   76.658128] SError Interrupt on CPU4, code 0xbf000000 -- SError
[   76.658129] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
[   76.658130] Hardware name: Pine64 RockPro64 (DT)
[   76.658131] pstate: 60400005 (nZCv daif +PAN -UAO)
[   76.658131] pc : nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
[   76.658132] lr : nvkm_device_ctor+0x2c0/0x2e18 [nouveau]
[   76.658133] sp : ffff800012daf740
[   76.658134] x29: ffff800012daf740 x28: ffff0000c2db9800
[   76.658136] x27: 0000000000000000 x26: ffff0000c7f2e0b0
[   76.658138] x25: 0000000000000000 x24: 0000000000000000
[   76.658139] x23: 0000000000000000 x22: ffff800009e1b1e0
[   76.658141] x21: ffff0000c335b940 x20: ffff800009ede660
[   76.658142] x19: 0000000000000000 x18: 0000000000000001
[   76.658144] x17: 0000000000000000 x16: 0000000000000000
[   76.658145] x15: ffff0000c335bdc0 x14: ffff8000140a0000
[   76.658147] x13: 0000000000110000 x12: 0040000000000001
[   76.658148] x11: ffff0000c12b8180 x10: 0000000000000085
[   76.658150] x9 : ffff800013c00000 x8 : 0000000000110000
[   76.658152] x7 : 0000000000000000 x6 : ffff0000f7ff0008
[   76.658153] x5 : ffff800013d0ffff x4 : ffff800013c00000
[   76.658154] x3 : ffff800013c00004 x2 : 0140000000000000
[   76.658156] x1 : 0000000000000000 x0 : ffff800013c00000
[   76.658158] Kernel panic - not syncing: Asynchronous SError Interrupt
[   76.658159] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
[   76.658160] Hardware name: Pine64 RockPro64 (DT)
[   76.658161] Call trace:
[   76.658161]  dump_backtrace+0x0/0x148
[   76.658162]  show_stack+0x24/0x30
[   76.658163]  dump_stack+0xa8/0xf0
[   76.658163]  panic+0x160/0x36c
[   76.658164]  nmi_panic+0x94/0x98
[   76.658164]  arm64_serror_panic+0x84/0x90
[   76.658165]  do_serror+0x88/0x140
[   76.658166]  el1_error+0xc4/0x170
[   76.658166]  nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
[   76.658167]  nvkm_device_pci_new+0xf8/0x2c8 [nouveau]
[   76.658168]  nouveau_drm_probe+0x60/0x288 [nouveau]
[   76.658168]  local_pci_probe+0x44/0x98
[   76.658169]  pci_device_probe+0x118/0x1b0
[   76.658170]  really_probe+0xe0/0x320
[   76.658171]  driver_probe_device+0x5c/0xf0
[   76.658171]  device_driver_attach+0x74/0x80
[   76.658172]  __driver_attach+0x64/0xe0
[   76.658173]  bus_for_each_dev+0x80/0xd0
[   76.658173]  driver_attach+0x30/0x40
[   76.658174]  bus_add_driver+0x14c/0x1f0
[   76.658175]  driver_register+0x64/0x110
[   76.658175]  __pci_register_driver+0x58/0x68
[   76.658176]  nouveau_drm_init+0x17c/0x10000 [nouveau]
[   76.658177]  do_one_initcall+0x50/0x298
[   76.658178]  do_init_module+0x88/0x270
[   76.658178]  load_module+0x2068/0x2600
[   76.658179]  __do_sys_finit_module+0xd0/0xe8
[   76.658180]  __arm64_sys_finit_module+0x28/0x38
[   76.658180]  el0_svc_common.constprop.0+0x7c/0x1f0
[   76.658181]  el0_svc_handler+0x34/0xa0
[   76.658182]  el0_svc+0x14/0x40
[   76.658182]  el0_sync_handler+0x118/0x280
[   76.658183]  el0_sync+0x164/0x180
[   76.658248] SMP: stopping secondary CPUs
[   76.658249] Kernel Offset: disabled
[   76.658250] CPU features: 0x10002,2000600c
[   76.658250] Memory Limit: none


> > >
> > > Another reason I believe the address assignments are overflowing and
> > > corrupting other address assignments is after the external abort, the
> > > entire PCIE controller is inaccessible.
> > > $ lspci
> > > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
> > > lspci: Unable to read the standard configuration space header of
> > > device 0000:01:00.1
> > > pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
> > > lspci: Unable to read the standard configuration space header of
> > > device 0000:00:00.0
> > > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
> > > lspci: Unable to read the standard configuration space header of
> > > device 0000:01:00.0
> > >
> > > Attempting to rescan the bus or any other pci function results in a
> > > hung kernel task.
> >
> > You crashed deep in the kernel part-way through the driver's probe
> > routine; don't expect any partly-initialised bits of that driver to
> > actually work. I bet you also can't load/unload modules and do various
> > other things either, because the thread will almost certainly have been
> > holding a bunch of locks, too. It's not worth trying to reason about the
> > state of anything after that kind of unexpected crash ;)
>
> That is observed, but I think it is actually more destructive than that.
> Attempting to do an access after the fact, then performing a soft
> reset results in a spectacular crash when it tries to bring the BIG
> cluster online.
>
> [    0.203160] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
> [    5.355606] CPU4: failed to come online
> [    5.363258] CPU4: failed in unknown state : 0x0
> Unhandled Exception in EL3.
> x30 =           0xUhandled Exception in L3.
> x30 =           UNhandled ExcEption in EL3.
> x30 =   0xUnhandled Excepti2n in EL_.
> x30 =           0xUnhandled Exception in EL3.
> x30 =           lxUnhadled Exception in EL3.
> x30 =           0xUnhandled Excepion in EL3.
> x30 =           0xUnhandled Exception in EL3.
> x30 =   0xUnhandled Eception in EL3.
> x30 =           0xUnhandled Exception in EL3.
> x30 =           0xUnhandled Exception in EL3.x30 =
> 0xUnhandled Exception in EL3.
> x30 =           0xUnhandled Excepton i2 EL3.
> x30 =           0xUnhandUnhandled Exceptionein EL3.lx30 =
>  0xUnhandled Exception in EL3.
> x30 =           0xUnhandled Unhanded Exception in EL3.
> x30 =           0xUnhandled Exceptin in EL3Unhandled Exception inEL3.
> x0 =            0xUnhandled Exception in EL3.
> x30 =           0xUnhandled Exception in EL3.
> x30 =           0xUnhandled Exception in EL.
> x30 =           0xUnhandled Exception in EL3.
> x30 =           0xUnhandled Exception in EL3.
> x30 =           0xUnhandled xception in EL3
> x30 =           0xUnhcndled Exception in EL3.
> x30=            0xUnhandled Excextion in EL3.
> x30 =   0xUnhandled Exception in EL3.
> x30 =           0x[   10.484411] CPU5: failed to come online
> [   10.484807] CPU5: failed in unknown state : 0x0
> [   10.485374] smp: Brought up 1 node, 4 CPUs
>
> It goes on until it tries to do it's first DMA setup, then just hangs.
> >
> > >> If the Root Port (00:00.0) supports AER, you could also dump out the
> > >> status registers from the AER capability and figure out whether it
> > >> logged a PCIe error.  This would be sort of like what
> > >> aer_process_err_devices() does.  A bit of a hassle to do this by hand
> > >> in the exception path, but could be enlightening, just as a debug
> > >> tool.
> > >
> > > Is there a way to handle external synchronous aborts in a device driver?
> >
> > Not easily on arm64 - the closest thing we have is some arch-internal
> > machinery for undefined instruction hooks. You'd have to hack something
> > gruesome into do_sea()...
> >
> > For starters it might be worth enabling all the debug prints in
> > rockchip-pcie to see if there's any error IRQ corresponding to the
> > aborted access.
>
> Debug prints are already enabled, I had to hack in the print to show
> what addresses it was crashing on.
>
> Your explanation was the last piece for me to understand how this all
> works, thanks!
> >
> > Robin.
> >
> > > If so, I'll definitely look into plugging in the aer status functions.
> > >
> > >>
> > >>> [3] https://twitter.com/domipheus/status/1167586160077627393
> > >>>>
> > >>>>> [  198.751581] Modules linked in: drm_panel_orientation_quirks
> > >>>>> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> > >>>>> crc_ccitt nf_defrag_ipv6
> > >>>>> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> > >>>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> > >>>>> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> > >>>>> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> > >>>>> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > >>>>> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> > >>>>> [  198.755773] sp : ffff8000132af530
> > >>>>> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> > >>>>> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> > >>>>> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> > >>>>> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> > >>>>> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> > >>>>> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> > >>>>> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> > >>>>> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> > >>>>> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> > >>>>> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> > >>>>> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> > >>>>> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> > >>>>> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> > >>>>> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> > >>>>> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> > >>>>> [  198.763348] Call trace:
> > >>>>> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > >>>>> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> > >>>>> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> > >>>>> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> > >>>>> [  198.765413]  pci_scan_single_device+0x84/0xf8
> > >>>>> [  198.765812]  pci_scan_slot+0x7c/0x120
> > >>>>> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> > >>>>> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> > >>>>> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> > >>>>> [  198.767430]  pci_scan_child_bus+0x24/0x30
> > >>>>> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> > >>>>> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> > >>>>> [  198.768770]  platform_drv_probe+0x58/0xa8
> > >>>>> [  198.769139]  really_probe+0xe0/0x318
> > >>>>> [  198.769468]  driver_probe_device+0x5c/0xf0
> > >>>>> [  198.769844]  device_driver_attach+0x74/0x80
> > >>>>> [  198.770227]  __driver_attach+0x64/0xe8
> > >>>>> [  198.770572]  bus_for_each_dev+0x84/0xd8
> > >>>>> [  198.770924]  driver_attach+0x30/0x40
> > >>>>> [  198.771253]  bus_add_driver+0x188/0x1e8
> > >>>>> [  198.771605]  driver_register+0x64/0x110
> > >>>>> [  198.771956]  __platform_driver_register+0x54/0x60
> > >>>>> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> > >>>>> [  198.772998]  do_one_initcall+0x94/0x390
> > >>>>> [  198.773353]  do_init_module+0x88/0x268
> > >>>>> [  198.773697]  load_module+0x1e18/0x2198
> > >>>>> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> > >>>>> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> > >>>>> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> > >>>>> [  198.775297]  el0_svc_handler+0x34/0xa0
> > >>>>> [  198.775645]  el0_svc+0x14/0x40
> > >>>>> [  198.775928]  el0_sync_handler+0x118/0x290
> > >>>>> [  198.776295]  el0_sync+0x164/0x180
> > >>>>> [  198.776609] Code: bad PC value
> > >>>>> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---
> > >
> > > _______________________________________________
> > > Linux-rockchip mailing list
> > > Linux-rockchip@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-rockchip
> > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-13  1:06             ` Peter Geis
@ 2019-11-13  1:19               ` Peter Geis
  2019-11-21  0:36                 ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-13  1:19 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Bjorn Helgaas, linux-pci, Shawn Lin, Heiko Stuebner,
	open list:ARM/Rockchip SoC...

On Tue, Nov 12, 2019 at 8:06 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 2:41 PM Peter Geis <pgwipeout@gmail.com> wrote:
> >
> > On Tue, Nov 12, 2019 at 2:15 PM Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > > On 12/11/2019 3:55 pm, Peter Geis wrote:
> > > > On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >>
> > > >> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
> > > >>> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >>>> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> > > >>>>> Good Morning,
> > > >>>>>
> > > >>>>> I'm attempting to debug an issue with the rockpro64 pcie port.
> > > >>>>> It would appear that the port does not like various cards, including
> > > >>>>> cards of the same make that randomly work or do not work, such as
> > > >>>>> Intel i340 based NICs.
> > > >>>>> I'm experiencing it with a GTX645 gpu.
> > > >>>>>
> > > >>>>> This seems to be a long running issue, referenced both at [0], and [1].
> > > >>>>> There was an attempt to rectify it, by adding a delay between training
> > > >>>>> and probing [2], but that doesn't seem to be the issue here.
> > > >>>>> It appears that when we probe further into the card, such as devfn >
> > > >>>>> 1, we trigger the bug.
> > > >>>>> I've added a print statement that prints the devfn, address, and size
> > > >>>>> information, which you can see below.
> > > >>>>>
> > > >>>>> I've attempted setting the available number of lanes to 1 as well, to
> > > >>>>> no difference.
> > > >>>>>
> > > >>>>> If anyone could point me in the right direction as to where to
> > > >>>>> continue debugging, I'd greatly appreciate it.
> > > >>>>>
> > > >>>>> [0] https://github.com/ayufan-rock64/linux-build/issues/254
> > > >>>>> [1] https://github.com/rockchip-linux/kernel/issues/116
> > > >>>>> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> > > >>>>>
> > > >>>>> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> > > >>>>> for per-lane PHY
> > > >>>>> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> > > >>>>> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> > > >>>>> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> > > >>>>> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> > > >>>>> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> > > >>>>> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> > > >>>>> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> > > >>>>> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> > > >>>>> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> > > >>>>> 0xfa000000..0xfbdfffff -> 0xfa000000
> > > >>>>> [  198.550676] rockchip-pcie f8000000.pcie:    IO
> > > >>>>> 0xfbe00000..0xfbefffff -> 0xfbe00000
> > > >>>>> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> > > >>>>> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> > > >>>>> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> > > >>>>> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > > >>>>> (bus address [0xfbe00000-0xfbefffff])
> > > >>>>> [  198.552965] pci_bus 0000:00: scanning bus
> > > >>>>> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> > > >>>>> [  198.555508] pci 0000:00:00.0: supports D1
> > > >>>>> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > > >>>>> [  198.556023] pci 0000:00:00.0: PME# disabled
> > > >>>>> [  198.561245] pci_bus 0000:00: fixups for bus
> > > >>>>> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> > > >>>>> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> > > >>>>> 00-00]), reconfiguring
> > > >>>>> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> > > >>>>> [  198.567008] pci_bus 0000:01: scanning bus
> > > >>>>> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> > > >>>>> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > > >>>>> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > > >>>>> 64bit pref]
> > > >>>>> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > > >>>>> 64bit pref]
> > > >>>>> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> > > >>>>> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> > > >>>>> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > > >>>>> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> > > >>>>> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > > >>>>> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> > > >>>>> with 8 GT/s x16 link)
> > > >>>>> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> > > >>>>> decodes=io+mem,owns=none,locks=none
> > > >>>>> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> > > >>>>> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > > >>>>> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> > > >>>>> <snip>
> > > >>>>> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> > > >>>>> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> > > >>>>> [  198.750881] Internal error: synchronous external abort: 96000210
> > > >>>>> [#1] PREEMPT SMP
> > > >>>>
> > > >>>> Is there really supposed to be a device at 01:00.2?
> > > >>>>
> > > >>>> Maybe this is just the PCIe Unsupported Request error that we expect
> > > >>>> to get when trying to read config space of a device that doesn't
> > > >>>> exist.
> > > >>>>
> > > >>>> On "most" platforms, we just get ~0 data back when that happens, but
> > > >>>> I'm not sure that's always the case on arm64.  I think it depends on
> > > >>>> how the PCIe host bridge is designed, and there might be some CPU
> > > >>>> configuration, too.
> > > >>>
> > > >>> Yes, this is a GTX645 video card.
> > > >>> Nvidia cards usually have two to three devices,
> > > >>> The GPU proper, the audio device for the hdmi output, and the i2c controller.
> > > >>>
> > > >>> I do think that this driver is missing sanity checking on the
> > > >>> addressing, since the BRCM driver for the RPI4 doesn't try to
> > > >>> enumerate a video card, since it checks if the MMIO space is large
> > > >>> enough to fit the BAR before assigning the addresses. See [3]. Also in
> > > >>> that thread he was able to increase the address space provided to the
> > > >>> BRCM driver and fix the issue, but I don't see how we could do that on
> > > >>> the rk3399.
> > > >>>
> > > >>> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
> > > >>> 128 MB, which already exceeds our address space.
> > > >>> I think the driver is just overflowing the address space.
> > > >>
> > > >> If we don't have enough space to assign all the device BARs, I think a
> > > >> driver will still be able to claim the device, but when the driver
> > > >> calls pci_enable_device(), it should fail.  Lack of address space
> > > >> should not cause a PCIe error.
> > > >>
> > > >> But in this case, none of that matters because we're still enumerating
> > > >> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
> > > >> of trying to bind drivers to devices, so the driver isn't involved at
> > > >> all yet.
> > > > For clarification, the driver I'm referring to is the rk3399-pcie host driver.
> > > >>
> > > >> The backtrace says we're trying to read the Vendor ID of a device, and
> > > >> your debug output suggests we're trying to enumerate 01:00.2.  If you
> > > >> put that card in another system, you could find out how many functions
> > > >> it has.
> > > >>
> > > >> Or if you swapped this with other cards where you know the number of
> > > >> functions, you could see if the external abort always happens when
> > > >> probing for the first unimplemented function.
> > > > This card definitely has more than one function.
> > >
> > > FWIW, random lspci logs I grovelled up off the internet show cards with
> > > the same IDs only implementing functions 0 and 1, which does suggest
> > > that maybe function 2 really doesn't exist but the card handles
> > > unsupported config requests in a way that this particular bridge/root
> > > port doesn't quite deal with properly.
> >
> > Thanks! Is there a sane way we could make the kernel handle this in
> > place of the controller?
> > (It's apparently based on the designware ip, but it doesn't use their
> > driver for some reason)
> >
> > >
> > > > Before my original message I hacked in some code to make the driver
> > > > return 0xff when devfn > 1, and the scan passed, but as soon as
> > > > nouveau attempted to access the device, the entire kernel exploded.
>
> This is what happens when probing is prevented from reading above
> devfn 1, and nouveau gets a shot.
> [   76.658128] SError Interrupt on CPU4, code 0xbf000000 -- SError
> [   76.658129] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
> [   76.658130] Hardware name: Pine64 RockPro64 (DT)
> [   76.658131] pstate: 60400005 (nZCv daif +PAN -UAO)
> [   76.658131] pc : nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
> [   76.658132] lr : nvkm_device_ctor+0x2c0/0x2e18 [nouveau]
> [   76.658133] sp : ffff800012daf740
> [   76.658134] x29: ffff800012daf740 x28: ffff0000c2db9800
> [   76.658136] x27: 0000000000000000 x26: ffff0000c7f2e0b0
> [   76.658138] x25: 0000000000000000 x24: 0000000000000000
> [   76.658139] x23: 0000000000000000 x22: ffff800009e1b1e0
> [   76.658141] x21: ffff0000c335b940 x20: ffff800009ede660
> [   76.658142] x19: 0000000000000000 x18: 0000000000000001
> [   76.658144] x17: 0000000000000000 x16: 0000000000000000
> [   76.658145] x15: ffff0000c335bdc0 x14: ffff8000140a0000
> [   76.658147] x13: 0000000000110000 x12: 0040000000000001
> [   76.658148] x11: ffff0000c12b8180 x10: 0000000000000085
> [   76.658150] x9 : ffff800013c00000 x8 : 0000000000110000
> [   76.658152] x7 : 0000000000000000 x6 : ffff0000f7ff0008
> [   76.658153] x5 : ffff800013d0ffff x4 : ffff800013c00000
> [   76.658154] x3 : ffff800013c00004 x2 : 0140000000000000
> [   76.658156] x1 : 0000000000000000 x0 : ffff800013c00000
> [   76.658158] Kernel panic - not syncing: Asynchronous SError Interrupt
> [   76.658159] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
> [   76.658160] Hardware name: Pine64 RockPro64 (DT)
> [   76.658161] Call trace:
> [   76.658161]  dump_backtrace+0x0/0x148
> [   76.658162]  show_stack+0x24/0x30
> [   76.658163]  dump_stack+0xa8/0xf0
> [   76.658163]  panic+0x160/0x36c
> [   76.658164]  nmi_panic+0x94/0x98
> [   76.658164]  arm64_serror_panic+0x84/0x90
> [   76.658165]  do_serror+0x88/0x140
> [   76.658166]  el1_error+0xc4/0x170
> [   76.658166]  nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
> [   76.658167]  nvkm_device_pci_new+0xf8/0x2c8 [nouveau]
> [   76.658168]  nouveau_drm_probe+0x60/0x288 [nouveau]
> [   76.658168]  local_pci_probe+0x44/0x98
> [   76.658169]  pci_device_probe+0x118/0x1b0
> [   76.658170]  really_probe+0xe0/0x320
> [   76.658171]  driver_probe_device+0x5c/0xf0
> [   76.658171]  device_driver_attach+0x74/0x80
> [   76.658172]  __driver_attach+0x64/0xe0
> [   76.658173]  bus_for_each_dev+0x80/0xd0
> [   76.658173]  driver_attach+0x30/0x40
> [   76.658174]  bus_add_driver+0x14c/0x1f0
> [   76.658175]  driver_register+0x64/0x110
> [   76.658175]  __pci_register_driver+0x58/0x68
> [   76.658176]  nouveau_drm_init+0x17c/0x10000 [nouveau]
> [   76.658177]  do_one_initcall+0x50/0x298
> [   76.658178]  do_init_module+0x88/0x270
> [   76.658178]  load_module+0x2068/0x2600
> [   76.658179]  __do_sys_finit_module+0xd0/0xe8
> [   76.658180]  __arm64_sys_finit_module+0x28/0x38
> [   76.658180]  el0_svc_common.constprop.0+0x7c/0x1f0
> [   76.658181]  el0_svc_handler+0x34/0xa0
> [   76.658182]  el0_svc+0x14/0x40
> [   76.658182]  el0_sync_handler+0x118/0x280
> [   76.658183]  el0_sync+0x164/0x180
> [   76.658248] SMP: stopping secondary CPUs
> [   76.658249] Kernel Offset: disabled
> [   76.658250] CPU features: 0x10002,2000600c
> [   76.658250] Memory Limit: none

Last log, I promise.
This is what happens when it is locked out of devfn > 1, and nouveau
is not available.

[    4.199087] vcc3v3_pcie: supplied by vcc12v_dcin
[   55.617783] ehci-pci: EHCI PCI platform driver
[   55.868935] ohci-pci: OHCI PCI platform driver
[   65.426085] rockchip-pcie f8000000.pcie: missing legacy phy; search
for per-lane PHY
[   65.426645] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[   65.426662] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[   65.507012] rockchip-pcie f8000000.pcie: current link width is x4
[   65.507078] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
[   65.507088] rockchip-pcie f8000000.pcie: Parsing ranges property...
[   65.507103] rockchip-pcie f8000000.pcie:   MEM
0xfa000000..0xfbdfffff -> 0xfa000000
[   65.507112] rockchip-pcie f8000000.pcie:    IO
0xfbe00000..0xfbffffff -> 0xfbe00000
[   65.507368] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
[   65.507376] pci_bus 0000:00: root bus resource [bus 00-1f]
[   65.507382] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
[   65.507390] pci_bus 0000:00: root bus resource [io
0x0000-0x1fffff] (bus address [0xfbe00000-0xfbffffff])
[   65.507394] pci_bus 0000:00: scanning bus
[   65.507423] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
[   65.507543] pci 0000:00:00.0: supports D1
[   65.507547] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[   65.507555] pci 0000:00:00.0: PME# disabled
[   65.511436] pci_bus 0000:00: fixups for bus
[   65.511452] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[   65.511457] pci 0000:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[   65.511471] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
[   65.511633] pci_bus 0000:01: scanning bus
[   65.511739] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
[   65.512205] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
[   65.512274] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
64bit pref]
[   65.512345] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
64bit pref]
[   65.512390] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
[   65.512394] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
[   65.512435] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
[   65.512485] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
[   65.513319] pci 0000:01:00.0: 8.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x4 link at 0000:00:00.0 (capable of 126.016 Gb/s
with 8 GT/s x16 link)
[   65.513568] pci 0000:01:00.0: vgaarb: VGA device added:
decodes=io+mem,owns=none,locks=none
[   65.513744] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
[   65.513900] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[   65.514179] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
[   65.515129] rockchip-pcie f8000000.pcie: phy link changes
[   65.527011] pci_bus 0000:01: fixups for bus
[   65.527021] pci_bus 0000:01: bus scan returning with max=01
[   65.527030] pci_bus 0000:01: busn_res: [bus 01-1f] end is updated to 01
[   65.527042] pci_bus 0000:00: bus scan returning with max=01
[   65.527076] pci 0000:00:00.0: BAR 14: no space for [mem size 0x0c000000]
[   65.527081] pci 0000:00:00.0: BAR 14: failed to assign [mem size 0x0c000000]
[   65.527091] pci 0000:01:00.0: BAR 1: no space for [mem size
0x08000000 64bit pref]
[   65.527095] pci 0000:01:00.0: BAR 1: failed to assign [mem size
0x08000000 64bit pref]
[   65.527100] pci 0000:01:00.0: BAR 3: no space for [mem size
0x02000000 64bit pref]
[   65.527104] pci 0000:01:00.0: BAR 3: failed to assign [mem size
0x02000000 64bit pref]
[   65.527109] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000]
[   65.527119] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x01000000]
[   65.527123] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00080000 pref]
[   65.527129] pci 0000:01:00.0: BAR 6: failed to assign [mem size
0x00080000 pref]
[   65.527134] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000]
[   65.527138] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000]
[   65.527142] pci 0000:01:00.0: BAR 5: no space for [io  size 0x0080]
[   65.527146] pci 0000:01:00.0: BAR 5: failed to assign [io  size 0x0080]
[   65.527152] pci 0000:00:00.0: PCI bridge to [bus 01]
[   65.529410] pcieport 0000:00:00.0: assign IRQ: got 238
[   65.529454] pcieport 0000:00:00.0: enabling bus mastering
[   65.529807] pcieport 0000:00:00.0: PME: Signaling with IRQ 239
[   65.530265] pcieport 0000:00:00.0: AER: enabled with IRQ 239
[   65.530685] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0


>
>
> > > >
> > > > Another reason I believe the address assignments are overflowing and
> > > > corrupting other address assignments is after the external abort, the
> > > > entire PCIE controller is inaccessible.
> > > > $ lspci
> > > > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
> > > > lspci: Unable to read the standard configuration space header of
> > > > device 0000:01:00.1
> > > > pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
> > > > lspci: Unable to read the standard configuration space header of
> > > > device 0000:00:00.0
> > > > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
> > > > lspci: Unable to read the standard configuration space header of
> > > > device 0000:01:00.0
> > > >
> > > > Attempting to rescan the bus or any other pci function results in a
> > > > hung kernel task.
> > >
> > > You crashed deep in the kernel part-way through the driver's probe
> > > routine; don't expect any partly-initialised bits of that driver to
> > > actually work. I bet you also can't load/unload modules and do various
> > > other things either, because the thread will almost certainly have been
> > > holding a bunch of locks, too. It's not worth trying to reason about the
> > > state of anything after that kind of unexpected crash ;)
> >
> > That is observed, but I think it is actually more destructive than that.
> > Attempting to do an access after the fact, then performing a soft
> > reset results in a spectacular crash when it tries to bring the BIG
> > cluster online.
> >
> > [    0.203160] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
> > [    5.355606] CPU4: failed to come online
> > [    5.363258] CPU4: failed in unknown state : 0x0
> > Unhandled Exception in EL3.
> > x30 =           0xUhandled Exception in L3.
> > x30 =           UNhandled ExcEption in EL3.
> > x30 =   0xUnhandled Excepti2n in EL_.
> > x30 =           0xUnhandled Exception in EL3.
> > x30 =           lxUnhadled Exception in EL3.
> > x30 =           0xUnhandled Excepion in EL3.
> > x30 =           0xUnhandled Exception in EL3.
> > x30 =   0xUnhandled Eception in EL3.
> > x30 =           0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled Exception in EL3.x30 =
> > 0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled Excepton i2 EL3.
> > x30 =           0xUnhandUnhandled Exceptionein EL3.lx30 =
> >  0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled Unhanded Exception in EL3.
> > x30 =           0xUnhandled Exceptin in EL3Unhandled Exception inEL3.
> > x0 =            0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled Exception in EL.
> > x30 =           0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled Exception in EL3.
> > x30 =           0xUnhandled xception in EL3
> > x30 =           0xUnhcndled Exception in EL3.
> > x30=            0xUnhandled Excextion in EL3.
> > x30 =   0xUnhandled Exception in EL3.
> > x30 =           0x[   10.484411] CPU5: failed to come online
> > [   10.484807] CPU5: failed in unknown state : 0x0
> > [   10.485374] smp: Brought up 1 node, 4 CPUs
> >
> > It goes on until it tries to do it's first DMA setup, then just hangs.
> > >
> > > >> If the Root Port (00:00.0) supports AER, you could also dump out the
> > > >> status registers from the AER capability and figure out whether it
> > > >> logged a PCIe error.  This would be sort of like what
> > > >> aer_process_err_devices() does.  A bit of a hassle to do this by hand
> > > >> in the exception path, but could be enlightening, just as a debug
> > > >> tool.
> > > >
> > > > Is there a way to handle external synchronous aborts in a device driver?
> > >
> > > Not easily on arm64 - the closest thing we have is some arch-internal
> > > machinery for undefined instruction hooks. You'd have to hack something
> > > gruesome into do_sea()...
> > >
> > > For starters it might be worth enabling all the debug prints in
> > > rockchip-pcie to see if there's any error IRQ corresponding to the
> > > aborted access.
> >
> > Debug prints are already enabled, I had to hack in the print to show
> > what addresses it was crashing on.
> >
> > Your explanation was the last piece for me to understand how this all
> > works, thanks!
> > >
> > > Robin.
> > >
> > > > If so, I'll definitely look into plugging in the aer status functions.
> > > >
> > > >>
> > > >>> [3] https://twitter.com/domipheus/status/1167586160077627393
> > > >>>>
> > > >>>>> [  198.751581] Modules linked in: drm_panel_orientation_quirks
> > > >>>>> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> > > >>>>> crc_ccitt nf_defrag_ipv6
> > > >>>>> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> > > >>>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> > > >>>>> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> > > >>>>> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> > > >>>>> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > >>>>> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> > > >>>>> [  198.755773] sp : ffff8000132af530
> > > >>>>> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> > > >>>>> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> > > >>>>> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> > > >>>>> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> > > >>>>> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> > > >>>>> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> > > >>>>> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> > > >>>>> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> > > >>>>> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> > > >>>>> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> > > >>>>> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> > > >>>>> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> > > >>>>> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> > > >>>>> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> > > >>>>> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> > > >>>>> [  198.763348] Call trace:
> > > >>>>> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > >>>>> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> > > >>>>> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> > > >>>>> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> > > >>>>> [  198.765413]  pci_scan_single_device+0x84/0xf8
> > > >>>>> [  198.765812]  pci_scan_slot+0x7c/0x120
> > > >>>>> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> > > >>>>> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> > > >>>>> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> > > >>>>> [  198.767430]  pci_scan_child_bus+0x24/0x30
> > > >>>>> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> > > >>>>> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> > > >>>>> [  198.768770]  platform_drv_probe+0x58/0xa8
> > > >>>>> [  198.769139]  really_probe+0xe0/0x318
> > > >>>>> [  198.769468]  driver_probe_device+0x5c/0xf0
> > > >>>>> [  198.769844]  device_driver_attach+0x74/0x80
> > > >>>>> [  198.770227]  __driver_attach+0x64/0xe8
> > > >>>>> [  198.770572]  bus_for_each_dev+0x84/0xd8
> > > >>>>> [  198.770924]  driver_attach+0x30/0x40
> > > >>>>> [  198.771253]  bus_add_driver+0x188/0x1e8
> > > >>>>> [  198.771605]  driver_register+0x64/0x110
> > > >>>>> [  198.771956]  __platform_driver_register+0x54/0x60
> > > >>>>> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> > > >>>>> [  198.772998]  do_one_initcall+0x94/0x390
> > > >>>>> [  198.773353]  do_init_module+0x88/0x268
> > > >>>>> [  198.773697]  load_module+0x1e18/0x2198
> > > >>>>> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> > > >>>>> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> > > >>>>> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> > > >>>>> [  198.775297]  el0_svc_handler+0x34/0xa0
> > > >>>>> [  198.775645]  el0_svc+0x14/0x40
> > > >>>>> [  198.775928]  el0_sync_handler+0x118/0x290
> > > >>>>> [  198.776295]  el0_sync+0x164/0x180
> > > >>>>> [  198.776609] Code: bad PC value
> > > >>>>> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---
> > > >
> > > > _______________________________________________
> > > > Linux-rockchip mailing list
> > > > Linux-rockchip@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/linux-rockchip
> > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-13  1:19               ` Peter Geis
@ 2019-11-21  0:36                 ` Peter Geis
  2019-11-21  2:03                   ` Shawn Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Geis @ 2019-11-21  0:36 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Bjorn Helgaas, linux-pci, Shawn Lin, Heiko Stuebner,
	open list:ARM/Rockchip SoC...

On Tue, Nov 12, 2019 at 8:19 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 8:06 PM Peter Geis <pgwipeout@gmail.com> wrote:
> >
> > On Tue, Nov 12, 2019 at 2:41 PM Peter Geis <pgwipeout@gmail.com> wrote:
> > >
> > > On Tue, Nov 12, 2019 at 2:15 PM Robin Murphy <robin.murphy@arm.com> wrote:
> > > >
> > > > On 12/11/2019 3:55 pm, Peter Geis wrote:
> > > > > On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >>
> > > > >> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
> > > > >>> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >>>> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
> > > > >>>>> Good Morning,
> > > > >>>>>
> > > > >>>>> I'm attempting to debug an issue with the rockpro64 pcie port.
> > > > >>>>> It would appear that the port does not like various cards, including
> > > > >>>>> cards of the same make that randomly work or do not work, such as
> > > > >>>>> Intel i340 based NICs.
> > > > >>>>> I'm experiencing it with a GTX645 gpu.
> > > > >>>>>
> > > > >>>>> This seems to be a long running issue, referenced both at [0], and [1].
> > > > >>>>> There was an attempt to rectify it, by adding a delay between training
> > > > >>>>> and probing [2], but that doesn't seem to be the issue here.
> > > > >>>>> It appears that when we probe further into the card, such as devfn >
> > > > >>>>> 1, we trigger the bug.
> > > > >>>>> I've added a print statement that prints the devfn, address, and size
> > > > >>>>> information, which you can see below.
> > > > >>>>>
> > > > >>>>> I've attempted setting the available number of lanes to 1 as well, to
> > > > >>>>> no difference.
> > > > >>>>>
> > > > >>>>> If anyone could point me in the right direction as to where to
> > > > >>>>> continue debugging, I'd greatly appreciate it.
> > > > >>>>>
> > > > >>>>> [0] https://github.com/ayufan-rock64/linux-build/issues/254
> > > > >>>>> [1] https://github.com/rockchip-linux/kernel/issues/116
> > > > >>>>> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
> > > > >>>>>
> > > > >>>>> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
> > > > >>>>> for per-lane PHY
> > > > >>>>> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> > > > >>>>> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> > > > >>>>> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
> > > > >>>>> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
> > > > >>>>> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
> > > > >>>>> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
> > > > >>>>> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> > > > >>>>> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
> > > > >>>>> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
> > > > >>>>> 0xfa000000..0xfbdfffff -> 0xfa000000
> > > > >>>>> [  198.550676] rockchip-pcie f8000000.pcie:    IO
> > > > >>>>> 0xfbe00000..0xfbefffff -> 0xfbe00000
> > > > >>>>> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> > > > >>>>> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
> > > > >>>>> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> > > > >>>>> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > > > >>>>> (bus address [0xfbe00000-0xfbefffff])
> > > > >>>>> [  198.552965] pci_bus 0000:00: scanning bus
> > > > >>>>> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> > > > >>>>> [  198.555508] pci 0000:00:00.0: supports D1
> > > > >>>>> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > > > >>>>> [  198.556023] pci 0000:00:00.0: PME# disabled
> > > > >>>>> [  198.561245] pci_bus 0000:00: fixups for bus
> > > > >>>>> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> > > > >>>>> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
> > > > >>>>> 00-00]), reconfiguring
> > > > >>>>> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> > > > >>>>> [  198.567008] pci_bus 0000:01: scanning bus
> > > > >>>>> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> > > > >>>>> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > > > >>>>> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > > > >>>>> 64bit pref]
> > > > >>>>> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > > > >>>>> 64bit pref]
> > > > >>>>> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> > > > >>>>> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> > > > >>>>> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> > > > >>>>> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> > > > >>>>> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > > > >>>>> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
> > > > >>>>> with 8 GT/s x16 link)
> > > > >>>>> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
> > > > >>>>> decodes=io+mem,owns=none,locks=none
> > > > >>>>> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> > > > >>>>> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > > > >>>>> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> > > > >>>>> <snip>
> > > > >>>>> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
> > > > >>>>> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
> > > > >>>>> [  198.750881] Internal error: synchronous external abort: 96000210
> > > > >>>>> [#1] PREEMPT SMP
> > > > >>>>
> > > > >>>> Is there really supposed to be a device at 01:00.2?
> > > > >>>>
> > > > >>>> Maybe this is just the PCIe Unsupported Request error that we expect
> > > > >>>> to get when trying to read config space of a device that doesn't
> > > > >>>> exist.
> > > > >>>>
> > > > >>>> On "most" platforms, we just get ~0 data back when that happens, but
> > > > >>>> I'm not sure that's always the case on arm64.  I think it depends on
> > > > >>>> how the PCIe host bridge is designed, and there might be some CPU
> > > > >>>> configuration, too.
> > > > >>>
> > > > >>> Yes, this is a GTX645 video card.
> > > > >>> Nvidia cards usually have two to three devices,
> > > > >>> The GPU proper, the audio device for the hdmi output, and the i2c controller.
> > > > >>>
> > > > >>> I do think that this driver is missing sanity checking on the
> > > > >>> addressing, since the BRCM driver for the RPI4 doesn't try to
> > > > >>> enumerate a video card, since it checks if the MMIO space is large
> > > > >>> enough to fit the BAR before assigning the addresses. See [3]. Also in
> > > > >>> that thread he was able to increase the address space provided to the
> > > > >>> BRCM driver and fix the issue, but I don't see how we could do that on
> > > > >>> the rk3399.
> > > > >>>
> > > > >>> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
> > > > >>> 128 MB, which already exceeds our address space.
> > > > >>> I think the driver is just overflowing the address space.
> > > > >>
> > > > >> If we don't have enough space to assign all the device BARs, I think a
> > > > >> driver will still be able to claim the device, but when the driver
> > > > >> calls pci_enable_device(), it should fail.  Lack of address space
> > > > >> should not cause a PCIe error.
> > > > >>
> > > > >> But in this case, none of that matters because we're still enumerating
> > > > >> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
> > > > >> of trying to bind drivers to devices, so the driver isn't involved at
> > > > >> all yet.
> > > > > For clarification, the driver I'm referring to is the rk3399-pcie host driver.
> > > > >>
> > > > >> The backtrace says we're trying to read the Vendor ID of a device, and
> > > > >> your debug output suggests we're trying to enumerate 01:00.2.  If you
> > > > >> put that card in another system, you could find out how many functions
> > > > >> it has.
> > > > >>
> > > > >> Or if you swapped this with other cards where you know the number of
> > > > >> functions, you could see if the external abort always happens when
> > > > >> probing for the first unimplemented function.
> > > > > This card definitely has more than one function.
> > > >
> > > > FWIW, random lspci logs I grovelled up off the internet show cards with
> > > > the same IDs only implementing functions 0 and 1, which does suggest
> > > > that maybe function 2 really doesn't exist but the card handles
> > > > unsupported config requests in a way that this particular bridge/root
> > > > port doesn't quite deal with properly.
> > >
> > > Thanks! Is there a sane way we could make the kernel handle this in
> > > place of the controller?
> > > (It's apparently based on the designware ip, but it doesn't use their
> > > driver for some reason)
> > >
> > > >
> > > > > Before my original message I hacked in some code to make the driver
> > > > > return 0xff when devfn > 1, and the scan passed, but as soon as
> > > > > nouveau attempted to access the device, the entire kernel exploded.
> >
> > This is what happens when probing is prevented from reading above
> > devfn 1, and nouveau gets a shot.
> > [   76.658128] SError Interrupt on CPU4, code 0xbf000000 -- SError
> > [   76.658129] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
> > 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
> > [   76.658130] Hardware name: Pine64 RockPro64 (DT)
> > [   76.658131] pstate: 60400005 (nZCv daif +PAN -UAO)
> > [   76.658131] pc : nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
> > [   76.658132] lr : nvkm_device_ctor+0x2c0/0x2e18 [nouveau]
> > [   76.658133] sp : ffff800012daf740
> > [   76.658134] x29: ffff800012daf740 x28: ffff0000c2db9800
> > [   76.658136] x27: 0000000000000000 x26: ffff0000c7f2e0b0
> > [   76.658138] x25: 0000000000000000 x24: 0000000000000000
> > [   76.658139] x23: 0000000000000000 x22: ffff800009e1b1e0
> > [   76.658141] x21: ffff0000c335b940 x20: ffff800009ede660
> > [   76.658142] x19: 0000000000000000 x18: 0000000000000001
> > [   76.658144] x17: 0000000000000000 x16: 0000000000000000
> > [   76.658145] x15: ffff0000c335bdc0 x14: ffff8000140a0000
> > [   76.658147] x13: 0000000000110000 x12: 0040000000000001
> > [   76.658148] x11: ffff0000c12b8180 x10: 0000000000000085
> > [   76.658150] x9 : ffff800013c00000 x8 : 0000000000110000
> > [   76.658152] x7 : 0000000000000000 x6 : ffff0000f7ff0008
> > [   76.658153] x5 : ffff800013d0ffff x4 : ffff800013c00000
> > [   76.658154] x3 : ffff800013c00004 x2 : 0140000000000000
> > [   76.658156] x1 : 0000000000000000 x0 : ffff800013c00000
> > [   76.658158] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [   76.658159] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
> > 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
> > [   76.658160] Hardware name: Pine64 RockPro64 (DT)
> > [   76.658161] Call trace:
> > [   76.658161]  dump_backtrace+0x0/0x148
> > [   76.658162]  show_stack+0x24/0x30
> > [   76.658163]  dump_stack+0xa8/0xf0
> > [   76.658163]  panic+0x160/0x36c
> > [   76.658164]  nmi_panic+0x94/0x98
> > [   76.658164]  arm64_serror_panic+0x84/0x90
> > [   76.658165]  do_serror+0x88/0x140
> > [   76.658166]  el1_error+0xc4/0x170
> > [   76.658166]  nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
> > [   76.658167]  nvkm_device_pci_new+0xf8/0x2c8 [nouveau]
> > [   76.658168]  nouveau_drm_probe+0x60/0x288 [nouveau]
> > [   76.658168]  local_pci_probe+0x44/0x98
> > [   76.658169]  pci_device_probe+0x118/0x1b0
> > [   76.658170]  really_probe+0xe0/0x320
> > [   76.658171]  driver_probe_device+0x5c/0xf0
> > [   76.658171]  device_driver_attach+0x74/0x80
> > [   76.658172]  __driver_attach+0x64/0xe0
> > [   76.658173]  bus_for_each_dev+0x80/0xd0
> > [   76.658173]  driver_attach+0x30/0x40
> > [   76.658174]  bus_add_driver+0x14c/0x1f0
> > [   76.658175]  driver_register+0x64/0x110
> > [   76.658175]  __pci_register_driver+0x58/0x68
> > [   76.658176]  nouveau_drm_init+0x17c/0x10000 [nouveau]
> > [   76.658177]  do_one_initcall+0x50/0x298
> > [   76.658178]  do_init_module+0x88/0x270
> > [   76.658178]  load_module+0x2068/0x2600
> > [   76.658179]  __do_sys_finit_module+0xd0/0xe8
> > [   76.658180]  __arm64_sys_finit_module+0x28/0x38
> > [   76.658180]  el0_svc_common.constprop.0+0x7c/0x1f0
> > [   76.658181]  el0_svc_handler+0x34/0xa0
> > [   76.658182]  el0_svc+0x14/0x40
> > [   76.658182]  el0_sync_handler+0x118/0x280
> > [   76.658183]  el0_sync+0x164/0x180
> > [   76.658248] SMP: stopping secondary CPUs
> > [   76.658249] Kernel Offset: disabled
> > [   76.658250] CPU features: 0x10002,2000600c
> > [   76.658250] Memory Limit: none
>
> Last log, I promise.
> This is what happens when it is locked out of devfn > 1, and nouveau
> is not available.
>
> [    4.199087] vcc3v3_pcie: supplied by vcc12v_dcin
> [   55.617783] ehci-pci: EHCI PCI platform driver
> [   55.868935] ohci-pci: OHCI PCI platform driver
> [   65.426085] rockchip-pcie f8000000.pcie: missing legacy phy; search
> for per-lane PHY
> [   65.426645] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
> [   65.426662] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
> [   65.507012] rockchip-pcie f8000000.pcie: current link width is x4
> [   65.507078] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
> [   65.507088] rockchip-pcie f8000000.pcie: Parsing ranges property...
> [   65.507103] rockchip-pcie f8000000.pcie:   MEM
> 0xfa000000..0xfbdfffff -> 0xfa000000
> [   65.507112] rockchip-pcie f8000000.pcie:    IO
> 0xfbe00000..0xfbffffff -> 0xfbe00000
> [   65.507368] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
> [   65.507376] pci_bus 0000:00: root bus resource [bus 00-1f]
> [   65.507382] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
> [   65.507390] pci_bus 0000:00: root bus resource [io
> 0x0000-0x1fffff] (bus address [0xfbe00000-0xfbffffff])
> [   65.507394] pci_bus 0000:00: scanning bus
> [   65.507423] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
> [   65.507543] pci 0000:00:00.0: supports D1
> [   65.507547] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> [   65.507555] pci 0000:00:00.0: PME# disabled
> [   65.511436] pci_bus 0000:00: fixups for bus
> [   65.511452] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> [   65.511457] pci 0000:00:00.0: bridge configuration invalid ([bus
> 00-00]), reconfiguring
> [   65.511471] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> [   65.511633] pci_bus 0000:01: scanning bus
> [   65.511739] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
> [   65.512205] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> [   65.512274] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> 64bit pref]
> [   65.512345] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> 64bit pref]
> [   65.512390] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
> [   65.512394] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
> [   65.512435] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
> [   65.512485] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
> [   65.513319] pci 0000:01:00.0: 8.000 Gb/s available PCIe bandwidth,
> limited by 2.5 GT/s x4 link at 0000:00:00.0 (capable of 126.016 Gb/s
> with 8 GT/s x16 link)
> [   65.513568] pci 0000:01:00.0: vgaarb: VGA device added:
> decodes=io+mem,owns=none,locks=none
> [   65.513744] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
> [   65.513900] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> [   65.514179] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
> [   65.515129] rockchip-pcie f8000000.pcie: phy link changes
> [   65.527011] pci_bus 0000:01: fixups for bus
> [   65.527021] pci_bus 0000:01: bus scan returning with max=01
> [   65.527030] pci_bus 0000:01: busn_res: [bus 01-1f] end is updated to 01
> [   65.527042] pci_bus 0000:00: bus scan returning with max=01
> [   65.527076] pci 0000:00:00.0: BAR 14: no space for [mem size 0x0c000000]
> [   65.527081] pci 0000:00:00.0: BAR 14: failed to assign [mem size 0x0c000000]
> [   65.527091] pci 0000:01:00.0: BAR 1: no space for [mem size
> 0x08000000 64bit pref]
> [   65.527095] pci 0000:01:00.0: BAR 1: failed to assign [mem size
> 0x08000000 64bit pref]
> [   65.527100] pci 0000:01:00.0: BAR 3: no space for [mem size
> 0x02000000 64bit pref]
> [   65.527104] pci 0000:01:00.0: BAR 3: failed to assign [mem size
> 0x02000000 64bit pref]
> [   65.527109] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000]
> [   65.527119] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x01000000]
> [   65.527123] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00080000 pref]
> [   65.527129] pci 0000:01:00.0: BAR 6: failed to assign [mem size
> 0x00080000 pref]
> [   65.527134] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000]
> [   65.527138] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000]
> [   65.527142] pci 0000:01:00.0: BAR 5: no space for [io  size 0x0080]
> [   65.527146] pci 0000:01:00.0: BAR 5: failed to assign [io  size 0x0080]
> [   65.527152] pci 0000:00:00.0: PCI bridge to [bus 01]
> [   65.529410] pcieport 0000:00:00.0: assign IRQ: got 238
> [   65.529454] pcieport 0000:00:00.0: enabling bus mastering
> [   65.529807] pcieport 0000:00:00.0: PME: Signaling with IRQ 239
> [   65.530265] pcieport 0000:00:00.0: AER: enabled with IRQ 239
> [   65.530685] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

An update to this issue, I purchased a pcie switch and tried testing
with a few pcie devices I have around.
Under no circumstances can I get it to read the switch, it aborts
immediately upon reading the first register.
Of note is it keeps reporting the following :
[   64.510788] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[   64.510793] pci 0000:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[   64.510807] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
<snip>
[   64.527368] pci_bus 0000:01: fixups for bus
[   64.527399] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 0
[   64.527404] pci 0000:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[   64.527452] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 1
[   64.527850] pci_bus 0000:02: scanning bus
And aborts here.

Thoughts?

>
>
> >
> >
> > > > >
> > > > > Another reason I believe the address assignments are overflowing and
> > > > > corrupting other address assignments is after the external abort, the
> > > > > entire PCIE controller is inaccessible.
> > > > > $ lspci
> > > > > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
> > > > > lspci: Unable to read the standard configuration space header of
> > > > > device 0000:01:00.1
> > > > > pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
> > > > > lspci: Unable to read the standard configuration space header of
> > > > > device 0000:00:00.0
> > > > > pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
> > > > > lspci: Unable to read the standard configuration space header of
> > > > > device 0000:01:00.0
> > > > >
> > > > > Attempting to rescan the bus or any other pci function results in a
> > > > > hung kernel task.
> > > >
> > > > You crashed deep in the kernel part-way through the driver's probe
> > > > routine; don't expect any partly-initialised bits of that driver to
> > > > actually work. I bet you also can't load/unload modules and do various
> > > > other things either, because the thread will almost certainly have been
> > > > holding a bunch of locks, too. It's not worth trying to reason about the
> > > > state of anything after that kind of unexpected crash ;)
> > >
> > > That is observed, but I think it is actually more destructive than that.
> > > Attempting to do an access after the fact, then performing a soft
> > > reset results in a spectacular crash when it tries to bring the BIG
> > > cluster online.
> > >
> > > [    0.203160] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
> > > [    5.355606] CPU4: failed to come online
> > > [    5.363258] CPU4: failed in unknown state : 0x0
> > > Unhandled Exception in EL3.
> > > x30 =           0xUhandled Exception in L3.
> > > x30 =           UNhandled ExcEption in EL3.
> > > x30 =   0xUnhandled Excepti2n in EL_.
> > > x30 =           0xUnhandled Exception in EL3.
> > > x30 =           lxUnhadled Exception in EL3.
> > > x30 =           0xUnhandled Excepion in EL3.
> > > x30 =           0xUnhandled Exception in EL3.
> > > x30 =   0xUnhandled Eception in EL3.
> > > x30 =           0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled Exception in EL3.x30 =
> > > 0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled Excepton i2 EL3.
> > > x30 =           0xUnhandUnhandled Exceptionein EL3.lx30 =
> > >  0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled Unhanded Exception in EL3.
> > > x30 =           0xUnhandled Exceptin in EL3Unhandled Exception inEL3.
> > > x0 =            0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled Exception in EL.
> > > x30 =           0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled Exception in EL3.
> > > x30 =           0xUnhandled xception in EL3
> > > x30 =           0xUnhcndled Exception in EL3.
> > > x30=            0xUnhandled Excextion in EL3.
> > > x30 =   0xUnhandled Exception in EL3.
> > > x30 =           0x[   10.484411] CPU5: failed to come online
> > > [   10.484807] CPU5: failed in unknown state : 0x0
> > > [   10.485374] smp: Brought up 1 node, 4 CPUs
> > >
> > > It goes on until it tries to do it's first DMA setup, then just hangs.
> > > >
> > > > >> If the Root Port (00:00.0) supports AER, you could also dump out the
> > > > >> status registers from the AER capability and figure out whether it
> > > > >> logged a PCIe error.  This would be sort of like what
> > > > >> aer_process_err_devices() does.  A bit of a hassle to do this by hand
> > > > >> in the exception path, but could be enlightening, just as a debug
> > > > >> tool.
> > > > >
> > > > > Is there a way to handle external synchronous aborts in a device driver?
> > > >
> > > > Not easily on arm64 - the closest thing we have is some arch-internal
> > > > machinery for undefined instruction hooks. You'd have to hack something
> > > > gruesome into do_sea()...
> > > >
> > > > For starters it might be worth enabling all the debug prints in
> > > > rockchip-pcie to see if there's any error IRQ corresponding to the
> > > > aborted access.
> > >
> > > Debug prints are already enabled, I had to hack in the print to show
> > > what addresses it was crashing on.
> > >
> > > Your explanation was the last piece for me to understand how this all
> > > works, thanks!
> > > >
> > > > Robin.
> > > >
> > > > > If so, I'll definitely look into plugging in the aer status functions.
> > > > >
> > > > >>
> > > > >>> [3] https://twitter.com/domipheus/status/1167586160077627393
> > > > >>>>
> > > > >>>>> [  198.751581] Modules linked in: drm_panel_orientation_quirks
> > > > >>>>> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
> > > > >>>>> crc_ccitt nf_defrag_ipv6
> > > > >>>>> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
> > > > >>>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
> > > > >>>>> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
> > > > >>>>> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
> > > > >>>>> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > > >>>>> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
> > > > >>>>> [  198.755773] sp : ffff8000132af530
> > > > >>>>> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
> > > > >>>>> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
> > > > >>>>> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
> > > > >>>>> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
> > > > >>>>> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
> > > > >>>>> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
> > > > >>>>> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
> > > > >>>>> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
> > > > >>>>> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
> > > > >>>>> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
> > > > >>>>> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
> > > > >>>>> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
> > > > >>>>> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
> > > > >>>>> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
> > > > >>>>> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
> > > > >>>>> [  198.763348] Call trace:
> > > > >>>>> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
> > > > >>>>> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
> > > > >>>>> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
> > > > >>>>> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
> > > > >>>>> [  198.765413]  pci_scan_single_device+0x84/0xf8
> > > > >>>>> [  198.765812]  pci_scan_slot+0x7c/0x120
> > > > >>>>> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
> > > > >>>>> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
> > > > >>>>> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
> > > > >>>>> [  198.767430]  pci_scan_child_bus+0x24/0x30
> > > > >>>>> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
> > > > >>>>> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
> > > > >>>>> [  198.768770]  platform_drv_probe+0x58/0xa8
> > > > >>>>> [  198.769139]  really_probe+0xe0/0x318
> > > > >>>>> [  198.769468]  driver_probe_device+0x5c/0xf0
> > > > >>>>> [  198.769844]  device_driver_attach+0x74/0x80
> > > > >>>>> [  198.770227]  __driver_attach+0x64/0xe8
> > > > >>>>> [  198.770572]  bus_for_each_dev+0x84/0xd8
> > > > >>>>> [  198.770924]  driver_attach+0x30/0x40
> > > > >>>>> [  198.771253]  bus_add_driver+0x188/0x1e8
> > > > >>>>> [  198.771605]  driver_register+0x64/0x110
> > > > >>>>> [  198.771956]  __platform_driver_register+0x54/0x60
> > > > >>>>> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
> > > > >>>>> [  198.772998]  do_one_initcall+0x94/0x390
> > > > >>>>> [  198.773353]  do_init_module+0x88/0x268
> > > > >>>>> [  198.773697]  load_module+0x1e18/0x2198
> > > > >>>>> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
> > > > >>>>> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
> > > > >>>>> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
> > > > >>>>> [  198.775297]  el0_svc_handler+0x34/0xa0
> > > > >>>>> [  198.775645]  el0_svc+0x14/0x40
> > > > >>>>> [  198.775928]  el0_sync_handler+0x118/0x290
> > > > >>>>> [  198.776295]  el0_sync+0x164/0x180
> > > > >>>>> [  198.776609] Code: bad PC value
> > > > >>>>> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---
> > > > >
> > > > > _______________________________________________
> > > > > Linux-rockchip mailing list
> > > > > Linux-rockchip@lists.infradead.org
> > > > > http://lists.infradead.org/mailman/listinfo/linux-rockchip
> > > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-21  0:36                 ` Peter Geis
@ 2019-11-21  2:03                   ` Shawn Lin
  2019-11-22  1:02                     ` Peter Geis
  0 siblings, 1 reply; 17+ messages in thread
From: Shawn Lin @ 2019-11-21  2:03 UTC (permalink / raw)
  To: Peter Geis, Robin Murphy
  Cc: shawn.lin, Bjorn Helgaas, linux-pci, Heiko Stuebner,
	open list:ARM/Rockchip SoC...

[-- Attachment #1: Type: text/plain, Size: 27337 bytes --]


On 2019/11/21 8:36, Peter Geis wrote:
> On Tue, Nov 12, 2019 at 8:19 PM Peter Geis <pgwipeout@gmail.com> wrote:
>>
>> On Tue, Nov 12, 2019 at 8:06 PM Peter Geis <pgwipeout@gmail.com> wrote:
>>>
>>> On Tue, Nov 12, 2019 at 2:41 PM Peter Geis <pgwipeout@gmail.com> wrote:
>>>>
>>>> On Tue, Nov 12, 2019 at 2:15 PM Robin Murphy <robin.murphy@arm.com> wrote:
>>>>>
>>>>> On 12/11/2019 3:55 pm, Peter Geis wrote:
>>>>>> On Mon, Nov 11, 2019 at 9:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>>>>
>>>>>>> On Mon, Nov 11, 2019 at 07:30:15PM -0500, Peter Geis wrote:
>>>>>>>> On Mon, Nov 11, 2019 at 7:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>>>>>> On Mon, Nov 04, 2019 at 01:55:40PM -0500, Peter Geis wrote:
>>>>>>>>>> Good Morning,
>>>>>>>>>>
>>>>>>>>>> I'm attempting to debug an issue with the rockpro64 pcie port.
>>>>>>>>>> It would appear that the port does not like various cards, including
>>>>>>>>>> cards of the same make that randomly work or do not work, such as
>>>>>>>>>> Intel i340 based NICs.
>>>>>>>>>> I'm experiencing it with a GTX645 gpu.
>>>>>>>>>>
>>>>>>>>>> This seems to be a long running issue, referenced both at [0], and [1].
>>>>>>>>>> There was an attempt to rectify it, by adding a delay between training
>>>>>>>>>> and probing [2], but that doesn't seem to be the issue here.
>>>>>>>>>> It appears that when we probe further into the card, such as devfn >
>>>>>>>>>> 1, we trigger the bug.
>>>>>>>>>> I've added a print statement that prints the devfn, address, and size
>>>>>>>>>> information, which you can see below.
>>>>>>>>>>
>>>>>>>>>> I've attempted setting the available number of lanes to 1 as well, to
>>>>>>>>>> no difference.
>>>>>>>>>>
>>>>>>>>>> If anyone could point me in the right direction as to where to
>>>>>>>>>> continue debugging, I'd greatly appreciate it.
>>>>>>>>>>
>>>>>>>>>> [0] https://github.com/ayufan-rock64/linux-build/issues/254
>>>>>>>>>> [1] https://github.com/rockchip-linux/kernel/issues/116
>>>>>>>>>> [2] https://github.com/ayufan-rock64/linux-kernel/commit/3cde5c624c9c39aa03251a55c2d26a48b5bdca5b
>>>>>>>>>>
>>>>>>>>>> [  198.491458] rockchip-pcie f8000000.pcie: missing legacy phy; search
>>>>>>>>>> for per-lane PHY
>>>>>>>>>> [  198.492986] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
>>>>>>>>>> [  198.493060] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
>>>>>>>>>> [  198.550444] rockchip-pcie f8000000.pcie: current link width is x1
>>>>>>>>>> [  198.550458] rockchip-pcie f8000000.pcie: idling lane 1
>>>>>>>>>> [  198.550479] rockchip-pcie f8000000.pcie: idling lane 2
>>>>>>>>>> [  198.550490] rockchip-pcie f8000000.pcie: idling lane 3
>>>>>>>>>> [  198.550608] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
>>>>>>>>>> [  198.550625] rockchip-pcie f8000000.pcie: Parsing ranges property...
>>>>>>>>>> [  198.550656] rockchip-pcie f8000000.pcie:   MEM
>>>>>>>>>> 0xfa000000..0xfbdfffff -> 0xfa000000
>>>>>>>>>> [  198.550676] rockchip-pcie f8000000.pcie:    IO
>>>>>>>>>> 0xfbe00000..0xfbefffff -> 0xfbe00000
>>>>>>>>>> [  198.552908] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
>>>>>>>>>> [  198.552933] pci_bus 0000:00: root bus resource [bus 00-1f]
>>>>>>>>>> [  198.552943] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
>>>>>>>>>> [  198.552954] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
>>>>>>>>>> (bus address [0xfbe00000-0xfbefffff])
>>>>>>>>>> [  198.552965] pci_bus 0000:00: scanning bus
>>>>>>>>>> [  198.554198] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
>>>>>>>>>> [  198.555508] pci 0000:00:00.0: supports D1
>>>>>>>>>> [  198.555516] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
>>>>>>>>>> [  198.556023] pci 0000:00:00.0: PME# disabled
>>>>>>>>>> [  198.561245] pci_bus 0000:00: fixups for bus
>>>>>>>>>> [  198.561269] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
>>>>>>>>>> [  198.561277] pci 0000:00:00.0: bridge configuration invalid ([bus
>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>> [  198.566429] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
>>>>>>>>>> [  198.567008] pci_bus 0000:01: scanning bus
>>>>>>>>>> [  198.567171] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
>>>>>>>>>> [  198.567420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
>>>>>>>>>> [  198.567515] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
>>>>>>>>>> 64bit pref]
>>>>>>>>>> [  198.567608] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
>>>>>>>>>> 64bit pref]
>>>>>>>>>> [  198.567665] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
>>>>>>>>>> [  198.567673] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
>>>>>>>>>> [  198.567730] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
>>>>>>>>>> [  198.567815] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
>>>>>>>>>> [  198.569051] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
>>>>>>>>>> limited by 2.5 GT/s x1 link at 0000:00:00.0 (capable of 126.016 Gb/s
>>>>>>>>>> with 8 GT/s x16 link)
>>>>>>>>>> [  198.570225] pci 0000:01:00.0: vgaarb: VGA device added:
>>>>>>>>>> decodes=io+mem,owns=none,locks=none
>>>>>>>>>> [  198.570481] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
>>>>>>>>>> [  198.570663] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
>>>>>>>>>> [  198.571039] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
>>>>>>>>>> <snip>
>>>>>>>>>> [  198.749857] pci_bus 0000:01: read pcie, devfn 1, at 100, size 2
>>>>>>>>>> [  198.750252] pci_bus 0000:01: read pcie, devfn 2, at 0, size 4
>>>>>>>>>> [  198.750881] Internal error: synchronous external abort: 96000210
>>>>>>>>>> [#1] PREEMPT SMP
>>>>>>>>>
>>>>>>>>> Is there really supposed to be a device at 01:00.2?
>>>>>>>>>
>>>>>>>>> Maybe this is just the PCIe Unsupported Request error that we expect
>>>>>>>>> to get when trying to read config space of a device that doesn't
>>>>>>>>> exist.
>>>>>>>>>
>>>>>>>>> On "most" platforms, we just get ~0 data back when that happens, but
>>>>>>>>> I'm not sure that's always the case on arm64.  I think it depends on
>>>>>>>>> how the PCIe host bridge is designed, and there might be some CPU
>>>>>>>>> configuration, too.
>>>>>>>>
>>>>>>>> Yes, this is a GTX645 video card.
>>>>>>>> Nvidia cards usually have two to three devices,
>>>>>>>> The GPU proper, the audio device for the hdmi output, and the i2c controller.
>>>>>>>>
>>>>>>>> I do think that this driver is missing sanity checking on the
>>>>>>>> addressing, since the BRCM driver for the RPI4 doesn't try to
>>>>>>>> enumerate a video card, since it checks if the MMIO space is large
>>>>>>>> enough to fit the BAR before assigning the addresses. See [3]. Also in
>>>>>>>> that thread he was able to increase the address space provided to the
>>>>>>>> BRCM driver and fix the issue, but I don't see how we could do that on
>>>>>>>> the rk3399.
>>>>>>>>
>>>>>>>> pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref] is
>>>>>>>> 128 MB, which already exceeds our address space.
>>>>>>>> I think the driver is just overflowing the address space.
>>>>>>>
>>>>>>> If we don't have enough space to assign all the device BARs, I think a
>>>>>>> driver will still be able to claim the device, but when the driver
>>>>>>> calls pci_enable_device(), it should fail.  Lack of address space
>>>>>>> should not cause a PCIe error.
>>>>>>>
>>>>>>> But in this case, none of that matters because we're still enumerating
>>>>>>> devices in pci_scan_root_bus_bridge().  We haven't gotten to the point
>>>>>>> of trying to bind drivers to devices, so the driver isn't involved at
>>>>>>> all yet.
>>>>>> For clarification, the driver I'm referring to is the rk3399-pcie host driver.
>>>>>>>
>>>>>>> The backtrace says we're trying to read the Vendor ID of a device, and
>>>>>>> your debug output suggests we're trying to enumerate 01:00.2.  If you
>>>>>>> put that card in another system, you could find out how many functions
>>>>>>> it has.
>>>>>>>
>>>>>>> Or if you swapped this with other cards where you know the number of
>>>>>>> functions, you could see if the external abort always happens when
>>>>>>> probing for the first unimplemented function.
>>>>>> This card definitely has more than one function.
>>>>>
>>>>> FWIW, random lspci logs I grovelled up off the internet show cards with
>>>>> the same IDs only implementing functions 0 and 1, which does suggest
>>>>> that maybe function 2 really doesn't exist but the card handles
>>>>> unsupported config requests in a way that this particular bridge/root
>>>>> port doesn't quite deal with properly.
>>>>
>>>> Thanks! Is there a sane way we could make the kernel handle this in
>>>> place of the controller?
>>>> (It's apparently based on the designware ip, but it doesn't use their
>>>> driver for some reason)
>>>>
>>>>>
>>>>>> Before my original message I hacked in some code to make the driver
>>>>>> return 0xff when devfn > 1, and the scan passed, but as soon as
>>>>>> nouveau attempted to access the device, the entire kernel exploded.
>>>
>>> This is what happens when probing is prevented from reading above
>>> devfn 1, and nouveau gets a shot.
>>> [   76.658128] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>> [   76.658129] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
>>> [   76.658130] Hardware name: Pine64 RockPro64 (DT)
>>> [   76.658131] pstate: 60400005 (nZCv daif +PAN -UAO)
>>> [   76.658131] pc : nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
>>> [   76.658132] lr : nvkm_device_ctor+0x2c0/0x2e18 [nouveau]
>>> [   76.658133] sp : ffff800012daf740
>>> [   76.658134] x29: ffff800012daf740 x28: ffff0000c2db9800
>>> [   76.658136] x27: 0000000000000000 x26: ffff0000c7f2e0b0
>>> [   76.658138] x25: 0000000000000000 x24: 0000000000000000
>>> [   76.658139] x23: 0000000000000000 x22: ffff800009e1b1e0
>>> [   76.658141] x21: ffff0000c335b940 x20: ffff800009ede660
>>> [   76.658142] x19: 0000000000000000 x18: 0000000000000001
>>> [   76.658144] x17: 0000000000000000 x16: 0000000000000000
>>> [   76.658145] x15: ffff0000c335bdc0 x14: ffff8000140a0000
>>> [   76.658147] x13: 0000000000110000 x12: 0040000000000001
>>> [   76.658148] x11: ffff0000c12b8180 x10: 0000000000000085
>>> [   76.658150] x9 : ffff800013c00000 x8 : 0000000000110000
>>> [   76.658152] x7 : 0000000000000000 x6 : ffff0000f7ff0008
>>> [   76.658153] x5 : ffff800013d0ffff x4 : ffff800013c00000
>>> [   76.658154] x3 : ffff800013c00004 x2 : 0140000000000000
>>> [   76.658156] x1 : 0000000000000000 x0 : ffff800013c00000
>>> [   76.658158] Kernel panic - not syncing: Asynchronous SError Interrupt
>>> [   76.658159] CPU: 4 PID: 1752 Comm: systemd-udevd Not tainted
>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #16
>>> [   76.658160] Hardware name: Pine64 RockPro64 (DT)
>>> [   76.658161] Call trace:
>>> [   76.658161]  dump_backtrace+0x0/0x148
>>> [   76.658162]  show_stack+0x24/0x30
>>> [   76.658163]  dump_stack+0xa8/0xf0
>>> [   76.658163]  panic+0x160/0x36c
>>> [   76.658164]  nmi_panic+0x94/0x98
>>> [   76.658164]  arm64_serror_panic+0x84/0x90
>>> [   76.658165]  do_serror+0x88/0x140
>>> [   76.658166]  el1_error+0xc4/0x170
>>> [   76.658166]  nvkm_device_ctor+0x2cc/0x2e18 [nouveau]
>>> [   76.658167]  nvkm_device_pci_new+0xf8/0x2c8 [nouveau]
>>> [   76.658168]  nouveau_drm_probe+0x60/0x288 [nouveau]
>>> [   76.658168]  local_pci_probe+0x44/0x98
>>> [   76.658169]  pci_device_probe+0x118/0x1b0
>>> [   76.658170]  really_probe+0xe0/0x320
>>> [   76.658171]  driver_probe_device+0x5c/0xf0
>>> [   76.658171]  device_driver_attach+0x74/0x80
>>> [   76.658172]  __driver_attach+0x64/0xe0
>>> [   76.658173]  bus_for_each_dev+0x80/0xd0
>>> [   76.658173]  driver_attach+0x30/0x40
>>> [   76.658174]  bus_add_driver+0x14c/0x1f0
>>> [   76.658175]  driver_register+0x64/0x110
>>> [   76.658175]  __pci_register_driver+0x58/0x68
>>> [   76.658176]  nouveau_drm_init+0x17c/0x10000 [nouveau]
>>> [   76.658177]  do_one_initcall+0x50/0x298
>>> [   76.658178]  do_init_module+0x88/0x270
>>> [   76.658178]  load_module+0x2068/0x2600
>>> [   76.658179]  __do_sys_finit_module+0xd0/0xe8
>>> [   76.658180]  __arm64_sys_finit_module+0x28/0x38
>>> [   76.658180]  el0_svc_common.constprop.0+0x7c/0x1f0
>>> [   76.658181]  el0_svc_handler+0x34/0xa0
>>> [   76.658182]  el0_svc+0x14/0x40
>>> [   76.658182]  el0_sync_handler+0x118/0x280
>>> [   76.658183]  el0_sync+0x164/0x180
>>> [   76.658248] SMP: stopping secondary CPUs
>>> [   76.658249] Kernel Offset: disabled
>>> [   76.658250] CPU features: 0x10002,2000600c
>>> [   76.658250] Memory Limit: none
>>
>> Last log, I promise.
>> This is what happens when it is locked out of devfn > 1, and nouveau
>> is not available.
>>
>> [    4.199087] vcc3v3_pcie: supplied by vcc12v_dcin
>> [   55.617783] ehci-pci: EHCI PCI platform driver
>> [   55.868935] ohci-pci: OHCI PCI platform driver
>> [   65.426085] rockchip-pcie f8000000.pcie: missing legacy phy; search
>> for per-lane PHY
>> [   65.426645] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
>> [   65.426662] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
>> [   65.507012] rockchip-pcie f8000000.pcie: current link width is x4
>> [   65.507078] rockchip-pcie f8000000.pcie: host bridge /pcie@f8000000 ranges:
>> [   65.507088] rockchip-pcie f8000000.pcie: Parsing ranges property...
>> [   65.507103] rockchip-pcie f8000000.pcie:   MEM
>> 0xfa000000..0xfbdfffff -> 0xfa000000
>> [   65.507112] rockchip-pcie f8000000.pcie:    IO
>> 0xfbe00000..0xfbffffff -> 0xfbe00000
>> [   65.507368] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
>> [   65.507376] pci_bus 0000:00: root bus resource [bus 00-1f]
>> [   65.507382] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
>> [   65.507390] pci_bus 0000:00: root bus resource [io
>> 0x0000-0x1fffff] (bus address [0xfbe00000-0xfbffffff])
>> [   65.507394] pci_bus 0000:00: scanning bus
>> [   65.507423] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
>> [   65.507543] pci 0000:00:00.0: supports D1
>> [   65.507547] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
>> [   65.507555] pci 0000:00:00.0: PME# disabled
>> [   65.511436] pci_bus 0000:00: fixups for bus
>> [   65.511452] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
>> [   65.511457] pci 0000:00:00.0: bridge configuration invalid ([bus
>> 00-00]), reconfiguring
>> [   65.511471] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
>> [   65.511633] pci_bus 0000:01: scanning bus
>> [   65.511739] pci 0000:01:00.0: [10de:11c4] type 00 class 0x030000
>> [   65.512205] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
>> [   65.512274] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
>> 64bit pref]
>> [   65.512345] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
>> 64bit pref]
>> [   65.512390] pci 0000:01:00.0: reg 0x24: initial BAR value 0x00000000 invalid
>> [   65.512394] pci 0000:01:00.0: reg 0x24: [io  size 0x0080]
>> [   65.512435] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
>> [   65.512485] pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
>> [   65.513319] pci 0000:01:00.0: 8.000 Gb/s available PCIe bandwidth,
>> limited by 2.5 GT/s x4 link at 0000:00:00.0 (capable of 126.016 Gb/s
>> with 8 GT/s x16 link)
>> [   65.513568] pci 0000:01:00.0: vgaarb: VGA device added:
>> decodes=io+mem,owns=none,locks=none
>> [   65.513744] pci 0000:01:00.1: [10de:0e0b] type 00 class 0x040300
>> [   65.513900] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
>> [   65.514179] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
>> [   65.515129] rockchip-pcie f8000000.pcie: phy link changes
>> [   65.527011] pci_bus 0000:01: fixups for bus
>> [   65.527021] pci_bus 0000:01: bus scan returning with max=01
>> [   65.527030] pci_bus 0000:01: busn_res: [bus 01-1f] end is updated to 01
>> [   65.527042] pci_bus 0000:00: bus scan returning with max=01
>> [   65.527076] pci 0000:00:00.0: BAR 14: no space for [mem size 0x0c000000]
>> [   65.527081] pci 0000:00:00.0: BAR 14: failed to assign [mem size 0x0c000000]
>> [   65.527091] pci 0000:01:00.0: BAR 1: no space for [mem size
>> 0x08000000 64bit pref]
>> [   65.527095] pci 0000:01:00.0: BAR 1: failed to assign [mem size
>> 0x08000000 64bit pref]
>> [   65.527100] pci 0000:01:00.0: BAR 3: no space for [mem size
>> 0x02000000 64bit pref]
>> [   65.527104] pci 0000:01:00.0: BAR 3: failed to assign [mem size
>> 0x02000000 64bit pref]
>> [   65.527109] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000]
>> [   65.527119] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x01000000]
>> [   65.527123] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00080000 pref]
>> [   65.527129] pci 0000:01:00.0: BAR 6: failed to assign [mem size
>> 0x00080000 pref]
>> [   65.527134] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000]
>> [   65.527138] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000]
>> [   65.527142] pci 0000:01:00.0: BAR 5: no space for [io  size 0x0080]
>> [   65.527146] pci 0000:01:00.0: BAR 5: failed to assign [io  size 0x0080]
>> [   65.527152] pci 0000:00:00.0: PCI bridge to [bus 01]
>> [   65.529410] pcieport 0000:00:00.0: assign IRQ: got 238
>> [   65.529454] pcieport 0000:00:00.0: enabling bus mastering
>> [   65.529807] pcieport 0000:00:00.0: PME: Signaling with IRQ 239
>> [   65.530265] pcieport 0000:00:00.0: AER: enabled with IRQ 239
>> [   65.530685] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
> 
> An update to this issue, I purchased a pcie switch and tried testing
> with a few pcie devices I have around.
> Under no circumstances can I get it to read the switch, it aborts
> immediately upon reading the first register.
> Of note is it keeps reporting the following :
> [   64.510788] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
> [   64.510793] pci 0000:00:00.0: bridge configuration invalid ([bus
> 00-00]), reconfiguring
> [   64.510807] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
> <snip>
> [   64.527368] pci_bus 0000:01: fixups for bus
> [   64.527399] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 0
> [   64.527404] pci 0000:01:00.0: bridge configuration invalid ([bus
> 00-00]), reconfiguring
> [   64.527452] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 1
> [   64.527850] pci_bus 0000:02: scanning bus
> And aborts here.
> 
> Thoughts?

Not having too much time to follow up w/ upstream development, sorry.
I attach a patch based on Linux-next for debugging, but I don't know if
it could work as expected since I can't test it now. But at least it
shows what the problems were.

> 
>>
>>
>>>
>>>
>>>>>>
>>>>>> Another reason I believe the address assignments are overflowing and
>>>>>> corrupting other address assignments is after the external abort, the
>>>>>> entire PCIE controller is inaccessible.
>>>>>> $ lspci
>>>>>> pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.1/config
>>>>>> lspci: Unable to read the standard configuration space header of
>>>>>> device 0000:01:00.1
>>>>>> pcilib: Cannot open /sys/bus/pci/devices/0000:00:00.0/config
>>>>>> lspci: Unable to read the standard configuration space header of
>>>>>> device 0000:00:00.0
>>>>>> pcilib: Cannot open /sys/bus/pci/devices/0000:01:00.0/config
>>>>>> lspci: Unable to read the standard configuration space header of
>>>>>> device 0000:01:00.0
>>>>>>
>>>>>> Attempting to rescan the bus or any other pci function results in a
>>>>>> hung kernel task.
>>>>>
>>>>> You crashed deep in the kernel part-way through the driver's probe
>>>>> routine; don't expect any partly-initialised bits of that driver to
>>>>> actually work. I bet you also can't load/unload modules and do various
>>>>> other things either, because the thread will almost certainly have been
>>>>> holding a bunch of locks, too. It's not worth trying to reason about the
>>>>> state of anything after that kind of unexpected crash ;)
>>>>
>>>> That is observed, but I think it is actually more destructive than that.
>>>> Attempting to do an access after the fact, then performing a soft
>>>> reset results in a spectacular crash when it tries to bring the BIG
>>>> cluster online.
>>>>
>>>> [    0.203160] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
>>>> [    5.355606] CPU4: failed to come online
>>>> [    5.363258] CPU4: failed in unknown state : 0x0
>>>> Unhandled Exception in EL3.
>>>> x30 =           0xUhandled Exception in L3.
>>>> x30 =           UNhandled ExcEption in EL3.
>>>> x30 =   0xUnhandled Excepti2n in EL_.
>>>> x30 =           0xUnhandled Exception in EL3.
>>>> x30 =           lxUnhadled Exception in EL3.
>>>> x30 =           0xUnhandled Excepion in EL3.
>>>> x30 =           0xUnhandled Exception in EL3.
>>>> x30 =   0xUnhandled Eception in EL3.
>>>> x30 =           0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled Exception in EL3.x30 =
>>>> 0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled Excepton i2 EL3.
>>>> x30 =           0xUnhandUnhandled Exceptionein EL3.lx30 =
>>>>   0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled Unhanded Exception in EL3.
>>>> x30 =           0xUnhandled Exceptin in EL3Unhandled Exception inEL3.
>>>> x0 =            0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled Exception in EL.
>>>> x30 =           0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled Exception in EL3.
>>>> x30 =           0xUnhandled xception in EL3
>>>> x30 =           0xUnhcndled Exception in EL3.
>>>> x30=            0xUnhandled Excextion in EL3.
>>>> x30 =   0xUnhandled Exception in EL3.
>>>> x30 =           0x[   10.484411] CPU5: failed to come online
>>>> [   10.484807] CPU5: failed in unknown state : 0x0
>>>> [   10.485374] smp: Brought up 1 node, 4 CPUs
>>>>
>>>> It goes on until it tries to do it's first DMA setup, then just hangs.
>>>>>
>>>>>>> If the Root Port (00:00.0) supports AER, you could also dump out the
>>>>>>> status registers from the AER capability and figure out whether it
>>>>>>> logged a PCIe error.  This would be sort of like what
>>>>>>> aer_process_err_devices() does.  A bit of a hassle to do this by hand
>>>>>>> in the exception path, but could be enlightening, just as a debug
>>>>>>> tool.
>>>>>>
>>>>>> Is there a way to handle external synchronous aborts in a device driver?
>>>>>
>>>>> Not easily on arm64 - the closest thing we have is some arch-internal
>>>>> machinery for undefined instruction hooks. You'd have to hack something
>>>>> gruesome into do_sea()...
>>>>>
>>>>> For starters it might be worth enabling all the debug prints in
>>>>> rockchip-pcie to see if there's any error IRQ corresponding to the
>>>>> aborted access.
>>>>
>>>> Debug prints are already enabled, I had to hack in the print to show
>>>> what addresses it was crashing on.
>>>>
>>>> Your explanation was the last piece for me to understand how this all
>>>> works, thanks!
>>>>>
>>>>> Robin.
>>>>>
>>>>>> If so, I'll definitely look into plugging in the aer status functions.
>>>>>>
>>>>>>>
>>>>>>>> [3] https://twitter.com/domipheus/status/1167586160077627393
>>>>>>>>>
>>>>>>>>>> [  198.751581] Modules linked in: drm_panel_orientation_quirks
>>>>>>>>>> pcie_rockchip_host(+) cpufreq_dt sch_fq_codel ip_tables x_tables ipv6
>>>>>>>>>> crc_ccitt nf_defrag_ipv6
>>>>>>>>>> [  198.752861] CPU: 1 PID: 1686 Comm: systemd-udevd Not tainted
>>>>>>>>>> 5.4.0-rc5-next-20191031-00001-gddbfb17ac1c4-dirty #5
>>>>>>>>>> [  198.753791] Hardware name: Pine64 RockPro64 (DT)
>>>>>>>>>> [  198.754215] pstate: 60400085 (nZCv daIf +PAN -UAO)
>>>>>>>>>> [  198.754672] pc : __raw_readl+0x0/0x8 [pcie_rockchip_host]
>>>>>>>>>> [  198.755172] lr : rockchip_pcie_rd_conf+0x140/0x1dc [pcie_rockchip_host]
>>>>>>>>>> [  198.755773] sp : ffff8000132af530
>>>>>>>>>> [  198.756079] x29: ffff8000132af530 x28: 0000000000000000
>>>>>>>>>> [  198.756565] x27: 0000000000000001 x26: 0000000000000000
>>>>>>>>>> [  198.757049] x25: ffff0000c20ac000 x24: 0000000000002000
>>>>>>>>>> [  198.757534] x23: ffff0000c20ae5c0 x22: ffff8000132af5d4
>>>>>>>>>> [  198.758018] x21: 0000000000002000 x20: 0000000000000004
>>>>>>>>>> [  198.758502] x19: 0000000000102000 x18: 0000000000000001
>>>>>>>>>> [  198.758987] x17: 0000000000000000 x16: 0000000000000000
>>>>>>>>>> [  198.759472] x15: ffffffffffffffff x14: ffff80001159bcc8
>>>>>>>>>> [  198.759957] x13: 0000000000000000 x12: ffff800011b2c000
>>>>>>>>>> [  198.760441] x11: ffff8000115bf000 x10: ffff800011310018
>>>>>>>>>> [  198.760926] x9 : 00000000fffb9fff x8 : 0000000000000001
>>>>>>>>>> [  198.761410] x7 : 0000000000000000 x6 : ffff0000f7492548
>>>>>>>>>> [  198.761894] x5 : 0000000000000001 x4 : ffff0000f7492548
>>>>>>>>>> [  198.762379] x3 : 0000000000000000 x2 : 0000000000c00008
>>>>>>>>>> [  198.762863] x1 : ffff80001dc00008 x0 : ffff80001a102000
>>>>>>>>>> [  198.763348] Call trace:
>>>>>>>>>> [  198.763583]  __raw_readl+0x0/0x8 [pcie_rockchip_host]
>>>>>>>>>> [  198.764057]  pci_bus_read_config_dword+0x88/0xd0
>>>>>>>>>> [  198.764484]  pci_bus_generic_read_dev_vendor_id+0x40/0x1b8
>>>>>>>>>> [  198.764982]  pci_bus_read_dev_vendor_id+0x58/0x88
>>>>>>>>>> [  198.765413]  pci_scan_single_device+0x84/0xf8
>>>>>>>>>> [  198.765812]  pci_scan_slot+0x7c/0x120
>>>>>>>>>> [  198.766149]  pci_scan_child_bus_extend+0x68/0x2dc
>>>>>>>>>> [  198.766579]  pci_scan_bridge_extend+0x350/0x588
>>>>>>>>>> [  198.766992]  pci_scan_child_bus_extend+0x21c/0x2dc
>>>>>>>>>> [  198.767430]  pci_scan_child_bus+0x24/0x30
>>>>>>>>>> [  198.767797]  pci_scan_root_bus_bridge+0xc4/0xd0
>>>>>>>>>> [  198.768215]  rockchip_pcie_probe+0x610/0x74c [pcie_rockchip_host]
>>>>>>>>>> [  198.768770]  platform_drv_probe+0x58/0xa8
>>>>>>>>>> [  198.769139]  really_probe+0xe0/0x318
>>>>>>>>>> [  198.769468]  driver_probe_device+0x5c/0xf0
>>>>>>>>>> [  198.769844]  device_driver_attach+0x74/0x80
>>>>>>>>>> [  198.770227]  __driver_attach+0x64/0xe8
>>>>>>>>>> [  198.770572]  bus_for_each_dev+0x84/0xd8
>>>>>>>>>> [  198.770924]  driver_attach+0x30/0x40
>>>>>>>>>> [  198.771253]  bus_add_driver+0x188/0x1e8
>>>>>>>>>> [  198.771605]  driver_register+0x64/0x110
>>>>>>>>>> [  198.771956]  __platform_driver_register+0x54/0x60
>>>>>>>>>> [  198.772388]  rockchip_pcie_driver_init+0x28/0x10000 [pcie_rockchip_host]
>>>>>>>>>> [  198.772998]  do_one_initcall+0x94/0x390
>>>>>>>>>> [  198.773353]  do_init_module+0x88/0x268
>>>>>>>>>> [  198.773697]  load_module+0x1e18/0x2198
>>>>>>>>>> [  198.774043]  __do_sys_finit_module+0xd0/0xe8
>>>>>>>>>> [  198.774435]  __arm64_sys_finit_module+0x28/0x38
>>>>>>>>>> [  198.774858]  el0_svc_common.constprop.3+0xa4/0x1d8
>>>>>>>>>> [  198.775297]  el0_svc_handler+0x34/0xa0
>>>>>>>>>> [  198.775645]  el0_svc+0x14/0x40
>>>>>>>>>> [  198.775928]  el0_sync_handler+0x118/0x290
>>>>>>>>>> [  198.776295]  el0_sync+0x164/0x180
>>>>>>>>>> [  198.776609] Code: bad PC value
>>>>>>>>>> [  198.776897] ---[ end trace 88fc77651b5e2909 ]---
>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-rockchip mailing list
>>>>>> Linux-rockchip@lists.infradead.org
>>>>>> http://lists.infradead.org/mailman/listinfo/linux-rockchip
>>>>>>
> 
> 
> 

[-- Attachment #2: 0001-WFT-PCI-rockchip-play-game-with-unsupported-request-.patch --]
[-- Type: text/plain, Size: 9960 bytes --]

From da9db487615c3687a0823c54d8d0790cbd4f79a2 Mon Sep 17 00:00:00 2001
From: Shawn Lin <shawn.lin@rock-chips.com>
Date: Thu, 21 Nov 2019 09:45:12 +0800
Subject: [PATCH] WFT: PCI: rockchip: play game with unsupported request from
 EP

Native defect prevents this RC far from supporting any response
from EP which UR filed is set. Take a big hammer to take over
Serror handler and work around it.

Maybe ATF is the right place in the future to cover this kind
of defect, but all right, for ARM32, imx6q_pcie_abort_handler
almost handle the same defect.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
---
 arch/arm64/include/asm/system_misc.h        |  5 +-
 arch/arm64/kernel/entry.S                   | 86 ++++++++++++++++++++---------
 arch/arm64/mm/fault.c                       | 44 ++++++++++++++-
 drivers/pci/controller/pcie-rockchip-host.c | 15 +++++
 drivers/pci/controller/pcie-rockchip.h      |  4 ++
 drivers/pci/probe.c                         |  7 ++-
 6 files changed, 132 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 1ab63cf..b45e4e1 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -28,7 +28,10 @@ void arm64_notify_die(const char *str, struct pt_regs *regs,
 void hook_debug_fault_code(int nr, int (*fn)(unsigned long, unsigned int,
 					     struct pt_regs *),
 			   int sig, int code, const char *name);
-
+void hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int,
+			struct pt_regs *), int sig, int code, const char *name);
+void *hook_serror_handler(int (*fn)(unsigned long, unsigned int,
+		        struct pt_regs *));
 struct mm_struct;
 extern void __show_regs(struct pt_regs *);
 
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 583f71a..321f9ff 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -471,7 +471,7 @@ ENTRY(vectors)
 	kernel_ventry	1, sync_invalid			// Synchronous EL1t
 	kernel_ventry	1, irq_invalid			// IRQ EL1t
 	kernel_ventry	1, fiq_invalid			// FIQ EL1t
-	kernel_ventry	1, error_invalid		// Error EL1t
+	kernel_ventry   1, error                // Error EL1h
 
 	kernel_ventry	1, sync				// Synchronous EL1h
 	kernel_ventry	1, irq				// IRQ EL1h
@@ -562,6 +562,36 @@ el1_sync_invalid:
 	inv_entry 1, BAD_SYNC
 ENDPROC(el1_sync_invalid)
 
+       .align  6
+el1_error:
+       kernel_entry 1
+       mrs     x1, esr_el1                     // read the syndrome register
+       lsr     x24, x1, #ESR_ELx_EC_SHIFT      // exception class
+       cmp     x24, #ESR_ELx_EC_SERROR         // SError exception in EL1
+       b.ne    el1_error_inv
+el1_serr:
+       mrs     x0, far_el1
+       enable_dbg
+       // re-enable interrupts if they were enabled in the aborted context
+       tbnz    x23, #7, 1f                     // PSR_I_BIT
+       //enable_irq
+       msr     daifclr, #2
+1:
+       mov     x2, sp                          // struct pt_regs
+       bl      do_serr_abort
+
+       // disable interrupts before pulling preserved data off the stack
+       //disable_irq
+       msr     daifset, #2
+       kernel_exit 1
+el1_error_inv:
+       enable_dbg
+       mov     x0, sp
+       mov     x2, x1
+       mov     x1, #BAD_ERROR
+       b       bad_mode
+ENDPROC(el1_error)
+
 el1_irq_invalid:
 	inv_entry 1, BAD_IRQ
 ENDPROC(el1_irq_invalid)
@@ -667,6 +697,7 @@ el0_irq_compat:
 	kernel_entry 0, 32
 	b	el0_irq_naked
 
+       .align  6
 el0_error_compat:
 	kernel_entry 0, 32
 	b	el0_error_naked
@@ -697,30 +728,6 @@ el0_irq_naked:
 	b	ret_to_user
 ENDPROC(el0_irq)
 
-el1_error:
-	kernel_entry 1
-	mrs	x1, esr_el1
-	gic_prio_kentry_setup tmp=x2
-	enable_dbg
-	mov	x0, sp
-	bl	do_serror
-	kernel_exit 1
-ENDPROC(el1_error)
-
-el0_error:
-	kernel_entry 0
-el0_error_naked:
-	mrs	x25, esr_el1
-	gic_prio_kentry_setup tmp=x2
-	ct_user_exit_irqoff
-	enable_dbg
-	mov	x0, sp
-	mov	x1, x25
-	bl	do_serror
-	enable_da_f
-	b	ret_to_user
-ENDPROC(el0_error)
-
 /*
  * Ok, we need to do extra processing, enter the slow path.
  */
@@ -865,6 +872,35 @@ __entry_tramp_data_start:
 #endif /* CONFIG_RANDOMIZE_BASE */
 #endif /* CONFIG_UNMAP_KERNEL_AT_EL0 */
 
+       .align  6
+el0_error:
+       kernel_entry 0
+el0_error_naked:
+       mrs     x25, esr_el1                    // read the syndrome register
+       lsr     x24, x25, #ESR_ELx_EC_SHIFT     // exception class
+       cmp     x24, #ESR_ELx_EC_SERROR         // SError exception in EL0
+       b.ne    el0_error_inv
+el0_serr:
+       mrs     x26, far_el1
+       // enable interrupts before calling the main handler
+       msr     daifclr, #(8 | 2)
+       //ct_user_exit
+#ifdef CONFIG_CONTEXT_TRACKING
+               bl      context_tracking_user_exit
+#endif
+       bic     x0, x26, #(0xff << 56)
+       mov     x1, x25
+       mov     x2, sp
+       bl      do_serr_abort
+       b       ret_to_user
+el0_error_inv:
+       enable_dbg
+       mov     x0, sp
+       mov     x1, #BAD_ERROR
+       mov     x2, x25
+       b       bad_mode
+ENDPROC(el0_error)
+
 /*
  * Register switch for AArch64. The callee-saved registers need to be saved
  * and restored. On entry:
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 077b02a..5adb7ef 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -48,7 +48,7 @@ struct fault_info {
 	const char *name;
 };
 
-static const struct fault_info fault_info[];
+static struct fault_info fault_info[] __ro_after_init;
 static struct fault_info debug_fault_info[];
 
 static inline const struct fault_info *esr_to_fault_info(unsigned int esr)
@@ -660,7 +660,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 0;
 }
 
-static const struct fault_info fault_info[] = {
+static struct fault_info fault_info[] __ro_after_init = {
 	{ do_bad,		SIGKILL, SI_KERNEL,	"ttbr address size fault"	},
 	{ do_bad,		SIGKILL, SI_KERNEL,	"level 1 address size fault"	},
 	{ do_bad,		SIGKILL, SI_KERNEL,	"level 2 address size fault"	},
@@ -727,6 +727,46 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 63"			},
 };
 
+void __init hook_fault_code(int nr,
+                       int (*fn)(unsigned long, unsigned int, struct pt_regs *),
+                       int sig, int code, const char *name)
+{
+       BUG_ON(nr < 0 || nr >= ARRAY_SIZE(fault_info));
+
+       fault_info[nr].fn       = fn;
+       fault_info[nr].sig      = sig;
+       fault_info[nr].code     = code;
+       fault_info[nr].name     = name;
+}
+
+static int (*serror_handler)(unsigned long, unsigned int,
+                            struct pt_regs *) __ro_after_init;
+
+void *__init hook_serror_handler(int (*fn)(unsigned long, unsigned int,
+                                struct pt_regs *))
+{
+       void *ret = serror_handler;
+
+       serror_handler = fn;
+       return ret;
+}
+
+asmlinkage void  do_serr_abort(unsigned long addr, unsigned int esr,
+                                        struct pt_regs *regs)
+{
+	const struct fault_info *inf;
+
+	inf = esr_to_fault_info(esr);
+       if (serror_handler)
+               if (!serror_handler(addr, esr, regs))
+                       return;
+
+       pr_alert("Unhandled SError: (0x%08x) at 0x%016lx\n", esr, addr);
+       __show_regs(regs);
+
+       arm64_notify_die(inf->name, regs, inf->sig, inf->code, (void __user *)addr, esr);
+}
+
 void do_mem_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 {
 	const struct fault_info *inf = esr_to_fault_info(esr);
diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
index ee83f84..5c5578b 100644
--- a/drivers/pci/controller/pcie-rockchip-host.c
+++ b/drivers/pci/controller/pcie-rockchip-host.c
@@ -958,6 +958,17 @@ static int __maybe_unused rockchip_pcie_resume_noirq(struct device *dev)
 	return err;
 }
 
+#ifdef CONFIG_ARM64
+static int (*serror_chain)(unsigned long addr, unsigned int esr,
+                               struct pt_regs *regs);
+static int do_rockchip_pcie_serror(unsigned long addr, unsigned int esr,
+                               struct pt_regs *regs)
+{
+       printk("%s:hey!\n", __func__);
+       return 0;
+}
+#endif
+
 static int rockchip_pcie_probe(struct platform_device *pdev)
 {
 	struct rockchip_pcie *rockchip;
@@ -1028,6 +1039,10 @@ static int rockchip_pcie_probe(struct platform_device *pdev)
 	bridge->map_irq = of_irq_parse_and_map_pci;
 	bridge->swizzle_irq = pci_common_swizzle;
 
+       #ifdef CONFIG_ARM64
+       serror_chain = hook_serror_handler(do_rockchip_pcie_serror);
+       #endif
+
 	err = pci_scan_root_bus_bridge(bridge);
 	if (err < 0)
 		goto err_remove_irq_domain;
diff --git a/drivers/pci/controller/pcie-rockchip.h b/drivers/pci/controller/pcie-rockchip.h
index d90dfb3..aa0829c 100644
--- a/drivers/pci/controller/pcie-rockchip.h
+++ b/drivers/pci/controller/pcie-rockchip.h
@@ -13,6 +13,10 @@
 
 #include <linux/kernel.h>
 #include <linux/pci.h>
+#ifdef CONFIG_ARM64
+#include <asm/signal.h>
+#include <asm/system_misc.h>
+#endif
 
 /*
  * The upper 16 bits of PCIE_CLIENT_CONFIG are a write mask for the lower 16
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 512cb43..0d04bc5 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2254,7 +2254,12 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
 	    bridge->device == 0x80b5)
 		return pci_idt_bus_quirk(bus, devfn, l, timeout);
 #endif
-
+	/*
+	 * Some broken switches need RC write something to its RO vender ID,
+	 * otherwise it kick back 0x0 for any CFG read. I see how windows +
+	 * BIOS enumerate it from analyzer.... Nee proper quirks help here?
+	 * */
+	pci_bus_write_config_dword(bus, devfn, PCI_VENDOR_ID, 0x00240000);
 	return pci_bus_generic_read_dev_vendor_id(bus, devfn, l, timeout);
 }
 EXPORT_SYMBOL(pci_bus_read_dev_vendor_id);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] rk3399-rockpro64 pcie synchronous external abort
  2019-11-21  2:03                   ` Shawn Lin
@ 2019-11-22  1:02                     ` Peter Geis
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Geis @ 2019-11-22  1:02 UTC (permalink / raw)
  To: Shawn Lin
  Cc: Robin Murphy, Bjorn Helgaas, linux-pci, Heiko Stuebner,
	open list:ARM/Rockchip SoC...

On Wed, Nov 20, 2019 at 9:05 PM Shawn Lin <shawn.lin@rock-chips.com> wrote:
>
<snip>
>
> Not having too much time to follow up w/ upstream development, sorry.
> I attach a patch based on Linux-next for debugging, but I don't know if
> it could work as expected since I can't test it now. But at least it
> shows what the problems were.

Well it is certainly hideous, and does not compile cleanly (built-in
complains, module is broken), but it finished scanning all buses.
I'll do some more serious testing and get back to you with more results.
Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-04 18:55 [BUG] rk3399-rockpro64 pcie synchronous external abort Peter Geis
2019-11-09  1:08 ` Peter Geis
2019-11-09 16:37   ` Doug Anderson
2019-11-10 15:43     ` Peter Geis
2019-11-12  0:03       ` Bjorn Helgaas
2019-11-12  0:21         ` Peter Geis
2019-11-12  0:13 ` Bjorn Helgaas
2019-11-12  0:30   ` Peter Geis
2019-11-12  2:29     ` Bjorn Helgaas
2019-11-12 15:55       ` Peter Geis
2019-11-12 19:15         ` Robin Murphy
2019-11-12 19:41           ` Peter Geis
2019-11-13  1:06             ` Peter Geis
2019-11-13  1:19               ` Peter Geis
2019-11-21  0:36                 ` Peter Geis
2019-11-21  2:03                   ` Shawn Lin
2019-11-22  1:02                     ` Peter Geis

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org
	public-inbox-index linux-pci

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git