All of lore.kernel.org
 help / color / mirror / Atom feed
* radeon ring 0 test failed on arm64
@ 2021-05-25  2:34 Peter Geis
  2021-05-25 12:46 ` Alex Deucher
  2021-05-25 14:08 ` Christian König
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Geis @ 2021-05-25  2:34 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig; +Cc: amd-gfx

Good Evening,

I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
This device has 1GB available at <0x3 0x00000000> for the PCIe
controller, which makes a dGPU theoretically possible.
While attempting to light off a HD7570 card I manage to get a modeset
console, but ring0 test fails and disables acceleration.

Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
Any insight you can provide would be much appreciated.

Very Respectfully,
Peter Geis

lspci -v
00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
(rev 01) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 96
        Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
        I/O behind bridge: 00001000-00001fff [size=4K]
        Memory behind bridge: 00900000-009fffff [size=1M]
        Prefetchable memory behind bridge:
0000000010000000-000000001fffffff [size=256M]
        Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
        Capabilities: [70] Express Root Port (Slot-), MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Secondary PCI Express
        Capabilities: [160] L1 PM Substates
        Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
Len=100 <?>
        Kernel driver in use: pcieport

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
        Subsystem: Dell Turks PRO [Radeon HD 7570]
        Flags: bus master, fast devsel, latency 0, IRQ 95
        Memory at 310000000 (64-bit, prefetchable) [size=256M]
        Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
        I/O ports at 1000 [size=256]
        Expansion ROM at 300920000 [disabled] [size=128K]
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Kernel driver in use: radeon

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
        Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
        Flags: bus master, fast devsel, latency 0, IRQ 98
        Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Kernel driver in use: snd_hda_intel

[    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
vpcie3v3-supply from device tree
[    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
/pcie@fe260000 ranges:
[    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
[    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
0x0300800000..0x03008fffff -> 0x0000800000
[    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
0x0300900000..0x033fffffff -> 0x0000900000
[    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
[    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
[    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
[    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
[    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
8 outbound, 8 inbound
[    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
[    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
[    6.653142] pci_bus 0000:00: root bus resource [bus 00]
[    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
(bus address [0x800000-0x8fffff])
[    6.654781] pci_bus 0000:00: root bus resource [mem
0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
[    6.655782] pci_bus 0000:00: scanning bus
[    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
can't handle them)
[    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
[    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
[    6.659923] pci 0000:00:00.0: supports D1 D2
[    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[    6.661053] pci 0000:00:00.0: PME# disabled
[    6.672578] pci_bus 0000:00: fixups for bus
[    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
[    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
under [bus 00] (conflicts with (null) [bus 00])
[    6.675993] pci_bus 0000:01: scanning bus
[    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
[    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
64bit pref]
[    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
[    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
[    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
[    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    6.682170] pci 0000:01:00.0: supports D1 D2
[    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
Gb/s with 2.5 GT/s PCIe x16 link)
[    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
decodes=io+mem,owns=none,locks=none
[    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
[    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    6.691099] pci 0000:01:00.1: supports D1 D2
[    6.702495] pci_bus 0000:01: fixups for bus
[    6.702935] pci_bus 0000:01: bus scan returning with max=01
[    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
[    6.704171] pci_bus 0000:00: bus scan returning with max=ff
[    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
0x310000000-0x31fffffff 64bit pref]
[    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
[    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
0x300a00000-0x300a0ffff pref]
[    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
[    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
0x310000000-0x31fffffff 64bit pref]
[    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
0x300900000-0x30091ffff 64bit]
[    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
0x300920000-0x30093ffff pref]
[    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
0x300940000-0x300943fff 64bit]
[    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
[    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
[    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
[    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
[    6.713278] pci 0000:00:00.0:   bridge window [mem
0x310000000-0x31fffffff 64bit pref]
[    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
[    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
[    6.751738] pcieport 0000:00:00.0: saving config space at offset
0x0 (reading 0x35661d87)
[    6.752495] pcieport 0000:00:00.0: saving config space at offset
0x4 (reading 0x100507)
[    6.753224] pcieport 0000:00:00.0: saving config space at offset
0x8 (reading 0x6040001)
[    6.754217] pcieport 0000:00:00.0: saving config space at offset
0xc (reading 0x10000)
[    6.754942] pcieport 0000:00:00.0: saving config space at offset
0x10 (reading 0x0)
[    6.755640] pcieport 0000:00:00.0: saving config space at offset
0x14 (reading 0x0)
[    6.756337] pcieport 0000:00:00.0: saving config space at offset
0x18 (reading 0xff0100)
[    6.757073] pcieport 0000:00:00.0: saving config space at offset
0x1c (reading 0x20001010)
[    6.757878] pcieport 0000:00:00.0: saving config space at offset
0x20 (reading 0x900090)
[    6.758614] pcieport 0000:00:00.0: saving config space at offset
0x24 (reading 0x1ff11001)
[    6.759361] pcieport 0000:00:00.0: saving config space at offset
0x28 (reading 0x0)
[    6.760057] pcieport 0000:00:00.0: saving config space at offset
0x2c (reading 0x0)
[    6.760752] pcieport 0000:00:00.0: saving config space at offset
0x30 (reading 0x0)
[    6.761501] pcieport 0000:00:00.0: saving config space at offset
0x34 (reading 0x40)
[    6.762206] pcieport 0000:00:00.0: saving config space at offset
0x38 (reading 0x0)
[    6.762902] pcieport 0000:00:00.0: saving config space at offset
0x3c (reading 0x2015f)
[    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
[    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
[    6.766911] [drm:drm_minor_register]
[    6.770051] [drm:drm_minor_register] new minor registered 128
[    6.770606] [drm:drm_minor_register]
[    6.771958] [drm:drm_minor_register] new minor registered 0
[    6.772640] [drm] initializing kernel modesetting (TURKS
0x1002:0x675D 0x1028:0x2B20 0x00).
[    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
[    7.029814] ATOM BIOS: TURKS
[    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
00000000 0kb
[    7.030901] [drm] GPU not posted. posting now...
[    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
0x000000003FFFFFFF (1024M used)
[    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
0x000000007FFFFFFF
[    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
[    7.039533] [drm] RAM width 128bits DDR
[    7.040975] [drm] radeon: 1024M of VRAM memory ready
[    7.041543] [drm] radeon: 1024M of GTT memory ready.
[    7.042289] [drm:ni_init_microcode]
[    7.042639] [drm] Loading TURKS Microcode
[    7.043047] [drm] Internal thermal controller with fan control
[    7.059713] [drm] radeon: dpm initialized
[    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
[    7.169257] radeon 0000:01:00.0: WB enabled
[    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
addr 0x0000000040000c00
[    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
addr 0x0000000040000c0c
[    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
addr 0x0000000000072118
[    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
[    7.184105] radeon 0000:01:00.0: radeon: using MSI.
[    7.184571] [drm:drm_irq_install] irq=97
[    7.185619] [drm] radeon: irq initialized.
[    7.186795] radeon 0000:01:00.0: enabling bus mastering
[    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
start: rptr 0, wptr 96
[    7.188118] [drm:evergreen_irq_process] IH: D1 flip
[    7.188563] [drm:evergreen_irq_process] IH: D2 flip
[    7.189006] [drm:evergreen_irq_process] IH: D3 flip
[    7.189450] [drm:evergreen_irq_process] IH: D4 flip
[    7.189894] [drm:evergreen_irq_process] IH: D5 flip
[    7.190337] [drm:evergreen_irq_process] IH: D6 flip
[    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
start: rptr 96, wptr 96
[    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
(scratch(0x8504)=0xCAFEDEAD)
[    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
[    7.533961] [drm:drm_irq_uninstall] irq=97
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25  2:34 radeon ring 0 test failed on arm64 Peter Geis
@ 2021-05-25 12:46 ` Alex Deucher
  2021-05-25 12:55   ` Peter Geis
  2021-05-25 14:08 ` Christian König
  1 sibling, 1 reply; 45+ messages in thread
From: Alex Deucher @ 2021-05-25 12:46 UTC (permalink / raw)
  To: Peter Geis; +Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> wrote:
>
> Good Evening,
>
> I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
> This device has 1GB available at <0x3 0x00000000> for the PCIe
> controller, which makes a dGPU theoretically possible.
> While attempting to light off a HD7570 card I manage to get a modeset
> console, but ring0 test fails and disables acceleration.
>
> Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
> Any insight you can provide would be much appreciated.

Does your platform support PCIe cache coherency with the CPU?  I.e.,
does the CPU allow cache snoops from PCIe devices?  That is required
for the driver to operate.

Alex


>
> Very Respectfully,
> Peter Geis
>
> lspci -v
> 00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
> (rev 01) (prog-if 00 [Normal decode])
>         Flags: bus master, fast devsel, latency 0, IRQ 96
>         Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
>         I/O behind bridge: 00001000-00001fff [size=4K]
>         Memory behind bridge: 00900000-009fffff [size=1M]
>         Prefetchable memory behind bridge:
> 0000000010000000-000000001fffffff [size=256M]
>         Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
>         Capabilities: [40] Power Management version 3
>         Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
>         Capabilities: [70] Express Root Port (Slot-), MSI 00
>         Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
>         Capabilities: [100] Advanced Error Reporting
>         Capabilities: [148] Secondary PCI Express
>         Capabilities: [160] L1 PM Substates
>         Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
> Len=100 <?>
>         Kernel driver in use: pcieport
>
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
>         Subsystem: Dell Turks PRO [Radeon HD 7570]
>         Flags: bus master, fast devsel, latency 0, IRQ 95
>         Memory at 310000000 (64-bit, prefetchable) [size=256M]
>         Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
>         I/O ports at 1000 [size=256]
>         Expansion ROM at 300920000 [disabled] [size=128K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [58] Express Legacy Endpoint, MSI 00
>         Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
>         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
>         Capabilities: [150] Advanced Error Reporting
>         Kernel driver in use: radeon
>
> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
> HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
>         Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
>         Flags: bus master, fast devsel, latency 0, IRQ 98
>         Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [58] Express Legacy Endpoint, MSI 00
>         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
>         Capabilities: [150] Advanced Error Reporting
>         Kernel driver in use: snd_hda_intel
>
> [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
> vpcie3v3-supply from device tree
> [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
> /pcie@fe260000 ranges:
> [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
> [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
> 0x0300800000..0x03008fffff -> 0x0000800000
> [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
> 0x0300900000..0x033fffffff -> 0x0000900000
> [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
> [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
> [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
> [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
> [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
> 8 outbound, 8 inbound
> [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
> [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
> [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
> [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> (bus address [0x800000-0x8fffff])
> [    6.654781] pci_bus 0000:00: root bus resource [mem
> 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
> [    6.655782] pci_bus 0000:00: scanning bus
> [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
> can't handle them)
> [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
> [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
> [    6.659923] pci 0000:00:00.0: supports D1 D2
> [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> [    6.661053] pci 0000:00:00.0: PME# disabled
> [    6.672578] pci_bus 0000:00: fixups for bus
> [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
> [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
> under [bus 00] (conflicts with (null) [bus 00])
> [    6.675993] pci_bus 0000:01: scanning bus
> [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
> [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
> 64bit pref]
> [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
> [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
> [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
> [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
> [    6.682170] pci 0000:01:00.0: supports D1 D2
> [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
> Gb/s with 2.5 GT/s PCIe x16 link)
> [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
> decodes=io+mem,owns=none,locks=none
> [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
> [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> [    6.691099] pci 0000:01:00.1: supports D1 D2
> [    6.702495] pci_bus 0000:01: fixups for bus
> [    6.702935] pci_bus 0000:01: bus scan returning with max=01
> [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
> [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
> [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
> 0x310000000-0x31fffffff 64bit pref]
> [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
> [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
> 0x300a00000-0x300a0ffff pref]
> [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
> [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
> 0x310000000-0x31fffffff 64bit pref]
> [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
> 0x300900000-0x30091ffff 64bit]
> [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
> 0x300920000-0x30093ffff pref]
> [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
> 0x300940000-0x300943fff 64bit]
> [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
> [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
> [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
> [    6.713278] pci 0000:00:00.0:   bridge window [mem
> 0x310000000-0x31fffffff 64bit pref]
> [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
> [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
> [    6.751738] pcieport 0000:00:00.0: saving config space at offset
> 0x0 (reading 0x35661d87)
> [    6.752495] pcieport 0000:00:00.0: saving config space at offset
> 0x4 (reading 0x100507)
> [    6.753224] pcieport 0000:00:00.0: saving config space at offset
> 0x8 (reading 0x6040001)
> [    6.754217] pcieport 0000:00:00.0: saving config space at offset
> 0xc (reading 0x10000)
> [    6.754942] pcieport 0000:00:00.0: saving config space at offset
> 0x10 (reading 0x0)
> [    6.755640] pcieport 0000:00:00.0: saving config space at offset
> 0x14 (reading 0x0)
> [    6.756337] pcieport 0000:00:00.0: saving config space at offset
> 0x18 (reading 0xff0100)
> [    6.757073] pcieport 0000:00:00.0: saving config space at offset
> 0x1c (reading 0x20001010)
> [    6.757878] pcieport 0000:00:00.0: saving config space at offset
> 0x20 (reading 0x900090)
> [    6.758614] pcieport 0000:00:00.0: saving config space at offset
> 0x24 (reading 0x1ff11001)
> [    6.759361] pcieport 0000:00:00.0: saving config space at offset
> 0x28 (reading 0x0)
> [    6.760057] pcieport 0000:00:00.0: saving config space at offset
> 0x2c (reading 0x0)
> [    6.760752] pcieport 0000:00:00.0: saving config space at offset
> 0x30 (reading 0x0)
> [    6.761501] pcieport 0000:00:00.0: saving config space at offset
> 0x34 (reading 0x40)
> [    6.762206] pcieport 0000:00:00.0: saving config space at offset
> 0x38 (reading 0x0)
> [    6.762902] pcieport 0000:00:00.0: saving config space at offset
> 0x3c (reading 0x2015f)
> [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
> [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
> [    6.766911] [drm:drm_minor_register]
> [    6.770051] [drm:drm_minor_register] new minor registered 128
> [    6.770606] [drm:drm_minor_register]
> [    6.771958] [drm:drm_minor_register] new minor registered 0
> [    6.772640] [drm] initializing kernel modesetting (TURKS
> 0x1002:0x675D 0x1028:0x2B20 0x00).
> [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
> [    7.029814] ATOM BIOS: TURKS
> [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
> 00000000 0kb
> [    7.030901] [drm] GPU not posted. posting now...
> [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
> 0x000000003FFFFFFF (1024M used)
> [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
> [    7.039533] [drm] RAM width 128bits DDR
> [    7.040975] [drm] radeon: 1024M of VRAM memory ready
> [    7.041543] [drm] radeon: 1024M of GTT memory ready.
> [    7.042289] [drm:ni_init_microcode]
> [    7.042639] [drm] Loading TURKS Microcode
> [    7.043047] [drm] Internal thermal controller with fan control
> [    7.059713] [drm] radeon: dpm initialized
> [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
> [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
> [    7.169257] radeon 0000:01:00.0: WB enabled
> [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> addr 0x0000000040000c00
> [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> addr 0x0000000040000c0c
> [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
> addr 0x0000000000072118
> [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
> [    7.184571] [drm:drm_irq_install] irq=97
> [    7.185619] [drm] radeon: irq initialized.
> [    7.186795] radeon 0000:01:00.0: enabling bus mastering
> [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
> start: rptr 0, wptr 96
> [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
> [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
> [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
> [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
> [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
> [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
> [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
> start: rptr 96, wptr 96
> [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
> (scratch(0x8504)=0xCAFEDEAD)
> [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
> [    7.533961] [drm:drm_irq_uninstall] irq=97
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 12:46 ` Alex Deucher
@ 2021-05-25 12:55   ` Peter Geis
  2021-05-25 13:05     ` Alex Deucher
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Geis @ 2021-05-25 12:55 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

On Tue, May 25, 2021 at 8:47 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> wrote:
> >
> > Good Evening,
> >
> > I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
> > This device has 1GB available at <0x3 0x00000000> for the PCIe
> > controller, which makes a dGPU theoretically possible.
> > While attempting to light off a HD7570 card I manage to get a modeset
> > console, but ring0 test fails and disables acceleration.
> >
> > Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
> > Any insight you can provide would be much appreciated.
>
> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> does the CPU allow cache snoops from PCIe devices?  That is required
> for the driver to operate.

Ah, most likely not.
This issue has come up already as the GIC isn't permitted to snoop on
the CPUs, so I doubt the PCIe controller can either.

Is there no way to work around this or is it dead in the water?

>
> Alex
>
>
> >
> > Very Respectfully,
> > Peter Geis
> >
> > lspci -v
> > 00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
> > (rev 01) (prog-if 00 [Normal decode])
> >         Flags: bus master, fast devsel, latency 0, IRQ 96
> >         Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
> >         I/O behind bridge: 00001000-00001fff [size=4K]
> >         Memory behind bridge: 00900000-009fffff [size=1M]
> >         Prefetchable memory behind bridge:
> > 0000000010000000-000000001fffffff [size=256M]
> >         Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
> >         Capabilities: [40] Power Management version 3
> >         Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
> >         Capabilities: [70] Express Root Port (Slot-), MSI 00
> >         Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
> >         Capabilities: [100] Advanced Error Reporting
> >         Capabilities: [148] Secondary PCI Express
> >         Capabilities: [160] L1 PM Substates
> >         Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
> > Len=100 <?>
> >         Kernel driver in use: pcieport
> >
> > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > [AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
> >         Subsystem: Dell Turks PRO [Radeon HD 7570]
> >         Flags: bus master, fast devsel, latency 0, IRQ 95
> >         Memory at 310000000 (64-bit, prefetchable) [size=256M]
> >         Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
> >         I/O ports at 1000 [size=256]
> >         Expansion ROM at 300920000 [disabled] [size=128K]
> >         Capabilities: [50] Power Management version 3
> >         Capabilities: [58] Express Legacy Endpoint, MSI 00
> >         Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> >         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > Len=010 <?>
> >         Capabilities: [150] Advanced Error Reporting
> >         Kernel driver in use: radeon
> >
> > 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
> > HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> >         Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> >         Flags: bus master, fast devsel, latency 0, IRQ 98
> >         Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
> >         Capabilities: [50] Power Management version 3
> >         Capabilities: [58] Express Legacy Endpoint, MSI 00
> >         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > Len=010 <?>
> >         Capabilities: [150] Advanced Error Reporting
> >         Kernel driver in use: snd_hda_intel
> >
> > [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
> > vpcie3v3-supply from device tree
> > [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
> > /pcie@fe260000 ranges:
> > [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
> > [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
> > 0x0300800000..0x03008fffff -> 0x0000800000
> > [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
> > 0x0300900000..0x033fffffff -> 0x0000900000
> > [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
> > [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
> > [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
> > [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
> > [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
> > 8 outbound, 8 inbound
> > [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
> > [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
> > [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
> > [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > (bus address [0x800000-0x8fffff])
> > [    6.654781] pci_bus 0000:00: root bus resource [mem
> > 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
> > [    6.655782] pci_bus 0000:00: scanning bus
> > [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
> > can't handle them)
> > [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
> > [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
> > [    6.659923] pci 0000:00:00.0: supports D1 D2
> > [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > [    6.661053] pci 0000:00:00.0: PME# disabled
> > [    6.672578] pci_bus 0000:00: fixups for bus
> > [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
> > [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
> > under [bus 00] (conflicts with (null) [bus 00])
> > [    6.675993] pci_bus 0000:01: scanning bus
> > [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
> > [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
> > 64bit pref]
> > [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
> > [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
> > [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
> > [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
> > [    6.682170] pci 0000:01:00.0: supports D1 D2
> > [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
> > Gb/s with 2.5 GT/s PCIe x16 link)
> > [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
> > decodes=io+mem,owns=none,locks=none
> > [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
> > [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> > [    6.691099] pci 0000:01:00.1: supports D1 D2
> > [    6.702495] pci_bus 0000:01: fixups for bus
> > [    6.702935] pci_bus 0000:01: bus scan returning with max=01
> > [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
> > [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
> > [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
> > 0x310000000-0x31fffffff 64bit pref]
> > [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
> > [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
> > 0x300a00000-0x300a0ffff pref]
> > [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
> > [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
> > 0x310000000-0x31fffffff 64bit pref]
> > [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
> > 0x300900000-0x30091ffff 64bit]
> > [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
> > 0x300920000-0x30093ffff pref]
> > [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
> > 0x300940000-0x300943fff 64bit]
> > [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
> > [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> > [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
> > [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
> > [    6.713278] pci 0000:00:00.0:   bridge window [mem
> > 0x310000000-0x31fffffff 64bit pref]
> > [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
> > [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
> > [    6.751738] pcieport 0000:00:00.0: saving config space at offset
> > 0x0 (reading 0x35661d87)
> > [    6.752495] pcieport 0000:00:00.0: saving config space at offset
> > 0x4 (reading 0x100507)
> > [    6.753224] pcieport 0000:00:00.0: saving config space at offset
> > 0x8 (reading 0x6040001)
> > [    6.754217] pcieport 0000:00:00.0: saving config space at offset
> > 0xc (reading 0x10000)
> > [    6.754942] pcieport 0000:00:00.0: saving config space at offset
> > 0x10 (reading 0x0)
> > [    6.755640] pcieport 0000:00:00.0: saving config space at offset
> > 0x14 (reading 0x0)
> > [    6.756337] pcieport 0000:00:00.0: saving config space at offset
> > 0x18 (reading 0xff0100)
> > [    6.757073] pcieport 0000:00:00.0: saving config space at offset
> > 0x1c (reading 0x20001010)
> > [    6.757878] pcieport 0000:00:00.0: saving config space at offset
> > 0x20 (reading 0x900090)
> > [    6.758614] pcieport 0000:00:00.0: saving config space at offset
> > 0x24 (reading 0x1ff11001)
> > [    6.759361] pcieport 0000:00:00.0: saving config space at offset
> > 0x28 (reading 0x0)
> > [    6.760057] pcieport 0000:00:00.0: saving config space at offset
> > 0x2c (reading 0x0)
> > [    6.760752] pcieport 0000:00:00.0: saving config space at offset
> > 0x30 (reading 0x0)
> > [    6.761501] pcieport 0000:00:00.0: saving config space at offset
> > 0x34 (reading 0x40)
> > [    6.762206] pcieport 0000:00:00.0: saving config space at offset
> > 0x38 (reading 0x0)
> > [    6.762902] pcieport 0000:00:00.0: saving config space at offset
> > 0x3c (reading 0x2015f)
> > [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
> > [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
> > [    6.766911] [drm:drm_minor_register]
> > [    6.770051] [drm:drm_minor_register] new minor registered 128
> > [    6.770606] [drm:drm_minor_register]
> > [    6.771958] [drm:drm_minor_register] new minor registered 0
> > [    6.772640] [drm] initializing kernel modesetting (TURKS
> > 0x1002:0x675D 0x1028:0x2B20 0x00).
> > [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
> > [    7.029814] ATOM BIOS: TURKS
> > [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
> > 00000000 0kb
> > [    7.030901] [drm] GPU not posted. posting now...
> > [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
> > 0x000000003FFFFFFF (1024M used)
> > [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> > 0x000000007FFFFFFF
> > [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
> > [    7.039533] [drm] RAM width 128bits DDR
> > [    7.040975] [drm] radeon: 1024M of VRAM memory ready
> > [    7.041543] [drm] radeon: 1024M of GTT memory ready.
> > [    7.042289] [drm:ni_init_microcode]
> > [    7.042639] [drm] Loading TURKS Microcode
> > [    7.043047] [drm] Internal thermal controller with fan control
> > [    7.059713] [drm] radeon: dpm initialized
> > [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
> > [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
> > radeon.pcie_gen2=0
> > [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
> > [    7.169257] radeon 0000:01:00.0: WB enabled
> > [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> > addr 0x0000000040000c00
> > [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> > addr 0x0000000040000c0c
> > [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
> > addr 0x0000000000072118
> > [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> > [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
> > [    7.184571] [drm:drm_irq_install] irq=97
> > [    7.185619] [drm] radeon: irq initialized.
> > [    7.186795] radeon 0000:01:00.0: enabling bus mastering
> > [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
> > start: rptr 0, wptr 96
> > [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
> > [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
> > [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
> > [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
> > [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
> > [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
> > [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
> > start: rptr 96, wptr 96
> > [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
> > (scratch(0x8504)=0xCAFEDEAD)
> > [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
> > [    7.533961] [drm:drm_irq_uninstall] irq=97
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 12:55   ` Peter Geis
@ 2021-05-25 13:05     ` Alex Deucher
  2021-05-25 13:18       ` Peter Geis
  2021-05-25 20:09       ` Robin Murphy
  0 siblings, 2 replies; 45+ messages in thread
From: Alex Deucher @ 2021-05-25 13:05 UTC (permalink / raw)
  To: Peter Geis; +Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 8:47 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> wrote:
> > >
> > > Good Evening,
> > >
> > > I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
> > > This device has 1GB available at <0x3 0x00000000> for the PCIe
> > > controller, which makes a dGPU theoretically possible.
> > > While attempting to light off a HD7570 card I manage to get a modeset
> > > console, but ring0 test fails and disables acceleration.
> > >
> > > Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
> > > Any insight you can provide would be much appreciated.
> >
> > Does your platform support PCIe cache coherency with the CPU?  I.e.,
> > does the CPU allow cache snoops from PCIe devices?  That is required
> > for the driver to operate.
>
> Ah, most likely not.
> This issue has come up already as the GIC isn't permitted to snoop on
> the CPUs, so I doubt the PCIe controller can either.
>
> Is there no way to work around this or is it dead in the water?

It's required by the pcie spec.  You could potentially work around it
if you can allocate uncached memory for DMA, but I don't think that is
possible currently.  Ideally we'd figure out some way to detect if a
particular platform supports cache snooping or not as well.

Alex


>
> >
> > Alex
> >
> >
> > >
> > > Very Respectfully,
> > > Peter Geis
> > >
> > > lspci -v
> > > 00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
> > > (rev 01) (prog-if 00 [Normal decode])
> > >         Flags: bus master, fast devsel, latency 0, IRQ 96
> > >         Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
> > >         I/O behind bridge: 00001000-00001fff [size=4K]
> > >         Memory behind bridge: 00900000-009fffff [size=1M]
> > >         Prefetchable memory behind bridge:
> > > 0000000010000000-000000001fffffff [size=256M]
> > >         Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
> > >         Capabilities: [40] Power Management version 3
> > >         Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
> > >         Capabilities: [70] Express Root Port (Slot-), MSI 00
> > >         Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
> > >         Capabilities: [100] Advanced Error Reporting
> > >         Capabilities: [148] Secondary PCI Express
> > >         Capabilities: [160] L1 PM Substates
> > >         Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
> > > Len=100 <?>
> > >         Kernel driver in use: pcieport
> > >
> > > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > [AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
> > >         Subsystem: Dell Turks PRO [Radeon HD 7570]
> > >         Flags: bus master, fast devsel, latency 0, IRQ 95
> > >         Memory at 310000000 (64-bit, prefetchable) [size=256M]
> > >         Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
> > >         I/O ports at 1000 [size=256]
> > >         Expansion ROM at 300920000 [disabled] [size=128K]
> > >         Capabilities: [50] Power Management version 3
> > >         Capabilities: [58] Express Legacy Endpoint, MSI 00
> > >         Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > >         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > > Len=010 <?>
> > >         Capabilities: [150] Advanced Error Reporting
> > >         Kernel driver in use: radeon
> > >
> > > 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
> > > HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> > >         Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> > >         Flags: bus master, fast devsel, latency 0, IRQ 98
> > >         Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
> > >         Capabilities: [50] Power Management version 3
> > >         Capabilities: [58] Express Legacy Endpoint, MSI 00
> > >         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > >         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > > Len=010 <?>
> > >         Capabilities: [150] Advanced Error Reporting
> > >         Kernel driver in use: snd_hda_intel
> > >
> > > [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
> > > vpcie3v3-supply from device tree
> > > [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
> > > /pcie@fe260000 ranges:
> > > [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
> > > [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
> > > 0x0300800000..0x03008fffff -> 0x0000800000
> > > [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
> > > 0x0300900000..0x033fffffff -> 0x0000900000
> > > [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
> > > [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
> > > [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
> > > [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
> > > [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
> > > 8 outbound, 8 inbound
> > > [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
> > > [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
> > > [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
> > > [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > > (bus address [0x800000-0x8fffff])
> > > [    6.654781] pci_bus 0000:00: root bus resource [mem
> > > 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
> > > [    6.655782] pci_bus 0000:00: scanning bus
> > > [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
> > > can't handle them)
> > > [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
> > > [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
> > > [    6.659923] pci 0000:00:00.0: supports D1 D2
> > > [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > > [    6.661053] pci 0000:00:00.0: PME# disabled
> > > [    6.672578] pci_bus 0000:00: fixups for bus
> > > [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
> > > [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
> > > under [bus 00] (conflicts with (null) [bus 00])
> > > [    6.675993] pci_bus 0000:01: scanning bus
> > > [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
> > > [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
> > > 64bit pref]
> > > [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
> > > [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
> > > [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
> > > [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
> > > [    6.682170] pci 0000:01:00.0: supports D1 D2
> > > [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > > limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
> > > Gb/s with 2.5 GT/s PCIe x16 link)
> > > [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
> > > decodes=io+mem,owns=none,locks=none
> > > [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
> > > [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> > > [    6.691099] pci 0000:01:00.1: supports D1 D2
> > > [    6.702495] pci_bus 0000:01: fixups for bus
> > > [    6.702935] pci_bus 0000:01: bus scan returning with max=01
> > > [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
> > > [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
> > > [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
> > > 0x310000000-0x31fffffff 64bit pref]
> > > [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
> > > [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
> > > 0x300a00000-0x300a0ffff pref]
> > > [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
> > > [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
> > > 0x310000000-0x31fffffff 64bit pref]
> > > [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
> > > 0x300900000-0x30091ffff 64bit]
> > > [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
> > > 0x300920000-0x30093ffff pref]
> > > [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
> > > 0x300940000-0x300943fff 64bit]
> > > [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
> > > [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> > > [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
> > > [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
> > > [    6.713278] pci 0000:00:00.0:   bridge window [mem
> > > 0x310000000-0x31fffffff 64bit pref]
> > > [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
> > > [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
> > > [    6.751738] pcieport 0000:00:00.0: saving config space at offset
> > > 0x0 (reading 0x35661d87)
> > > [    6.752495] pcieport 0000:00:00.0: saving config space at offset
> > > 0x4 (reading 0x100507)
> > > [    6.753224] pcieport 0000:00:00.0: saving config space at offset
> > > 0x8 (reading 0x6040001)
> > > [    6.754217] pcieport 0000:00:00.0: saving config space at offset
> > > 0xc (reading 0x10000)
> > > [    6.754942] pcieport 0000:00:00.0: saving config space at offset
> > > 0x10 (reading 0x0)
> > > [    6.755640] pcieport 0000:00:00.0: saving config space at offset
> > > 0x14 (reading 0x0)
> > > [    6.756337] pcieport 0000:00:00.0: saving config space at offset
> > > 0x18 (reading 0xff0100)
> > > [    6.757073] pcieport 0000:00:00.0: saving config space at offset
> > > 0x1c (reading 0x20001010)
> > > [    6.757878] pcieport 0000:00:00.0: saving config space at offset
> > > 0x20 (reading 0x900090)
> > > [    6.758614] pcieport 0000:00:00.0: saving config space at offset
> > > 0x24 (reading 0x1ff11001)
> > > [    6.759361] pcieport 0000:00:00.0: saving config space at offset
> > > 0x28 (reading 0x0)
> > > [    6.760057] pcieport 0000:00:00.0: saving config space at offset
> > > 0x2c (reading 0x0)
> > > [    6.760752] pcieport 0000:00:00.0: saving config space at offset
> > > 0x30 (reading 0x0)
> > > [    6.761501] pcieport 0000:00:00.0: saving config space at offset
> > > 0x34 (reading 0x40)
> > > [    6.762206] pcieport 0000:00:00.0: saving config space at offset
> > > 0x38 (reading 0x0)
> > > [    6.762902] pcieport 0000:00:00.0: saving config space at offset
> > > 0x3c (reading 0x2015f)
> > > [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
> > > [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
> > > [    6.766911] [drm:drm_minor_register]
> > > [    6.770051] [drm:drm_minor_register] new minor registered 128
> > > [    6.770606] [drm:drm_minor_register]
> > > [    6.771958] [drm:drm_minor_register] new minor registered 0
> > > [    6.772640] [drm] initializing kernel modesetting (TURKS
> > > 0x1002:0x675D 0x1028:0x2B20 0x00).
> > > [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
> > > [    7.029814] ATOM BIOS: TURKS
> > > [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
> > > 00000000 0kb
> > > [    7.030901] [drm] GPU not posted. posting now...
> > > [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
> > > 0x000000003FFFFFFF (1024M used)
> > > [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> > > 0x000000007FFFFFFF
> > > [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
> > > [    7.039533] [drm] RAM width 128bits DDR
> > > [    7.040975] [drm] radeon: 1024M of VRAM memory ready
> > > [    7.041543] [drm] radeon: 1024M of GTT memory ready.
> > > [    7.042289] [drm:ni_init_microcode]
> > > [    7.042639] [drm] Loading TURKS Microcode
> > > [    7.043047] [drm] Internal thermal controller with fan control
> > > [    7.059713] [drm] radeon: dpm initialized
> > > [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
> > > [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
> > > radeon.pcie_gen2=0
> > > [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
> > > [    7.169257] radeon 0000:01:00.0: WB enabled
> > > [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> > > addr 0x0000000040000c00
> > > [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> > > addr 0x0000000040000c0c
> > > [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
> > > addr 0x0000000000072118
> > > [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> > > [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
> > > [    7.184571] [drm:drm_irq_install] irq=97
> > > [    7.185619] [drm] radeon: irq initialized.
> > > [    7.186795] radeon 0000:01:00.0: enabling bus mastering
> > > [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
> > > start: rptr 0, wptr 96
> > > [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
> > > [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
> > > [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
> > > [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
> > > [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
> > > [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
> > > [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
> > > start: rptr 96, wptr 96
> > > [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
> > > (scratch(0x8504)=0xCAFEDEAD)
> > > [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
> > > [    7.533961] [drm:drm_irq_uninstall] irq=97
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 13:05     ` Alex Deucher
@ 2021-05-25 13:18       ` Peter Geis
  2021-05-25 20:09       ` Robin Murphy
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Geis @ 2021-05-25 13:18 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

On Tue, May 25, 2021 at 9:05 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com> wrote:
> >
> > On Tue, May 25, 2021 at 8:47 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> wrote:
> > > >
> > > > Good Evening,
> > > >
> > > > I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
> > > > This device has 1GB available at <0x3 0x00000000> for the PCIe
> > > > controller, which makes a dGPU theoretically possible.
> > > > While attempting to light off a HD7570 card I manage to get a modeset
> > > > console, but ring0 test fails and disables acceleration.
> > > >
> > > > Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
> > > > Any insight you can provide would be much appreciated.
> > >
> > > Does your platform support PCIe cache coherency with the CPU?  I.e.,
> > > does the CPU allow cache snoops from PCIe devices?  That is required
> > > for the driver to operate.
> >
> > Ah, most likely not.
> > This issue has come up already as the GIC isn't permitted to snoop on
> > the CPUs, so I doubt the PCIe controller can either.
> >
> > Is there no way to work around this or is it dead in the water?
>
> It's required by the pcie spec.  You could potentially work around it
> if you can allocate uncached memory for DMA, but I don't think that is
> possible currently.  Ideally we'd figure out some way to detect if a
> particular platform supports cache snooping or not as well.
>
> Alex

Okay. Considering the RPi crew have similar issues trying to get dGPUs
working on the RPi4, I believe this is unfortunately a common issue on
arm64 consumer devices.
A common test for cache snooping would benefit other subsystems too,
since the GIC-V3 on the rk356x series requires an ugly hack to disable
cache snooping.

Thank you for your time,
Peter

>
>
> >
> > >
> > > Alex
> > >
> > >
> > > >
> > > > Very Respectfully,
> > > > Peter Geis
> > > >
> > > > lspci -v
> > > > 00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
> > > > (rev 01) (prog-if 00 [Normal decode])
> > > >         Flags: bus master, fast devsel, latency 0, IRQ 96
> > > >         Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
> > > >         I/O behind bridge: 00001000-00001fff [size=4K]
> > > >         Memory behind bridge: 00900000-009fffff [size=1M]
> > > >         Prefetchable memory behind bridge:
> > > > 0000000010000000-000000001fffffff [size=256M]
> > > >         Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
> > > >         Capabilities: [40] Power Management version 3
> > > >         Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
> > > >         Capabilities: [70] Express Root Port (Slot-), MSI 00
> > > >         Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
> > > >         Capabilities: [100] Advanced Error Reporting
> > > >         Capabilities: [148] Secondary PCI Express
> > > >         Capabilities: [160] L1 PM Substates
> > > >         Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
> > > > Len=100 <?>
> > > >         Kernel driver in use: pcieport
> > > >
> > > > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > > [AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
> > > >         Subsystem: Dell Turks PRO [Radeon HD 7570]
> > > >         Flags: bus master, fast devsel, latency 0, IRQ 95
> > > >         Memory at 310000000 (64-bit, prefetchable) [size=256M]
> > > >         Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
> > > >         I/O ports at 1000 [size=256]
> > > >         Expansion ROM at 300920000 [disabled] [size=128K]
> > > >         Capabilities: [50] Power Management version 3
> > > >         Capabilities: [58] Express Legacy Endpoint, MSI 00
> > > >         Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > >         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > > > Len=010 <?>
> > > >         Capabilities: [150] Advanced Error Reporting
> > > >         Kernel driver in use: radeon
> > > >
> > > > 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
> > > > HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> > > >         Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> > > >         Flags: bus master, fast devsel, latency 0, IRQ 98
> > > >         Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
> > > >         Capabilities: [50] Power Management version 3
> > > >         Capabilities: [58] Express Legacy Endpoint, MSI 00
> > > >         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > > >         Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > > > Len=010 <?>
> > > >         Capabilities: [150] Advanced Error Reporting
> > > >         Kernel driver in use: snd_hda_intel
> > > >
> > > > [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
> > > > vpcie3v3-supply from device tree
> > > > [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
> > > > /pcie@fe260000 ranges:
> > > > [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
> > > > [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
> > > > 0x0300800000..0x03008fffff -> 0x0000800000
> > > > [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
> > > > 0x0300900000..0x033fffffff -> 0x0000900000
> > > > [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
> > > > [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
> > > > [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
> > > > [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
> > > > [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
> > > > 8 outbound, 8 inbound
> > > > [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
> > > > [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
> > > > [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
> > > > [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > > > (bus address [0x800000-0x8fffff])
> > > > [    6.654781] pci_bus 0000:00: root bus resource [mem
> > > > 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
> > > > [    6.655782] pci_bus 0000:00: scanning bus
> > > > [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
> > > > can't handle them)
> > > > [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
> > > > [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
> > > > [    6.659923] pci 0000:00:00.0: supports D1 D2
> > > > [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > > > [    6.661053] pci 0000:00:00.0: PME# disabled
> > > > [    6.672578] pci_bus 0000:00: fixups for bus
> > > > [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
> > > > [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
> > > > under [bus 00] (conflicts with (null) [bus 00])
> > > > [    6.675993] pci_bus 0000:01: scanning bus
> > > > [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
> > > > [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
> > > > 64bit pref]
> > > > [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
> > > > [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
> > > > [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
> > > > [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
> > > > [    6.682170] pci 0000:01:00.0: supports D1 D2
> > > > [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > > > limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
> > > > Gb/s with 2.5 GT/s PCIe x16 link)
> > > > [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
> > > > decodes=io+mem,owns=none,locks=none
> > > > [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
> > > > [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> > > > [    6.691099] pci 0000:01:00.1: supports D1 D2
> > > > [    6.702495] pci_bus 0000:01: fixups for bus
> > > > [    6.702935] pci_bus 0000:01: bus scan returning with max=01
> > > > [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
> > > > [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
> > > > [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
> > > > 0x310000000-0x31fffffff 64bit pref]
> > > > [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
> > > > [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
> > > > 0x300a00000-0x300a0ffff pref]
> > > > [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
> > > > [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
> > > > 0x310000000-0x31fffffff 64bit pref]
> > > > [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
> > > > 0x300900000-0x30091ffff 64bit]
> > > > [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
> > > > 0x300920000-0x30093ffff pref]
> > > > [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
> > > > 0x300940000-0x300943fff 64bit]
> > > > [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
> > > > [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> > > > [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
> > > > [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
> > > > [    6.713278] pci 0000:00:00.0:   bridge window [mem
> > > > 0x310000000-0x31fffffff 64bit pref]
> > > > [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
> > > > [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
> > > > [    6.751738] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x0 (reading 0x35661d87)
> > > > [    6.752495] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x4 (reading 0x100507)
> > > > [    6.753224] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x8 (reading 0x6040001)
> > > > [    6.754217] pcieport 0000:00:00.0: saving config space at offset
> > > > 0xc (reading 0x10000)
> > > > [    6.754942] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x10 (reading 0x0)
> > > > [    6.755640] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x14 (reading 0x0)
> > > > [    6.756337] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x18 (reading 0xff0100)
> > > > [    6.757073] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x1c (reading 0x20001010)
> > > > [    6.757878] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x20 (reading 0x900090)
> > > > [    6.758614] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x24 (reading 0x1ff11001)
> > > > [    6.759361] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x28 (reading 0x0)
> > > > [    6.760057] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x2c (reading 0x0)
> > > > [    6.760752] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x30 (reading 0x0)
> > > > [    6.761501] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x34 (reading 0x40)
> > > > [    6.762206] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x38 (reading 0x0)
> > > > [    6.762902] pcieport 0000:00:00.0: saving config space at offset
> > > > 0x3c (reading 0x2015f)
> > > > [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
> > > > [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
> > > > [    6.766911] [drm:drm_minor_register]
> > > > [    6.770051] [drm:drm_minor_register] new minor registered 128
> > > > [    6.770606] [drm:drm_minor_register]
> > > > [    6.771958] [drm:drm_minor_register] new minor registered 0
> > > > [    6.772640] [drm] initializing kernel modesetting (TURKS
> > > > 0x1002:0x675D 0x1028:0x2B20 0x00).
> > > > [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
> > > > [    7.029814] ATOM BIOS: TURKS
> > > > [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
> > > > 00000000 0kb
> > > > [    7.030901] [drm] GPU not posted. posting now...
> > > > [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
> > > > 0x000000003FFFFFFF (1024M used)
> > > > [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> > > > 0x000000007FFFFFFF
> > > > [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
> > > > [    7.039533] [drm] RAM width 128bits DDR
> > > > [    7.040975] [drm] radeon: 1024M of VRAM memory ready
> > > > [    7.041543] [drm] radeon: 1024M of GTT memory ready.
> > > > [    7.042289] [drm:ni_init_microcode]
> > > > [    7.042639] [drm] Loading TURKS Microcode
> > > > [    7.043047] [drm] Internal thermal controller with fan control
> > > > [    7.059713] [drm] radeon: dpm initialized
> > > > [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
> > > > [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
> > > > radeon.pcie_gen2=0
> > > > [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
> > > > [    7.169257] radeon 0000:01:00.0: WB enabled
> > > > [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> > > > addr 0x0000000040000c00
> > > > [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> > > > addr 0x0000000040000c0c
> > > > [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
> > > > addr 0x0000000000072118
> > > > [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> > > > [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
> > > > [    7.184571] [drm:drm_irq_install] irq=97
> > > > [    7.185619] [drm] radeon: irq initialized.
> > > > [    7.186795] radeon 0000:01:00.0: enabling bus mastering
> > > > [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
> > > > start: rptr 0, wptr 96
> > > > [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
> > > > [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
> > > > [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
> > > > [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
> > > > [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
> > > > [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
> > > > [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
> > > > start: rptr 96, wptr 96
> > > > [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
> > > > (scratch(0x8504)=0xCAFEDEAD)
> > > > [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
> > > > [    7.533961] [drm:drm_irq_uninstall] irq=97
> > > > _______________________________________________
> > > > amd-gfx mailing list
> > > > amd-gfx@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25  2:34 radeon ring 0 test failed on arm64 Peter Geis
  2021-05-25 12:46 ` Alex Deucher
@ 2021-05-25 14:08 ` Christian König
  2021-05-25 14:19   ` Peter Geis
  1 sibling, 1 reply; 45+ messages in thread
From: Christian König @ 2021-05-25 14:08 UTC (permalink / raw)
  To: Peter Geis, alexander.deucher; +Cc: amd-gfx

Hi Peter,

some comment additionally what Alex said.

Am 25.05.21 um 04:34 schrieb Peter Geis:
> Good Evening,
>
> I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
> This device has 1GB available at <0x3 0x00000000> for the PCIe
> controller, which makes a dGPU theoretically possible.
> While attempting to light off a HD7570 card I manage to get a modeset
> console, but ring0 test fails and disables acceleration.
>
> Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
> Any insight you can provide would be much appreciated.
>
> Very Respectfully,
> Peter Geis
>
> lspci -v
> 00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
> (rev 01) (prog-if 00 [Normal decode])
>          Flags: bus master, fast devsel, latency 0, IRQ 96
>          Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
>          I/O behind bridge: 00001000-00001fff [size=4K]
>          Memory behind bridge: 00900000-009fffff [size=1M]
>          Prefetchable memory behind bridge:
> 0000000010000000-000000001fffffff [size=256M]
>          Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
>          Capabilities: [40] Power Management version 3
>          Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
>          Capabilities: [70] Express Root Port (Slot-), MSI 00
>          Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
>          Capabilities: [100] Advanced Error Reporting
>          Capabilities: [148] Secondary PCI Express
>          Capabilities: [160] L1 PM Substates
>          Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
> Len=100 <?>
>          Kernel driver in use: pcieport
>
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
>          Subsystem: Dell Turks PRO [Radeon HD 7570]
>          Flags: bus master, fast devsel, latency 0, IRQ 95
>          Memory at 310000000 (64-bit, prefetchable) [size=256M]

>          Memory at 300900000 (64-bit, non-prefetchable) [size=128K]

This here...

>          I/O ports at 1000 [size=256]
>          Expansion ROM at 300920000 [disabled] [size=128K]
>          Capabilities: [50] Power Management version 3
>          Capabilities: [58] Express Legacy Endpoint, MSI 00
>          Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
>          Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
>          Capabilities: [150] Advanced Error Reporting
>          Kernel driver in use: radeon
>
> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
> HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
>          Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
>          Flags: bus master, fast devsel, latency 0, IRQ 98

>          Memory at 300940000 (64-bit, non-prefetchable) [size=16K]

And that look rather fishy to me. The non-prefetchable memory on AMD 
GPUs is 32bit, bit 64bit.

Looks like something is wrong with the detection code here.

Christian.

>          Capabilities: [50] Power Management version 3
>          Capabilities: [58] Express Legacy Endpoint, MSI 00
>          Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>          Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
>          Capabilities: [150] Advanced Error Reporting
>          Kernel driver in use: snd_hda_intel
>
> [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
> vpcie3v3-supply from device tree
> [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
> /pcie@fe260000 ranges:
> [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
> [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
> 0x0300800000..0x03008fffff -> 0x0000800000
> [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
> 0x0300900000..0x033fffffff -> 0x0000900000
> [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
> [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
> [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
> [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
> [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
> 8 outbound, 8 inbound
> [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
> [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
> [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
> [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> (bus address [0x800000-0x8fffff])
> [    6.654781] pci_bus 0000:00: root bus resource [mem
> 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
> [    6.655782] pci_bus 0000:00: scanning bus
> [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
> can't handle them)
> [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
> [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
> [    6.659923] pci 0000:00:00.0: supports D1 D2
> [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> [    6.661053] pci 0000:00:00.0: PME# disabled
> [    6.672578] pci_bus 0000:00: fixups for bus
> [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
> [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
> under [bus 00] (conflicts with (null) [bus 00])
> [    6.675993] pci_bus 0000:01: scanning bus
> [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
> [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
> 64bit pref]
> [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
> [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
> [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
> [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
> [    6.682170] pci 0000:01:00.0: supports D1 D2
> [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
> Gb/s with 2.5 GT/s PCIe x16 link)
> [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
> decodes=io+mem,owns=none,locks=none
> [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
> [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> [    6.691099] pci 0000:01:00.1: supports D1 D2
> [    6.702495] pci_bus 0000:01: fixups for bus
> [    6.702935] pci_bus 0000:01: bus scan returning with max=01
> [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
> [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
> [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
> 0x310000000-0x31fffffff 64bit pref]
> [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
> [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
> 0x300a00000-0x300a0ffff pref]
> [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
> [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
> 0x310000000-0x31fffffff 64bit pref]
> [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
> 0x300900000-0x30091ffff 64bit]
> [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
> 0x300920000-0x30093ffff pref]
> [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
> 0x300940000-0x300943fff 64bit]
> [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
> [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
> [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
> [    6.713278] pci 0000:00:00.0:   bridge window [mem
> 0x310000000-0x31fffffff 64bit pref]
> [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
> [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
> [    6.751738] pcieport 0000:00:00.0: saving config space at offset
> 0x0 (reading 0x35661d87)
> [    6.752495] pcieport 0000:00:00.0: saving config space at offset
> 0x4 (reading 0x100507)
> [    6.753224] pcieport 0000:00:00.0: saving config space at offset
> 0x8 (reading 0x6040001)
> [    6.754217] pcieport 0000:00:00.0: saving config space at offset
> 0xc (reading 0x10000)
> [    6.754942] pcieport 0000:00:00.0: saving config space at offset
> 0x10 (reading 0x0)
> [    6.755640] pcieport 0000:00:00.0: saving config space at offset
> 0x14 (reading 0x0)
> [    6.756337] pcieport 0000:00:00.0: saving config space at offset
> 0x18 (reading 0xff0100)
> [    6.757073] pcieport 0000:00:00.0: saving config space at offset
> 0x1c (reading 0x20001010)
> [    6.757878] pcieport 0000:00:00.0: saving config space at offset
> 0x20 (reading 0x900090)
> [    6.758614] pcieport 0000:00:00.0: saving config space at offset
> 0x24 (reading 0x1ff11001)
> [    6.759361] pcieport 0000:00:00.0: saving config space at offset
> 0x28 (reading 0x0)
> [    6.760057] pcieport 0000:00:00.0: saving config space at offset
> 0x2c (reading 0x0)
> [    6.760752] pcieport 0000:00:00.0: saving config space at offset
> 0x30 (reading 0x0)
> [    6.761501] pcieport 0000:00:00.0: saving config space at offset
> 0x34 (reading 0x40)
> [    6.762206] pcieport 0000:00:00.0: saving config space at offset
> 0x38 (reading 0x0)
> [    6.762902] pcieport 0000:00:00.0: saving config space at offset
> 0x3c (reading 0x2015f)
> [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
> [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
> [    6.766911] [drm:drm_minor_register]
> [    6.770051] [drm:drm_minor_register] new minor registered 128
> [    6.770606] [drm:drm_minor_register]
> [    6.771958] [drm:drm_minor_register] new minor registered 0
> [    6.772640] [drm] initializing kernel modesetting (TURKS
> 0x1002:0x675D 0x1028:0x2B20 0x00).
> [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
> [    7.029814] ATOM BIOS: TURKS
> [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
> 00000000 0kb
> [    7.030901] [drm] GPU not posted. posting now...
> [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
> 0x000000003FFFFFFF (1024M used)
> [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
> [    7.039533] [drm] RAM width 128bits DDR
> [    7.040975] [drm] radeon: 1024M of VRAM memory ready
> [    7.041543] [drm] radeon: 1024M of GTT memory ready.
> [    7.042289] [drm:ni_init_microcode]
> [    7.042639] [drm] Loading TURKS Microcode
> [    7.043047] [drm] Internal thermal controller with fan control
> [    7.059713] [drm] radeon: dpm initialized
> [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
> [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
> [    7.169257] radeon 0000:01:00.0: WB enabled
> [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> addr 0x0000000040000c00
> [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> addr 0x0000000040000c0c
> [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
> addr 0x0000000000072118
> [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
> [    7.184571] [drm:drm_irq_install] irq=97
> [    7.185619] [drm] radeon: irq initialized.
> [    7.186795] radeon 0000:01:00.0: enabling bus mastering
> [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
> start: rptr 0, wptr 96
> [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
> [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
> [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
> [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
> [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
> [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
> [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
> start: rptr 96, wptr 96
> [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
> (scratch(0x8504)=0xCAFEDEAD)
> [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
> [    7.533961] [drm:drm_irq_uninstall] irq=97

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 14:08 ` Christian König
@ 2021-05-25 14:19   ` Peter Geis
  2021-05-25 15:09     ` Christian König
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Geis @ 2021-05-25 14:19 UTC (permalink / raw)
  To: Christian König; +Cc: alexander.deucher, amd-gfx

On Tue, May 25, 2021 at 10:08 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Peter,
>
> some comment additionally what Alex said.
>
> Am 25.05.21 um 04:34 schrieb Peter Geis:
> > Good Evening,
> >
> > I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
> > This device has 1GB available at <0x3 0x00000000> for the PCIe
> > controller, which makes a dGPU theoretically possible.
> > While attempting to light off a HD7570 card I manage to get a modeset
> > console, but ring0 test fails and disables acceleration.
> >
> > Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
> > Any insight you can provide would be much appreciated.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > lspci -v
> > 00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 3566
> > (rev 01) (prog-if 00 [Normal decode])
> >          Flags: bus master, fast devsel, latency 0, IRQ 96
> >          Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
> >          I/O behind bridge: 00001000-00001fff [size=4K]
> >          Memory behind bridge: 00900000-009fffff [size=1M]
> >          Prefetchable memory behind bridge:
> > 0000000010000000-000000001fffffff [size=256M]
> >          Expansion ROM at 300a00000 [virtual] [disabled] [size=64K]
> >          Capabilities: [40] Power Management version 3
> >          Capabilities: [50] MSI: Enable+ Count=1/32 Maskable- 64bit+
> >          Capabilities: [70] Express Root Port (Slot-), MSI 00
> >          Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
> >          Capabilities: [100] Advanced Error Reporting
> >          Capabilities: [148] Secondary PCI Express
> >          Capabilities: [160] L1 PM Substates
> >          Capabilities: [170] Vendor Specific Information: ID=0002 Rev=4
> > Len=100 <?>
> >          Kernel driver in use: pcieport
> >
> > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > [AMD/ATI] Turks PRO [Radeon HD 7570] (prog-if 00 [VGA controller])
> >          Subsystem: Dell Turks PRO [Radeon HD 7570]
> >          Flags: bus master, fast devsel, latency 0, IRQ 95
> >          Memory at 310000000 (64-bit, prefetchable) [size=256M]
>
> >          Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
>
> This here...
>
> >          I/O ports at 1000 [size=256]
> >          Expansion ROM at 300920000 [disabled] [size=128K]
> >          Capabilities: [50] Power Management version 3
> >          Capabilities: [58] Express Legacy Endpoint, MSI 00
> >          Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> >          Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > Len=010 <?>
> >          Capabilities: [150] Advanced Error Reporting
> >          Kernel driver in use: radeon
> >
> > 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Turks
> > HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> >          Subsystem: Dell Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]
> >          Flags: bus master, fast devsel, latency 0, IRQ 98
>
> >          Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
>
> And that look rather fishy to me. The non-prefetchable memory on AMD
> GPUs is 32bit, bit 64bit.
>
> Looks like something is wrong with the detection code here.
>
> Christian.

Yes, you are correct. There's something weird with the allocation
detection code and flags.
It's currently being discussed on [1].
Perhaps some crosstalk would be beneficial.

I did notice that even if I flag the memory in the device-tree ranges
as 32bit when the final allocation occurs it's flagged as 64bit.
But it changes the behavior, because if it's flagged as 64bit in the
device tree the allocation fails for most of the AMD BARs.

[1]https://lore.kernel.org/lkml/7a1e2ebc-f7d8-8431-d844-41a9c36a8911@arm.com/

>
> >          Capabilities: [50] Power Management version 3
> >          Capabilities: [58] Express Legacy Endpoint, MSI 00
> >          Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >          Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
> > Len=010 <?>
> >          Capabilities: [150] Advanced Error Reporting
> >          Kernel driver in use: snd_hda_intel
> >
> > [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
> > vpcie3v3-supply from device tree
> > [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
> > /pcie@fe260000 ranges:
> > [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
> > [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
> > 0x0300800000..0x03008fffff -> 0x0000800000
> > [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
> > 0x0300900000..0x033fffffff -> 0x0000900000
> > [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
> > [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
> > [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
> > [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
> > [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
> > 8 outbound, 8 inbound
> > [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
> > [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
> > [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
> > [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
> > (bus address [0x800000-0x8fffff])
> > [    6.654781] pci_bus 0000:00: root bus resource [mem
> > 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
> > [    6.655782] pci_bus 0000:00: scanning bus
> > [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
> > can't handle them)
> > [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
> > [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
> > [    6.659923] pci 0000:00:00.0: supports D1 D2
> > [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> > [    6.661053] pci 0000:00:00.0: PME# disabled
> > [    6.672578] pci_bus 0000:00: fixups for bus
> > [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
> > [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
> > under [bus 00] (conflicts with (null) [bus 00])
> > [    6.675993] pci_bus 0000:01: scanning bus
> > [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
> > [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
> > 64bit pref]
> > [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
> > [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
> > [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
> > [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
> > [    6.682170] pci 0000:01:00.0: supports D1 D2
> > [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
> > limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
> > Gb/s with 2.5 GT/s PCIe x16 link)
> > [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
> > decodes=io+mem,owns=none,locks=none
> > [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
> > [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
> > [    6.691099] pci 0000:01:00.1: supports D1 D2
> > [    6.702495] pci_bus 0000:01: fixups for bus
> > [    6.702935] pci_bus 0000:01: bus scan returning with max=01
> > [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
> > [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
> > [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
> > 0x310000000-0x31fffffff 64bit pref]
> > [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
> > [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
> > 0x300a00000-0x300a0ffff pref]
> > [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
> > [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
> > 0x310000000-0x31fffffff 64bit pref]
> > [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
> > 0x300900000-0x30091ffff 64bit]
> > [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
> > 0x300920000-0x30093ffff pref]
> > [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
> > 0x300940000-0x300943fff 64bit]
> > [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
> > [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> > [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
> > [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
> > [    6.713278] pci 0000:00:00.0:   bridge window [mem
> > 0x310000000-0x31fffffff 64bit pref]
> > [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
> > [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
> > [    6.751738] pcieport 0000:00:00.0: saving config space at offset
> > 0x0 (reading 0x35661d87)
> > [    6.752495] pcieport 0000:00:00.0: saving config space at offset
> > 0x4 (reading 0x100507)
> > [    6.753224] pcieport 0000:00:00.0: saving config space at offset
> > 0x8 (reading 0x6040001)
> > [    6.754217] pcieport 0000:00:00.0: saving config space at offset
> > 0xc (reading 0x10000)
> > [    6.754942] pcieport 0000:00:00.0: saving config space at offset
> > 0x10 (reading 0x0)
> > [    6.755640] pcieport 0000:00:00.0: saving config space at offset
> > 0x14 (reading 0x0)
> > [    6.756337] pcieport 0000:00:00.0: saving config space at offset
> > 0x18 (reading 0xff0100)
> > [    6.757073] pcieport 0000:00:00.0: saving config space at offset
> > 0x1c (reading 0x20001010)
> > [    6.757878] pcieport 0000:00:00.0: saving config space at offset
> > 0x20 (reading 0x900090)
> > [    6.758614] pcieport 0000:00:00.0: saving config space at offset
> > 0x24 (reading 0x1ff11001)
> > [    6.759361] pcieport 0000:00:00.0: saving config space at offset
> > 0x28 (reading 0x0)
> > [    6.760057] pcieport 0000:00:00.0: saving config space at offset
> > 0x2c (reading 0x0)
> > [    6.760752] pcieport 0000:00:00.0: saving config space at offset
> > 0x30 (reading 0x0)
> > [    6.761501] pcieport 0000:00:00.0: saving config space at offset
> > 0x34 (reading 0x40)
> > [    6.762206] pcieport 0000:00:00.0: saving config space at offset
> > 0x38 (reading 0x0)
> > [    6.762902] pcieport 0000:00:00.0: saving config space at offset
> > 0x3c (reading 0x2015f)
> > [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
> > [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
> > [    6.766911] [drm:drm_minor_register]
> > [    6.770051] [drm:drm_minor_register] new minor registered 128
> > [    6.770606] [drm:drm_minor_register]
> > [    6.771958] [drm:drm_minor_register] new minor registered 0
> > [    6.772640] [drm] initializing kernel modesetting (TURKS
> > 0x1002:0x675D 0x1028:0x2B20 0x00).
> > [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
> > [    7.029814] ATOM BIOS: TURKS
> > [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
> > 00000000 0kb
> > [    7.030901] [drm] GPU not posted. posting now...
> > [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
> > 0x000000003FFFFFFF (1024M used)
> > [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> > 0x000000007FFFFFFF
> > [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
> > [    7.039533] [drm] RAM width 128bits DDR
> > [    7.040975] [drm] radeon: 1024M of VRAM memory ready
> > [    7.041543] [drm] radeon: 1024M of GTT memory ready.
> > [    7.042289] [drm:ni_init_microcode]
> > [    7.042639] [drm] Loading TURKS Microcode
> > [    7.043047] [drm] Internal thermal controller with fan control
> > [    7.059713] [drm] radeon: dpm initialized
> > [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
> > [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
> > radeon.pcie_gen2=0
> > [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
> > [    7.169257] radeon 0000:01:00.0: WB enabled
> > [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> > addr 0x0000000040000c00
> > [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> > addr 0x0000000040000c0c
> > [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
> > addr 0x0000000000072118
> > [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> > [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
> > [    7.184571] [drm:drm_irq_install] irq=97
> > [    7.185619] [drm] radeon: irq initialized.
> > [    7.186795] radeon 0000:01:00.0: enabling bus mastering
> > [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
> > start: rptr 0, wptr 96
> > [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
> > [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
> > [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
> > [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
> > [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
> > [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
> > [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
> > start: rptr 96, wptr 96
> > [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
> > (scratch(0x8504)=0xCAFEDEAD)
> > [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
> > [    7.533961] [drm:drm_irq_uninstall] irq=97
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 14:19   ` Peter Geis
@ 2021-05-25 15:09     ` Christian König
  0 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2021-05-25 15:09 UTC (permalink / raw)
  To: Peter Geis, Christian König; +Cc: alexander.deucher, amd-gfx

Am 25.05.21 um 16:19 schrieb Peter Geis:
> On Tue, May 25, 2021 at 10:08 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Hi Peter,
>>
>> some comment additionally what Alex said.
>>
>> Am 25.05.21 um 04:34 schrieb Peter Geis:
>> [SNIP]
>>>           Memory at 300900000 (64-bit, non-prefetchable) [size=128K]
>> This here...
>>
>>>       [SNIP]
>>>           Memory at 300940000 (64-bit, non-prefetchable) [size=16K]
>> And that look rather fishy to me. The non-prefetchable memory on AMD
>> GPUs is 32bit, bit 64bit.
>>
>> Looks like something is wrong with the detection code here.
>>
>> Christian.
> Yes, you are correct. There's something weird with the allocation
> detection code and flags.
> It's currently being discussed on [1].
> Perhaps some crosstalk would be beneficial.
>
> I did notice that even if I flag the memory in the device-tree ranges
> as 32bit when the final allocation occurs it's flagged as 64bit.
> But it changes the behavior, because if it's flagged as 64bit in the
> device tree the allocation fails for most of the AMD BARs.

Well that the allocation fails is the least of your problems.

When you program a 32bit BAR as 64bit you overwrite the register behind 
the BAR address with the upper 32bits of the 64bit address value.

So even if the allocation fits into 32bits, the extra register write 
will certainly put your device into a banana state.

Feel free to loop me in on those discussions.

Regards,
Christian.


> [1]https://lore.kernel.org/lkml/7a1e2ebc-f7d8-8431-d844-41a9c36a8911@arm.com/
>
>>>           Capabilities: [50] Power Management version 3
>>>           Capabilities: [58] Express Legacy Endpoint, MSI 00
>>>           Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>           Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
>>> Len=010 <?>
>>>           Capabilities: [150] Advanced Error Reporting
>>>           Kernel driver in use: snd_hda_intel
>>>
>>> [    6.431312] rockchip-dw-pcie 3c0000000.pcie: Looking up
>>> vpcie3v3-supply from device tree
>>> [    6.434619] rockchip-dw-pcie 3c0000000.pcie: host bridge
>>> /pcie@fe260000 ranges:
>>> [    6.435350] rockchip-dw-pcie 3c0000000.pcie: Parsing ranges property...
>>> [    6.436018] rockchip-dw-pcie 3c0000000.pcie:       IO
>>> 0x0300800000..0x03008fffff -> 0x0000800000
>>> [    6.436978] rockchip-dw-pcie 3c0000000.pcie:      MEM
>>> 0x0300900000..0x033fffffff -> 0x0000900000
>>> [    6.438065] rockchip-dw-pcie 3c0000000.pcie: got 49 for legacy interrupt
>>> [    6.439386] rockchip-dw-pcie 3c0000000.pcie: found 5 interrupts
>>> [    6.439934] rockchip-dw-pcie 3c0000000.pcie: invalid resource
>>> [    6.440473] rockchip-dw-pcie 3c0000000.pcie: iATU unroll: enabled
>>> [    6.441029] rockchip-dw-pcie 3c0000000.pcie: Detected iATU regions:
>>> 8 outbound, 8 inbound
>>> [    6.650165] rockchip-dw-pcie 3c0000000.pcie: Link up
>>> [    6.652438] rockchip-dw-pcie 3c0000000.pcie: PCI host bridge to bus 0000:00
>>> [    6.653142] pci_bus 0000:00: root bus resource [bus 00]
>>> [    6.653899] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
>>> (bus address [0x800000-0x8fffff])
>>> [    6.654781] pci_bus 0000:00: root bus resource [mem
>>> 0x300900000-0x33fffffff] (bus address [0x00900000-0x3fffffff])
>>> [    6.655782] pci_bus 0000:00: scanning bus
>>> [    6.656689] pci 0000:00:00.0: disabling Extended Tags (this device
>>> can't handle them)
>>> [    6.657605] pci 0000:00:00.0: [1d87:3566] type 01 class 0x060400
>>> [    6.658418] pci 0000:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
>>> [    6.659923] pci 0000:00:00.0: supports D1 D2
>>> [    6.660360] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
>>> [    6.661053] pci 0000:00:00.0: PME# disabled
>>> [    6.672578] pci_bus 0000:00: fixups for bus
>>> [    6.673063] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 0
>>> [    6.675021] pci_bus 0000:01: busn_res: can not insert [bus 01-ff]
>>> under [bus 00] (conflicts with (null) [bus 00])
>>> [    6.675993] pci_bus 0000:01: scanning bus
>>> [    6.676705] pci 0000:01:00.0: [1002:675d] type 00 class 0x030000
>>> [    6.677672] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff
>>> 64bit pref]
>>> [    6.678493] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0001ffff 64bit]
>>> [    6.679217] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
>>> [    6.679894] pci 0000:01:00.0: reg 0x20: [io  size 0x0100]
>>> [    6.680565] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
>>> [    6.682170] pci 0000:01:00.0: supports D1 D2
>>> [    6.682897] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth,
>>> limited by 2.5 GT/s PCIe x1 link at 0000:00:00.0 (capable of 32.000
>>> Gb/s with 2.5 GT/s PCIe x16 link)
>>> [    6.686670] pci 0000:01:00.0: vgaarb: VGA device added:
>>> decodes=io+mem,owns=none,locks=none
>>> [    6.688367] pci 0000:01:00.1: [1002:aa90] type 00 class 0x040300
>>> [    6.689168] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
>>> [    6.691099] pci 0000:01:00.1: supports D1 D2
>>> [    6.702495] pci_bus 0000:01: fixups for bus
>>> [    6.702935] pci_bus 0000:01: bus scan returning with max=01
>>> [    6.703500] pci 0000:00:00.0: scanning [bus 01-ff] behind bridge, pass 1
>>> [    6.704171] pci_bus 0000:00: bus scan returning with max=ff
>>> [    6.704768] pci 0000:00:00.0: BAR 15: assigned [mem
>>> 0x310000000-0x31fffffff 64bit pref]
>>> [    6.705664] pci 0000:00:00.0: BAR 14: assigned [mem 0x300900000-0x3009fffff]
>>> [    6.706337] pci 0000:00:00.0: BAR 6: assigned [mem
>>> 0x300a00000-0x300a0ffff pref]
>>> [    6.707035] pci 0000:00:00.0: BAR 13: assigned [io  0x1000-0x1fff]
>>> [    6.707687] pci 0000:01:00.0: BAR 0: assigned [mem
>>> 0x310000000-0x31fffffff 64bit pref]
>>> [    6.708522] pci 0000:01:00.0: BAR 2: assigned [mem
>>> 0x300900000-0x30091ffff 64bit]
>>> [    6.709411] pci 0000:01:00.0: BAR 6: assigned [mem
>>> 0x300920000-0x30093ffff pref]
>>> [    6.710116] pci 0000:01:00.1: BAR 0: assigned [mem
>>> 0x300940000-0x300943fff 64bit]
>>> [    6.710897] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x10ff]
>>> [    6.711516] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
>>> [    6.712022] pci 0000:00:00.0:   bridge window [io  0x1000-0x1fff]
>>> [    6.712617] pci 0000:00:00.0:   bridge window [mem 0x300900000-0x3009fffff]
>>> [    6.713278] pci 0000:00:00.0:   bridge window [mem
>>> 0x310000000-0x31fffffff 64bit pref]
>>> [    6.716165] pcieport 0000:00:00.0: assign IRQ: got 95
>>> [    6.749839] pcieport 0000:00:00.0: PME: Signaling with IRQ 96
>>> [    6.751738] pcieport 0000:00:00.0: saving config space at offset
>>> 0x0 (reading 0x35661d87)
>>> [    6.752495] pcieport 0000:00:00.0: saving config space at offset
>>> 0x4 (reading 0x100507)
>>> [    6.753224] pcieport 0000:00:00.0: saving config space at offset
>>> 0x8 (reading 0x6040001)
>>> [    6.754217] pcieport 0000:00:00.0: saving config space at offset
>>> 0xc (reading 0x10000)
>>> [    6.754942] pcieport 0000:00:00.0: saving config space at offset
>>> 0x10 (reading 0x0)
>>> [    6.755640] pcieport 0000:00:00.0: saving config space at offset
>>> 0x14 (reading 0x0)
>>> [    6.756337] pcieport 0000:00:00.0: saving config space at offset
>>> 0x18 (reading 0xff0100)
>>> [    6.757073] pcieport 0000:00:00.0: saving config space at offset
>>> 0x1c (reading 0x20001010)
>>> [    6.757878] pcieport 0000:00:00.0: saving config space at offset
>>> 0x20 (reading 0x900090)
>>> [    6.758614] pcieport 0000:00:00.0: saving config space at offset
>>> 0x24 (reading 0x1ff11001)
>>> [    6.759361] pcieport 0000:00:00.0: saving config space at offset
>>> 0x28 (reading 0x0)
>>> [    6.760057] pcieport 0000:00:00.0: saving config space at offset
>>> 0x2c (reading 0x0)
>>> [    6.760752] pcieport 0000:00:00.0: saving config space at offset
>>> 0x30 (reading 0x0)
>>> [    6.761501] pcieport 0000:00:00.0: saving config space at offset
>>> 0x34 (reading 0x40)
>>> [    6.762206] pcieport 0000:00:00.0: saving config space at offset
>>> 0x38 (reading 0x0)
>>> [    6.762902] pcieport 0000:00:00.0: saving config space at offset
>>> 0x3c (reading 0x2015f)
>>> [    6.764350] radeon 0000:01:00.0: assign IRQ: got 95
>>> [    6.766212] radeon 0000:01:00.0: enabling device (0000 -> 0003)
>>> [    6.766911] [drm:drm_minor_register]
>>> [    6.770051] [drm:drm_minor_register] new minor registered 128
>>> [    6.770606] [drm:drm_minor_register]
>>> [    6.771958] [drm:drm_minor_register] new minor registered 0
>>> [    6.772640] [drm] initializing kernel modesetting (TURKS
>>> 0x1002:0x675D 0x1028:0x2B20 0x00).
>>> [    7.029251] [drm:radeon_get_bios] ATOMBIOS detected
>>> [    7.029814] ATOM BIOS: TURKS
>>> [    7.030100] [drm:atom_allocate_fb_scratch] atom firmware requested
>>> 00000000 0kb
>>> [    7.030901] [drm] GPU not posted. posting now...
>>> [    7.037575] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
>>> 0x000000003FFFFFFF (1024M used)
>>> [    7.038388] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
>>> 0x000000007FFFFFFF
>>> [    7.039082] [drm] Detected VRAM RAM=1024M, BAR=256M
>>> [    7.039533] [drm] RAM width 128bits DDR
>>> [    7.040975] [drm] radeon: 1024M of VRAM memory ready
>>> [    7.041543] [drm] radeon: 1024M of GTT memory ready.
>>> [    7.042289] [drm:ni_init_microcode]
>>> [    7.042639] [drm] Loading TURKS Microcode
>>> [    7.043047] [drm] Internal thermal controller with fan control
>>> [    7.059713] [drm] radeon: dpm initialized
>>> [    7.060375] [drm] GART: num cpu pages 262144, num gpu pages 262144
>>> [    7.069457] [drm] enabling PCIE gen 2 link speeds, disable with
>>> radeon.pcie_gen2=0
>>> [    7.167901] [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
>>> [    7.169257] radeon 0000:01:00.0: WB enabled
>>> [    7.169770] radeon 0000:01:00.0: fence driver on ring 0 use gpu
>>> addr 0x0000000040000c00
>>> [    7.170496] radeon 0000:01:00.0: fence driver on ring 3 use gpu
>>> addr 0x0000000040000c0c
>>> [    7.177636] radeon 0000:01:00.0: fence driver on ring 5 use gpu
>>> addr 0x0000000000072118
>>> [    7.182365] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
>>> [    7.184105] radeon 0000:01:00.0: radeon: using MSI.
>>> [    7.184571] [drm:drm_irq_install] irq=97
>>> [    7.185619] [drm] radeon: irq initialized.
>>> [    7.186795] radeon 0000:01:00.0: enabling bus mastering
>>> [    7.187346] [drm:evergreen_irq_process] evergreen_irq_process
>>> start: rptr 0, wptr 96
>>> [    7.188118] [drm:evergreen_irq_process] IH: D1 flip
>>> [    7.188563] [drm:evergreen_irq_process] IH: D2 flip
>>> [    7.189006] [drm:evergreen_irq_process] IH: D3 flip
>>> [    7.189450] [drm:evergreen_irq_process] IH: D4 flip
>>> [    7.189894] [drm:evergreen_irq_process] IH: D5 flip
>>> [    7.190337] [drm:evergreen_irq_process] IH: D6 flip
>>> [    7.190811] [drm:evergreen_irq_process] evergreen_irq_process
>>> start: rptr 96, wptr 96
>>> [    7.530753] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed
>>> (scratch(0x8504)=0xCAFEDEAD)
>>> [    7.531564] radeon 0000:01:00.0: disabling GPU acceleration
>>> [    7.533961] [drm:drm_irq_uninstall] irq=97
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 13:05     ` Alex Deucher
  2021-05-25 13:18       ` Peter Geis
@ 2021-05-25 20:09       ` Robin Murphy
  2021-05-26  9:42         ` Christian König
  1 sibling, 1 reply; 45+ messages in thread
From: Robin Murphy @ 2021-05-25 20:09 UTC (permalink / raw)
  To: Alex Deucher, Peter Geis
  Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

On 2021-05-25 14:05, Alex Deucher wrote:
> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com> wrote:
>>
>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>>>
>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> wrote:
>>>>
>>>> Good Evening,
>>>>
>>>> I am stress testing the pcie controller on the rk3566-quartz64 prototype SBC.
>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>> controller, which makes a dGPU theoretically possible.
>>>> While attempting to light off a HD7570 card I manage to get a modeset
>>>> console, but ring0 test fails and disables acceleration.
>>>>
>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux kernel.
>>>> Any insight you can provide would be much appreciated.
>>>
>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>> for the driver to operate.
>>
>> Ah, most likely not.
>> This issue has come up already as the GIC isn't permitted to snoop on
>> the CPUs, so I doubt the PCIe controller can either.
>>
>> Is there no way to work around this or is it dead in the water?
> 
> It's required by the pcie spec.  You could potentially work around it
> if you can allocate uncached memory for DMA, but I don't think that is
> possible currently.  Ideally we'd figure out some way to detect if a
> particular platform supports cache snooping or not as well.

There's device_get_dma_attr(), although I don't think it will work 
currently for PCI devices without an OF or ACPI node - we could perhaps 
do with a PCI-specific wrapper which can walk up and defer to the host 
bridge's firmware description as necessary.

The common DMA ops *do* correctly keep track of per-device coherency 
internally, but drivers aren't supposed to be poking at that information 
directly.

Robin.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-25 20:09       ` Robin Murphy
@ 2021-05-26  9:42         ` Christian König
  2021-05-26 10:59           ` Robin Murphy
  0 siblings, 1 reply; 45+ messages in thread
From: Christian König @ 2021-05-26  9:42 UTC (permalink / raw)
  To: Robin Murphy, Alex Deucher, Peter Geis
  Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

Hi Robin,

Am 25.05.21 um 22:09 schrieb Robin Murphy:
> On 2021-05-25 14:05, Alex Deucher wrote:
>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com> wrote:
>>>
>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher <alexdeucher@gmail.com> 
>>> wrote:
>>>>
>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> 
>>>> wrote:
>>>>>
>>>>> Good Evening,
>>>>>
>>>>> I am stress testing the pcie controller on the rk3566-quartz64 
>>>>> prototype SBC.
>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>> controller, which makes a dGPU theoretically possible.
>>>>> While attempting to light off a HD7570 card I manage to get a modeset
>>>>> console, but ring0 test fails and disables acceleration.
>>>>>
>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux 
>>>>> kernel.
>>>>> Any insight you can provide would be much appreciated.
>>>>
>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>> for the driver to operate.
>>>
>>> Ah, most likely not.
>>> This issue has come up already as the GIC isn't permitted to snoop on
>>> the CPUs, so I doubt the PCIe controller can either.
>>>
>>> Is there no way to work around this or is it dead in the water?
>>
>> It's required by the pcie spec.  You could potentially work around it
>> if you can allocate uncached memory for DMA, but I don't think that is
>> possible currently.  Ideally we'd figure out some way to detect if a
>> particular platform supports cache snooping or not as well.
>
> There's device_get_dma_attr(), although I don't think it will work 
> currently for PCI devices without an OF or ACPI node - we could 
> perhaps do with a PCI-specific wrapper which can walk up and defer to 
> the host bridge's firmware description as necessary.
>
> The common DMA ops *do* correctly keep track of per-device coherency 
> internally, but drivers aren't supposed to be poking at that 
> information directly.

That sounds like you underestimate the problem. ARM has unfortunately 
made the coherency for PCI an optional IP.

So we are talking about a hardware limitation which potentially can't be 
fixed without replacing the hardware.

Christian.

>
> Robin.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-26  9:42         ` Christian König
@ 2021-05-26 10:59           ` Robin Murphy
  2021-05-26 11:21             ` Christian König
  0 siblings, 1 reply; 45+ messages in thread
From: Robin Murphy @ 2021-05-26 10:59 UTC (permalink / raw)
  To: Christian König, Alex Deucher, Peter Geis
  Cc: Deucher, Alexander, Christian Koenig, amd-gfx list

On 2021-05-26 10:42, Christian König wrote:
> Hi Robin,
> 
> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>> On 2021-05-25 14:05, Alex Deucher wrote:
>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com> wrote:
>>>>
>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher <alexdeucher@gmail.com> 
>>>> wrote:
>>>>>
>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> 
>>>>> wrote:
>>>>>>
>>>>>> Good Evening,
>>>>>>
>>>>>> I am stress testing the pcie controller on the rk3566-quartz64 
>>>>>> prototype SBC.
>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>> While attempting to light off a HD7570 card I manage to get a modeset
>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>
>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux 
>>>>>> kernel.
>>>>>> Any insight you can provide would be much appreciated.
>>>>>
>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>> for the driver to operate.
>>>>
>>>> Ah, most likely not.
>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>
>>>> Is there no way to work around this or is it dead in the water?
>>>
>>> It's required by the pcie spec.  You could potentially work around it
>>> if you can allocate uncached memory for DMA, but I don't think that is
>>> possible currently.  Ideally we'd figure out some way to detect if a
>>> particular platform supports cache snooping or not as well.
>>
>> There's device_get_dma_attr(), although I don't think it will work 
>> currently for PCI devices without an OF or ACPI node - we could 
>> perhaps do with a PCI-specific wrapper which can walk up and defer to 
>> the host bridge's firmware description as necessary.
>>
>> The common DMA ops *do* correctly keep track of per-device coherency 
>> internally, but drivers aren't supposed to be poking at that 
>> information directly.
> 
> That sounds like you underestimate the problem. ARM has unfortunately 
> made the coherency for PCI an optional IP.

Sorry to be that guy, but I'm involved a lot internally with our system 
IP and interconnect, and I probably understand the situation better than 
99% of the community ;)

For the record, the SBSA specification (the closet thing we have to a 
"system architecture") does require that PCIe is integrated in an 
I/O-coherent manner, but we don't have any control over what people do 
in embedded applications (note that we don't make PCIe IP at all, and 
there is plenty of 3rd-party interconnect IP).

> So we are talking about a hardware limitation which potentially can't be 
> fixed without replacing the hardware.

You expressed interest in "some way to detect if a particular platform 
supports cache snooping or not", by which I assumed you meant a software 
method for the amdgpu/radeon drivers to call, rather than, say, a 
website that driver maintainers can look up SoC names on. I'm saying 
that that API already exists (just may need a bit more work). Note that 
it is emphatically not a platform-level thing since coherency can and 
does vary per device within a system.

I wasn't suggesting that Linux could somehow make coherency magically 
work when the signals don't physically exist in the interconnect - I was 
assuming you'd merely want to do something like throw a big warning and 
taint the kernel to help triage bug reports. Some drivers like 
ahci_qoriq and panfrost simply need to know so they can program their 
device to emit the appropriate memory attributes either way, and rely on 
the DMA API to hide the rest of the difference, but if you want to treat 
non-coherent use as unsupported because it would require too invasive 
changes that's fine by me.

Robin.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-26 10:59           ` Robin Murphy
@ 2021-05-26 11:21             ` Christian König
  2022-03-17  0:14                 ` Peter Geis
  0 siblings, 1 reply; 45+ messages in thread
From: Christian König @ 2021-05-26 11:21 UTC (permalink / raw)
  To: Robin Murphy, Christian König, Alex Deucher, Peter Geis
  Cc: Deucher, Alexander, amd-gfx list

Hi Robin,

Am 26.05.21 um 12:59 schrieb Robin Murphy:
> On 2021-05-26 10:42, Christian König wrote:
>> Hi Robin,
>>
>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com> 
>>>> wrote:
>>>>>
>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher 
>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> Good Evening,
>>>>>>>
>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64 
>>>>>>> prototype SBC.
>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>> While attempting to light off a HD7570 card I manage to get a 
>>>>>>> modeset
>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>
>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux 
>>>>>>> kernel.
>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>
>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>> for the driver to operate.
>>>>>
>>>>> Ah, most likely not.
>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>
>>>>> Is there no way to work around this or is it dead in the water?
>>>>
>>>> It's required by the pcie spec.  You could potentially work around it
>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>> particular platform supports cache snooping or not as well.
>>>
>>> There's device_get_dma_attr(), although I don't think it will work 
>>> currently for PCI devices without an OF or ACPI node - we could 
>>> perhaps do with a PCI-specific wrapper which can walk up and defer 
>>> to the host bridge's firmware description as necessary.
>>>
>>> The common DMA ops *do* correctly keep track of per-device coherency 
>>> internally, but drivers aren't supposed to be poking at that 
>>> information directly.
>>
>> That sounds like you underestimate the problem. ARM has unfortunately 
>> made the coherency for PCI an optional IP.
>
> Sorry to be that guy, but I'm involved a lot internally with our 
> system IP and interconnect, and I probably understand the situation 
> better than 99% of the community ;)

I need to apologize, didn't realized who was answering :)

It just sounded to me that you wanted to suggest to the end user that 
this is fixable in software and I really wanted to avoid even more 
customers coming around asking how to do this.

> For the record, the SBSA specification (the closet thing we have to a 
> "system architecture") does require that PCIe is integrated in an 
> I/O-coherent manner, but we don't have any control over what people do 
> in embedded applications (note that we don't make PCIe IP at all, and 
> there is plenty of 3rd-party interconnect IP).

So basically it is not the fault of the ARM IP-core, but people are just 
stitching together PCIe interconnect IP with a core where it is not 
supposed to be used with.

Do I get that correctly? That's an interesting puzzle piece in the picture.

>> So we are talking about a hardware limitation which potentially can't 
>> be fixed without replacing the hardware.
>
> You expressed interest in "some way to detect if a particular platform 
> supports cache snooping or not", by which I assumed you meant a 
> software method for the amdgpu/radeon drivers to call, rather than, 
> say, a website that driver maintainers can look up SoC names on. I'm 
> saying that that API already exists (just may need a bit more work). 
> Note that it is emphatically not a platform-level thing since 
> coherency can and does vary per device within a system.

Well, I think this is not something an individual driver should mess 
with. What the driver should do is just express that it needs coherent 
access to all of system memory and if that is not possible fail to load 
with a warning why it is not possible.

>
> I wasn't suggesting that Linux could somehow make coherency magically 
> work when the signals don't physically exist in the interconnect - I 
> was assuming you'd merely want to do something like throw a big 
> warning and taint the kernel to help triage bug reports. Some drivers 
> like ahci_qoriq and panfrost simply need to know so they can program 
> their device to emit the appropriate memory attributes either way, and 
> rely on the DMA API to hide the rest of the difference, but if you 
> want to treat non-coherent use as unsupported because it would require 
> too invasive changes that's fine by me.

Yes exactly that please. I mean not sure how panfrost is doing it, but 
at least the Vulkan userspace API specification requires devices to have 
coherent access to system memory.

So even if I would want to do this it is simply not possible because the 
application doesn't tell the driver which memory is accessed by the 
device and which by the CPU.

Christian.

>
> Robin.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2021-05-26 11:21             ` Christian König
@ 2022-03-17  0:14                 ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17  0:14 UTC (permalink / raw)
  To: Kever Yang, Robin Murphy, Shawn Lin
  Cc: Christian König, Christian König, Alex Deucher,
	Deucher, Alexander, amd-gfx list, open list:ARM/Rockchip SoC...

Good Evening,

I apologize for raising this email chain from the dead, but there have
been some developments that have introduced even more questions.
I've looped the Rockchip mailing list into this too, as this affects
rk356x, and likely the upcoming rk3588 if [1] is to be believed.

TLDR for those not familiar: It seems the rk356x series (and possibly
the rk3588) were built without any outer coherent cache.
This means (unless Rockchip wants to clarify here) devices such as the
ITS and PCIe cannot utilize cache snooping.
This is based on the results of the email chain [2].

The new circumstances are as follows:
The RPi CM4 Adventure Team as I've taken to calling them has been
attempting to get a dGPU working with the very broken Broadcom
controller in the RPi CM4.
Recently they acquired a SoQuartz rk3566 module which is pin
compatible with the CM4, and have taken to trying it out as well.

This is how I got involved.
It seems they found a trivial way to force the Radeon R600 driver to
use Non-Cached memory for everything.
This single line change, combined with using memset_io instead of
memset, allows the ring tests to pass and the card probes successfully
(minus the DMA limitations of the rk356x due to the 32 bit
interconnect).
I discovered using this method that we start having unaligned io
memory access faults (bus errors) when running glmark2-drm (running
glmark2 directly was impossible, as both X and Wayland crashed too
early).
I traced this to using what I thought at the time was an unsafe memcpy
in the mesa stack.
Rewriting this function to force aligned writes solved the problem and
allows glmark2-drm to run to completion.
With some extensive debugging, I found about half a dozen memcpy
functions in mesa that if forced to be aligned would allow Wayland to
start, but with hilarious display corruption (see [3]. [4]).
The CM4 team is convinced this is an issue with memcpy in glibc, but
I'm not convinced it's that simple.

On my two hour drive in to work this morning, I got to thinking.
If this was an memcpy fault, this would be universally broken on arm64
which is obviously not the case.
So I started thinking, what is different here than with systems known to work:
1. No IOMMU for the PCIe controller.
2. The Outer Cache Issue.

Robin:
My questions for you, since you're the smartest person I know about
arm64 memory management:
Could cache snooping permit unaligned accesses to IO to be safe?
Or
Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
Or
Am I insane here?

Rockchip:
Please update on the status for the Outer Cache errata for ITS services.
Please provide an answer to the errata of the PCIe controller, in
regard to cache snooping and buffering, for both the rk356x and the
upcoming rk3588.

[1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
[2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
[3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
[4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png

Thank you everyone for your time.

Very Respectfully,
Peter Geis

On Wed, May 26, 2021 at 7:21 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Robin,
>
> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> > On 2021-05-26 10:42, Christian König wrote:
> >> Hi Robin,
> >>
> >> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>
> >>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Good Evening,
> >>>>>>>
> >>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>> prototype SBC.
> >>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>> modeset
> >>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>
> >>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>> kernel.
> >>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>
> >>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>> for the driver to operate.
> >>>>>
> >>>>> Ah, most likely not.
> >>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>
> >>>>> Is there no way to work around this or is it dead in the water?
> >>>>
> >>>> It's required by the pcie spec.  You could potentially work around it
> >>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>> particular platform supports cache snooping or not as well.
> >>>
> >>> There's device_get_dma_attr(), although I don't think it will work
> >>> currently for PCI devices without an OF or ACPI node - we could
> >>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>> to the host bridge's firmware description as necessary.
> >>>
> >>> The common DMA ops *do* correctly keep track of per-device coherency
> >>> internally, but drivers aren't supposed to be poking at that
> >>> information directly.
> >>
> >> That sounds like you underestimate the problem. ARM has unfortunately
> >> made the coherency for PCI an optional IP.
> >
> > Sorry to be that guy, but I'm involved a lot internally with our
> > system IP and interconnect, and I probably understand the situation
> > better than 99% of the community ;)
>
> I need to apologize, didn't realized who was answering :)
>
> It just sounded to me that you wanted to suggest to the end user that
> this is fixable in software and I really wanted to avoid even more
> customers coming around asking how to do this.
>
> > For the record, the SBSA specification (the closet thing we have to a
> > "system architecture") does require that PCIe is integrated in an
> > I/O-coherent manner, but we don't have any control over what people do
> > in embedded applications (note that we don't make PCIe IP at all, and
> > there is plenty of 3rd-party interconnect IP).
>
> So basically it is not the fault of the ARM IP-core, but people are just
> stitching together PCIe interconnect IP with a core where it is not
> supposed to be used with.
>
> Do I get that correctly? That's an interesting puzzle piece in the picture.
>
> >> So we are talking about a hardware limitation which potentially can't
> >> be fixed without replacing the hardware.
> >
> > You expressed interest in "some way to detect if a particular platform
> > supports cache snooping or not", by which I assumed you meant a
> > software method for the amdgpu/radeon drivers to call, rather than,
> > say, a website that driver maintainers can look up SoC names on. I'm
> > saying that that API already exists (just may need a bit more work).
> > Note that it is emphatically not a platform-level thing since
> > coherency can and does vary per device within a system.
>
> Well, I think this is not something an individual driver should mess
> with. What the driver should do is just express that it needs coherent
> access to all of system memory and if that is not possible fail to load
> with a warning why it is not possible.
>
> >
> > I wasn't suggesting that Linux could somehow make coherency magically
> > work when the signals don't physically exist in the interconnect - I
> > was assuming you'd merely want to do something like throw a big
> > warning and taint the kernel to help triage bug reports. Some drivers
> > like ahci_qoriq and panfrost simply need to know so they can program
> > their device to emit the appropriate memory attributes either way, and
> > rely on the DMA API to hide the rest of the difference, but if you
> > want to treat non-coherent use as unsupported because it would require
> > too invasive changes that's fine by me.
>
> Yes exactly that please. I mean not sure how panfrost is doing it, but
> at least the Vulkan userspace API specification requires devices to have
> coherent access to system memory.
>
> So even if I would want to do this it is simply not possible because the
> application doesn't tell the driver which memory is accessed by the
> device and which by the CPU.
>
> Christian.
>
> >
> > Robin.
>

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17  0:14                 ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17  0:14 UTC (permalink / raw)
  To: Kever Yang, Robin Murphy, Shawn Lin
  Cc: open list:ARM/Rockchip SoC...,
	Christian König, amd-gfx list, Deucher, Alexander,
	Alex Deucher, Christian König

Good Evening,

I apologize for raising this email chain from the dead, but there have
been some developments that have introduced even more questions.
I've looped the Rockchip mailing list into this too, as this affects
rk356x, and likely the upcoming rk3588 if [1] is to be believed.

TLDR for those not familiar: It seems the rk356x series (and possibly
the rk3588) were built without any outer coherent cache.
This means (unless Rockchip wants to clarify here) devices such as the
ITS and PCIe cannot utilize cache snooping.
This is based on the results of the email chain [2].

The new circumstances are as follows:
The RPi CM4 Adventure Team as I've taken to calling them has been
attempting to get a dGPU working with the very broken Broadcom
controller in the RPi CM4.
Recently they acquired a SoQuartz rk3566 module which is pin
compatible with the CM4, and have taken to trying it out as well.

This is how I got involved.
It seems they found a trivial way to force the Radeon R600 driver to
use Non-Cached memory for everything.
This single line change, combined with using memset_io instead of
memset, allows the ring tests to pass and the card probes successfully
(minus the DMA limitations of the rk356x due to the 32 bit
interconnect).
I discovered using this method that we start having unaligned io
memory access faults (bus errors) when running glmark2-drm (running
glmark2 directly was impossible, as both X and Wayland crashed too
early).
I traced this to using what I thought at the time was an unsafe memcpy
in the mesa stack.
Rewriting this function to force aligned writes solved the problem and
allows glmark2-drm to run to completion.
With some extensive debugging, I found about half a dozen memcpy
functions in mesa that if forced to be aligned would allow Wayland to
start, but with hilarious display corruption (see [3]. [4]).
The CM4 team is convinced this is an issue with memcpy in glibc, but
I'm not convinced it's that simple.

On my two hour drive in to work this morning, I got to thinking.
If this was an memcpy fault, this would be universally broken on arm64
which is obviously not the case.
So I started thinking, what is different here than with systems known to work:
1. No IOMMU for the PCIe controller.
2. The Outer Cache Issue.

Robin:
My questions for you, since you're the smartest person I know about
arm64 memory management:
Could cache snooping permit unaligned accesses to IO to be safe?
Or
Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
Or
Am I insane here?

Rockchip:
Please update on the status for the Outer Cache errata for ITS services.
Please provide an answer to the errata of the PCIe controller, in
regard to cache snooping and buffering, for both the rk356x and the
upcoming rk3588.

[1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
[2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
[3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
[4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png

Thank you everyone for your time.

Very Respectfully,
Peter Geis

On Wed, May 26, 2021 at 7:21 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Robin,
>
> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> > On 2021-05-26 10:42, Christian König wrote:
> >> Hi Robin,
> >>
> >> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>
> >>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Good Evening,
> >>>>>>>
> >>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>> prototype SBC.
> >>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>> modeset
> >>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>
> >>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>> kernel.
> >>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>
> >>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>> for the driver to operate.
> >>>>>
> >>>>> Ah, most likely not.
> >>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>
> >>>>> Is there no way to work around this or is it dead in the water?
> >>>>
> >>>> It's required by the pcie spec.  You could potentially work around it
> >>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>> particular platform supports cache snooping or not as well.
> >>>
> >>> There's device_get_dma_attr(), although I don't think it will work
> >>> currently for PCI devices without an OF or ACPI node - we could
> >>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>> to the host bridge's firmware description as necessary.
> >>>
> >>> The common DMA ops *do* correctly keep track of per-device coherency
> >>> internally, but drivers aren't supposed to be poking at that
> >>> information directly.
> >>
> >> That sounds like you underestimate the problem. ARM has unfortunately
> >> made the coherency for PCI an optional IP.
> >
> > Sorry to be that guy, but I'm involved a lot internally with our
> > system IP and interconnect, and I probably understand the situation
> > better than 99% of the community ;)
>
> I need to apologize, didn't realized who was answering :)
>
> It just sounded to me that you wanted to suggest to the end user that
> this is fixable in software and I really wanted to avoid even more
> customers coming around asking how to do this.
>
> > For the record, the SBSA specification (the closet thing we have to a
> > "system architecture") does require that PCIe is integrated in an
> > I/O-coherent manner, but we don't have any control over what people do
> > in embedded applications (note that we don't make PCIe IP at all, and
> > there is plenty of 3rd-party interconnect IP).
>
> So basically it is not the fault of the ARM IP-core, but people are just
> stitching together PCIe interconnect IP with a core where it is not
> supposed to be used with.
>
> Do I get that correctly? That's an interesting puzzle piece in the picture.
>
> >> So we are talking about a hardware limitation which potentially can't
> >> be fixed without replacing the hardware.
> >
> > You expressed interest in "some way to detect if a particular platform
> > supports cache snooping or not", by which I assumed you meant a
> > software method for the amdgpu/radeon drivers to call, rather than,
> > say, a website that driver maintainers can look up SoC names on. I'm
> > saying that that API already exists (just may need a bit more work).
> > Note that it is emphatically not a platform-level thing since
> > coherency can and does vary per device within a system.
>
> Well, I think this is not something an individual driver should mess
> with. What the driver should do is just express that it needs coherent
> access to all of system memory and if that is not possible fail to load
> with a warning why it is not possible.
>
> >
> > I wasn't suggesting that Linux could somehow make coherency magically
> > work when the signals don't physically exist in the interconnect - I
> > was assuming you'd merely want to do something like throw a big
> > warning and taint the kernel to help triage bug reports. Some drivers
> > like ahci_qoriq and panfrost simply need to know so they can program
> > their device to emit the appropriate memory attributes either way, and
> > rely on the DMA API to hide the rest of the difference, but if you
> > want to treat non-coherent use as unsupported because it would require
> > too invasive changes that's fine by me.
>
> Yes exactly that please. I mean not sure how panfrost is doing it, but
> at least the Vulkan userspace API specification requires devices to have
> coherent access to system memory.
>
> So even if I would want to do this it is simply not possible because the
> application doesn't tell the driver which memory is accessed by the
> device and which by the CPU.
>
> Christian.
>
> >
> > Robin.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  0:14                 ` Peter Geis
@ 2022-03-17  3:07                   ` Kever Yang
  -1 siblings, 0 replies; 45+ messages in thread
From: Kever Yang @ 2022-03-17  3:07 UTC (permalink / raw)
  To: Peter Geis, Robin Murphy, Shawn Lin
  Cc: Christian König, Christian König, Alex Deucher,
	Deucher, Alexander, amd-gfx list, open list:ARM/Rockchip SoC...,
	Tao Huang

Hi Peter,

On 2022/3/17 08:14, Peter Geis wrote:
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.

Our SoC design team has double check with ARM GIC/ITS IP team for many 
times, and the GITS_CBASER
of GIC600 IP does not support hardware bind or config to a fix value, so 
they insist this is an IP
limitation instead of a SoC bug, software should take  care of it :(
I will check again if we can provide errata for this issue.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.


Sorry, what is this?


Thanks,
- Kever
>
> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>> On 2021-05-26 10:42, Christian König wrote:
>>>> Hi Robin,
>>>>
>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>> wrote:
>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Good Evening,
>>>>>>>>>
>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>> prototype SBC.
>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>> modeset
>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>
>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>> kernel.
>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>> for the driver to operate.
>>>>>>> Ah, most likely not.
>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>
>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>> particular platform supports cache snooping or not as well.
>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>> to the host bridge's firmware description as necessary.
>>>>>
>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>> internally, but drivers aren't supposed to be poking at that
>>>>> information directly.
>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>> made the coherency for PCI an optional IP.
>>> Sorry to be that guy, but I'm involved a lot internally with our
>>> system IP and interconnect, and I probably understand the situation
>>> better than 99% of the community ;)
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>>> For the record, the SBSA specification (the closet thing we have to a
>>> "system architecture") does require that PCIe is integrated in an
>>> I/O-coherent manner, but we don't have any control over what people do
>>> in embedded applications (note that we don't make PCIe IP at all, and
>>> there is plenty of 3rd-party interconnect IP).
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>>>> So we are talking about a hardware limitation which potentially can't
>>>> be fixed without replacing the hardware.
>>> You expressed interest in "some way to detect if a particular platform
>>> supports cache snooping or not", by which I assumed you meant a
>>> software method for the amdgpu/radeon drivers to call, rather than,
>>> say, a website that driver maintainers can look up SoC names on. I'm
>>> saying that that API already exists (just may need a bit more work).
>>> Note that it is emphatically not a platform-level thing since
>>> coherency can and does vary per device within a system.
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>>> I wasn't suggesting that Linux could somehow make coherency magically
>>> work when the signals don't physically exist in the interconnect - I
>>> was assuming you'd merely want to do something like throw a big
>>> warning and taint the kernel to help triage bug reports. Some drivers
>>> like ahci_qoriq and panfrost simply need to know so they can program
>>> their device to emit the appropriate memory attributes either way, and
>>> rely on the DMA API to hide the rest of the difference, but if you
>>> want to treat non-coherent use as unsupported because it would require
>>> too invasive changes that's fine by me.
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>>> Robin.

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17  3:07                   ` Kever Yang
  0 siblings, 0 replies; 45+ messages in thread
From: Kever Yang @ 2022-03-17  3:07 UTC (permalink / raw)
  To: Peter Geis, Robin Murphy, Shawn Lin
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, amd-gfx list, Deucher, Alexander,
	Alex Deucher, Christian König

Hi Peter,

On 2022/3/17 08:14, Peter Geis wrote:
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.

Our SoC design team has double check with ARM GIC/ITS IP team for many 
times, and the GITS_CBASER
of GIC600 IP does not support hardware bind or config to a fix value, so 
they insist this is an IP
limitation instead of a SoC bug, software should take  care of it :(
I will check again if we can provide errata for this issue.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.


Sorry, what is this?


Thanks,
- Kever
>
> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>> On 2021-05-26 10:42, Christian König wrote:
>>>> Hi Robin,
>>>>
>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>> wrote:
>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Good Evening,
>>>>>>>>>
>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>> prototype SBC.
>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>> modeset
>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>
>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>> kernel.
>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>> for the driver to operate.
>>>>>>> Ah, most likely not.
>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>
>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>> particular platform supports cache snooping or not as well.
>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>> to the host bridge's firmware description as necessary.
>>>>>
>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>> internally, but drivers aren't supposed to be poking at that
>>>>> information directly.
>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>> made the coherency for PCI an optional IP.
>>> Sorry to be that guy, but I'm involved a lot internally with our
>>> system IP and interconnect, and I probably understand the situation
>>> better than 99% of the community ;)
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>>> For the record, the SBSA specification (the closet thing we have to a
>>> "system architecture") does require that PCIe is integrated in an
>>> I/O-coherent manner, but we don't have any control over what people do
>>> in embedded applications (note that we don't make PCIe IP at all, and
>>> there is plenty of 3rd-party interconnect IP).
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>>>> So we are talking about a hardware limitation which potentially can't
>>>> be fixed without replacing the hardware.
>>> You expressed interest in "some way to detect if a particular platform
>>> supports cache snooping or not", by which I assumed you meant a
>>> software method for the amdgpu/radeon drivers to call, rather than,
>>> say, a website that driver maintainers can look up SoC names on. I'm
>>> saying that that API already exists (just may need a bit more work).
>>> Note that it is emphatically not a platform-level thing since
>>> coherency can and does vary per device within a system.
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>>> I wasn't suggesting that Linux could somehow make coherency magically
>>> work when the signals don't physically exist in the interconnect - I
>>> was assuming you'd merely want to do something like throw a big
>>> warning and taint the kernel to help triage bug reports. Some drivers
>>> like ahci_qoriq and panfrost simply need to know so they can program
>>> their device to emit the appropriate memory attributes either way, and
>>> rely on the DMA API to hide the rest of the difference, but if you
>>> want to treat non-coherent use as unsupported because it would require
>>> too invasive changes that's fine by me.
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>>> Robin.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  0:14                 ` Peter Geis
@ 2022-03-17  9:14                   ` Christian König
  -1 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2022-03-17  9:14 UTC (permalink / raw)
  To: Peter Geis, Kever Yang, Robin Murphy, Shawn Lin
  Cc: Alex Deucher, Christian König, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...

Hi Peter,

Am 17.03.22 um 01:14 schrieb Peter Geis:
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.

well, as far as I know that is a clear violation of the PCIe specification.

Coherent access to system memory is simply a must have.

> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.

Yeah, you basically just force it into AGP mode :)

There is just absolutely no guarantee that this works reliable.

> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.

Yes exactly that.

Both OpenGL and Vulkan allow the application to mmap() device memory and 
do any memory access they want with that.

This means that changing memcpy is just a futile effort, it's still 
possible for the application to make an unaligned memory access and that 
is perfectly valid.

> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.

Oh, very good point. I would be interested in that as answer as well.

Regards,
Christian.

>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>> On 2021-05-26 10:42, Christian König wrote:
>>>> Hi Robin,
>>>>
>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>> wrote:
>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Good Evening,
>>>>>>>>>
>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>> prototype SBC.
>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>> modeset
>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>
>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>> kernel.
>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>> for the driver to operate.
>>>>>>> Ah, most likely not.
>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>
>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>> particular platform supports cache snooping or not as well.
>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>> to the host bridge's firmware description as necessary.
>>>>>
>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>> internally, but drivers aren't supposed to be poking at that
>>>>> information directly.
>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>> made the coherency for PCI an optional IP.
>>> Sorry to be that guy, but I'm involved a lot internally with our
>>> system IP and interconnect, and I probably understand the situation
>>> better than 99% of the community ;)
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>>> For the record, the SBSA specification (the closet thing we have to a
>>> "system architecture") does require that PCIe is integrated in an
>>> I/O-coherent manner, but we don't have any control over what people do
>>> in embedded applications (note that we don't make PCIe IP at all, and
>>> there is plenty of 3rd-party interconnect IP).
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>>>> So we are talking about a hardware limitation which potentially can't
>>>> be fixed without replacing the hardware.
>>> You expressed interest in "some way to detect if a particular platform
>>> supports cache snooping or not", by which I assumed you meant a
>>> software method for the amdgpu/radeon drivers to call, rather than,
>>> say, a website that driver maintainers can look up SoC names on. I'm
>>> saying that that API already exists (just may need a bit more work).
>>> Note that it is emphatically not a platform-level thing since
>>> coherency can and does vary per device within a system.
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>>> I wasn't suggesting that Linux could somehow make coherency magically
>>> work when the signals don't physically exist in the interconnect - I
>>> was assuming you'd merely want to do something like throw a big
>>> warning and taint the kernel to help triage bug reports. Some drivers
>>> like ahci_qoriq and panfrost simply need to know so they can program
>>> their device to emit the appropriate memory attributes either way, and
>>> rely on the DMA API to hide the rest of the difference, but if you
>>> want to treat non-coherent use as unsupported because it would require
>>> too invasive changes that's fine by me.
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>>> Robin.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17  9:14                   ` Christian König
  0 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2022-03-17  9:14 UTC (permalink / raw)
  To: Peter Geis, Kever Yang, Robin Murphy, Shawn Lin
  Cc: Christian König, Alex Deucher, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...

Hi Peter,

Am 17.03.22 um 01:14 schrieb Peter Geis:
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.

well, as far as I know that is a clear violation of the PCIe specification.

Coherent access to system memory is simply a must have.

> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.

Yeah, you basically just force it into AGP mode :)

There is just absolutely no guarantee that this works reliable.

> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.

Yes exactly that.

Both OpenGL and Vulkan allow the application to mmap() device memory and 
do any memory access they want with that.

This means that changing memcpy is just a futile effort, it's still 
possible for the application to make an unaligned memory access and that 
is perfectly valid.

> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.

Oh, very good point. I would be interested in that as answer as well.

Regards,
Christian.

>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>> On 2021-05-26 10:42, Christian König wrote:
>>>> Hi Robin,
>>>>
>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>> wrote:
>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Good Evening,
>>>>>>>>>
>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>> prototype SBC.
>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>> modeset
>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>
>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>> kernel.
>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>> for the driver to operate.
>>>>>>> Ah, most likely not.
>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>
>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>> particular platform supports cache snooping or not as well.
>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>> to the host bridge's firmware description as necessary.
>>>>>
>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>> internally, but drivers aren't supposed to be poking at that
>>>>> information directly.
>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>> made the coherency for PCI an optional IP.
>>> Sorry to be that guy, but I'm involved a lot internally with our
>>> system IP and interconnect, and I probably understand the situation
>>> better than 99% of the community ;)
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>>> For the record, the SBSA specification (the closet thing we have to a
>>> "system architecture") does require that PCIe is integrated in an
>>> I/O-coherent manner, but we don't have any control over what people do
>>> in embedded applications (note that we don't make PCIe IP at all, and
>>> there is plenty of 3rd-party interconnect IP).
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>>>> So we are talking about a hardware limitation which potentially can't
>>>> be fixed without replacing the hardware.
>>> You expressed interest in "some way to detect if a particular platform
>>> supports cache snooping or not", by which I assumed you meant a
>>> software method for the amdgpu/radeon drivers to call, rather than,
>>> say, a website that driver maintainers can look up SoC names on. I'm
>>> saying that that API already exists (just may need a bit more work).
>>> Note that it is emphatically not a platform-level thing since
>>> coherency can and does vary per device within a system.
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>>> I wasn't suggesting that Linux could somehow make coherency magically
>>> work when the signals don't physically exist in the interconnect - I
>>> was assuming you'd merely want to do something like throw a big
>>> warning and taint the kernel to help triage bug reports. Some drivers
>>> like ahci_qoriq and panfrost simply need to know so they can program
>>> their device to emit the appropriate memory attributes either way, and
>>> rely on the DMA API to hide the rest of the difference, but if you
>>> want to treat non-coherent use as unsupported because it would require
>>> too invasive changes that's fine by me.
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>>> Robin.


_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  0:14                 ` Peter Geis
@ 2022-03-17 10:37                   ` Robin Murphy
  -1 siblings, 0 replies; 45+ messages in thread
From: Robin Murphy @ 2022-03-17 10:37 UTC (permalink / raw)
  To: Peter Geis, Kever Yang, Shawn Lin
  Cc: Christian König, Christian König, Alex Deucher,
	Deucher, Alexander, amd-gfx list, open list:ARM/Rockchip SoC...

On 2022-03-17 00:14, Peter Geis wrote:
> Good Evening,
> 
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> 
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
> 
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
> 
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
> 
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
> 
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?

No.

> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?

No.

> Or
> Am I insane here?

No. (probably)

CPU access to PCIe has nothing to do with PCIe's access to memory. From 
what you've described, my guess is that a GPU BAR gets put in a 
non-prefetchable window, such that it ends up mapped as Device memory 
(whereas if it were prefetchable it would be Normal Non-Cacheable).

Robin.

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 10:37                   ` Robin Murphy
  0 siblings, 0 replies; 45+ messages in thread
From: Robin Murphy @ 2022-03-17 10:37 UTC (permalink / raw)
  To: Peter Geis, Kever Yang, Shawn Lin
  Cc: open list:ARM/Rockchip SoC...,
	Christian König, amd-gfx list, Deucher, Alexander,
	Alex Deucher, Christian König

On 2022-03-17 00:14, Peter Geis wrote:
> Good Evening,
> 
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> 
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
> 
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
> 
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
> 
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
> 
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?

No.

> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?

No.

> Or
> Am I insane here?

No. (probably)

CPU access to PCIe has nothing to do with PCIe's access to memory. From 
what you've described, my guess is that a GPU BAR gets put in a 
non-prefetchable window, such that it ends up mapped as Device memory 
(whereas if it were prefetchable it would be Normal Non-Cacheable).

Robin.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  3:07                   ` Kever Yang
@ 2022-03-17 12:19                     ` Peter Geis
  -1 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 12:19 UTC (permalink / raw)
  To: Kever Yang
  Cc: Robin Murphy, Shawn Lin, Christian König,
	Christian König, Alex Deucher, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...,
	Tao Huang

On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>
> Hi Peter,
>
> On 2022/3/17 08:14, Peter Geis wrote:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
> >
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> > Or
> > Am I insane here?
> >
> > Rockchip:
> > Please update on the status for the Outer Cache errata for ITS services.
>
> Our SoC design team has double check with ARM GIC/ITS IP team for many
> times, and the GITS_CBASER
> of GIC600 IP does not support hardware bind or config to a fix value, so
> they insist this is an IP
> limitation instead of a SoC bug, software should take  care of it :(
> I will check again if we can provide errata for this issue.

Thanks. This is necessary as the mbi-alias provides an imperfect
implementation of the ITS and causes certain PCIe cards (eg x520 Intel
10G NIC) to misbehave.

> > Please provide an answer to the errata of the PCIe controller, in
> > regard to cache snooping and buffering, for both the rk356x and the
> > upcoming rk3588.
>
>
> Sorry, what is this?

Part of the ITS bug is it expects to be cache coherent with the CPU
cluster by design.
Due to the rk356x being implemented without an outer accessible cache,
the ITS and other devices that require cache coherency (PCIe for
example) crash in fun ways.
This means that rk356x cannot implement a specification compliant ITS or PCIe.
From the rk3588 source dump it appears it was produced without an
outer accessible cache, which means if true it also will be unable to
use any PCIe cards that implement cache coherency as part of their
design.

>
>
> Thanks,
> - Kever
> >
> > [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> > [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> > [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> > [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
> >
> > Thank you everyone for your time.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > On Wed, May 26, 2021 at 7:21 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>> On 2021-05-26 10:42, Christian König wrote:
> >>>> Hi Robin,
> >>>>
> >>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>>>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Good Evening,
> >>>>>>>>>
> >>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>>>> prototype SBC.
> >>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>>>> modeset
> >>>>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>>>
> >>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>>>> kernel.
> >>>>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>>>> for the driver to operate.
> >>>>>>> Ah, most likely not.
> >>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>>>
> >>>>>>> Is there no way to work around this or is it dead in the water?
> >>>>>> It's required by the pcie spec.  You could potentially work around it
> >>>>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>>>> particular platform supports cache snooping or not as well.
> >>>>> There's device_get_dma_attr(), although I don't think it will work
> >>>>> currently for PCI devices without an OF or ACPI node - we could
> >>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>>>> to the host bridge's firmware description as necessary.
> >>>>>
> >>>>> The common DMA ops *do* correctly keep track of per-device coherency
> >>>>> internally, but drivers aren't supposed to be poking at that
> >>>>> information directly.
> >>>> That sounds like you underestimate the problem. ARM has unfortunately
> >>>> made the coherency for PCI an optional IP.
> >>> Sorry to be that guy, but I'm involved a lot internally with our
> >>> system IP and interconnect, and I probably understand the situation
> >>> better than 99% of the community ;)
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >>> For the record, the SBSA specification (the closet thing we have to a
> >>> "system architecture") does require that PCIe is integrated in an
> >>> I/O-coherent manner, but we don't have any control over what people do
> >>> in embedded applications (note that we don't make PCIe IP at all, and
> >>> there is plenty of 3rd-party interconnect IP).
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >>>> So we are talking about a hardware limitation which potentially can't
> >>>> be fixed without replacing the hardware.
> >>> You expressed interest in "some way to detect if a particular platform
> >>> supports cache snooping or not", by which I assumed you meant a
> >>> software method for the amdgpu/radeon drivers to call, rather than,
> >>> say, a website that driver maintainers can look up SoC names on. I'm
> >>> saying that that API already exists (just may need a bit more work).
> >>> Note that it is emphatically not a platform-level thing since
> >>> coherency can and does vary per device within a system.
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >>> I wasn't suggesting that Linux could somehow make coherency magically
> >>> work when the signals don't physically exist in the interconnect - I
> >>> was assuming you'd merely want to do something like throw a big
> >>> warning and taint the kernel to help triage bug reports. Some drivers
> >>> like ahci_qoriq and panfrost simply need to know so they can program
> >>> their device to emit the appropriate memory attributes either way, and
> >>> rely on the DMA API to hide the rest of the difference, but if you
> >>> want to treat non-coherent use as unsupported because it would require
> >>> too invasive changes that's fine by me.
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >>> Robin.
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 12:19                     ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 12:19 UTC (permalink / raw)
  To: Kever Yang
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, amd-gfx list, Deucher,
	Alexander, Alex Deucher, Robin Murphy, Christian König

On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>
> Hi Peter,
>
> On 2022/3/17 08:14, Peter Geis wrote:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
> >
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> > Or
> > Am I insane here?
> >
> > Rockchip:
> > Please update on the status for the Outer Cache errata for ITS services.
>
> Our SoC design team has double check with ARM GIC/ITS IP team for many
> times, and the GITS_CBASER
> of GIC600 IP does not support hardware bind or config to a fix value, so
> they insist this is an IP
> limitation instead of a SoC bug, software should take  care of it :(
> I will check again if we can provide errata for this issue.

Thanks. This is necessary as the mbi-alias provides an imperfect
implementation of the ITS and causes certain PCIe cards (eg x520 Intel
10G NIC) to misbehave.

> > Please provide an answer to the errata of the PCIe controller, in
> > regard to cache snooping and buffering, for both the rk356x and the
> > upcoming rk3588.
>
>
> Sorry, what is this?

Part of the ITS bug is it expects to be cache coherent with the CPU
cluster by design.
Due to the rk356x being implemented without an outer accessible cache,
the ITS and other devices that require cache coherency (PCIe for
example) crash in fun ways.
This means that rk356x cannot implement a specification compliant ITS or PCIe.
From the rk3588 source dump it appears it was produced without an
outer accessible cache, which means if true it also will be unable to
use any PCIe cards that implement cache coherency as part of their
design.

>
>
> Thanks,
> - Kever
> >
> > [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> > [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> > [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> > [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
> >
> > Thank you everyone for your time.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > On Wed, May 26, 2021 at 7:21 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>> On 2021-05-26 10:42, Christian König wrote:
> >>>> Hi Robin,
> >>>>
> >>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>>>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Good Evening,
> >>>>>>>>>
> >>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>>>> prototype SBC.
> >>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>>>> modeset
> >>>>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>>>
> >>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>>>> kernel.
> >>>>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>>>> for the driver to operate.
> >>>>>>> Ah, most likely not.
> >>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>>>
> >>>>>>> Is there no way to work around this or is it dead in the water?
> >>>>>> It's required by the pcie spec.  You could potentially work around it
> >>>>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>>>> particular platform supports cache snooping or not as well.
> >>>>> There's device_get_dma_attr(), although I don't think it will work
> >>>>> currently for PCI devices without an OF or ACPI node - we could
> >>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>>>> to the host bridge's firmware description as necessary.
> >>>>>
> >>>>> The common DMA ops *do* correctly keep track of per-device coherency
> >>>>> internally, but drivers aren't supposed to be poking at that
> >>>>> information directly.
> >>>> That sounds like you underestimate the problem. ARM has unfortunately
> >>>> made the coherency for PCI an optional IP.
> >>> Sorry to be that guy, but I'm involved a lot internally with our
> >>> system IP and interconnect, and I probably understand the situation
> >>> better than 99% of the community ;)
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >>> For the record, the SBSA specification (the closet thing we have to a
> >>> "system architecture") does require that PCIe is integrated in an
> >>> I/O-coherent manner, but we don't have any control over what people do
> >>> in embedded applications (note that we don't make PCIe IP at all, and
> >>> there is plenty of 3rd-party interconnect IP).
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >>>> So we are talking about a hardware limitation which potentially can't
> >>>> be fixed without replacing the hardware.
> >>> You expressed interest in "some way to detect if a particular platform
> >>> supports cache snooping or not", by which I assumed you meant a
> >>> software method for the amdgpu/radeon drivers to call, rather than,
> >>> say, a website that driver maintainers can look up SoC names on. I'm
> >>> saying that that API already exists (just may need a bit more work).
> >>> Note that it is emphatically not a platform-level thing since
> >>> coherency can and does vary per device within a system.
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >>> I wasn't suggesting that Linux could somehow make coherency magically
> >>> work when the signals don't physically exist in the interconnect - I
> >>> was assuming you'd merely want to do something like throw a big
> >>> warning and taint the kernel to help triage bug reports. Some drivers
> >>> like ahci_qoriq and panfrost simply need to know so they can program
> >>> their device to emit the appropriate memory attributes either way, and
> >>> rely on the DMA API to hide the rest of the difference, but if you
> >>> want to treat non-coherent use as unsupported because it would require
> >>> too invasive changes that's fine by me.
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >>> Robin.
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  9:14                   ` Christian König
@ 2022-03-17 12:21                     ` Peter Geis
  -1 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 12:21 UTC (permalink / raw)
  To: Christian König
  Cc: Kever Yang, Robin Murphy, Shawn Lin, Christian König,
	Alex Deucher, Deucher, Alexander, amd-gfx list,
	open list:ARM/Rockchip SoC...

On Thu, Mar 17, 2022 at 5:15 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Peter,
>
> Am 17.03.22 um 01:14 schrieb Peter Geis:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
>
> well, as far as I know that is a clear violation of the PCIe specification.
>
> Coherent access to system memory is simply a must have.

From what I've read of the Arm documentation on the AXI bus, this is
supposed to be implemented by design as well.

>
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
>
> Yeah, you basically just force it into AGP mode :)
>
> There is just absolutely no guarantee that this works reliable.

Ah, that makes sense.

>
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
>
> Yes exactly that.
>
> Both OpenGL and Vulkan allow the application to mmap() device memory and
> do any memory access they want with that.
>
> This means that changing memcpy is just a futile effort, it's still
> possible for the application to make an unaligned memory access and that
> is perfectly valid.

I was afraid of that and it reflects what I see with X11's behavior.

>
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
>
> Oh, very good point. I would be interested in that as answer as well.
>
> Regards,
> Christian.
>
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> > Or
> > Am I insane here?
> >
> > Rockchip:
> > Please update on the status for the Outer Cache errata for ITS services.
> > Please provide an answer to the errata of the PCIe controller, in
> > regard to cache snooping and buffering, for both the rk356x and the
> > upcoming rk3588.
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
> > [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
> > [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
> > [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0
> >
> > Thank you everyone for your time.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > On Wed, May 26, 2021 at 7:21 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>> On 2021-05-26 10:42, Christian König wrote:
> >>>> Hi Robin,
> >>>>
> >>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>>>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Good Evening,
> >>>>>>>>>
> >>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>>>> prototype SBC.
> >>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>>>> modeset
> >>>>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>>>
> >>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>>>> kernel.
> >>>>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>>>> for the driver to operate.
> >>>>>>> Ah, most likely not.
> >>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>>>
> >>>>>>> Is there no way to work around this or is it dead in the water?
> >>>>>> It's required by the pcie spec.  You could potentially work around it
> >>>>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>>>> particular platform supports cache snooping or not as well.
> >>>>> There's device_get_dma_attr(), although I don't think it will work
> >>>>> currently for PCI devices without an OF or ACPI node - we could
> >>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>>>> to the host bridge's firmware description as necessary.
> >>>>>
> >>>>> The common DMA ops *do* correctly keep track of per-device coherency
> >>>>> internally, but drivers aren't supposed to be poking at that
> >>>>> information directly.
> >>>> That sounds like you underestimate the problem. ARM has unfortunately
> >>>> made the coherency for PCI an optional IP.
> >>> Sorry to be that guy, but I'm involved a lot internally with our
> >>> system IP and interconnect, and I probably understand the situation
> >>> better than 99% of the community ;)
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >>> For the record, the SBSA specification (the closet thing we have to a
> >>> "system architecture") does require that PCIe is integrated in an
> >>> I/O-coherent manner, but we don't have any control over what people do
> >>> in embedded applications (note that we don't make PCIe IP at all, and
> >>> there is plenty of 3rd-party interconnect IP).
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >>>> So we are talking about a hardware limitation which potentially can't
> >>>> be fixed without replacing the hardware.
> >>> You expressed interest in "some way to detect if a particular platform
> >>> supports cache snooping or not", by which I assumed you meant a
> >>> software method for the amdgpu/radeon drivers to call, rather than,
> >>> say, a website that driver maintainers can look up SoC names on. I'm
> >>> saying that that API already exists (just may need a bit more work).
> >>> Note that it is emphatically not a platform-level thing since
> >>> coherency can and does vary per device within a system.
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >>> I wasn't suggesting that Linux could somehow make coherency magically
> >>> work when the signals don't physically exist in the interconnect - I
> >>> was assuming you'd merely want to do something like throw a big
> >>> warning and taint the kernel to help triage bug reports. Some drivers
> >>> like ahci_qoriq and panfrost simply need to know so they can program
> >>> their device to emit the appropriate memory attributes either way, and
> >>> rely on the DMA API to hide the rest of the difference, but if you
> >>> want to treat non-coherent use as unsupported because it would require
> >>> too invasive changes that's fine by me.
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >>> Robin.
>

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 12:21                     ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 12:21 UTC (permalink / raw)
  To: Christian König
  Cc: open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher, Robin Murphy

On Thu, Mar 17, 2022 at 5:15 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Peter,
>
> Am 17.03.22 um 01:14 schrieb Peter Geis:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
>
> well, as far as I know that is a clear violation of the PCIe specification.
>
> Coherent access to system memory is simply a must have.

From what I've read of the Arm documentation on the AXI bus, this is
supposed to be implemented by design as well.

>
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
>
> Yeah, you basically just force it into AGP mode :)
>
> There is just absolutely no guarantee that this works reliable.

Ah, that makes sense.

>
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
>
> Yes exactly that.
>
> Both OpenGL and Vulkan allow the application to mmap() device memory and
> do any memory access they want with that.
>
> This means that changing memcpy is just a futile effort, it's still
> possible for the application to make an unaligned memory access and that
> is perfectly valid.

I was afraid of that and it reflects what I see with X11's behavior.

>
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
>
> Oh, very good point. I would be interested in that as answer as well.
>
> Regards,
> Christian.
>
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> > Or
> > Am I insane here?
> >
> > Rockchip:
> > Please update on the status for the Outer Cache errata for ITS services.
> > Please provide an answer to the errata of the PCIe controller, in
> > regard to cache snooping and buffering, for both the rk356x and the
> > upcoming rk3588.
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
> > [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
> > [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
> > [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0
> >
> > Thank you everyone for your time.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > On Wed, May 26, 2021 at 7:21 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>> On 2021-05-26 10:42, Christian König wrote:
> >>>> Hi Robin,
> >>>>
> >>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>>>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Good Evening,
> >>>>>>>>>
> >>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>>>> prototype SBC.
> >>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>>>> modeset
> >>>>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>>>
> >>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>>>> kernel.
> >>>>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>>>> for the driver to operate.
> >>>>>>> Ah, most likely not.
> >>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>>>
> >>>>>>> Is there no way to work around this or is it dead in the water?
> >>>>>> It's required by the pcie spec.  You could potentially work around it
> >>>>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>>>> particular platform supports cache snooping or not as well.
> >>>>> There's device_get_dma_attr(), although I don't think it will work
> >>>>> currently for PCI devices without an OF or ACPI node - we could
> >>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>>>> to the host bridge's firmware description as necessary.
> >>>>>
> >>>>> The common DMA ops *do* correctly keep track of per-device coherency
> >>>>> internally, but drivers aren't supposed to be poking at that
> >>>>> information directly.
> >>>> That sounds like you underestimate the problem. ARM has unfortunately
> >>>> made the coherency for PCI an optional IP.
> >>> Sorry to be that guy, but I'm involved a lot internally with our
> >>> system IP and interconnect, and I probably understand the situation
> >>> better than 99% of the community ;)
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >>> For the record, the SBSA specification (the closet thing we have to a
> >>> "system architecture") does require that PCIe is integrated in an
> >>> I/O-coherent manner, but we don't have any control over what people do
> >>> in embedded applications (note that we don't make PCIe IP at all, and
> >>> there is plenty of 3rd-party interconnect IP).
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >>>> So we are talking about a hardware limitation which potentially can't
> >>>> be fixed without replacing the hardware.
> >>> You expressed interest in "some way to detect if a particular platform
> >>> supports cache snooping or not", by which I assumed you meant a
> >>> software method for the amdgpu/radeon drivers to call, rather than,
> >>> say, a website that driver maintainers can look up SoC names on. I'm
> >>> saying that that API already exists (just may need a bit more work).
> >>> Note that it is emphatically not a platform-level thing since
> >>> coherency can and does vary per device within a system.
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >>> I wasn't suggesting that Linux could somehow make coherency magically
> >>> work when the signals don't physically exist in the interconnect - I
> >>> was assuming you'd merely want to do something like throw a big
> >>> warning and taint the kernel to help triage bug reports. Some drivers
> >>> like ahci_qoriq and panfrost simply need to know so they can program
> >>> their device to emit the appropriate memory attributes either way, and
> >>> rely on the DMA API to hide the rest of the difference, but if you
> >>> want to treat non-coherent use as unsupported because it would require
> >>> too invasive changes that's fine by me.
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >>> Robin.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17 10:37                   ` Robin Murphy
@ 2022-03-17 12:26                     ` Peter Geis
  -1 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 12:26 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Kever Yang, Shawn Lin, Christian König,
	Christian König, Alex Deucher, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...

On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2022-03-17 00:14, Peter Geis wrote:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
> >
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
>
> No.
>
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>
> No.
>
> > Or
> > Am I insane here?
>
> No. (probably)
>
> CPU access to PCIe has nothing to do with PCIe's access to memory. From
> what you've described, my guess is that a GPU BAR gets put in a
> non-prefetchable window, such that it ends up mapped as Device memory
> (whereas if it were prefetchable it would be Normal Non-Cacheable).

Okay, this is perfect and I think you just put me on the right track
for identifying the exact issue. Thanks!

I've sliced up the non-prefetchable window and given it a prefetchable window.
The 256MB BAR now resides in that window.
However I'm still getting bus errors, so it seems the prefetch isn't
actually happening.
The difference is now the GPU realizes that an error has happened and
initiates recovery, vice before where it seemed to be clueless.
If I understand everything correctly, that's because before the bus
error was raised by the CPU due to the memory flag, vice now where
it's actually the bus raising the alarm.

My next question, is this something the driver should set and isn't,
or is it just because of the broken cache coherency?

>
> Robin.

Thanks again!
Peter

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 12:26                     ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 12:26 UTC (permalink / raw)
  To: Robin Murphy
  Cc: open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher, Christian König

On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2022-03-17 00:14, Peter Geis wrote:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
> >
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
>
> No.
>
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>
> No.
>
> > Or
> > Am I insane here?
>
> No. (probably)
>
> CPU access to PCIe has nothing to do with PCIe's access to memory. From
> what you've described, my guess is that a GPU BAR gets put in a
> non-prefetchable window, such that it ends up mapped as Device memory
> (whereas if it were prefetchable it would be Normal Non-Cacheable).

Okay, this is perfect and I think you just put me on the right track
for identifying the exact issue. Thanks!

I've sliced up the non-prefetchable window and given it a prefetchable window.
The 256MB BAR now resides in that window.
However I'm still getting bus errors, so it seems the prefetch isn't
actually happening.
The difference is now the GPU realizes that an error has happened and
initiates recovery, vice before where it seemed to be clueless.
If I understand everything correctly, that's because before the bus
error was raised by the CPU due to the memory flag, vice now where
it's actually the bus raising the alarm.

My next question, is this something the driver should set and isn't,
or is it just because of the broken cache coherency?

>
> Robin.

Thanks again!
Peter

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17 12:26                     ` Peter Geis
@ 2022-03-17 12:51                       ` Christian König
  -1 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2022-03-17 12:51 UTC (permalink / raw)
  To: Peter Geis, Robin Murphy
  Cc: Kever Yang, Shawn Lin, Christian König, Alex Deucher,
	Deucher, Alexander, amd-gfx list, open list:ARM/Rockchip SoC...

Am 17.03.22 um 13:26 schrieb Peter Geis:
> On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>> On 2022-03-17 00:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>> No.
>>
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>> No.
>>
>>> Or
>>> Am I insane here?
>> No. (probably)
>>
>> CPU access to PCIe has nothing to do with PCIe's access to memory. From
>> what you've described, my guess is that a GPU BAR gets put in a
>> non-prefetchable window, such that it ends up mapped as Device memory
>> (whereas if it were prefetchable it would be Normal Non-Cacheable).
> Okay, this is perfect and I think you just put me on the right track
> for identifying the exact issue. Thanks!
>
> I've sliced up the non-prefetchable window and given it a prefetchable window.
> The 256MB BAR now resides in that window.
> However I'm still getting bus errors, so it seems the prefetch isn't
> actually happening.
> The difference is now the GPU realizes that an error has happened and
> initiates recovery, vice before where it seemed to be clueless.
> If I understand everything correctly, that's because before the bus
> error was raised by the CPU due to the memory flag, vice now where
> it's actually the bus raising the alarm.

Mhm, that's really interesting.

The BIF (bus interface) should be able to handle all power of twos 
between 8bits and 128bits on the hardware generation IIRC (but could 
also be 64bits or 256bits, need to check the hw docs as well).

So once the request ended up at the GPU it should be able to handle it. 
Maybe a mis-configured bridge in between?

> My next question, is this something the driver should set and isn't,
> or is it just because of the broken cache coherency?

As Robin noted as well we have two different issues here:

1. Cache coherency of system memory.
2. Unaligned accesses on IO memory.

The later can actually be avoided if we absolutely have to. E.g. for 
bringup with test the ASICs alone without any DRAM attached. That is so 
called ZFB (zero frame buffer) mode for the driver.

I don't think we ever made the necessary patches for that public, but in 
theory it is possible.

Only the first item is just not solvable cleanly as far as I understand it.

Regards,
Christian.

>
>> Robin.
> Thanks again!
> Peter


_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 12:51                       ` Christian König
  0 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2022-03-17 12:51 UTC (permalink / raw)
  To: Peter Geis, Robin Murphy
  Cc: open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher

Am 17.03.22 um 13:26 schrieb Peter Geis:
> On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>> On 2022-03-17 00:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>> No.
>>
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>> No.
>>
>>> Or
>>> Am I insane here?
>> No. (probably)
>>
>> CPU access to PCIe has nothing to do with PCIe's access to memory. From
>> what you've described, my guess is that a GPU BAR gets put in a
>> non-prefetchable window, such that it ends up mapped as Device memory
>> (whereas if it were prefetchable it would be Normal Non-Cacheable).
> Okay, this is perfect and I think you just put me on the right track
> for identifying the exact issue. Thanks!
>
> I've sliced up the non-prefetchable window and given it a prefetchable window.
> The 256MB BAR now resides in that window.
> However I'm still getting bus errors, so it seems the prefetch isn't
> actually happening.
> The difference is now the GPU realizes that an error has happened and
> initiates recovery, vice before where it seemed to be clueless.
> If I understand everything correctly, that's because before the bus
> error was raised by the CPU due to the memory flag, vice now where
> it's actually the bus raising the alarm.

Mhm, that's really interesting.

The BIF (bus interface) should be able to handle all power of twos 
between 8bits and 128bits on the hardware generation IIRC (but could 
also be 64bits or 256bits, need to check the hw docs as well).

So once the request ended up at the GPU it should be able to handle it. 
Maybe a mis-configured bridge in between?

> My next question, is this something the driver should set and isn't,
> or is it just because of the broken cache coherency?

As Robin noted as well we have two different issues here:

1. Cache coherency of system memory.
2. Unaligned accesses on IO memory.

The later can actually be avoided if we absolutely have to. E.g. for 
bringup with test the ASICs alone without any DRAM attached. That is so 
called ZFB (zero frame buffer) mode for the driver.

I don't think we ever made the necessary patches for that public, but in 
theory it is possible.

Only the first item is just not solvable cleanly as far as I understand it.

Regards,
Christian.

>
>> Robin.
> Thanks again!
> Peter


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17 12:26                     ` Peter Geis
@ 2022-03-17 13:17                       ` Robin Murphy
  -1 siblings, 0 replies; 45+ messages in thread
From: Robin Murphy @ 2022-03-17 13:17 UTC (permalink / raw)
  To: Peter Geis
  Cc: Kever Yang, Shawn Lin, Christian König,
	Christian König, Alex Deucher, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...

On 2022-03-17 12:26, Peter Geis wrote:
> On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2022-03-17 00:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>
>> No.
>>
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>>
>> No.
>>
>>> Or
>>> Am I insane here?
>>
>> No. (probably)
>>
>> CPU access to PCIe has nothing to do with PCIe's access to memory. From
>> what you've described, my guess is that a GPU BAR gets put in a
>> non-prefetchable window, such that it ends up mapped as Device memory
>> (whereas if it were prefetchable it would be Normal Non-Cacheable).
> 
> Okay, this is perfect and I think you just put me on the right track
> for identifying the exact issue. Thanks!
> 
> I've sliced up the non-prefetchable window and given it a prefetchable window.
> The 256MB BAR now resides in that window.
> However I'm still getting bus errors, so it seems the prefetch isn't
> actually happening.

Note that "prefetchable" really just means "no side-effects on reads", 
i.e. we can map it with a Normal memory type that technically *allows* 
the CPU to make speculative accesses because they will not be harmful, 
but that's not to say the CPU will do so. Just that if it did, you 
wouldn't notice anyway.

It's entirely possible that the PCIe IP itself doesn't like unaligned 
accesses, so changing the memory type just moves you from an alignment 
fault to an external abort.

> The difference is now the GPU realizes that an error has happened and
> initiates recovery, vice before where it seemed to be clueless.
> If I understand everything correctly, that's because before the bus
> error was raised by the CPU due to the memory flag, vice now where
> it's actually the bus raising the alarm.
> 
> My next question, is this something the driver should set and isn't,
> or is it just because of the broken cache coherency?

The general rule for userspace mmap()ing PCIe-attached memory and 
handing it off to glibc or anyone else who might assume it's regular 
system RAM is "don't do that". If it's not access size or alignment that 
falls over, it could be atomic operations, MTE tags, or any other 
new-fangled memory innovation. For the ultimate dream of just plugging 
in a card full of RAM, you either need to look back to ISA or forward to 
CXL ;)

Robin.

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 13:17                       ` Robin Murphy
  0 siblings, 0 replies; 45+ messages in thread
From: Robin Murphy @ 2022-03-17 13:17 UTC (permalink / raw)
  To: Peter Geis
  Cc: open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher, Christian König

On 2022-03-17 12:26, Peter Geis wrote:
> On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2022-03-17 00:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>
>> No.
>>
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>>
>> No.
>>
>>> Or
>>> Am I insane here?
>>
>> No. (probably)
>>
>> CPU access to PCIe has nothing to do with PCIe's access to memory. From
>> what you've described, my guess is that a GPU BAR gets put in a
>> non-prefetchable window, such that it ends up mapped as Device memory
>> (whereas if it were prefetchable it would be Normal Non-Cacheable).
> 
> Okay, this is perfect and I think you just put me on the right track
> for identifying the exact issue. Thanks!
> 
> I've sliced up the non-prefetchable window and given it a prefetchable window.
> The 256MB BAR now resides in that window.
> However I'm still getting bus errors, so it seems the prefetch isn't
> actually happening.

Note that "prefetchable" really just means "no side-effects on reads", 
i.e. we can map it with a Normal memory type that technically *allows* 
the CPU to make speculative accesses because they will not be harmful, 
but that's not to say the CPU will do so. Just that if it did, you 
wouldn't notice anyway.

It's entirely possible that the PCIe IP itself doesn't like unaligned 
accesses, so changing the memory type just moves you from an alignment 
fault to an external abort.

> The difference is now the GPU realizes that an error has happened and
> initiates recovery, vice before where it seemed to be clueless.
> If I understand everything correctly, that's because before the bus
> error was raised by the CPU due to the memory flag, vice now where
> it's actually the bus raising the alarm.
> 
> My next question, is this something the driver should set and isn't,
> or is it just because of the broken cache coherency?

The general rule for userspace mmap()ing PCIe-attached memory and 
handing it off to glibc or anyone else who might assume it's regular 
system RAM is "don't do that". If it's not access size or alignment that 
falls over, it could be atomic operations, MTE tags, or any other 
new-fangled memory innovation. For the ultimate dream of just plugging 
in a card full of RAM, you either need to look back to ISA or forward to 
CXL ;)

Robin.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17 13:17                       ` Robin Murphy
@ 2022-03-17 14:21                         ` Peter Geis
  -1 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 14:21 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Kever Yang, Shawn Lin, Christian König,
	Christian König, Alex Deucher, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...,
	Jingoo Han, Gustavo Pimentel, Simon Xue

On Thu, Mar 17, 2022 at 9:17 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2022-03-17 12:26, Peter Geis wrote:
> > On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 2022-03-17 00:14, Peter Geis wrote:
> >>> Good Evening,

I've added the Designware driver maintainers, since the Rockchip host
driver uses the dwc driver.

> >>>
> >>> I apologize for raising this email chain from the dead, but there have
> >>> been some developments that have introduced even more questions.
> >>> I've looped the Rockchip mailing list into this too, as this affects
> >>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >>>
> >>> TLDR for those not familiar: It seems the rk356x series (and possibly
> >>> the rk3588) were built without any outer coherent cache.
> >>> This means (unless Rockchip wants to clarify here) devices such as the
> >>> ITS and PCIe cannot utilize cache snooping.
> >>> This is based on the results of the email chain [2].
> >>>
> >>> The new circumstances are as follows:
> >>> The RPi CM4 Adventure Team as I've taken to calling them has been
> >>> attempting to get a dGPU working with the very broken Broadcom
> >>> controller in the RPi CM4.
> >>> Recently they acquired a SoQuartz rk3566 module which is pin
> >>> compatible with the CM4, and have taken to trying it out as well.
> >>>
> >>> This is how I got involved.
> >>> It seems they found a trivial way to force the Radeon R600 driver to
> >>> use Non-Cached memory for everything.
> >>> This single line change, combined with using memset_io instead of
> >>> memset, allows the ring tests to pass and the card probes successfully
> >>> (minus the DMA limitations of the rk356x due to the 32 bit
> >>> interconnect).
> >>> I discovered using this method that we start having unaligned io
> >>> memory access faults (bus errors) when running glmark2-drm (running
> >>> glmark2 directly was impossible, as both X and Wayland crashed too
> >>> early).
> >>> I traced this to using what I thought at the time was an unsafe memcpy
> >>> in the mesa stack.
> >>> Rewriting this function to force aligned writes solved the problem and
> >>> allows glmark2-drm to run to completion.
> >>> With some extensive debugging, I found about half a dozen memcpy
> >>> functions in mesa that if forced to be aligned would allow Wayland to
> >>> start, but with hilarious display corruption (see [3]. [4]).
> >>> The CM4 team is convinced this is an issue with memcpy in glibc, but
> >>> I'm not convinced it's that simple.
> >>>
> >>> On my two hour drive in to work this morning, I got to thinking.
> >>> If this was an memcpy fault, this would be universally broken on arm64
> >>> which is obviously not the case.
> >>> So I started thinking, what is different here than with systems known to work:
> >>> 1. No IOMMU for the PCIe controller.
> >>> 2. The Outer Cache Issue.
> >>>
> >>> Robin:
> >>> My questions for you, since you're the smartest person I know about
> >>> arm64 memory management:
> >>> Could cache snooping permit unaligned accesses to IO to be safe?
> >>
> >> No.
> >>
> >>> Or
> >>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> >>
> >> No.
> >>
> >>> Or
> >>> Am I insane here?
> >>
> >> No. (probably)
> >>
> >> CPU access to PCIe has nothing to do with PCIe's access to memory. From
> >> what you've described, my guess is that a GPU BAR gets put in a
> >> non-prefetchable window, such that it ends up mapped as Device memory
> >> (whereas if it were prefetchable it would be Normal Non-Cacheable).
> >
> > Okay, this is perfect and I think you just put me on the right track
> > for identifying the exact issue. Thanks!
> >
> > I've sliced up the non-prefetchable window and given it a prefetchable window.
> > The 256MB BAR now resides in that window.
> > However I'm still getting bus errors, so it seems the prefetch isn't
> > actually happening.
>
> Note that "prefetchable" really just means "no side-effects on reads",
> i.e. we can map it with a Normal memory type that technically *allows*
> the CPU to make speculative accesses because they will not be harmful,
> but that's not to say the CPU will do so. Just that if it did, you
> wouldn't notice anyway.
>
> It's entirely possible that the PCIe IP itself doesn't like unaligned
> accesses, so changing the memory type just moves you from an alignment
> fault to an external abort.

Okay, I've tried setting up PL_COHERENCY_CONTROL_3_OFF, where AxCACHE
can be forced from auto to predefined for reads and writes.
As I understand it, the cache bit should permit characteristic
mismatch to be accepted and prefetch to be enabled, when combined with
the read/write bits.
It doesn't seem to make a difference however.
I got the idea to look for this from the Armada8K and Tegra drivers.

It would be nice to know if dGPUs work at all on *any* DWC based PCIe
controllers.
We could use those as a starting point to find out what's broken here.

>
> > The difference is now the GPU realizes that an error has happened and
> > initiates recovery, vice before where it seemed to be clueless.
> > If I understand everything correctly, that's because before the bus
> > error was raised by the CPU due to the memory flag, vice now where
> > it's actually the bus raising the alarm.
> >
> > My next question, is this something the driver should set and isn't,
> > or is it just because of the broken cache coherency?
>
> The general rule for userspace mmap()ing PCIe-attached memory and
> handing it off to glibc or anyone else who might assume it's regular
> system RAM is "don't do that". If it's not access size or alignment that
> falls over, it could be atomic operations, MTE tags, or any other
> new-fangled memory innovation. For the ultimate dream of just plugging
> in a card full of RAM, you either need to look back to ISA or forward to
> CXL ;)

So either go back to the really old way of doing things, find and fix
the underlying problem, or wait for the IP to catch up?

>
> Robin.

Thanks!
Peter

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 14:21                         ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-17 14:21 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Simon Xue, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Jingoo Han, Alex Deucher, Gustavo Pimentel,
	Christian König

On Thu, Mar 17, 2022 at 9:17 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2022-03-17 12:26, Peter Geis wrote:
> > On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 2022-03-17 00:14, Peter Geis wrote:
> >>> Good Evening,

I've added the Designware driver maintainers, since the Rockchip host
driver uses the dwc driver.

> >>>
> >>> I apologize for raising this email chain from the dead, but there have
> >>> been some developments that have introduced even more questions.
> >>> I've looped the Rockchip mailing list into this too, as this affects
> >>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >>>
> >>> TLDR for those not familiar: It seems the rk356x series (and possibly
> >>> the rk3588) were built without any outer coherent cache.
> >>> This means (unless Rockchip wants to clarify here) devices such as the
> >>> ITS and PCIe cannot utilize cache snooping.
> >>> This is based on the results of the email chain [2].
> >>>
> >>> The new circumstances are as follows:
> >>> The RPi CM4 Adventure Team as I've taken to calling them has been
> >>> attempting to get a dGPU working with the very broken Broadcom
> >>> controller in the RPi CM4.
> >>> Recently they acquired a SoQuartz rk3566 module which is pin
> >>> compatible with the CM4, and have taken to trying it out as well.
> >>>
> >>> This is how I got involved.
> >>> It seems they found a trivial way to force the Radeon R600 driver to
> >>> use Non-Cached memory for everything.
> >>> This single line change, combined with using memset_io instead of
> >>> memset, allows the ring tests to pass and the card probes successfully
> >>> (minus the DMA limitations of the rk356x due to the 32 bit
> >>> interconnect).
> >>> I discovered using this method that we start having unaligned io
> >>> memory access faults (bus errors) when running glmark2-drm (running
> >>> glmark2 directly was impossible, as both X and Wayland crashed too
> >>> early).
> >>> I traced this to using what I thought at the time was an unsafe memcpy
> >>> in the mesa stack.
> >>> Rewriting this function to force aligned writes solved the problem and
> >>> allows glmark2-drm to run to completion.
> >>> With some extensive debugging, I found about half a dozen memcpy
> >>> functions in mesa that if forced to be aligned would allow Wayland to
> >>> start, but with hilarious display corruption (see [3]. [4]).
> >>> The CM4 team is convinced this is an issue with memcpy in glibc, but
> >>> I'm not convinced it's that simple.
> >>>
> >>> On my two hour drive in to work this morning, I got to thinking.
> >>> If this was an memcpy fault, this would be universally broken on arm64
> >>> which is obviously not the case.
> >>> So I started thinking, what is different here than with systems known to work:
> >>> 1. No IOMMU for the PCIe controller.
> >>> 2. The Outer Cache Issue.
> >>>
> >>> Robin:
> >>> My questions for you, since you're the smartest person I know about
> >>> arm64 memory management:
> >>> Could cache snooping permit unaligned accesses to IO to be safe?
> >>
> >> No.
> >>
> >>> Or
> >>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> >>
> >> No.
> >>
> >>> Or
> >>> Am I insane here?
> >>
> >> No. (probably)
> >>
> >> CPU access to PCIe has nothing to do with PCIe's access to memory. From
> >> what you've described, my guess is that a GPU BAR gets put in a
> >> non-prefetchable window, such that it ends up mapped as Device memory
> >> (whereas if it were prefetchable it would be Normal Non-Cacheable).
> >
> > Okay, this is perfect and I think you just put me on the right track
> > for identifying the exact issue. Thanks!
> >
> > I've sliced up the non-prefetchable window and given it a prefetchable window.
> > The 256MB BAR now resides in that window.
> > However I'm still getting bus errors, so it seems the prefetch isn't
> > actually happening.
>
> Note that "prefetchable" really just means "no side-effects on reads",
> i.e. we can map it with a Normal memory type that technically *allows*
> the CPU to make speculative accesses because they will not be harmful,
> but that's not to say the CPU will do so. Just that if it did, you
> wouldn't notice anyway.
>
> It's entirely possible that the PCIe IP itself doesn't like unaligned
> accesses, so changing the memory type just moves you from an alignment
> fault to an external abort.

Okay, I've tried setting up PL_COHERENCY_CONTROL_3_OFF, where AxCACHE
can be forced from auto to predefined for reads and writes.
As I understand it, the cache bit should permit characteristic
mismatch to be accepted and prefetch to be enabled, when combined with
the read/write bits.
It doesn't seem to make a difference however.
I got the idea to look for this from the Armada8K and Tegra drivers.

It would be nice to know if dGPUs work at all on *any* DWC based PCIe
controllers.
We could use those as a starting point to find out what's broken here.

>
> > The difference is now the GPU realizes that an error has happened and
> > initiates recovery, vice before where it seemed to be clueless.
> > If I understand everything correctly, that's because before the bus
> > error was raised by the CPU due to the memory flag, vice now where
> > it's actually the bus raising the alarm.
> >
> > My next question, is this something the driver should set and isn't,
> > or is it just because of the broken cache coherency?
>
> The general rule for userspace mmap()ing PCIe-attached memory and
> handing it off to glibc or anyone else who might assume it's regular
> system RAM is "don't do that". If it's not access size or alignment that
> falls over, it could be atomic operations, MTE tags, or any other
> new-fangled memory innovation. For the ultimate dream of just plugging
> in a card full of RAM, you either need to look back to ISA or forward to
> CXL ;)

So either go back to the really old way of doing things, find and fix
the underlying problem, or wait for the IP to catch up?

>
> Robin.

Thanks!
Peter

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  9:14                   ` Christian König
@ 2022-03-17 20:27                     ` Alex Deucher
  -1 siblings, 0 replies; 45+ messages in thread
From: Alex Deucher @ 2022-03-17 20:27 UTC (permalink / raw)
  To: Christian König
  Cc: Peter Geis, Kever Yang, Robin Murphy, Shawn Lin,
	Christian König, Deucher, Alexander, amd-gfx list,
	open list:ARM/Rockchip SoC...

On Thu, Mar 17, 2022 at 5:15 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Peter,
>
> Am 17.03.22 um 01:14 schrieb Peter Geis:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
>
> well, as far as I know that is a clear violation of the PCIe specification.
>
> Coherent access to system memory is simply a must have.
>
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
>
> Yeah, you basically just force it into AGP mode :)
>
> There is just absolutely no guarantee that this works reliable.

It might not be too bad if we use the internal GART rather than AGP.
The challenge will be allocating uncached system memory.  I think that
tended to be the problem on most non-x86 platforms.

Alex


>
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
>
> Yes exactly that.
>
> Both OpenGL and Vulkan allow the application to mmap() device memory and
> do any memory access they want with that.
>
> This means that changing memcpy is just a futile effort, it's still
> possible for the application to make an unaligned memory access and that
> is perfectly valid.
>
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
>
> Oh, very good point. I would be interested in that as answer as well.
>
> Regards,
> Christian.
>
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> > Or
> > Am I insane here?
> >
> > Rockchip:
> > Please update on the status for the Outer Cache errata for ITS services.
> > Please provide an answer to the errata of the PCIe controller, in
> > regard to cache snooping and buffering, for both the rk356x and the
> > upcoming rk3588.
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
> > [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
> > [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
> > [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0
> >
> > Thank you everyone for your time.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > On Wed, May 26, 2021 at 7:21 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>> On 2021-05-26 10:42, Christian König wrote:
> >>>> Hi Robin,
> >>>>
> >>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>>>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Good Evening,
> >>>>>>>>>
> >>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>>>> prototype SBC.
> >>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>>>> modeset
> >>>>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>>>
> >>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>>>> kernel.
> >>>>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>>>> for the driver to operate.
> >>>>>>> Ah, most likely not.
> >>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>>>
> >>>>>>> Is there no way to work around this or is it dead in the water?
> >>>>>> It's required by the pcie spec.  You could potentially work around it
> >>>>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>>>> particular platform supports cache snooping or not as well.
> >>>>> There's device_get_dma_attr(), although I don't think it will work
> >>>>> currently for PCI devices without an OF or ACPI node - we could
> >>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>>>> to the host bridge's firmware description as necessary.
> >>>>>
> >>>>> The common DMA ops *do* correctly keep track of per-device coherency
> >>>>> internally, but drivers aren't supposed to be poking at that
> >>>>> information directly.
> >>>> That sounds like you underestimate the problem. ARM has unfortunately
> >>>> made the coherency for PCI an optional IP.
> >>> Sorry to be that guy, but I'm involved a lot internally with our
> >>> system IP and interconnect, and I probably understand the situation
> >>> better than 99% of the community ;)
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >>> For the record, the SBSA specification (the closet thing we have to a
> >>> "system architecture") does require that PCIe is integrated in an
> >>> I/O-coherent manner, but we don't have any control over what people do
> >>> in embedded applications (note that we don't make PCIe IP at all, and
> >>> there is plenty of 3rd-party interconnect IP).
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >>>> So we are talking about a hardware limitation which potentially can't
> >>>> be fixed without replacing the hardware.
> >>> You expressed interest in "some way to detect if a particular platform
> >>> supports cache snooping or not", by which I assumed you meant a
> >>> software method for the amdgpu/radeon drivers to call, rather than,
> >>> say, a website that driver maintainers can look up SoC names on. I'm
> >>> saying that that API already exists (just may need a bit more work).
> >>> Note that it is emphatically not a platform-level thing since
> >>> coherency can and does vary per device within a system.
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >>> I wasn't suggesting that Linux could somehow make coherency magically
> >>> work when the signals don't physically exist in the interconnect - I
> >>> was assuming you'd merely want to do something like throw a big
> >>> warning and taint the kernel to help triage bug reports. Some drivers
> >>> like ahci_qoriq and panfrost simply need to know so they can program
> >>> their device to emit the appropriate memory attributes either way, and
> >>> rely on the DMA API to hide the rest of the difference, but if you
> >>> want to treat non-coherent use as unsupported because it would require
> >>> too invasive changes that's fine by me.
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >>> Robin.
>

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-17 20:27                     ` Alex Deucher
  0 siblings, 0 replies; 45+ messages in thread
From: Alex Deucher @ 2022-03-17 20:27 UTC (permalink / raw)
  To: Christian König
  Cc: Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	open list:ARM/Rockchip SoC...,
	Peter Geis, Deucher, Alexander, Robin Murphy

On Thu, Mar 17, 2022 at 5:15 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Peter,
>
> Am 17.03.22 um 01:14 schrieb Peter Geis:
> > Good Evening,
> >
> > I apologize for raising this email chain from the dead, but there have
> > been some developments that have introduced even more questions.
> > I've looped the Rockchip mailing list into this too, as this affects
> > rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >
> > TLDR for those not familiar: It seems the rk356x series (and possibly
> > the rk3588) were built without any outer coherent cache.
> > This means (unless Rockchip wants to clarify here) devices such as the
> > ITS and PCIe cannot utilize cache snooping.
>
> well, as far as I know that is a clear violation of the PCIe specification.
>
> Coherent access to system memory is simply a must have.
>
> > This is based on the results of the email chain [2].
> >
> > The new circumstances are as follows:
> > The RPi CM4 Adventure Team as I've taken to calling them has been
> > attempting to get a dGPU working with the very broken Broadcom
> > controller in the RPi CM4.
> > Recently they acquired a SoQuartz rk3566 module which is pin
> > compatible with the CM4, and have taken to trying it out as well.
> >
> > This is how I got involved.
> > It seems they found a trivial way to force the Radeon R600 driver to
> > use Non-Cached memory for everything.
>
> Yeah, you basically just force it into AGP mode :)
>
> There is just absolutely no guarantee that this works reliable.

It might not be too bad if we use the internal GART rather than AGP.
The challenge will be allocating uncached system memory.  I think that
tended to be the problem on most non-x86 platforms.

Alex


>
> > This single line change, combined with using memset_io instead of
> > memset, allows the ring tests to pass and the card probes successfully
> > (minus the DMA limitations of the rk356x due to the 32 bit
> > interconnect).
> > I discovered using this method that we start having unaligned io
> > memory access faults (bus errors) when running glmark2-drm (running
> > glmark2 directly was impossible, as both X and Wayland crashed too
> > early).
> > I traced this to using what I thought at the time was an unsafe memcpy
> > in the mesa stack.
> > Rewriting this function to force aligned writes solved the problem and
> > allows glmark2-drm to run to completion.
> > With some extensive debugging, I found about half a dozen memcpy
> > functions in mesa that if forced to be aligned would allow Wayland to
> > start, but with hilarious display corruption (see [3]. [4]).
> > The CM4 team is convinced this is an issue with memcpy in glibc, but
> > I'm not convinced it's that simple.
>
> Yes exactly that.
>
> Both OpenGL and Vulkan allow the application to mmap() device memory and
> do any memory access they want with that.
>
> This means that changing memcpy is just a futile effort, it's still
> possible for the application to make an unaligned memory access and that
> is perfectly valid.
>
> > On my two hour drive in to work this morning, I got to thinking.
> > If this was an memcpy fault, this would be universally broken on arm64
> > which is obviously not the case.
> > So I started thinking, what is different here than with systems known to work:
> > 1. No IOMMU for the PCIe controller.
> > 2. The Outer Cache Issue.
>
> Oh, very good point. I would be interested in that as answer as well.
>
> Regards,
> Christian.
>
> >
> > Robin:
> > My questions for you, since you're the smartest person I know about
> > arm64 memory management:
> > Could cache snooping permit unaligned accesses to IO to be safe?
> > Or
> > Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> > Or
> > Am I insane here?
> >
> > Rockchip:
> > Please update on the status for the Outer Cache errata for ITS services.
> > Please provide an answer to the errata of the PCIe controller, in
> > regard to cache snooping and buffering, for both the rk356x and the
> > upcoming rk3588.
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
> > [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
> > [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
> > [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0
> >
> > Thank you everyone for your time.
> >
> > Very Respectfully,
> > Peter Geis
> >
> > On Wed, May 26, 2021 at 7:21 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>> On 2021-05-26 10:42, Christian König wrote:
> >>>> Hi Robin,
> >>>>
> >>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>>>> On 2021-05-25 14:05, Alex Deucher wrote:
> >>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>> wrote:
> >>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >>>>>>> <alexdeucher@gmail.com> wrote:
> >>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Good Evening,
> >>>>>>>>>
> >>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> >>>>>>>>> prototype SBC.
> >>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >>>>>>>>> controller, which makes a dGPU theoretically possible.
> >>>>>>>>> While attempting to light off a HD7570 card I manage to get a
> >>>>>>>>> modeset
> >>>>>>>>> console, but ring0 test fails and disables acceleration.
> >>>>>>>>>
> >>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >>>>>>>>> kernel.
> >>>>>>>>> Any insight you can provide would be much appreciated.
> >>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> >>>>>>>> for the driver to operate.
> >>>>>>> Ah, most likely not.
> >>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
> >>>>>>> the CPUs, so I doubt the PCIe controller can either.
> >>>>>>>
> >>>>>>> Is there no way to work around this or is it dead in the water?
> >>>>>> It's required by the pcie spec.  You could potentially work around it
> >>>>>> if you can allocate uncached memory for DMA, but I don't think that is
> >>>>>> possible currently.  Ideally we'd figure out some way to detect if a
> >>>>>> particular platform supports cache snooping or not as well.
> >>>>> There's device_get_dma_attr(), although I don't think it will work
> >>>>> currently for PCI devices without an OF or ACPI node - we could
> >>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
> >>>>> to the host bridge's firmware description as necessary.
> >>>>>
> >>>>> The common DMA ops *do* correctly keep track of per-device coherency
> >>>>> internally, but drivers aren't supposed to be poking at that
> >>>>> information directly.
> >>>> That sounds like you underestimate the problem. ARM has unfortunately
> >>>> made the coherency for PCI an optional IP.
> >>> Sorry to be that guy, but I'm involved a lot internally with our
> >>> system IP and interconnect, and I probably understand the situation
> >>> better than 99% of the community ;)
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >>> For the record, the SBSA specification (the closet thing we have to a
> >>> "system architecture") does require that PCIe is integrated in an
> >>> I/O-coherent manner, but we don't have any control over what people do
> >>> in embedded applications (note that we don't make PCIe IP at all, and
> >>> there is plenty of 3rd-party interconnect IP).
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >>>> So we are talking about a hardware limitation which potentially can't
> >>>> be fixed without replacing the hardware.
> >>> You expressed interest in "some way to detect if a particular platform
> >>> supports cache snooping or not", by which I assumed you meant a
> >>> software method for the amdgpu/radeon drivers to call, rather than,
> >>> say, a website that driver maintainers can look up SoC names on. I'm
> >>> saying that that API already exists (just may need a bit more work).
> >>> Note that it is emphatically not a platform-level thing since
> >>> coherency can and does vary per device within a system.
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >>> I wasn't suggesting that Linux could somehow make coherency magically
> >>> work when the signals don't physically exist in the interconnect - I
> >>> was assuming you'd merely want to do something like throw a big
> >>> warning and taint the kernel to help triage bug reports. Some drivers
> >>> like ahci_qoriq and panfrost simply need to know so they can program
> >>> their device to emit the appropriate memory attributes either way, and
> >>> rely on the DMA API to hide the rest of the difference, but if you
> >>> want to treat non-coherent use as unsupported because it would require
> >>> too invasive changes that's fine by me.
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >>> Robin.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17 12:19                     ` Peter Geis
@ 2022-03-18  7:51                       ` Kever Yang
  -1 siblings, 0 replies; 45+ messages in thread
From: Kever Yang @ 2022-03-18  7:51 UTC (permalink / raw)
  To: Peter Geis
  Cc: Robin Murphy, Shawn Lin, Christian König,
	Christian König, Alex Deucher, Deucher, Alexander,
	amd-gfx list, open list:ARM/Rockchip SoC...,
	Tao Huang


On 2022/3/17 20:19, Peter Geis wrote:
> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>> Hi Peter,
>>
>> On 2022/3/17 08:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>>> Or
>>> Am I insane here?
>>>
>>> Rockchip:
>>> Please update on the status for the Outer Cache errata for ITS services.
>> Our SoC design team has double check with ARM GIC/ITS IP team for many
>> times, and the GITS_CBASER
>> of GIC600 IP does not support hardware bind or config to a fix value, so
>> they insist this is an IP
>> limitation instead of a SoC bug, software should take  care of it :(
>> I will check again if we can provide errata for this issue.
> Thanks. This is necessary as the mbi-alias provides an imperfect
> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
> 10G NIC) to misbehave.
>
>>> Please provide an answer to the errata of the PCIe controller, in
>>> regard to cache snooping and buffering, for both the rk356x and the
>>> upcoming rk3588.
>>
>> Sorry, what is this?
> Part of the ITS bug is it expects to be cache coherent with the CPU
> cluster by design.
> Due to the rk356x being implemented without an outer accessible cache,
> the ITS and other devices that require cache coherency (PCIe for
> example) crash in fun ways.
Then this is still the ITS issue, not PCIe issue.
PCIe is a peripheral bus controller like USB and other device, the 
driver should maintain the "cache coherency" if there is any, and there 
is no requirement for hardware cache coherency between PCIe and CPU.
We didn't see any transfer error on rk356x PCIe till now, we can take a 
look if it's easy to reproduce.

Thanks,
- Kever


> This means that rk356x cannot implement a specification compliant ITS or PCIe.
> >From the rk3588 source dump it appears it was produced without an
> outer accessible cache, which means if true it also will be unable to
> use any PCIe cards that implement cache coherency as part of their
> design.
>
>>
>> Thanks,
>> - Kever
>>> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
>>> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
>>> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
>>> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>>>
>>> Thank you everyone for your time.
>>>
>>> Very Respectfully,
>>> Peter Geis
>>>
>>> On Wed, May 26, 2021 at 7:21 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Hi Robin,
>>>>
>>>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>>>> On 2021-05-26 10:42, Christian König wrote:
>>>>>> Hi Robin,
>>>>>>
>>>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Good Evening,
>>>>>>>>>>>
>>>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>>>> prototype SBC.
>>>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>>>> modeset
>>>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>>>
>>>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>>>> kernel.
>>>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>>>> for the driver to operate.
>>>>>>>>> Ah, most likely not.
>>>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>>>
>>>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>>>> particular platform supports cache snooping or not as well.
>>>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>>>> to the host bridge's firmware description as necessary.
>>>>>>>
>>>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>>>> internally, but drivers aren't supposed to be poking at that
>>>>>>> information directly.
>>>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>>>> made the coherency for PCI an optional IP.
>>>>> Sorry to be that guy, but I'm involved a lot internally with our
>>>>> system IP and interconnect, and I probably understand the situation
>>>>> better than 99% of the community ;)
>>>> I need to apologize, didn't realized who was answering :)
>>>>
>>>> It just sounded to me that you wanted to suggest to the end user that
>>>> this is fixable in software and I really wanted to avoid even more
>>>> customers coming around asking how to do this.
>>>>
>>>>> For the record, the SBSA specification (the closet thing we have to a
>>>>> "system architecture") does require that PCIe is integrated in an
>>>>> I/O-coherent manner, but we don't have any control over what people do
>>>>> in embedded applications (note that we don't make PCIe IP at all, and
>>>>> there is plenty of 3rd-party interconnect IP).
>>>> So basically it is not the fault of the ARM IP-core, but people are just
>>>> stitching together PCIe interconnect IP with a core where it is not
>>>> supposed to be used with.
>>>>
>>>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>>>
>>>>>> So we are talking about a hardware limitation which potentially can't
>>>>>> be fixed without replacing the hardware.
>>>>> You expressed interest in "some way to detect if a particular platform
>>>>> supports cache snooping or not", by which I assumed you meant a
>>>>> software method for the amdgpu/radeon drivers to call, rather than,
>>>>> say, a website that driver maintainers can look up SoC names on. I'm
>>>>> saying that that API already exists (just may need a bit more work).
>>>>> Note that it is emphatically not a platform-level thing since
>>>>> coherency can and does vary per device within a system.
>>>> Well, I think this is not something an individual driver should mess
>>>> with. What the driver should do is just express that it needs coherent
>>>> access to all of system memory and if that is not possible fail to load
>>>> with a warning why it is not possible.
>>>>
>>>>> I wasn't suggesting that Linux could somehow make coherency magically
>>>>> work when the signals don't physically exist in the interconnect - I
>>>>> was assuming you'd merely want to do something like throw a big
>>>>> warning and taint the kernel to help triage bug reports. Some drivers
>>>>> like ahci_qoriq and panfrost simply need to know so they can program
>>>>> their device to emit the appropriate memory attributes either way, and
>>>>> rely on the DMA API to hide the rest of the difference, but if you
>>>>> want to treat non-coherent use as unsupported because it would require
>>>>> too invasive changes that's fine by me.
>>>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>>>> at least the Vulkan userspace API specification requires devices to have
>>>> coherent access to system memory.
>>>>
>>>> So even if I would want to do this it is simply not possible because the
>>>> application doesn't tell the driver which memory is accessed by the
>>>> device and which by the CPU.
>>>>
>>>> Christian.
>>>>
>>>>> Robin.
>> _______________________________________________
>> Linux-rockchip mailing list
>> Linux-rockchip@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-rockchip

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-18  7:51                       ` Kever Yang
  0 siblings, 0 replies; 45+ messages in thread
From: Kever Yang @ 2022-03-18  7:51 UTC (permalink / raw)
  To: Peter Geis
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, amd-gfx list, Deucher,
	Alexander, Alex Deucher, Robin Murphy, Christian König


On 2022/3/17 20:19, Peter Geis wrote:
> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>> Hi Peter,
>>
>> On 2022/3/17 08:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>>> Or
>>> Am I insane here?
>>>
>>> Rockchip:
>>> Please update on the status for the Outer Cache errata for ITS services.
>> Our SoC design team has double check with ARM GIC/ITS IP team for many
>> times, and the GITS_CBASER
>> of GIC600 IP does not support hardware bind or config to a fix value, so
>> they insist this is an IP
>> limitation instead of a SoC bug, software should take  care of it :(
>> I will check again if we can provide errata for this issue.
> Thanks. This is necessary as the mbi-alias provides an imperfect
> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
> 10G NIC) to misbehave.
>
>>> Please provide an answer to the errata of the PCIe controller, in
>>> regard to cache snooping and buffering, for both the rk356x and the
>>> upcoming rk3588.
>>
>> Sorry, what is this?
> Part of the ITS bug is it expects to be cache coherent with the CPU
> cluster by design.
> Due to the rk356x being implemented without an outer accessible cache,
> the ITS and other devices that require cache coherency (PCIe for
> example) crash in fun ways.
Then this is still the ITS issue, not PCIe issue.
PCIe is a peripheral bus controller like USB and other device, the 
driver should maintain the "cache coherency" if there is any, and there 
is no requirement for hardware cache coherency between PCIe and CPU.
We didn't see any transfer error on rk356x PCIe till now, we can take a 
look if it's easy to reproduce.

Thanks,
- Kever


> This means that rk356x cannot implement a specification compliant ITS or PCIe.
> >From the rk3588 source dump it appears it was produced without an
> outer accessible cache, which means if true it also will be unable to
> use any PCIe cards that implement cache coherency as part of their
> design.
>
>>
>> Thanks,
>> - Kever
>>> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
>>> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
>>> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
>>> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>>>
>>> Thank you everyone for your time.
>>>
>>> Very Respectfully,
>>> Peter Geis
>>>
>>> On Wed, May 26, 2021 at 7:21 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Hi Robin,
>>>>
>>>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>>>> On 2021-05-26 10:42, Christian König wrote:
>>>>>> Hi Robin,
>>>>>>
>>>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Good Evening,
>>>>>>>>>>>
>>>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>>>> prototype SBC.
>>>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>>>> modeset
>>>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>>>
>>>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>>>> kernel.
>>>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>>>> for the driver to operate.
>>>>>>>>> Ah, most likely not.
>>>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>>>
>>>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>>>> particular platform supports cache snooping or not as well.
>>>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>>>> to the host bridge's firmware description as necessary.
>>>>>>>
>>>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>>>> internally, but drivers aren't supposed to be poking at that
>>>>>>> information directly.
>>>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>>>> made the coherency for PCI an optional IP.
>>>>> Sorry to be that guy, but I'm involved a lot internally with our
>>>>> system IP and interconnect, and I probably understand the situation
>>>>> better than 99% of the community ;)
>>>> I need to apologize, didn't realized who was answering :)
>>>>
>>>> It just sounded to me that you wanted to suggest to the end user that
>>>> this is fixable in software and I really wanted to avoid even more
>>>> customers coming around asking how to do this.
>>>>
>>>>> For the record, the SBSA specification (the closet thing we have to a
>>>>> "system architecture") does require that PCIe is integrated in an
>>>>> I/O-coherent manner, but we don't have any control over what people do
>>>>> in embedded applications (note that we don't make PCIe IP at all, and
>>>>> there is plenty of 3rd-party interconnect IP).
>>>> So basically it is not the fault of the ARM IP-core, but people are just
>>>> stitching together PCIe interconnect IP with a core where it is not
>>>> supposed to be used with.
>>>>
>>>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>>>
>>>>>> So we are talking about a hardware limitation which potentially can't
>>>>>> be fixed without replacing the hardware.
>>>>> You expressed interest in "some way to detect if a particular platform
>>>>> supports cache snooping or not", by which I assumed you meant a
>>>>> software method for the amdgpu/radeon drivers to call, rather than,
>>>>> say, a website that driver maintainers can look up SoC names on. I'm
>>>>> saying that that API already exists (just may need a bit more work).
>>>>> Note that it is emphatically not a platform-level thing since
>>>>> coherency can and does vary per device within a system.
>>>> Well, I think this is not something an individual driver should mess
>>>> with. What the driver should do is just express that it needs coherent
>>>> access to all of system memory and if that is not possible fail to load
>>>> with a warning why it is not possible.
>>>>
>>>>> I wasn't suggesting that Linux could somehow make coherency magically
>>>>> work when the signals don't physically exist in the interconnect - I
>>>>> was assuming you'd merely want to do something like throw a big
>>>>> warning and taint the kernel to help triage bug reports. Some drivers
>>>>> like ahci_qoriq and panfrost simply need to know so they can program
>>>>> their device to emit the appropriate memory attributes either way, and
>>>>> rely on the DMA API to hide the rest of the difference, but if you
>>>>> want to treat non-coherent use as unsupported because it would require
>>>>> too invasive changes that's fine by me.
>>>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>>>> at least the Vulkan userspace API specification requires devices to have
>>>> coherent access to system memory.
>>>>
>>>> So even if I would want to do this it is simply not possible because the
>>>> application doesn't tell the driver which memory is accessed by the
>>>> device and which by the CPU.
>>>>
>>>> Christian.
>>>>
>>>>> Robin.
>> _______________________________________________
>> Linux-rockchip mailing list
>> Linux-rockchip@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-18  7:51                       ` Kever Yang
  (?)
@ 2022-03-18  8:35                       ` Christian König
  2022-03-18 11:24                           ` Peter Geis
  -1 siblings, 1 reply; 45+ messages in thread
From: Christian König @ 2022-03-18  8:35 UTC (permalink / raw)
  To: Kever Yang, Peter Geis
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, amd-gfx list, Deucher,
	Alexander, Alex Deucher, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 13703 bytes --]



Am 18.03.22 um 08:51 schrieb Kever Yang:
>
> On 2022/3/17 20:19, Peter Geis wrote:
>> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang 
>> <kever.yang@rock-chips.com> wrote:
>>> Hi Peter,
>>>
>>> On 2022/3/17 08:14, Peter Geis wrote:
>>>> Good Evening,
>>>>
>>>> I apologize for raising this email chain from the dead, but there have
>>>> been some developments that have introduced even more questions.
>>>> I've looped the Rockchip mailing list into this too, as this affects
>>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>>
>>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>>> the rk3588) were built without any outer coherent cache.
>>>> This means (unless Rockchip wants to clarify here) devices such as the
>>>> ITS and PCIe cannot utilize cache snooping.
>>>> This is based on the results of the email chain [2].
>>>>
>>>> The new circumstances are as follows:
>>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>>> attempting to get a dGPU working with the very broken Broadcom
>>>> controller in the RPi CM4.
>>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>>> compatible with the CM4, and have taken to trying it out as well.
>>>>
>>>> This is how I got involved.
>>>> It seems they found a trivial way to force the Radeon R600 driver to
>>>> use Non-Cached memory for everything.
>>>> This single line change, combined with using memset_io instead of
>>>> memset, allows the ring tests to pass and the card probes successfully
>>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>>> interconnect).
>>>> I discovered using this method that we start having unaligned io
>>>> memory access faults (bus errors) when running glmark2-drm (running
>>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>>> early).
>>>> I traced this to using what I thought at the time was an unsafe memcpy
>>>> in the mesa stack.
>>>> Rewriting this function to force aligned writes solved the problem and
>>>> allows glmark2-drm to run to completion.
>>>> With some extensive debugging, I found about half a dozen memcpy
>>>> functions in mesa that if forced to be aligned would allow Wayland to
>>>> start, but with hilarious display corruption (see [3]. [4]).
>>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>>> I'm not convinced it's that simple.
>>>>
>>>> On my two hour drive in to work this morning, I got to thinking.
>>>> If this was an memcpy fault, this would be universally broken on arm64
>>>> which is obviously not the case.
>>>> So I started thinking, what is different here than with systems 
>>>> known to work:
>>>> 1. No IOMMU for the PCIe controller.
>>>> 2. The Outer Cache Issue.
>>>>
>>>> Robin:
>>>> My questions for you, since you're the smartest person I know about
>>>> arm64 memory management:
>>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>>> Or
>>>> Is it the lack of an IOMMU that's causing the ali gnment faults to 
>>>> become fatal?
>>>> Or
>>>> Am I insane here?
>>>>
>>>> Rockchip:
>>>> Please update on the status for the Outer Cache errata for ITS 
>>>> services.
>>> Our SoC design team has double check with ARM GIC/ITS IP team for many
>>> times, and the GITS_CBASER
>>> of GIC600 IP does not support hardware bind or config to a fix 
>>> value, so
>>> they insist this is an IP
>>> limitation instead of a SoC bug, software should take  care of it :(
>>> I will check again if we can provide errata for this issue.
>> Thanks. This is necessary as the mbi-alias provides an imperfect
>> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
>> 10G NIC) to misbehave.
>>
>>>> Please provide an answer to the errata of the PCIe controller, in
>>>> regard to cache snooping and buffering, for both the rk356x and the
>>>> upcoming rk3588.
>>>
>>> Sorry, what is this?
>> Part of the ITS bug is it expects to be cache coherent with the CPU
>> cluster by design.
>> Due to the rk356x being implemented without an outer accessible cache,
>> the ITS and other devices that require cache coherency (PCIe for
>> example) crash in fun ways.
> Then this is still the ITS issue, not PCIe issue.
> PCIe is a peripheral bus controller like USB and other device, the 
> driver should maintain the "cache coherency" if there is any, and 
> there is no requirement for hardware cache coherency between PCIe and CPU.

Well then I suggest to re-read the PCIe specification.

Cache coherency is defined as mandatory there. Non-cache coherency is an 
optional feature.

See section 2.2.6.5 in the PCIe 2.0 specification for a good example.

Regards,
Christian.

>
> We didn't see any transfer error on rk356x PCIe till now, we can take 
> a look if it's easy to reproduce.
>
> Thanks,
> - Kever
>
>
>> This means that rk356x cannot implement a specification compliant ITS 
>> or PCIe.
>> >From the rk3588 source dump it appears it was produced without an
>> outer accessible cache, which means if true it also will be unable to
>> use any PCIe cards that implement cache coherency as part of their
>> design.
>>
>>>
>>> Thanks,
>>> - Kever
>>>> [1] 
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LcwZggIwIqjvzjDH2DUnIDwxsgk7WmhE9LK13knx36E%3D&amp;reserved=0
>>>> [2] 
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fXALLO1EnGi2s8pClt6aMrUlzqDy2KDO8wzpi033qtU%3D&amp;reserved=0
>>>> [3] 
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tx%2Bw9ayScUTftjWAFL0GY%2FADQswxEJGRUhgxDw2TSzQ%3D&amp;reserved=0
>>>> [4] 
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=8VXuZvQAhD%2FsQBJ6WEXe0YElD6wCI675oxqHesKhclY%3D&amp;reserved=0
>>>>
>>>> Thank you everyone for your time.
>>>>
>>>> Very Respectfully,
>>>> Peter Geis
>>>>
>>>> On Wed, May 26, 2021 at 7:21 AM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>> Hi Robin,
>>>>>
>>>>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>>>>> On 2021-05-26 10:42, Christian König wrote:
>>>>>>> Hi Robin,
>>>>>>>
>>>>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>>>>> <alexdeucher@gmail.com> wrote:
>>>>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis 
>>>>>>>>>>> <pgwipeout@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Good Evening,
>>>>>>>>>>>>
>>>>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>>>>> prototype SBC.
>>>>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>>>>> modeset
>>>>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>>>>
>>>>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>>>>> kernel.
>>>>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>>>>> Does your platform support PCIe cache coherency with the 
>>>>>>>>>>> CPU?  I.e.,
>>>>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is 
>>>>>>>>>>> required
>>>>>>>>>>> for the driver to operate.
>>>>>>>>>> Ah, most likely not.
>>>>>>>>>> This issue has come up already as the GIC isn't permitted to 
>>>>>>>>>> snoop on
>>>>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>>>>
>>>>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>>>>> It's required by the pcie spec.  You could potentially work 
>>>>>>>>> around it
>>>>>>>>> if you can allocate uncached memory for DMA, but I don't think 
>>>>>>>>> that is
>>>>>>>>> possible currently.  Ideally we'd figure out some way to 
>>>>>>>>> detect if a
>>>>>>>>> particular platform supports cache snooping or not as well.
>>>>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>>>>> to the host bridge's firmware description as necessary.
>>>>>>>>
>>>>>>>> The common DMA ops *do* correctly keep track of per-device 
>>>>>>>> coherency
>>>>>>>> internally, but drivers aren't supposed to be poking at that
>>>>>>>> information directly.
>>>>>>> That sounds like you underestimate the problem. ARM has 
>>>>>>> unfortunately
>>>>>>> made the coherency for PCI an optional IP.
>>>>>> Sorry to be that guy, but I'm involved a lot internally with our
>>>>>> system IP and interconnect, and I probably understand the situation
>>>>>> better than 99% of the community ;)
>>>>> I need to apologize, didn't realized who was answering :)
>>>>>
>>>>> It just sounded to me that you wanted to suggest to the end user that
>>>>> this is fixable in software and I really wanted to avoid even more
>>>>> customers coming around asking how to do this.
>>>>>
>>>>>> For the record, the SBSA specification (the closet thing we have 
>>>>>> to a
>>>>>> "system architecture") does require that PCIe is integrated in an
>>>>>> I/O-coherent manner, but we don't have any control over what 
>>>>>> people do
>>>>>> in embedded applications (note that we don't make PCIe IP at all, 
>>>>>> and
>>>>>> there is plenty of 3rd-party interconnect IP).
>>>>> So basically it is not the fault of the ARM IP-core, but people 
>>>>> are just
>>>>> stitching together PCIe interconnect IP with a core where it is not
>>>>> supposed to be used with.
>>>>>
>>>>> Do I get that correctly? That's an interesting puzzle piece in the 
>>>>> picture.
>>>>>
>>>>>>> So we are talking about a hardware limitation which potentially 
>>>>>>> can't
>>>>>>> be fixed without replacing the hardware.
>>>>>> You expressed interest in "some way to detect if a particular 
>>>>>> platform
>>>>>> supports cache snooping or not", by which I assumed you meant a
>>>>>> software method for the amdgpu/radeon drivers to call, rather than,
>>>>>> say, a website that driver maintainers can look up SoC names on. I'm
>>>>>> saying that that API already exists (just may need a bit more work).
>>>>>> Note that it is emphatically not a platform-level thing since
>>>>>> coherency can and does vary per device within a system.
>>>>> Well, I think this is not something an individual driver should mess
>>>>> with. What the driver should do is just express that it needs 
>>>>> coherent
>>>>> access to all of system memory and if that is not possible fail to 
>>>>> load
>>>>> with a warning why it is not possible.
>>>>>
>>>>>> I wasn't suggesting that Linux could somehow make coherency 
>>>>>> magically
>>>>>> work when the signals don't physically exist in the interconnect - I
>>>>>> was assuming you'd merely want to do something like throw a big
>>>>>> warning and taint the kernel to help triage bug reports. Some 
>>>>>> drivers
>>>>>> like ahci_qoriq and panfrost simply need to know so they can program
>>>>>> their device to emit the appropriate memory attributes either 
>>>>>> way, and
>>>>>> rely on the DMA API to hide the rest of the difference, but if you
>>>>>> want to treat non-coherent use as unsupported because it would 
>>>>>> require
>>>>>> too invasive changes that's fine by me.
>>>>> Yes exactly that please. I mean not sure how panfrost is doing it, 
>>>>> but
>>>>> at least the Vulkan userspace API specification requires devices 
>>>>> to have
>>>>> coherent access to system memory.
>>>>>
>>>>> So even if I would want to do this it is simply not possible 
>>>>> because the
>>>>> application doesn't tell the driver which memory is accessed by the
>>>>> device and which by the CPU.
>>>>>
>>>>> Christian.
>>>>>
>>>>>> Robin.
>>> _______________________________________________
>>> Linux-rockchip mailing list
>>> Linux-rockchip@lists.infradead.org
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=F77FbO3SqslbzKu2%2FnjRLrQF45kljtD3%2FAEXEFd7NQs%3D&amp;reserved=0 
>>>

[-- Attachment #2: Type: text/html, Size: 26323 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-18  8:35                       ` Christian König
@ 2022-03-18 11:24                           ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-18 11:24 UTC (permalink / raw)
  To: Christian König
  Cc: Kever Yang, Robin Murphy, Shawn Lin, Christian König,
	Alex Deucher, Deucher, Alexander, amd-gfx list,
	open list:ARM/Rockchip SoC...,
	Tao Huang

On Fri, Mar 18, 2022 at 4:35 AM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> Am 18.03.22 um 08:51 schrieb Kever Yang:
>
>
> On 2022/3/17 20:19, Peter Geis wrote:
>
> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>
> Hi Peter,
>
> On 2022/3/17 08:14, Peter Geis wrote:
>
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the ali gnment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.
>
> Our SoC design team has double check with ARM GIC/ITS IP team for many
> times, and the GITS_CBASER
> of GIC600 IP does not support hardware bind or config to a fix value, so
> they insist this is an IP
> limitation instead of a SoC bug, software should take  care of it :(
> I will check again if we can provide errata for this issue.
>
> Thanks. This is necessary as the mbi-alias provides an imperfect
> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
> 10G NIC) to misbehave.
>
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.
>
>
> Sorry, what is this?
>
> Part of the ITS bug is it expects to be cache coherent with the CPU
> cluster by design.
> Due to the rk356x being implemented without an outer accessible cache,
> the ITS and other devices that require cache coherency (PCIe for
> example) crash in fun ways.
>
> Then this is still the ITS issue, not PCIe issue.
> PCIe is a peripheral bus controller like USB and other device, the driver should maintain the "cache coherency" if there is any, and there is no requirement for hardware cache coherency between PCIe and CPU.

Kever,

These issues are one and the same.
Certain hardware blocks *require* cache coherency as part of their design.
All of the *interesting* things PCIe can do stem from it.

When I saw you bumped the available window to the PCIe controller to
1GB I was really excited, because that meant we could finally support
devices that used these interesting features.
However, without cache coherency, having more than a 256MB window is a
waste, as any card that can take advantage of it *requires* coherency.
The same thing goes for a resizable BAR.
EP mode is the same, having the ability to connect one CPU to another
CPU over a PCIe bus loses the advantages when you don't have
coherency.
At that point, you might as well toss in a 2.5GB ethernet port and
just use that instead.

>
>
> Well then I suggest to re-read the PCIe specification.
>
> Cache coherency is defined as mandatory there. Non-cache coherency is an optional feature.
>
> See section 2.2.6.5 in the PCIe 2.0 specification for a good example.
>
> Regards,
> Christian.
>
>
> We didn't see any transfer error on rk356x PCIe till now, we can take a look if it's easy to reproduce.

It's easy to reproduce, just try to use any card that has a
significantly large enough BAR to warrant requiring coherency.
dGPUs are the most readily accessible device, but High Performance
Computing Acceleration devices and high power FPGAs also would work.
Was the resizable bar tested at all internally either?
Any current device that could use that requires coherency.
And like above, EP mode without coherency is a waste at best, and
unpleasant at worst.

Very Respectfully,
Peter

>
> Thanks,
> - Kever
>
>
> This means that rk356x cannot implement a specification compliant ITS or PCIe.
> >From the rk3588 source dump it appears it was produced without an
> outer accessible cache, which means if true it also will be unable to
> use any PCIe cards that implement cache coherency as part of their
> design.
>
>
> Thanks,
> - Kever
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LcwZggIwIqjvzjDH2DUnIDwxsgk7WmhE9LK13knx36E%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fXALLO1EnGi2s8pClt6aMrUlzqDy2KDO8wzpi033qtU%3D&amp;reserved=0
> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tx%2Bw9ayScUTftjWAFL0GY%2FADQswxEJGRUhgxDw2TSzQ%3D&amp;reserved=0
> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=8VXuZvQAhD%2FsQBJ6WEXe0YElD6wCI675oxqHesKhclY%3D&amp;reserved=0
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
>
> Hi Robin,
>
> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>
> On 2021-05-26 10:42, Christian König wrote:
>
> Hi Robin,
>
> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>
> On 2021-05-25 14:05, Alex Deucher wrote:
>
> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> wrote:
>
> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> <alexdeucher@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> wrote:
>
> Good Evening,
>
> I am stress testing the pcie controller on the rk3566-quartz64
> prototype SBC.
> This device has 1GB available at <0x3 0x00000000> for the PCIe
> controller, which makes a dGPU theoretically possible.
> While attempting to light off a HD7570 card I manage to get a
> modeset
> console, but ring0 test fails and disables acceleration.
>
> Note, we do not have UEFI, so all PCIe setup is from the Linux
> kernel.
> Any insight you can provide would be much appreciated.
>
> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> does the CPU allow cache snoops from PCIe devices?  That is required
> for the driver to operate.
>
> Ah, most likely not.
> This issue has come up already as the GIC isn't permitted to snoop on
> the CPUs, so I doubt the PCIe controller can either.
>
> Is there no way to work around this or is it dead in the water?
>
> It's required by the pcie spec.  You could potentially work around it
> if you can allocate uncached memory for DMA, but I don't think that is
> possible currently.  Ideally we'd figure out some way to detect if a
> particular platform supports cache snooping or not as well.
>
> There's device_get_dma_attr(), although I don't think it will work
> currently for PCI devices without an OF or ACPI node - we could
> perhaps do with a PCI-specific wrapper which can walk up and defer
> to the host bridge's firmware description as necessary.
>
> The common DMA ops *do* correctly keep track of per-device coherency
> internally, but drivers aren't supposed to be poking at that
> information directly.
>
> That sounds like you underestimate the problem. ARM has unfortunately
> made the coherency for PCI an optional IP.
>
> Sorry to be that guy, but I'm involved a lot internally with our
> system IP and interconnect, and I probably understand the situation
> better than 99% of the community ;)
>
> I need to apologize, didn't realized who was answering :)
>
> It just sounded to me that you wanted to suggest to the end user that
> this is fixable in software and I really wanted to avoid even more
> customers coming around asking how to do this.
>
> For the record, the SBSA specification (the closet thing we have to a
> "system architecture") does require that PCIe is integrated in an
> I/O-coherent manner, but we don't have any control over what people do
> in embedded applications (note that we don't make PCIe IP at all, and
> there is plenty of 3rd-party interconnect IP).
>
> So basically it is not the fault of the ARM IP-core, but people are just
> stitching together PCIe interconnect IP with a core where it is not
> supposed to be used with.
>
> Do I get that correctly? That's an interesting puzzle piece in the picture.
>
> So we are talking about a hardware limitation which potentially can't
> be fixed without replacing the hardware.
>
> You expressed interest in "some way to detect if a particular platform
> supports cache snooping or not", by which I assumed you meant a
> software method for the amdgpu/radeon drivers to call, rather than,
> say, a website that driver maintainers can look up SoC names on. I'm
> saying that that API already exists (just may need a bit more work).
> Note that it is emphatically not a platform-level thing since
> coherency can and does vary per device within a system.
>
> Well, I think this is not something an individual driver should mess
> with. What the driver should do is just express that it needs coherent
> access to all of system memory and if that is not possible fail to load
> with a warning why it is not possible.
>
> I wasn't suggesting that Linux could somehow make coherency magically
> work when the signals don't physically exist in the interconnect - I
> was assuming you'd merely want to do something like throw a big
> warning and taint the kernel to help triage bug reports. Some drivers
> like ahci_qoriq and panfrost simply need to know so they can program
> their device to emit the appropriate memory attributes either way, and
> rely on the DMA API to hide the rest of the difference, but if you
> want to treat non-coherent use as unsupported because it would require
> too invasive changes that's fine by me.
>
> Yes exactly that please. I mean not sure how panfrost is doing it, but
> at least the Vulkan userspace API specification requires devices to have
> coherent access to system memory.
>
> So even if I would want to do this it is simply not possible because the
> application doesn't tell the driver which memory is accessed by the
> device and which by the CPU.
>
> Christian.
>
> Robin.
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=F77FbO3SqslbzKu2%2FnjRLrQF45kljtD3%2FAEXEFd7NQs%3D&amp;reserved=0
>
>

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-18 11:24                           ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-18 11:24 UTC (permalink / raw)
  To: Christian König
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher, Robin Murphy

On Fri, Mar 18, 2022 at 4:35 AM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> Am 18.03.22 um 08:51 schrieb Kever Yang:
>
>
> On 2022/3/17 20:19, Peter Geis wrote:
>
> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>
> Hi Peter,
>
> On 2022/3/17 08:14, Peter Geis wrote:
>
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the ali gnment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.
>
> Our SoC design team has double check with ARM GIC/ITS IP team for many
> times, and the GITS_CBASER
> of GIC600 IP does not support hardware bind or config to a fix value, so
> they insist this is an IP
> limitation instead of a SoC bug, software should take  care of it :(
> I will check again if we can provide errata for this issue.
>
> Thanks. This is necessary as the mbi-alias provides an imperfect
> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
> 10G NIC) to misbehave.
>
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.
>
>
> Sorry, what is this?
>
> Part of the ITS bug is it expects to be cache coherent with the CPU
> cluster by design.
> Due to the rk356x being implemented without an outer accessible cache,
> the ITS and other devices that require cache coherency (PCIe for
> example) crash in fun ways.
>
> Then this is still the ITS issue, not PCIe issue.
> PCIe is a peripheral bus controller like USB and other device, the driver should maintain the "cache coherency" if there is any, and there is no requirement for hardware cache coherency between PCIe and CPU.

Kever,

These issues are one and the same.
Certain hardware blocks *require* cache coherency as part of their design.
All of the *interesting* things PCIe can do stem from it.

When I saw you bumped the available window to the PCIe controller to
1GB I was really excited, because that meant we could finally support
devices that used these interesting features.
However, without cache coherency, having more than a 256MB window is a
waste, as any card that can take advantage of it *requires* coherency.
The same thing goes for a resizable BAR.
EP mode is the same, having the ability to connect one CPU to another
CPU over a PCIe bus loses the advantages when you don't have
coherency.
At that point, you might as well toss in a 2.5GB ethernet port and
just use that instead.

>
>
> Well then I suggest to re-read the PCIe specification.
>
> Cache coherency is defined as mandatory there. Non-cache coherency is an optional feature.
>
> See section 2.2.6.5 in the PCIe 2.0 specification for a good example.
>
> Regards,
> Christian.
>
>
> We didn't see any transfer error on rk356x PCIe till now, we can take a look if it's easy to reproduce.

It's easy to reproduce, just try to use any card that has a
significantly large enough BAR to warrant requiring coherency.
dGPUs are the most readily accessible device, but High Performance
Computing Acceleration devices and high power FPGAs also would work.
Was the resizable bar tested at all internally either?
Any current device that could use that requires coherency.
And like above, EP mode without coherency is a waste at best, and
unpleasant at worst.

Very Respectfully,
Peter

>
> Thanks,
> - Kever
>
>
> This means that rk356x cannot implement a specification compliant ITS or PCIe.
> >From the rk3588 source dump it appears it was produced without an
> outer accessible cache, which means if true it also will be unable to
> use any PCIe cards that implement cache coherency as part of their
> design.
>
>
> Thanks,
> - Kever
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LcwZggIwIqjvzjDH2DUnIDwxsgk7WmhE9LK13knx36E%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fXALLO1EnGi2s8pClt6aMrUlzqDy2KDO8wzpi033qtU%3D&amp;reserved=0
> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tx%2Bw9ayScUTftjWAFL0GY%2FADQswxEJGRUhgxDw2TSzQ%3D&amp;reserved=0
> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=8VXuZvQAhD%2FsQBJ6WEXe0YElD6wCI675oxqHesKhclY%3D&amp;reserved=0
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
>
> Hi Robin,
>
> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>
> On 2021-05-26 10:42, Christian König wrote:
>
> Hi Robin,
>
> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>
> On 2021-05-25 14:05, Alex Deucher wrote:
>
> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> wrote:
>
> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> <alexdeucher@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> wrote:
>
> Good Evening,
>
> I am stress testing the pcie controller on the rk3566-quartz64
> prototype SBC.
> This device has 1GB available at <0x3 0x00000000> for the PCIe
> controller, which makes a dGPU theoretically possible.
> While attempting to light off a HD7570 card I manage to get a
> modeset
> console, but ring0 test fails and disables acceleration.
>
> Note, we do not have UEFI, so all PCIe setup is from the Linux
> kernel.
> Any insight you can provide would be much appreciated.
>
> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> does the CPU allow cache snoops from PCIe devices?  That is required
> for the driver to operate.
>
> Ah, most likely not.
> This issue has come up already as the GIC isn't permitted to snoop on
> the CPUs, so I doubt the PCIe controller can either.
>
> Is there no way to work around this or is it dead in the water?
>
> It's required by the pcie spec.  You could potentially work around it
> if you can allocate uncached memory for DMA, but I don't think that is
> possible currently.  Ideally we'd figure out some way to detect if a
> particular platform supports cache snooping or not as well.
>
> There's device_get_dma_attr(), although I don't think it will work
> currently for PCI devices without an OF or ACPI node - we could
> perhaps do with a PCI-specific wrapper which can walk up and defer
> to the host bridge's firmware description as necessary.
>
> The common DMA ops *do* correctly keep track of per-device coherency
> internally, but drivers aren't supposed to be poking at that
> information directly.
>
> That sounds like you underestimate the problem. ARM has unfortunately
> made the coherency for PCI an optional IP.
>
> Sorry to be that guy, but I'm involved a lot internally with our
> system IP and interconnect, and I probably understand the situation
> better than 99% of the community ;)
>
> I need to apologize, didn't realized who was answering :)
>
> It just sounded to me that you wanted to suggest to the end user that
> this is fixable in software and I really wanted to avoid even more
> customers coming around asking how to do this.
>
> For the record, the SBSA specification (the closet thing we have to a
> "system architecture") does require that PCIe is integrated in an
> I/O-coherent manner, but we don't have any control over what people do
> in embedded applications (note that we don't make PCIe IP at all, and
> there is plenty of 3rd-party interconnect IP).
>
> So basically it is not the fault of the ARM IP-core, but people are just
> stitching together PCIe interconnect IP with a core where it is not
> supposed to be used with.
>
> Do I get that correctly? That's an interesting puzzle piece in the picture.
>
> So we are talking about a hardware limitation which potentially can't
> be fixed without replacing the hardware.
>
> You expressed interest in "some way to detect if a particular platform
> supports cache snooping or not", by which I assumed you meant a
> software method for the amdgpu/radeon drivers to call, rather than,
> say, a website that driver maintainers can look up SoC names on. I'm
> saying that that API already exists (just may need a bit more work).
> Note that it is emphatically not a platform-level thing since
> coherency can and does vary per device within a system.
>
> Well, I think this is not something an individual driver should mess
> with. What the driver should do is just express that it needs coherent
> access to all of system memory and if that is not possible fail to load
> with a warning why it is not possible.
>
> I wasn't suggesting that Linux could somehow make coherency magically
> work when the signals don't physically exist in the interconnect - I
> was assuming you'd merely want to do something like throw a big
> warning and taint the kernel to help triage bug reports. Some drivers
> like ahci_qoriq and panfrost simply need to know so they can program
> their device to emit the appropriate memory attributes either way, and
> rely on the DMA API to hide the rest of the difference, but if you
> want to treat non-coherent use as unsupported because it would require
> too invasive changes that's fine by me.
>
> Yes exactly that please. I mean not sure how panfrost is doing it, but
> at least the Vulkan userspace API specification requires devices to have
> coherent access to system memory.
>
> So even if I would want to do this it is simply not possible because the
> application doesn't tell the driver which memory is accessed by the
> device and which by the CPU.
>
> Christian.
>
> Robin.
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C8bdb8c3a6a2e4643bbfd08da08b42da4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831867224766930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=F77FbO3SqslbzKu2%2FnjRLrQF45kljtD3%2FAEXEFd7NQs%3D&amp;reserved=0
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-18 11:24                           ` Peter Geis
@ 2022-03-18 12:31                             ` Christian König
  -1 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2022-03-18 12:31 UTC (permalink / raw)
  To: Peter Geis
  Cc: Kever Yang, Robin Murphy, Shawn Lin, Christian König,
	Alex Deucher, Deucher, Alexander, amd-gfx list,
	open list:ARM/Rockchip SoC...,
	Tao Huang

Am 18.03.22 um 12:24 schrieb Peter Geis:
> On Fri, Mar 18, 2022 at 4:35 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>>
>> Am 18.03.22 um 08:51 schrieb Kever Yang:
>>
>>
>> On 2022/3/17 20:19, Peter Geis wrote:
>>
>> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>>
>> Hi Peter,
>>
>> On 2022/3/17 08:14, Peter Geis wrote:
>>
>> Good Evening,
>>
>> I apologize for raising this email chain from the dead, but there have
>> been some developments that have introduced even more questions.
>> I've looped the Rockchip mailing list into this too, as this affects
>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>
>> TLDR for those not familiar: It seems the rk356x series (and possibly
>> the rk3588) were built without any outer coherent cache.
>> This means (unless Rockchip wants to clarify here) devices such as the
>> ITS and PCIe cannot utilize cache snooping.
>> This is based on the results of the email chain [2].
>>
>> The new circumstances are as follows:
>> The RPi CM4 Adventure Team as I've taken to calling them has been
>> attempting to get a dGPU working with the very broken Broadcom
>> controller in the RPi CM4.
>> Recently they acquired a SoQuartz rk3566 module which is pin
>> compatible with the CM4, and have taken to trying it out as well.
>>
>> This is how I got involved.
>> It seems they found a trivial way to force the Radeon R600 driver to
>> use Non-Cached memory for everything.
>> This single line change, combined with using memset_io instead of
>> memset, allows the ring tests to pass and the card probes successfully
>> (minus the DMA limitations of the rk356x due to the 32 bit
>> interconnect).
>> I discovered using this method that we start having unaligned io
>> memory access faults (bus errors) when running glmark2-drm (running
>> glmark2 directly was impossible, as both X and Wayland crashed too
>> early).
>> I traced this to using what I thought at the time was an unsafe memcpy
>> in the mesa stack.
>> Rewriting this function to force aligned writes solved the problem and
>> allows glmark2-drm to run to completion.
>> With some extensive debugging, I found about half a dozen memcpy
>> functions in mesa that if forced to be aligned would allow Wayland to
>> start, but with hilarious display corruption (see [3]. [4]).
>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>> I'm not convinced it's that simple.
>>
>> On my two hour drive in to work this morning, I got to thinking.
>> If this was an memcpy fault, this would be universally broken on arm64
>> which is obviously not the case.
>> So I started thinking, what is different here than with systems known to work:
>> 1. No IOMMU for the PCIe controller.
>> 2. The Outer Cache Issue.
>>
>> Robin:
>> My questions for you, since you're the smartest person I know about
>> arm64 memory management:
>> Could cache snooping permit unaligned accesses to IO to be safe?
>> Or
>> Is it the lack of an IOMMU that's causing the ali gnment faults to become fatal?
>> Or
>> Am I insane here?
>>
>> Rockchip:
>> Please update on the status for the Outer Cache errata for ITS services.
>>
>> Our SoC design team has double check with ARM GIC/ITS IP team for many
>> times, and the GITS_CBASER
>> of GIC600 IP does not support hardware bind or config to a fix value, so
>> they insist this is an IP
>> limitation instead of a SoC bug, software should take  care of it :(
>> I will check again if we can provide errata for this issue.
>>
>> Thanks. This is necessary as the mbi-alias provides an imperfect
>> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
>> 10G NIC) to misbehave.
>>
>> Please provide an answer to the errata of the PCIe controller, in
>> regard to cache snooping and buffering, for both the rk356x and the
>> upcoming rk3588.
>>
>>
>> Sorry, what is this?
>>
>> Part of the ITS bug is it expects to be cache coherent with the CPU
>> cluster by design.
>> Due to the rk356x being implemented without an outer accessible cache,
>> the ITS and other devices that require cache coherency (PCIe for
>> example) crash in fun ways.
>>
>> Then this is still the ITS issue, not PCIe issue.
>> PCIe is a peripheral bus controller like USB and other device, the driver should maintain the "cache coherency" if there is any, and there is no requirement for hardware cache coherency between PCIe and CPU.
> Kever,
>
> These issues are one and the same.

Well, that's not correct. You are still mixing two things up here:

1. The memory accesses from the device to the system memory must be 
coherent with the CPU cache. E.g. we root complex must snoop the CPU cache.
     That's a requirement of the PCIe spec. If you don't get that right 
a whole bunch of PCIe devices won't work correctly.

2. The memory accesses from the CPU to the devices PCIe BAR can be 
unaligned. E.g. a 64bit read can be aligned on a 32bit address.
     That is a requirement of the graphics stack. Other devices still 
might work fine without that.

Regards,
Christian.

> Certain hardware blocks *require* cache coherency as part of their design.
> All of the *interesting* things PCIe can do stem from it.
>
> When I saw you bumped the available window to the PCIe controller to
> 1GB I was really excited, because that meant we could finally support
> devices that used these interesting features.
> However, without cache coherency, having more than a 256MB window is a
> waste, as any card that can take advantage of it *requires* coherency.
> The same thing goes for a resizable BAR.
> EP mode is the same, having the ability to connect one CPU to another
> CPU over a PCIe bus loses the advantages when you don't have
> coherency.
> At that point, you might as well toss in a 2.5GB ethernet port and
> just use that instead.
>
>>
>> Well then I suggest to re-read the PCIe specification.
>>
>> Cache coherency is defined as mandatory there. Non-cache coherency is an optional feature.
>>
>> See section 2.2.6.5 in the PCIe 2.0 specification for a good example.
>>
>> Regards,
>> Christian.
>>
>>
>> We didn't see any transfer error on rk356x PCIe till now, we can take a look if it's easy to reproduce.
> It's easy to reproduce, just try to use any card that has a
> significantly large enough BAR to warrant requiring coherency.
> dGPUs are the most readily accessible device, but High Performance
> Computing Acceleration devices and high power FPGAs also would work.
> Was the resizable bar tested at all internally either?
> Any current device that could use that requires coherency.
> And like above, EP mode without coherency is a waste at best, and
> unpleasant at worst.
>
> Very Respectfully,
> Peter
>
>> Thanks,
>> - Kever
>>
>>
>> This means that rk356x cannot implement a specification compliant ITS or PCIe.
>> >From the rk3588 source dump it appears it was produced without an
>> outer accessible cache, which means if true it also will be unable to
>> use any PCIe cards that implement cache coherency as part of their
>> design.
>>
>>
>> Thanks,
>> - Kever
>>
>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=et3jUQ1Y2QaR56qTjl4LJ1vGurPwK8HfLosebUIV9bc%3D&amp;reserved=0
>> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UrGSye7MpCUO9tppCCmgSGlNa6X0otJ8tkcOb2PXjA8%3D&amp;reserved=0
>> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=agZjpl0LvSf4Jo3SoETVkW72uN0WiHb%2FYUA7V7c0G88%3D&amp;reserved=0
>> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tuBS9UfMegc1bc7U98zpsfQ1vUKsVmpscmNKpkn%2BHmk%3D&amp;reserved=0
>>
>> Thank you everyone for your time.
>>
>> Very Respectfully,
>> Peter Geis
>>
>> On Wed, May 26, 2021 at 7:21 AM Christian König
>> <christian.koenig@amd.com> wrote:
>>
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>
>> On 2021-05-26 10:42, Christian König wrote:
>>
>> Hi Robin,
>>
>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>
>> On 2021-05-25 14:05, Alex Deucher wrote:
>>
>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>> wrote:
>>
>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>> <alexdeucher@gmail.com> wrote:
>>
>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>> wrote:
>>
>> Good Evening,
>>
>> I am stress testing the pcie controller on the rk3566-quartz64
>> prototype SBC.
>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>> controller, which makes a dGPU theoretically possible.
>> While attempting to light off a HD7570 card I manage to get a
>> modeset
>> console, but ring0 test fails and disables acceleration.
>>
>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>> kernel.
>> Any insight you can provide would be much appreciated.
>>
>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>> does the CPU allow cache snoops from PCIe devices?  That is required
>> for the driver to operate.
>>
>> Ah, most likely not.
>> This issue has come up already as the GIC isn't permitted to snoop on
>> the CPUs, so I doubt the PCIe controller can either.
>>
>> Is there no way to work around this or is it dead in the water?
>>
>> It's required by the pcie spec.  You could potentially work around it
>> if you can allocate uncached memory for DMA, but I don't think that is
>> possible currently.  Ideally we'd figure out some way to detect if a
>> particular platform supports cache snooping or not as well.
>>
>> There's device_get_dma_attr(), although I don't think it will work
>> currently for PCI devices without an OF or ACPI node - we could
>> perhaps do with a PCI-specific wrapper which can walk up and defer
>> to the host bridge's firmware description as necessary.
>>
>> The common DMA ops *do* correctly keep track of per-device coherency
>> internally, but drivers aren't supposed to be poking at that
>> information directly.
>>
>> That sounds like you underestimate the problem. ARM has unfortunately
>> made the coherency for PCI an optional IP.
>>
>> Sorry to be that guy, but I'm involved a lot internally with our
>> system IP and interconnect, and I probably understand the situation
>> better than 99% of the community ;)
>>
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>> For the record, the SBSA specification (the closet thing we have to a
>> "system architecture") does require that PCIe is integrated in an
>> I/O-coherent manner, but we don't have any control over what people do
>> in embedded applications (note that we don't make PCIe IP at all, and
>> there is plenty of 3rd-party interconnect IP).
>>
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>> So we are talking about a hardware limitation which potentially can't
>> be fixed without replacing the hardware.
>>
>> You expressed interest in "some way to detect if a particular platform
>> supports cache snooping or not", by which I assumed you meant a
>> software method for the amdgpu/radeon drivers to call, rather than,
>> say, a website that driver maintainers can look up SoC names on. I'm
>> saying that that API already exists (just may need a bit more work).
>> Note that it is emphatically not a platform-level thing since
>> coherency can and does vary per device within a system.
>>
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>> I wasn't suggesting that Linux could somehow make coherency magically
>> work when the signals don't physically exist in the interconnect - I
>> was assuming you'd merely want to do something like throw a big
>> warning and taint the kernel to help triage bug reports. Some drivers
>> like ahci_qoriq and panfrost simply need to know so they can program
>> their device to emit the appropriate memory attributes either way, and
>> rely on the DMA API to hide the rest of the difference, but if you
>> want to treat non-coherent use as unsupported because it would require
>> too invasive changes that's fine by me.
>>
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>> Robin.
>>
>> _______________________________________________
>> Linux-rockchip mailing list
>> Linux-rockchip@lists.infradead.org
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gyKyym%2BH%2F9u%2FfBP953N97x%2BOJBt9EaR2aPivWITwlPo%3D&amp;reserved=0
>>
>>


_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-18 12:31                             ` Christian König
  0 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2022-03-18 12:31 UTC (permalink / raw)
  To: Peter Geis
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher, Robin Murphy

Am 18.03.22 um 12:24 schrieb Peter Geis:
> On Fri, Mar 18, 2022 at 4:35 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>>
>> Am 18.03.22 um 08:51 schrieb Kever Yang:
>>
>>
>> On 2022/3/17 20:19, Peter Geis wrote:
>>
>> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
>>
>> Hi Peter,
>>
>> On 2022/3/17 08:14, Peter Geis wrote:
>>
>> Good Evening,
>>
>> I apologize for raising this email chain from the dead, but there have
>> been some developments that have introduced even more questions.
>> I've looped the Rockchip mailing list into this too, as this affects
>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>
>> TLDR for those not familiar: It seems the rk356x series (and possibly
>> the rk3588) were built without any outer coherent cache.
>> This means (unless Rockchip wants to clarify here) devices such as the
>> ITS and PCIe cannot utilize cache snooping.
>> This is based on the results of the email chain [2].
>>
>> The new circumstances are as follows:
>> The RPi CM4 Adventure Team as I've taken to calling them has been
>> attempting to get a dGPU working with the very broken Broadcom
>> controller in the RPi CM4.
>> Recently they acquired a SoQuartz rk3566 module which is pin
>> compatible with the CM4, and have taken to trying it out as well.
>>
>> This is how I got involved.
>> It seems they found a trivial way to force the Radeon R600 driver to
>> use Non-Cached memory for everything.
>> This single line change, combined with using memset_io instead of
>> memset, allows the ring tests to pass and the card probes successfully
>> (minus the DMA limitations of the rk356x due to the 32 bit
>> interconnect).
>> I discovered using this method that we start having unaligned io
>> memory access faults (bus errors) when running glmark2-drm (running
>> glmark2 directly was impossible, as both X and Wayland crashed too
>> early).
>> I traced this to using what I thought at the time was an unsafe memcpy
>> in the mesa stack.
>> Rewriting this function to force aligned writes solved the problem and
>> allows glmark2-drm to run to completion.
>> With some extensive debugging, I found about half a dozen memcpy
>> functions in mesa that if forced to be aligned would allow Wayland to
>> start, but with hilarious display corruption (see [3]. [4]).
>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>> I'm not convinced it's that simple.
>>
>> On my two hour drive in to work this morning, I got to thinking.
>> If this was an memcpy fault, this would be universally broken on arm64
>> which is obviously not the case.
>> So I started thinking, what is different here than with systems known to work:
>> 1. No IOMMU for the PCIe controller.
>> 2. The Outer Cache Issue.
>>
>> Robin:
>> My questions for you, since you're the smartest person I know about
>> arm64 memory management:
>> Could cache snooping permit unaligned accesses to IO to be safe?
>> Or
>> Is it the lack of an IOMMU that's causing the ali gnment faults to become fatal?
>> Or
>> Am I insane here?
>>
>> Rockchip:
>> Please update on the status for the Outer Cache errata for ITS services.
>>
>> Our SoC design team has double check with ARM GIC/ITS IP team for many
>> times, and the GITS_CBASER
>> of GIC600 IP does not support hardware bind or config to a fix value, so
>> they insist this is an IP
>> limitation instead of a SoC bug, software should take  care of it :(
>> I will check again if we can provide errata for this issue.
>>
>> Thanks. This is necessary as the mbi-alias provides an imperfect
>> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
>> 10G NIC) to misbehave.
>>
>> Please provide an answer to the errata of the PCIe controller, in
>> regard to cache snooping and buffering, for both the rk356x and the
>> upcoming rk3588.
>>
>>
>> Sorry, what is this?
>>
>> Part of the ITS bug is it expects to be cache coherent with the CPU
>> cluster by design.
>> Due to the rk356x being implemented without an outer accessible cache,
>> the ITS and other devices that require cache coherency (PCIe for
>> example) crash in fun ways.
>>
>> Then this is still the ITS issue, not PCIe issue.
>> PCIe is a peripheral bus controller like USB and other device, the driver should maintain the "cache coherency" if there is any, and there is no requirement for hardware cache coherency between PCIe and CPU.
> Kever,
>
> These issues are one and the same.

Well, that's not correct. You are still mixing two things up here:

1. The memory accesses from the device to the system memory must be 
coherent with the CPU cache. E.g. we root complex must snoop the CPU cache.
     That's a requirement of the PCIe spec. If you don't get that right 
a whole bunch of PCIe devices won't work correctly.

2. The memory accesses from the CPU to the devices PCIe BAR can be 
unaligned. E.g. a 64bit read can be aligned on a 32bit address.
     That is a requirement of the graphics stack. Other devices still 
might work fine without that.

Regards,
Christian.

> Certain hardware blocks *require* cache coherency as part of their design.
> All of the *interesting* things PCIe can do stem from it.
>
> When I saw you bumped the available window to the PCIe controller to
> 1GB I was really excited, because that meant we could finally support
> devices that used these interesting features.
> However, without cache coherency, having more than a 256MB window is a
> waste, as any card that can take advantage of it *requires* coherency.
> The same thing goes for a resizable BAR.
> EP mode is the same, having the ability to connect one CPU to another
> CPU over a PCIe bus loses the advantages when you don't have
> coherency.
> At that point, you might as well toss in a 2.5GB ethernet port and
> just use that instead.
>
>>
>> Well then I suggest to re-read the PCIe specification.
>>
>> Cache coherency is defined as mandatory there. Non-cache coherency is an optional feature.
>>
>> See section 2.2.6.5 in the PCIe 2.0 specification for a good example.
>>
>> Regards,
>> Christian.
>>
>>
>> We didn't see any transfer error on rk356x PCIe till now, we can take a look if it's easy to reproduce.
> It's easy to reproduce, just try to use any card that has a
> significantly large enough BAR to warrant requiring coherency.
> dGPUs are the most readily accessible device, but High Performance
> Computing Acceleration devices and high power FPGAs also would work.
> Was the resizable bar tested at all internally either?
> Any current device that could use that requires coherency.
> And like above, EP mode without coherency is a waste at best, and
> unpleasant at worst.
>
> Very Respectfully,
> Peter
>
>> Thanks,
>> - Kever
>>
>>
>> This means that rk356x cannot implement a specification compliant ITS or PCIe.
>> >From the rk3588 source dump it appears it was produced without an
>> outer accessible cache, which means if true it also will be unable to
>> use any PCIe cards that implement cache coherency as part of their
>> design.
>>
>>
>> Thanks,
>> - Kever
>>
>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=et3jUQ1Y2QaR56qTjl4LJ1vGurPwK8HfLosebUIV9bc%3D&amp;reserved=0
>> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UrGSye7MpCUO9tppCCmgSGlNa6X0otJ8tkcOb2PXjA8%3D&amp;reserved=0
>> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=agZjpl0LvSf4Jo3SoETVkW72uN0WiHb%2FYUA7V7c0G88%3D&amp;reserved=0
>> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tuBS9UfMegc1bc7U98zpsfQ1vUKsVmpscmNKpkn%2BHmk%3D&amp;reserved=0
>>
>> Thank you everyone for your time.
>>
>> Very Respectfully,
>> Peter Geis
>>
>> On Wed, May 26, 2021 at 7:21 AM Christian König
>> <christian.koenig@amd.com> wrote:
>>
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>
>> On 2021-05-26 10:42, Christian König wrote:
>>
>> Hi Robin,
>>
>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>
>> On 2021-05-25 14:05, Alex Deucher wrote:
>>
>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
>> wrote:
>>
>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>> <alexdeucher@gmail.com> wrote:
>>
>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
>> wrote:
>>
>> Good Evening,
>>
>> I am stress testing the pcie controller on the rk3566-quartz64
>> prototype SBC.
>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>> controller, which makes a dGPU theoretically possible.
>> While attempting to light off a HD7570 card I manage to get a
>> modeset
>> console, but ring0 test fails and disables acceleration.
>>
>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>> kernel.
>> Any insight you can provide would be much appreciated.
>>
>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>> does the CPU allow cache snoops from PCIe devices?  That is required
>> for the driver to operate.
>>
>> Ah, most likely not.
>> This issue has come up already as the GIC isn't permitted to snoop on
>> the CPUs, so I doubt the PCIe controller can either.
>>
>> Is there no way to work around this or is it dead in the water?
>>
>> It's required by the pcie spec.  You could potentially work around it
>> if you can allocate uncached memory for DMA, but I don't think that is
>> possible currently.  Ideally we'd figure out some way to detect if a
>> particular platform supports cache snooping or not as well.
>>
>> There's device_get_dma_attr(), although I don't think it will work
>> currently for PCI devices without an OF or ACPI node - we could
>> perhaps do with a PCI-specific wrapper which can walk up and defer
>> to the host bridge's firmware description as necessary.
>>
>> The common DMA ops *do* correctly keep track of per-device coherency
>> internally, but drivers aren't supposed to be poking at that
>> information directly.
>>
>> That sounds like you underestimate the problem. ARM has unfortunately
>> made the coherency for PCI an optional IP.
>>
>> Sorry to be that guy, but I'm involved a lot internally with our
>> system IP and interconnect, and I probably understand the situation
>> better than 99% of the community ;)
>>
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>> For the record, the SBSA specification (the closet thing we have to a
>> "system architecture") does require that PCIe is integrated in an
>> I/O-coherent manner, but we don't have any control over what people do
>> in embedded applications (note that we don't make PCIe IP at all, and
>> there is plenty of 3rd-party interconnect IP).
>>
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>> So we are talking about a hardware limitation which potentially can't
>> be fixed without replacing the hardware.
>>
>> You expressed interest in "some way to detect if a particular platform
>> supports cache snooping or not", by which I assumed you meant a
>> software method for the amdgpu/radeon drivers to call, rather than,
>> say, a website that driver maintainers can look up SoC names on. I'm
>> saying that that API already exists (just may need a bit more work).
>> Note that it is emphatically not a platform-level thing since
>> coherency can and does vary per device within a system.
>>
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>> I wasn't suggesting that Linux could somehow make coherency magically
>> work when the signals don't physically exist in the interconnect - I
>> was assuming you'd merely want to do something like throw a big
>> warning and taint the kernel to help triage bug reports. Some drivers
>> like ahci_qoriq and panfrost simply need to know so they can program
>> their device to emit the appropriate memory attributes either way, and
>> rely on the DMA API to hide the rest of the difference, but if you
>> want to treat non-coherent use as unsupported because it would require
>> too invasive changes that's fine by me.
>>
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>> Robin.
>>
>> _______________________________________________
>> Linux-rockchip mailing list
>> Linux-rockchip@lists.infradead.org
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gyKyym%2BH%2F9u%2FfBP953N97x%2BOJBt9EaR2aPivWITwlPo%3D&amp;reserved=0
>>
>>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-18 12:31                             ` Christian König
@ 2022-03-18 12:45                               ` Peter Geis
  -1 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-18 12:45 UTC (permalink / raw)
  To: Christian König
  Cc: Kever Yang, Robin Murphy, Shawn Lin, Christian König,
	Alex Deucher, Deucher, Alexander, amd-gfx list,
	open list:ARM/Rockchip SoC...,
	Tao Huang

On Fri, Mar 18, 2022 at 8:31 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 18.03.22 um 12:24 schrieb Peter Geis:
> > On Fri, Mar 18, 2022 at 4:35 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >>
> >> Am 18.03.22 um 08:51 schrieb Kever Yang:
> >>
> >>
> >> On 2022/3/17 20:19, Peter Geis wrote:
> >>
> >> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2022/3/17 08:14, Peter Geis wrote:
> >>
> >> Good Evening,
> >>
> >> I apologize for raising this email chain from the dead, but there have
> >> been some developments that have introduced even more questions.
> >> I've looped the Rockchip mailing list into this too, as this affects
> >> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >>
> >> TLDR for those not familiar: It seems the rk356x series (and possibly
> >> the rk3588) were built without any outer coherent cache.
> >> This means (unless Rockchip wants to clarify here) devices such as the
> >> ITS and PCIe cannot utilize cache snooping.
> >> This is based on the results of the email chain [2].
> >>
> >> The new circumstances are as follows:
> >> The RPi CM4 Adventure Team as I've taken to calling them has been
> >> attempting to get a dGPU working with the very broken Broadcom
> >> controller in the RPi CM4.
> >> Recently they acquired a SoQuartz rk3566 module which is pin
> >> compatible with the CM4, and have taken to trying it out as well.
> >>
> >> This is how I got involved.
> >> It seems they found a trivial way to force the Radeon R600 driver to
> >> use Non-Cached memory for everything.
> >> This single line change, combined with using memset_io instead of
> >> memset, allows the ring tests to pass and the card probes successfully
> >> (minus the DMA limitations of the rk356x due to the 32 bit
> >> interconnect).
> >> I discovered using this method that we start having unaligned io
> >> memory access faults (bus errors) when running glmark2-drm (running
> >> glmark2 directly was impossible, as both X and Wayland crashed too
> >> early).
> >> I traced this to using what I thought at the time was an unsafe memcpy
> >> in the mesa stack.
> >> Rewriting this function to force aligned writes solved the problem and
> >> allows glmark2-drm to run to completion.
> >> With some extensive debugging, I found about half a dozen memcpy
> >> functions in mesa that if forced to be aligned would allow Wayland to
> >> start, but with hilarious display corruption (see [3]. [4]).
> >> The CM4 team is convinced this is an issue with memcpy in glibc, but
> >> I'm not convinced it's that simple.
> >>
> >> On my two hour drive in to work this morning, I got to thinking.
> >> If this was an memcpy fault, this would be universally broken on arm64
> >> which is obviously not the case.
> >> So I started thinking, what is different here than with systems known to work:
> >> 1. No IOMMU for the PCIe controller.
> >> 2. The Outer Cache Issue.
> >>
> >> Robin:
> >> My questions for you, since you're the smartest person I know about
> >> arm64 memory management:
> >> Could cache snooping permit unaligned accesses to IO to be safe?
> >> Or
> >> Is it the lack of an IOMMU that's causing the ali gnment faults to become fatal?
> >> Or
> >> Am I insane here?
> >>
> >> Rockchip:
> >> Please update on the status for the Outer Cache errata for ITS services.
> >>
> >> Our SoC design team has double check with ARM GIC/ITS IP team for many
> >> times, and the GITS_CBASER
> >> of GIC600 IP does not support hardware bind or config to a fix value, so
> >> they insist this is an IP
> >> limitation instead of a SoC bug, software should take  care of it :(
> >> I will check again if we can provide errata for this issue.
> >>
> >> Thanks. This is necessary as the mbi-alias provides an imperfect
> >> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
> >> 10G NIC) to misbehave.
> >>
> >> Please provide an answer to the errata of the PCIe controller, in
> >> regard to cache snooping and buffering, for both the rk356x and the
> >> upcoming rk3588.
> >>
> >>
> >> Sorry, what is this?
> >>
> >> Part of the ITS bug is it expects to be cache coherent with the CPU
> >> cluster by design.
> >> Due to the rk356x being implemented without an outer accessible cache,
> >> the ITS and other devices that require cache coherency (PCIe for
> >> example) crash in fun ways.
> >>
> >> Then this is still the ITS issue, not PCIe issue.
> >> PCIe is a peripheral bus controller like USB and other device, the driver should maintain the "cache coherency" if there is any, and there is no requirement for hardware cache coherency between PCIe and CPU.
> > Kever,
> >
> > These issues are one and the same.
>
> Well, that's not correct. You are still mixing two things up here:
>
> 1. The memory accesses from the device to the system memory must be
> coherent with the CPU cache. E.g. we root complex must snoop the CPU cache.
>      That's a requirement of the PCIe spec. If you don't get that right
> a whole bunch of PCIe devices won't work correctly.

The ITS issue referred to here is the same root problem.
See:
https://lore.kernel.org/lkml/874kg0q6lc.wl-maz@kernel.org/raw
for the description of that issue.
(It's actually two issues, lack of cache snooping, and the 32 bit bus
forcing DMA to be limited to <4G ram)

>
> 2. The memory accesses from the CPU to the devices PCIe BAR can be
> unaligned. E.g. a 64bit read can be aligned on a 32bit address.
>      That is a requirement of the graphics stack. Other devices still
> might work fine without that.

Correct, this is a separate issue, but only becomes obvious when the
cache issue is bypassed.
At least for Radeon, the ring tests fail immediately due to issue 1.
I'm waiting for the DWC-PCIe maintainers to weigh in here, but in the
meantime I've been reading up on the way it was supposed to be
implemented.
IF (big IF here) I'm understanding it correctly, they permit synthesis
of the PCIe controller with or without support for unaligned accesses.

>
> Regards,
> Christian.

Thanks for everything so far!
Peter

>
> > Certain hardware blocks *require* cache coherency as part of their design.
> > All of the *interesting* things PCIe can do stem from it.
> >
> > When I saw you bumped the available window to the PCIe controller to
> > 1GB I was really excited, because that meant we could finally support
> > devices that used these interesting features.
> > However, without cache coherency, having more than a 256MB window is a
> > waste, as any card that can take advantage of it *requires* coherency.
> > The same thing goes for a resizable BAR.
> > EP mode is the same, having the ability to connect one CPU to another
> > CPU over a PCIe bus loses the advantages when you don't have
> > coherency.
> > At that point, you might as well toss in a 2.5GB ethernet port and
> > just use that instead.
> >
> >>
> >> Well then I suggest to re-read the PCIe specification.
> >>
> >> Cache coherency is defined as mandatory there. Non-cache coherency is an optional feature.
> >>
> >> See section 2.2.6.5 in the PCIe 2.0 specification for a good example.
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >> We didn't see any transfer error on rk356x PCIe till now, we can take a look if it's easy to reproduce.
> > It's easy to reproduce, just try to use any card that has a
> > significantly large enough BAR to warrant requiring coherency.
> > dGPUs are the most readily accessible device, but High Performance
> > Computing Acceleration devices and high power FPGAs also would work.
> > Was the resizable bar tested at all internally either?
> > Any current device that could use that requires coherency.
> > And like above, EP mode without coherency is a waste at best, and
> > unpleasant at worst.
> >
> > Very Respectfully,
> > Peter
> >
> >> Thanks,
> >> - Kever
> >>
> >>
> >> This means that rk356x cannot implement a specification compliant ITS or PCIe.
> >> >From the rk3588 source dump it appears it was produced without an
> >> outer accessible cache, which means if true it also will be unable to
> >> use any PCIe cards that implement cache coherency as part of their
> >> design.
> >>
> >>
> >> Thanks,
> >> - Kever
> >>
> >> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=et3jUQ1Y2QaR56qTjl4LJ1vGurPwK8HfLosebUIV9bc%3D&amp;reserved=0
> >> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UrGSye7MpCUO9tppCCmgSGlNa6X0otJ8tkcOb2PXjA8%3D&amp;reserved=0
> >> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=agZjpl0LvSf4Jo3SoETVkW72uN0WiHb%2FYUA7V7c0G88%3D&amp;reserved=0
> >> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tuBS9UfMegc1bc7U98zpsfQ1vUKsVmpscmNKpkn%2BHmk%3D&amp;reserved=0
> >>
> >> Thank you everyone for your time.
> >>
> >> Very Respectfully,
> >> Peter Geis
> >>
> >> On Wed, May 26, 2021 at 7:21 AM Christian König
> >> <christian.koenig@amd.com> wrote:
> >>
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>
> >> On 2021-05-26 10:42, Christian König wrote:
> >>
> >> Hi Robin,
> >>
> >> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>
> >> On 2021-05-25 14:05, Alex Deucher wrote:
> >>
> >> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >> wrote:
> >>
> >> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >> <alexdeucher@gmail.com> wrote:
> >>
> >> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >> wrote:
> >>
> >> Good Evening,
> >>
> >> I am stress testing the pcie controller on the rk3566-quartz64
> >> prototype SBC.
> >> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >> controller, which makes a dGPU theoretically possible.
> >> While attempting to light off a HD7570 card I manage to get a
> >> modeset
> >> console, but ring0 test fails and disables acceleration.
> >>
> >> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >> kernel.
> >> Any insight you can provide would be much appreciated.
> >>
> >> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >> does the CPU allow cache snoops from PCIe devices?  That is required
> >> for the driver to operate.
> >>
> >> Ah, most likely not.
> >> This issue has come up already as the GIC isn't permitted to snoop on
> >> the CPUs, so I doubt the PCIe controller can either.
> >>
> >> Is there no way to work around this or is it dead in the water?
> >>
> >> It's required by the pcie spec.  You could potentially work around it
> >> if you can allocate uncached memory for DMA, but I don't think that is
> >> possible currently.  Ideally we'd figure out some way to detect if a
> >> particular platform supports cache snooping or not as well.
> >>
> >> There's device_get_dma_attr(), although I don't think it will work
> >> currently for PCI devices without an OF or ACPI node - we could
> >> perhaps do with a PCI-specific wrapper which can walk up and defer
> >> to the host bridge's firmware description as necessary.
> >>
> >> The common DMA ops *do* correctly keep track of per-device coherency
> >> internally, but drivers aren't supposed to be poking at that
> >> information directly.
> >>
> >> That sounds like you underestimate the problem. ARM has unfortunately
> >> made the coherency for PCI an optional IP.
> >>
> >> Sorry to be that guy, but I'm involved a lot internally with our
> >> system IP and interconnect, and I probably understand the situation
> >> better than 99% of the community ;)
> >>
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >> For the record, the SBSA specification (the closet thing we have to a
> >> "system architecture") does require that PCIe is integrated in an
> >> I/O-coherent manner, but we don't have any control over what people do
> >> in embedded applications (note that we don't make PCIe IP at all, and
> >> there is plenty of 3rd-party interconnect IP).
> >>
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >> So we are talking about a hardware limitation which potentially can't
> >> be fixed without replacing the hardware.
> >>
> >> You expressed interest in "some way to detect if a particular platform
> >> supports cache snooping or not", by which I assumed you meant a
> >> software method for the amdgpu/radeon drivers to call, rather than,
> >> say, a website that driver maintainers can look up SoC names on. I'm
> >> saying that that API already exists (just may need a bit more work).
> >> Note that it is emphatically not a platform-level thing since
> >> coherency can and does vary per device within a system.
> >>
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >> I wasn't suggesting that Linux could somehow make coherency magically
> >> work when the signals don't physically exist in the interconnect - I
> >> was assuming you'd merely want to do something like throw a big
> >> warning and taint the kernel to help triage bug reports. Some drivers
> >> like ahci_qoriq and panfrost simply need to know so they can program
> >> their device to emit the appropriate memory attributes either way, and
> >> rely on the DMA API to hide the rest of the difference, but if you
> >> want to treat non-coherent use as unsupported because it would require
> >> too invasive changes that's fine by me.
> >>
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >> Robin.
> >>
> >> _______________________________________________
> >> Linux-rockchip mailing list
> >> Linux-rockchip@lists.infradead.org
> >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gyKyym%2BH%2F9u%2FfBP953N97x%2BOJBt9EaR2aPivWITwlPo%3D&amp;reserved=0
> >>
> >>
>

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-18 12:45                               ` Peter Geis
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Geis @ 2022-03-18 12:45 UTC (permalink / raw)
  To: Christian König
  Cc: Tao Huang, open list:ARM/Rockchip SoC...,
	Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	Deucher, Alexander, Alex Deucher, Robin Murphy

On Fri, Mar 18, 2022 at 8:31 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 18.03.22 um 12:24 schrieb Peter Geis:
> > On Fri, Mar 18, 2022 at 4:35 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >>
> >> Am 18.03.22 um 08:51 schrieb Kever Yang:
> >>
> >>
> >> On 2022/3/17 20:19, Peter Geis wrote:
> >>
> >> On Wed, Mar 16, 2022 at 11:08 PM Kever Yang <kever.yang@rock-chips.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2022/3/17 08:14, Peter Geis wrote:
> >>
> >> Good Evening,
> >>
> >> I apologize for raising this email chain from the dead, but there have
> >> been some developments that have introduced even more questions.
> >> I've looped the Rockchip mailing list into this too, as this affects
> >> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
> >>
> >> TLDR for those not familiar: It seems the rk356x series (and possibly
> >> the rk3588) were built without any outer coherent cache.
> >> This means (unless Rockchip wants to clarify here) devices such as the
> >> ITS and PCIe cannot utilize cache snooping.
> >> This is based on the results of the email chain [2].
> >>
> >> The new circumstances are as follows:
> >> The RPi CM4 Adventure Team as I've taken to calling them has been
> >> attempting to get a dGPU working with the very broken Broadcom
> >> controller in the RPi CM4.
> >> Recently they acquired a SoQuartz rk3566 module which is pin
> >> compatible with the CM4, and have taken to trying it out as well.
> >>
> >> This is how I got involved.
> >> It seems they found a trivial way to force the Radeon R600 driver to
> >> use Non-Cached memory for everything.
> >> This single line change, combined with using memset_io instead of
> >> memset, allows the ring tests to pass and the card probes successfully
> >> (minus the DMA limitations of the rk356x due to the 32 bit
> >> interconnect).
> >> I discovered using this method that we start having unaligned io
> >> memory access faults (bus errors) when running glmark2-drm (running
> >> glmark2 directly was impossible, as both X and Wayland crashed too
> >> early).
> >> I traced this to using what I thought at the time was an unsafe memcpy
> >> in the mesa stack.
> >> Rewriting this function to force aligned writes solved the problem and
> >> allows glmark2-drm to run to completion.
> >> With some extensive debugging, I found about half a dozen memcpy
> >> functions in mesa that if forced to be aligned would allow Wayland to
> >> start, but with hilarious display corruption (see [3]. [4]).
> >> The CM4 team is convinced this is an issue with memcpy in glibc, but
> >> I'm not convinced it's that simple.
> >>
> >> On my two hour drive in to work this morning, I got to thinking.
> >> If this was an memcpy fault, this would be universally broken on arm64
> >> which is obviously not the case.
> >> So I started thinking, what is different here than with systems known to work:
> >> 1. No IOMMU for the PCIe controller.
> >> 2. The Outer Cache Issue.
> >>
> >> Robin:
> >> My questions for you, since you're the smartest person I know about
> >> arm64 memory management:
> >> Could cache snooping permit unaligned accesses to IO to be safe?
> >> Or
> >> Is it the lack of an IOMMU that's causing the ali gnment faults to become fatal?
> >> Or
> >> Am I insane here?
> >>
> >> Rockchip:
> >> Please update on the status for the Outer Cache errata for ITS services.
> >>
> >> Our SoC design team has double check with ARM GIC/ITS IP team for many
> >> times, and the GITS_CBASER
> >> of GIC600 IP does not support hardware bind or config to a fix value, so
> >> they insist this is an IP
> >> limitation instead of a SoC bug, software should take  care of it :(
> >> I will check again if we can provide errata for this issue.
> >>
> >> Thanks. This is necessary as the mbi-alias provides an imperfect
> >> implementation of the ITS and causes certain PCIe cards (eg x520 Intel
> >> 10G NIC) to misbehave.
> >>
> >> Please provide an answer to the errata of the PCIe controller, in
> >> regard to cache snooping and buffering, for both the rk356x and the
> >> upcoming rk3588.
> >>
> >>
> >> Sorry, what is this?
> >>
> >> Part of the ITS bug is it expects to be cache coherent with the CPU
> >> cluster by design.
> >> Due to the rk356x being implemented without an outer accessible cache,
> >> the ITS and other devices that require cache coherency (PCIe for
> >> example) crash in fun ways.
> >>
> >> Then this is still the ITS issue, not PCIe issue.
> >> PCIe is a peripheral bus controller like USB and other device, the driver should maintain the "cache coherency" if there is any, and there is no requirement for hardware cache coherency between PCIe and CPU.
> > Kever,
> >
> > These issues are one and the same.
>
> Well, that's not correct. You are still mixing two things up here:
>
> 1. The memory accesses from the device to the system memory must be
> coherent with the CPU cache. E.g. we root complex must snoop the CPU cache.
>      That's a requirement of the PCIe spec. If you don't get that right
> a whole bunch of PCIe devices won't work correctly.

The ITS issue referred to here is the same root problem.
See:
https://lore.kernel.org/lkml/874kg0q6lc.wl-maz@kernel.org/raw
for the description of that issue.
(It's actually two issues, lack of cache snooping, and the 32 bit bus
forcing DMA to be limited to <4G ram)

>
> 2. The memory accesses from the CPU to the devices PCIe BAR can be
> unaligned. E.g. a 64bit read can be aligned on a 32bit address.
>      That is a requirement of the graphics stack. Other devices still
> might work fine without that.

Correct, this is a separate issue, but only becomes obvious when the
cache issue is bypassed.
At least for Radeon, the ring tests fail immediately due to issue 1.
I'm waiting for the DWC-PCIe maintainers to weigh in here, but in the
meantime I've been reading up on the way it was supposed to be
implemented.
IF (big IF here) I'm understanding it correctly, they permit synthesis
of the PCIe controller with or without support for unaligned accesses.

>
> Regards,
> Christian.

Thanks for everything so far!
Peter

>
> > Certain hardware blocks *require* cache coherency as part of their design.
> > All of the *interesting* things PCIe can do stem from it.
> >
> > When I saw you bumped the available window to the PCIe controller to
> > 1GB I was really excited, because that meant we could finally support
> > devices that used these interesting features.
> > However, without cache coherency, having more than a 256MB window is a
> > waste, as any card that can take advantage of it *requires* coherency.
> > The same thing goes for a resizable BAR.
> > EP mode is the same, having the ability to connect one CPU to another
> > CPU over a PCIe bus loses the advantages when you don't have
> > coherency.
> > At that point, you might as well toss in a 2.5GB ethernet port and
> > just use that instead.
> >
> >>
> >> Well then I suggest to re-read the PCIe specification.
> >>
> >> Cache coherency is defined as mandatory there. Non-cache coherency is an optional feature.
> >>
> >> See section 2.2.6.5 in the PCIe 2.0 specification for a good example.
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >> We didn't see any transfer error on rk356x PCIe till now, we can take a look if it's easy to reproduce.
> > It's easy to reproduce, just try to use any card that has a
> > significantly large enough BAR to warrant requiring coherency.
> > dGPUs are the most readily accessible device, but High Performance
> > Computing Acceleration devices and high power FPGAs also would work.
> > Was the resizable bar tested at all internally either?
> > Any current device that could use that requires coherency.
> > And like above, EP mode without coherency is a waste at best, and
> > unpleasant at worst.
> >
> > Very Respectfully,
> > Peter
> >
> >> Thanks,
> >> - Kever
> >>
> >>
> >> This means that rk356x cannot implement a specification compliant ITS or PCIe.
> >> >From the rk3588 source dump it appears it was produced without an
> >> outer accessible cache, which means if true it also will be unable to
> >> use any PCIe cards that implement cache coherency as part of their
> >> design.
> >>
> >>
> >> Thanks,
> >> - Kever
> >>
> >> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=et3jUQ1Y2QaR56qTjl4LJ1vGurPwK8HfLosebUIV9bc%3D&amp;reserved=0
> >> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UrGSye7MpCUO9tppCCmgSGlNa6X0otJ8tkcOb2PXjA8%3D&amp;reserved=0
> >> [3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=agZjpl0LvSf4Jo3SoETVkW72uN0WiHb%2FYUA7V7c0G88%3D&amp;reserved=0
> >> [4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=tuBS9UfMegc1bc7U98zpsfQ1vUKsVmpscmNKpkn%2BHmk%3D&amp;reserved=0
> >>
> >> Thank you everyone for your time.
> >>
> >> Very Respectfully,
> >> Peter Geis
> >>
> >> On Wed, May 26, 2021 at 7:21 AM Christian König
> >> <christian.koenig@amd.com> wrote:
> >>
> >> Hi Robin,
> >>
> >> Am 26.05.21 um 12:59 schrieb Robin Murphy:
> >>
> >> On 2021-05-26 10:42, Christian König wrote:
> >>
> >> Hi Robin,
> >>
> >> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> >>
> >> On 2021-05-25 14:05, Alex Deucher wrote:
> >>
> >> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> >> wrote:
> >>
> >> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> >> <alexdeucher@gmail.com> wrote:
> >>
> >> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> >> wrote:
> >>
> >> Good Evening,
> >>
> >> I am stress testing the pcie controller on the rk3566-quartz64
> >> prototype SBC.
> >> This device has 1GB available at <0x3 0x00000000> for the PCIe
> >> controller, which makes a dGPU theoretically possible.
> >> While attempting to light off a HD7570 card I manage to get a
> >> modeset
> >> console, but ring0 test fails and disables acceleration.
> >>
> >> Note, we do not have UEFI, so all PCIe setup is from the Linux
> >> kernel.
> >> Any insight you can provide would be much appreciated.
> >>
> >> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> >> does the CPU allow cache snoops from PCIe devices?  That is required
> >> for the driver to operate.
> >>
> >> Ah, most likely not.
> >> This issue has come up already as the GIC isn't permitted to snoop on
> >> the CPUs, so I doubt the PCIe controller can either.
> >>
> >> Is there no way to work around this or is it dead in the water?
> >>
> >> It's required by the pcie spec.  You could potentially work around it
> >> if you can allocate uncached memory for DMA, but I don't think that is
> >> possible currently.  Ideally we'd figure out some way to detect if a
> >> particular platform supports cache snooping or not as well.
> >>
> >> There's device_get_dma_attr(), although I don't think it will work
> >> currently for PCI devices without an OF or ACPI node - we could
> >> perhaps do with a PCI-specific wrapper which can walk up and defer
> >> to the host bridge's firmware description as necessary.
> >>
> >> The common DMA ops *do* correctly keep track of per-device coherency
> >> internally, but drivers aren't supposed to be poking at that
> >> information directly.
> >>
> >> That sounds like you underestimate the problem. ARM has unfortunately
> >> made the coherency for PCI an optional IP.
> >>
> >> Sorry to be that guy, but I'm involved a lot internally with our
> >> system IP and interconnect, and I probably understand the situation
> >> better than 99% of the community ;)
> >>
> >> I need to apologize, didn't realized who was answering :)
> >>
> >> It just sounded to me that you wanted to suggest to the end user that
> >> this is fixable in software and I really wanted to avoid even more
> >> customers coming around asking how to do this.
> >>
> >> For the record, the SBSA specification (the closet thing we have to a
> >> "system architecture") does require that PCIe is integrated in an
> >> I/O-coherent manner, but we don't have any control over what people do
> >> in embedded applications (note that we don't make PCIe IP at all, and
> >> there is plenty of 3rd-party interconnect IP).
> >>
> >> So basically it is not the fault of the ARM IP-core, but people are just
> >> stitching together PCIe interconnect IP with a core where it is not
> >> supposed to be used with.
> >>
> >> Do I get that correctly? That's an interesting puzzle piece in the picture.
> >>
> >> So we are talking about a hardware limitation which potentially can't
> >> be fixed without replacing the hardware.
> >>
> >> You expressed interest in "some way to detect if a particular platform
> >> supports cache snooping or not", by which I assumed you meant a
> >> software method for the amdgpu/radeon drivers to call, rather than,
> >> say, a website that driver maintainers can look up SoC names on. I'm
> >> saying that that API already exists (just may need a bit more work).
> >> Note that it is emphatically not a platform-level thing since
> >> coherency can and does vary per device within a system.
> >>
> >> Well, I think this is not something an individual driver should mess
> >> with. What the driver should do is just express that it needs coherent
> >> access to all of system memory and if that is not possible fail to load
> >> with a warning why it is not possible.
> >>
> >> I wasn't suggesting that Linux could somehow make coherency magically
> >> work when the signals don't physically exist in the interconnect - I
> >> was assuming you'd merely want to do something like throw a big
> >> warning and taint the kernel to help triage bug reports. Some drivers
> >> like ahci_qoriq and panfrost simply need to know so they can program
> >> their device to emit the appropriate memory attributes either way, and
> >> rely on the DMA API to hide the rest of the difference, but if you
> >> want to treat non-coherent use as unsupported because it would require
> >> too invasive changes that's fine by me.
> >>
> >> Yes exactly that please. I mean not sure how panfrost is doing it, but
> >> at least the Vulkan userspace API specification requires devices to have
> >> coherent access to system memory.
> >>
> >> So even if I would want to do this it is simply not possible because the
> >> application doesn't tell the driver which memory is accessed by the
> >> device and which by the CPU.
> >>
> >> Christian.
> >>
> >> Robin.
> >>
> >> _______________________________________________
> >> Linux-rockchip mailing list
> >> Linux-rockchip@lists.infradead.org
> >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-rockchip&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C618d68406abf46aceb1708da08d1f61e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637831995714063605%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gyKyym%2BH%2F9u%2FfBP953N97x%2BOJBt9EaR2aPivWITwlPo%3D&amp;reserved=0
> >>
> >>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
  2022-03-17  0:14                 ` Peter Geis
@ 2022-03-23 21:06                   ` Alex Deucher
  -1 siblings, 0 replies; 45+ messages in thread
From: Alex Deucher @ 2022-03-23 21:06 UTC (permalink / raw)
  To: Peter Geis
  Cc: Kever Yang, Robin Murphy, Shawn Lin, Christian König,
	Christian König, Deucher, Alexander, amd-gfx list,
	open list:ARM/Rockchip SoC...

On Wed, Mar 16, 2022 at 8:14 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.

another similar datapoint for reference:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/3274

Alex

>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.
>
> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Hi Robin,
> >
> > Am 26.05.21 um 12:59 schrieb Robin Murphy:
> > > On 2021-05-26 10:42, Christian König wrote:
> > >> Hi Robin,
> > >>
> > >> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> > >>> On 2021-05-25 14:05, Alex Deucher wrote:
> > >>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> > >>>>> <alexdeucher@gmail.com> wrote:
> > >>>>>>
> > >>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> Good Evening,
> > >>>>>>>
> > >>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> > >>>>>>> prototype SBC.
> > >>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> > >>>>>>> controller, which makes a dGPU theoretically possible.
> > >>>>>>> While attempting to light off a HD7570 card I manage to get a
> > >>>>>>> modeset
> > >>>>>>> console, but ring0 test fails and disables acceleration.
> > >>>>>>>
> > >>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> > >>>>>>> kernel.
> > >>>>>>> Any insight you can provide would be much appreciated.
> > >>>>>>
> > >>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> > >>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> > >>>>>> for the driver to operate.
> > >>>>>
> > >>>>> Ah, most likely not.
> > >>>>> This issue has come up already as the GIC isn't permitted to snoop on
> > >>>>> the CPUs, so I doubt the PCIe controller can either.
> > >>>>>
> > >>>>> Is there no way to work around this or is it dead in the water?
> > >>>>
> > >>>> It's required by the pcie spec.  You could potentially work around it
> > >>>> if you can allocate uncached memory for DMA, but I don't think that is
> > >>>> possible currently.  Ideally we'd figure out some way to detect if a
> > >>>> particular platform supports cache snooping or not as well.
> > >>>
> > >>> There's device_get_dma_attr(), although I don't think it will work
> > >>> currently for PCI devices without an OF or ACPI node - we could
> > >>> perhaps do with a PCI-specific wrapper which can walk up and defer
> > >>> to the host bridge's firmware description as necessary.
> > >>>
> > >>> The common DMA ops *do* correctly keep track of per-device coherency
> > >>> internally, but drivers aren't supposed to be poking at that
> > >>> information directly.
> > >>
> > >> That sounds like you underestimate the problem. ARM has unfortunately
> > >> made the coherency for PCI an optional IP.
> > >
> > > Sorry to be that guy, but I'm involved a lot internally with our
> > > system IP and interconnect, and I probably understand the situation
> > > better than 99% of the community ;)
> >
> > I need to apologize, didn't realized who was answering :)
> >
> > It just sounded to me that you wanted to suggest to the end user that
> > this is fixable in software and I really wanted to avoid even more
> > customers coming around asking how to do this.
> >
> > > For the record, the SBSA specification (the closet thing we have to a
> > > "system architecture") does require that PCIe is integrated in an
> > > I/O-coherent manner, but we don't have any control over what people do
> > > in embedded applications (note that we don't make PCIe IP at all, and
> > > there is plenty of 3rd-party interconnect IP).
> >
> > So basically it is not the fault of the ARM IP-core, but people are just
> > stitching together PCIe interconnect IP with a core where it is not
> > supposed to be used with.
> >
> > Do I get that correctly? That's an interesting puzzle piece in the picture.
> >
> > >> So we are talking about a hardware limitation which potentially can't
> > >> be fixed without replacing the hardware.
> > >
> > > You expressed interest in "some way to detect if a particular platform
> > > supports cache snooping or not", by which I assumed you meant a
> > > software method for the amdgpu/radeon drivers to call, rather than,
> > > say, a website that driver maintainers can look up SoC names on. I'm
> > > saying that that API already exists (just may need a bit more work).
> > > Note that it is emphatically not a platform-level thing since
> > > coherency can and does vary per device within a system.
> >
> > Well, I think this is not something an individual driver should mess
> > with. What the driver should do is just express that it needs coherent
> > access to all of system memory and if that is not possible fail to load
> > with a warning why it is not possible.
> >
> > >
> > > I wasn't suggesting that Linux could somehow make coherency magically
> > > work when the signals don't physically exist in the interconnect - I
> > > was assuming you'd merely want to do something like throw a big
> > > warning and taint the kernel to help triage bug reports. Some drivers
> > > like ahci_qoriq and panfrost simply need to know so they can program
> > > their device to emit the appropriate memory attributes either way, and
> > > rely on the DMA API to hide the rest of the difference, but if you
> > > want to treat non-coherent use as unsupported because it would require
> > > too invasive changes that's fine by me.
> >
> > Yes exactly that please. I mean not sure how panfrost is doing it, but
> > at least the Vulkan userspace API specification requires devices to have
> > coherent access to system memory.
> >
> > So even if I would want to do this it is simply not possible because the
> > application doesn't tell the driver which memory is accessed by the
> > device and which by the CPU.
> >
> > Christian.
> >
> > >
> > > Robin.
> >

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: radeon ring 0 test failed on arm64
@ 2022-03-23 21:06                   ` Alex Deucher
  0 siblings, 0 replies; 45+ messages in thread
From: Alex Deucher @ 2022-03-23 21:06 UTC (permalink / raw)
  To: Peter Geis
  Cc: Christian König, Shawn Lin, Kever Yang, amd-gfx list,
	open list:ARM/Rockchip SoC...,
	Deucher, Alexander, Robin Murphy, Christian König

On Wed, Mar 16, 2022 at 8:14 PM Peter Geis <pgwipeout@gmail.com> wrote:
>
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.

another similar datapoint for reference:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/3274

Alex

>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.
>
> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Hi Robin,
> >
> > Am 26.05.21 um 12:59 schrieb Robin Murphy:
> > > On 2021-05-26 10:42, Christian König wrote:
> > >> Hi Robin,
> > >>
> > >> Am 25.05.21 um 22:09 schrieb Robin Murphy:
> > >>> On 2021-05-25 14:05, Alex Deucher wrote:
> > >>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
> > >>>>> <alexdeucher@gmail.com> wrote:
> > >>>>>>
> > >>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> Good Evening,
> > >>>>>>>
> > >>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
> > >>>>>>> prototype SBC.
> > >>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
> > >>>>>>> controller, which makes a dGPU theoretically possible.
> > >>>>>>> While attempting to light off a HD7570 card I manage to get a
> > >>>>>>> modeset
> > >>>>>>> console, but ring0 test fails and disables acceleration.
> > >>>>>>>
> > >>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
> > >>>>>>> kernel.
> > >>>>>>> Any insight you can provide would be much appreciated.
> > >>>>>>
> > >>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
> > >>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
> > >>>>>> for the driver to operate.
> > >>>>>
> > >>>>> Ah, most likely not.
> > >>>>> This issue has come up already as the GIC isn't permitted to snoop on
> > >>>>> the CPUs, so I doubt the PCIe controller can either.
> > >>>>>
> > >>>>> Is there no way to work around this or is it dead in the water?
> > >>>>
> > >>>> It's required by the pcie spec.  You could potentially work around it
> > >>>> if you can allocate uncached memory for DMA, but I don't think that is
> > >>>> possible currently.  Ideally we'd figure out some way to detect if a
> > >>>> particular platform supports cache snooping or not as well.
> > >>>
> > >>> There's device_get_dma_attr(), although I don't think it will work
> > >>> currently for PCI devices without an OF or ACPI node - we could
> > >>> perhaps do with a PCI-specific wrapper which can walk up and defer
> > >>> to the host bridge's firmware description as necessary.
> > >>>
> > >>> The common DMA ops *do* correctly keep track of per-device coherency
> > >>> internally, but drivers aren't supposed to be poking at that
> > >>> information directly.
> > >>
> > >> That sounds like you underestimate the problem. ARM has unfortunately
> > >> made the coherency for PCI an optional IP.
> > >
> > > Sorry to be that guy, but I'm involved a lot internally with our
> > > system IP and interconnect, and I probably understand the situation
> > > better than 99% of the community ;)
> >
> > I need to apologize, didn't realized who was answering :)
> >
> > It just sounded to me that you wanted to suggest to the end user that
> > this is fixable in software and I really wanted to avoid even more
> > customers coming around asking how to do this.
> >
> > > For the record, the SBSA specification (the closet thing we have to a
> > > "system architecture") does require that PCIe is integrated in an
> > > I/O-coherent manner, but we don't have any control over what people do
> > > in embedded applications (note that we don't make PCIe IP at all, and
> > > there is plenty of 3rd-party interconnect IP).
> >
> > So basically it is not the fault of the ARM IP-core, but people are just
> > stitching together PCIe interconnect IP with a core where it is not
> > supposed to be used with.
> >
> > Do I get that correctly? That's an interesting puzzle piece in the picture.
> >
> > >> So we are talking about a hardware limitation which potentially can't
> > >> be fixed without replacing the hardware.
> > >
> > > You expressed interest in "some way to detect if a particular platform
> > > supports cache snooping or not", by which I assumed you meant a
> > > software method for the amdgpu/radeon drivers to call, rather than,
> > > say, a website that driver maintainers can look up SoC names on. I'm
> > > saying that that API already exists (just may need a bit more work).
> > > Note that it is emphatically not a platform-level thing since
> > > coherency can and does vary per device within a system.
> >
> > Well, I think this is not something an individual driver should mess
> > with. What the driver should do is just express that it needs coherent
> > access to all of system memory and if that is not possible fail to load
> > with a warning why it is not possible.
> >
> > >
> > > I wasn't suggesting that Linux could somehow make coherency magically
> > > work when the signals don't physically exist in the interconnect - I
> > > was assuming you'd merely want to do something like throw a big
> > > warning and taint the kernel to help triage bug reports. Some drivers
> > > like ahci_qoriq and panfrost simply need to know so they can program
> > > their device to emit the appropriate memory attributes either way, and
> > > rely on the DMA API to hide the rest of the difference, but if you
> > > want to treat non-coherent use as unsupported because it would require
> > > too invasive changes that's fine by me.
> >
> > Yes exactly that please. I mean not sure how panfrost is doing it, but
> > at least the Vulkan userspace API specification requires devices to have
> > coherent access to system memory.
> >
> > So even if I would want to do this it is simply not possible because the
> > application doesn't tell the driver which memory is accessed by the
> > device and which by the CPU.
> >
> > Christian.
> >
> > >
> > > Robin.
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-03-23 21:06 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25  2:34 radeon ring 0 test failed on arm64 Peter Geis
2021-05-25 12:46 ` Alex Deucher
2021-05-25 12:55   ` Peter Geis
2021-05-25 13:05     ` Alex Deucher
2021-05-25 13:18       ` Peter Geis
2021-05-25 20:09       ` Robin Murphy
2021-05-26  9:42         ` Christian König
2021-05-26 10:59           ` Robin Murphy
2021-05-26 11:21             ` Christian König
2022-03-17  0:14               ` Peter Geis
2022-03-17  0:14                 ` Peter Geis
2022-03-17  3:07                 ` Kever Yang
2022-03-17  3:07                   ` Kever Yang
2022-03-17 12:19                   ` Peter Geis
2022-03-17 12:19                     ` Peter Geis
2022-03-18  7:51                     ` Kever Yang
2022-03-18  7:51                       ` Kever Yang
2022-03-18  8:35                       ` Christian König
2022-03-18 11:24                         ` Peter Geis
2022-03-18 11:24                           ` Peter Geis
2022-03-18 12:31                           ` Christian König
2022-03-18 12:31                             ` Christian König
2022-03-18 12:45                             ` Peter Geis
2022-03-18 12:45                               ` Peter Geis
2022-03-17  9:14                 ` Christian König
2022-03-17  9:14                   ` Christian König
2022-03-17 12:21                   ` Peter Geis
2022-03-17 12:21                     ` Peter Geis
2022-03-17 20:27                   ` Alex Deucher
2022-03-17 20:27                     ` Alex Deucher
2022-03-17 10:37                 ` Robin Murphy
2022-03-17 10:37                   ` Robin Murphy
2022-03-17 12:26                   ` Peter Geis
2022-03-17 12:26                     ` Peter Geis
2022-03-17 12:51                     ` Christian König
2022-03-17 12:51                       ` Christian König
2022-03-17 13:17                     ` Robin Murphy
2022-03-17 13:17                       ` Robin Murphy
2022-03-17 14:21                       ` Peter Geis
2022-03-17 14:21                         ` Peter Geis
2022-03-23 21:06                 ` Alex Deucher
2022-03-23 21:06                   ` Alex Deucher
2021-05-25 14:08 ` Christian König
2021-05-25 14:19   ` Peter Geis
2021-05-25 15:09     ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.