From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 105113] [hawaii, radeonsi, clover] Running Piglit cl/program/execute/{, tail-}calls{, -struct, -workitem-id}.cl cause GPU VM error and ring stalled GPU lockup Date: Sun, 29 Apr 2018 22:23:24 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0218758608==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 4FFFC6E0BA for ; Sun, 29 Apr 2018 22:23:24 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0218758608== Content-Type: multipart/alternative; boundary="15250406040.C79DFd.23775" Content-Transfer-Encoding: 7bit --15250406040.C79DFd.23775 Date: Sun, 29 Apr 2018 22:23:24 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D105113 Maciej S. Szmigiero changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mail@maciej.szmigiero.name --- Comment #2 from Maciej S. Szmigiero --- I've also hit this issue on "Oland PRO [Radeon R7 240/340] (rev 87)" with mesa-18.1.0_rc2, llvm-6.0.0 and kernel 4.16.5. The crash happens at "cl/program/execute/calls-struct.cl" from piglit as we= ll. It happens both from a X session and from a KMS console. The exact crash looks like this: [ 171.969488] radeon 0000:20:00.0: GPU fault detected: 147 0x06106001 [ 171.969489] radeon 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR=20= =20 0x00500030 [ 171.969490] radeon 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x10060001 [ 171.969491] VM fault (0x01, vmid 8) at page 5242928, read from CB (96) Then the radeon driver tries to reset the GPU endlessly. I've tried pcie_gen2=3D0, msi=3D0, dpm=3D0, hard_reset=3D1, vm_size=3D16 in= various combinations, nothing seems to help (msi=3D0 gives a ton of IOMMU errors, B= TW). Also have tried amdgpu which gives a similar crash (it looks like this driver didn't attempt to reset the GPU afterwards): [ 435.596230] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002 [ 435.596233] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR=20= =20 0x00500060 [ 435.596235] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08060002 [ 435.596239] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (96) [ 435.596245] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002 [ 435.596247] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR=20= =20 0x00500060 [ 435.596248] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08050002 [ 435.596252] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (80) [ 435.596256] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002 [ 435.596258] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR=20= =20 0x00500060 [ 435.596260] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08010002 [ 435.596263] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (16) [ 435.596267] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002 [ 435.596269] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR=20= =20 0x00500060 [ 435.596271] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08050002 [ 435.596274] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (80) [ 435.596278] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002 This might be (also?) a kernel bug since a userspace program should not be able to crash a GPU, regardless how incorrect command stream it sends to one. --=20 You are receiving this mail because: You are the assignee for the bug.= --15250406040.C79DFd.23775 Date: Sun, 29 Apr 2018 22:23:24 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated Maciej S. Szmigiero changed bug 10511= 3
What Removed Added
CC   mail@maciej.szmigiero.name

Commen= t # 2 on bug 10511= 3 from Maciej S. Szmigiero
I've also hit this issue on "Oland PRO [Radeon R7 240/340=
] (rev 87)" with
mesa-18.1.0_rc2, llvm-6.0.0 and kernel 4.16.5.

The crash happens at "cl/program/execute/calls-struct.cl" from pi=
glit as well.
It happens both from a X session and from a KMS console.

The exact crash looks like this:
[  171.969488] radeon 0000:20:00.0: GPU fault detected: 147 0x06106001
[  171.969489] radeon 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR=20=
=20
0x00500030
[  171.969490] radeon 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x10060001
[  171.969491] VM fault (0x01, vmid 8) at page 5242928, read from CB (96)

Then the radeon driver tries to reset the GPU endlessly.
I've tried pcie_gen2=3D0, msi=3D0, dpm=3D0, hard_reset=3D1, vm_size=3D16 in=
 various
combinations, nothing seems to help (msi=3D0 gives a ton of IOMMU errors, B=
TW).

Also have tried amdgpu which gives a similar crash (it looks like this
driver didn't attempt to reset the GPU afterwards):
[  435.596230] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002
[  435.596233] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR=20=
=20
0x00500060
[  435.596235] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x08060002
[  435.596239] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976,
read from '' (0x00000000) (96)
[  435.596245] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002
[  435.596247] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR=20=
=20
0x00500060
[  435.596248] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x08050002
[  435.596252] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976,
read from '' (0x00000000) (80)
[  435.596256] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002
[  435.596258] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR=20=
=20
0x00500060
[  435.596260] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x08010002
[  435.596263] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976,
read from '' (0x00000000) (16)
[  435.596267] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002
[  435.596269] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR=20=
=20
0x00500060
[  435.596271] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x08050002
[  435.596274] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976,
read from '' (0x00000000) (80)
[  435.596278] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002

This might be (also?) a kernel bug since a userspace program should not
be able to crash a GPU, regardless how incorrect command stream it sends
to one.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15250406040.C79DFd.23775-- --===============0218758608== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============0218758608==--