From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!"
/ [drm] IP block:sdma_v3_0 is hung!
Date: Wed, 22 Aug 2018 14:33:03 +0000
Message-ID:
References:
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============1933027203=="
Return-path:
Received: from culpepper.freedesktop.org (culpepper.freedesktop.org
[IPv6:2610:10:20:722:a800:ff:fe98:4b55])
by gabe.freedesktop.org (Postfix) with ESMTP id 144706E1F7
for ; Wed, 22 Aug 2018 14:33:03 +0000 (UTC)
In-Reply-To:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel"
To: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org
--===============1933027203==
Content-Type: multipart/alternative; boundary="15349483820.99a31a.9759"
Content-Transfer-Encoding: 7bit
--15349483820.99a31a.9759
Date: Wed, 22 Aug 2018 14:33:02 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
https://bugs.freedesktop.org/show_bug.cgi?id=3D102322
--- Comment #60 from Andrey Grodzovsky ---
(In reply to dwagner from comment #58)
> Here comes another trace log, with your info2.patch applied.
>=20
> Something must have changed since the last test, as it took pretty long t=
his
> time to reproduce the crash. Could that have been caused by
> https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdg=
pu/
> nbio_v7_4.c?h=3Damd-staging-drm-
> next&id=3Db385925f3922faca7435e50e31380bb2602fd6b8 now being part of the
> kernel?
Don't think it's related. This code is more related to virtualization.
>=20
> However, the latest trace you find attached below is not much different to
> the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you:
>=20
> [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3D475104, emitted seq=3D475106
> [ 1510.023117] [drm] GPU recovery disabled.
That just means you are again running with GPU VM update mode set to use SD=
MA.
Which is seen in you dmesg (amdgpu.vm_update_mode=3D0) , so are again
experiencing the original issue of SDMA hang. Please use
amdgpu.vm_update_mode=3D3 to get back to VM_FAULTs issue.
>=20
> amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs:
> soffs=3D00001001a0, eoffs=3D00001001b9, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs:
> soffs=3D0000100200, eoffs=3D00001021e0, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs:
> soffs=3D0000102200, eoffs=3D00001041e0, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs:
> soffs=3D000010c1e0, eoffs=3D000010c2e1, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job:
> entity=3D00000000406345a7, id=3D10239, fence=3D000000007a120377, ring=3Dg=
fx, job
> count:8, hw job count:0
>=20
> And later in the file you can find:
> ------------------------------------------------------
> crash detected!
>=20
> executing umr -O halt_waves -wa
> No active waves!
>=20
> executing umr -O verbose -R gfx[.]
>=20
> polaris11.gfx.rptr =3D=3D 512
> polaris11.gfx.wptr =3D=3D 512
> polaris11.gfx.drv_wptr =3D=3D 512
> polaris11.gfx.ring[ 481] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 482] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 483] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 484] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 485] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 486] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 487] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 488] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 489] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 490] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 491] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 492] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 493] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 494] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 495] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 496] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 497] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 498] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 499] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 500] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 501] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 502] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 503] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 504] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 505] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 506] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 507] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 508] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 509] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 510] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 511] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 512] =3D=3D 0xc0032200 rwD=20
>=20
>=20
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
>=20
> done after crash.
> -------------------------------------------
>=20
> So even without GPU reset, still no "waves". And the error message also d=
oes
> not state any VM fault address.
--=20
You are receiving this mail because:
You are the assignee for the bug.=
--15349483820.99a31a.9759
Date: Wed, 22 Aug 2018 14:33:02 +0000
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
Comme=
nt # 60
on bug 10232=
2
from Andrey Grodzovsky
(In reply to dwagner from comment #58)
> Here comes another trace log, with your info2.pa=
tch applied.
>=20
> Something must have changed since the last test, as it took pretty lon=
g this
> time to reproduce the crash. Could that have been caused by
> https://cgit.freedesktop.org/~agd5f/linux/commit/drivers=
/gpu/drm/amd/amdgpu/
> nbio_v7_4.c?h=3Damd-staging-drm-
> next&id=3Db385925f3922faca7435e50e31380bb2602fd6b8 now being part =
of the
> kernel?
Don't think it's related. This code is more related to virtualization.
>=20
> However, the latest trace you find attached below is not much differen=
t to
> the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you:
>=20
> [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3D475104, emitted seq=3D475106
> [ 1510.023117] [drm] GPU recovery disabled.
That just means you are again running with GPU VM update mode set to use SD=
MA.
Which is seen in you dmesg (amdgpu.vm_update_mode=3D0) , so are again
experiencing the original issue of SDMA hang. Please use
amdgpu.vm_update_mode=3D3 to get back to VM_FAULTs issue.
>=20
> amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs:
> soffs=3D00001001a0, eoffs=3D00001001b9, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs:
> soffs=3D0000100200, eoffs=3D00001021e0, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs:
> soffs=3D0000102200, eoffs=3D00001041e0, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs:
> soffs=3D000010c1e0, eoffs=3D000010c2e1, flags=3D70
> amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job:
> entity=3D00000000406345a7, id=3D10239, fence=3D000000007a120377, ring=
=3Dgfx, job
> count:8, hw job count:0
>=20
> And later in the file you can find:
> ------------------------------------------------------
> crash detected!
>=20
> executing umr -O halt_waves -wa
> No active waves!
>=20
> executing umr -O verbose -R gfx[.]
>=20
> polaris11.gfx.rptr =3D=3D 512
> polaris11.gfx.wptr =3D=3D 512
> polaris11.gfx.drv_wptr =3D=3D 512
> polaris11.gfx.ring[ 481] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 482] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 483] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 484] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 485] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 486] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 487] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 488] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 489] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 490] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 491] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 492] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 493] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 494] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 495] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 496] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 497] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 498] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 499] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 500] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 501] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 502] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 503] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 504] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 505] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 506] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 507] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 508] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 509] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 510] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 511] =3D=3D 0xffff1000 ...=20
> polaris11.gfx.ring[ 512] =3D=3D 0xc0032200 rwD=20
>=20
>=20
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
>=20
> done after crash.
> -------------------------------------------
>=20
> So even without GPU reset, still no "waves". And the error m=
essage also does
> not state any VM fault address.
You are receiving this mail because:
- You are the assignee for the bug.
=
--15349483820.99a31a.9759--
--===============1933027203==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs
IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz
dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg==
--===============1933027203==--