From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung! Date: Wed, 22 Aug 2018 14:33:03 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1933027203==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 144706E1F7 for ; Wed, 22 Aug 2018 14:33:03 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1933027203== Content-Type: multipart/alternative; boundary="15349483820.99a31a.9759" Content-Transfer-Encoding: 7bit --15349483820.99a31a.9759 Date: Wed, 22 Aug 2018 14:33:02 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D102322 --- Comment #60 from Andrey Grodzovsky --- (In reply to dwagner from comment #58) > Here comes another trace log, with your info2.patch applied. >=20 > Something must have changed since the last test, as it took pretty long t= his > time to reproduce the crash. Could that have been caused by > https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdg= pu/ > nbio_v7_4.c?h=3Damd-staging-drm- > next&id=3Db385925f3922faca7435e50e31380bb2602fd6b8 now being part of the > kernel? Don't think it's related. This code is more related to virtualization. >=20 > However, the latest trace you find attached below is not much different to > the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you: >=20 > [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > timeout, signaled seq=3D475104, emitted seq=3D475106 > [ 1510.023117] [drm] GPU recovery disabled. That just means you are again running with GPU VM update mode set to use SD= MA. Which is seen in you dmesg (amdgpu.vm_update_mode=3D0) , so are again experiencing the original issue of SDMA hang. Please use amdgpu.vm_update_mode=3D3 to get back to VM_FAULTs issue. >=20 > amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs: > soffs=3D00001001a0, eoffs=3D00001001b9, flags=3D70 > amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: > soffs=3D0000100200, eoffs=3D00001021e0, flags=3D70 > amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: > soffs=3D0000102200, eoffs=3D00001041e0, flags=3D70 > amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs: > soffs=3D000010c1e0, eoffs=3D000010c2e1, flags=3D70 > amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job: > entity=3D00000000406345a7, id=3D10239, fence=3D000000007a120377, ring=3Dg= fx, job > count:8, hw job count:0 >=20 > And later in the file you can find: > ------------------------------------------------------ > crash detected! >=20 > executing umr -O halt_waves -wa > No active waves! >=20 > executing umr -O verbose -R gfx[.] >=20 > polaris11.gfx.rptr =3D=3D 512 > polaris11.gfx.wptr =3D=3D 512 > polaris11.gfx.drv_wptr =3D=3D 512 > polaris11.gfx.ring[ 481] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 482] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 483] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 484] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 485] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 486] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 487] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 488] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 489] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 490] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 491] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 492] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 493] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 494] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 495] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 496] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 497] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 498] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 499] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 500] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 501] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 502] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 503] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 504] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 505] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 506] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 507] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 508] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 509] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 510] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 511] =3D=3D 0xffff1000 ...=20 > polaris11.gfx.ring[ 512] =3D=3D 0xc0032200 rwD=20 >=20 >=20 > trying to get ADR from dmesg output for 'umr -O verbose -vm ...' > trying to get VMID from dmesg output for 'umr -O verbose -vm ...' >=20 > done after crash. > ------------------------------------------- >=20 > So even without GPU reset, still no "waves". And the error message also d= oes > not state any VM fault address. --=20 You are receiving this mail because: You are the assignee for the bug.= --15349483820.99a31a.9759 Date: Wed, 22 Aug 2018 14:33:02 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 60 on bug 10232= 2 from Andrey Grodzovsky
(In reply to dwagner from comment #58)
> Here comes another trace log, with your info2.pa=
tch applied.
>=20
> Something must have changed since the last test, as it took pretty lon=
g this
> time to reproduce the crash. Could that have been caused by
> https://cgit.freedesktop.org/~agd5f/linux/commit/drivers=
/gpu/drm/amd/amdgpu/
> nbio_v7_4.c?h=3Damd-staging-drm-
> next&id=3Db385925f3922faca7435e50e31380bb2602fd6b8 now being part =
of the
> kernel?

Don't think it's related. This code is more related to virtualization.

>=20
> However, the latest trace you find attached below is not much differen=
t to
> the last one, xzcat /tmp/gpu_debug5.txt.xz  | grep '^\[' will tell you:
>=20
> [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3D475104, emitted seq=3D475106
> [ 1510.023117] [drm] GPU recovery disabled.

That just means you are again running with GPU VM update mode set to use SD=
MA.
Which is seen in you dmesg (amdgpu.vm_update_mode=3D0) , so are again
experiencing the original issue of SDMA hang. Please use
amdgpu.vm_update_mode=3D3 to get back to VM_FAULTs issue.

>=20
>      amdgpu_cs:0-806   [012] ....  1787.493126: amdgpu_vm_bo_cs:
> soffs=3D00001001a0, eoffs=3D00001001b9, flags=3D70
>      amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
> soffs=3D0000100200, eoffs=3D00001021e0, flags=3D70
>      amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
> soffs=3D0000102200, eoffs=3D00001041e0, flags=3D70
>      amdgpu_cs:0-806   [012] ....  1787.493129: amdgpu_vm_bo_cs:
> soffs=3D000010c1e0, eoffs=3D000010c2e1, flags=3D70
>      amdgpu_cs:0-806   [012] ....  1787.493131: drm_sched_job:
> entity=3D00000000406345a7, id=3D10239, fence=3D000000007a120377, ring=
=3Dgfx, job
> count:8, hw job count:0
>=20
> And later in the file you can find:
> ------------------------------------------------------
> crash detected!
>=20
> executing umr -O halt_waves -wa
> No active waves!
>=20
> executing umr -O verbose -R gfx[.]
>=20
> polaris11.gfx.rptr =3D=3D 512
> polaris11.gfx.wptr =3D=3D 512
> polaris11.gfx.drv_wptr =3D=3D 512
> polaris11.gfx.ring[ 481] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 482] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 483] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 484] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 485] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 486] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 487] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 488] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 489] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 490] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 491] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 492] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 493] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 494] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 495] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 496] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 497] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 498] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 499] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 500] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 501] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 502] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 503] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 504] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 505] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 506] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 507] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 508] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 509] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 510] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 511] =3D=3D 0xffff1000    ...=20
> polaris11.gfx.ring[ 512] =3D=3D 0xc0032200    rwD=20
>=20
>=20
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
>=20
> done after crash.
> -------------------------------------------
>=20
> So even without GPU reset, still no "waves". And the error m=
essage also does
> not state any VM fault address.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15349483820.99a31a.9759-- --===============1933027203== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============1933027203==--