From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 100465] Hard lockup with radeonsi driver on FirePro W600, W9000 and W9100 Date: Thu, 06 Apr 2017 16:27:39 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1554908520==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id A96326E9D4 for ; Thu, 6 Apr 2017 16:27:39 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1554908520== Content-Type: multipart/alternative; boundary="14914960590.dBdc.18660"; charset="UTF-8" --14914960590.dBdc.18660 Date: Thu, 6 Apr 2017 16:27:39 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D100465 --- Comment #9 from Julien Isorce --- When using R600_DEBUG=3Dcheck_vm on both Xorg and the gl app I can get some output in kern.log. It looks like a "ring 0 stalled" is detected and then follow a gpu softreset which succeeds ("GPU reset succeeded, trying to resu= me") but fails to resume because: [drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi= os stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD [drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi= os stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E Then there is two: radeon_mc_wait_for_idle failure "Wait for MC idle timedo= ut" from si_mc_program Finally si_startup fails because si_cp_resume fails because r600_ring_test fails with: "radeon: ring 0 test failed (scratch(0x850C)=3D0xCAFEDEAD)" But it seems it keeps looping trying to do a gpu softreset and at some poin= t it freezes. I need to confirm this ending scenario though but these atombios failures are worring in the first place. At the same time I get some "radeon_ttm_bo_destroy" notified by "WARN_ON(!list_empty(&bo->va));" from kernel radeon driver. So it seems to = leak some buffers.=20 I will attach the full log tomorrow, it is mess-up with my traces atm but t= he essential is above I hope. So I have 4 questions: 1: Can an application causes a "ring 0 stalled" ? or is it a driver bug (kernel side or mesa/drm or xserver) ? 2: About these atombios failures, does it mean that it fails to load the g= pu microcode/firmware ? 3: Does it try to do a gpu softreset because I added R600_DEBUG=3Dcheck_vm= ? Or this one just help to flush the traces on vm fault (like mentioned in a com= mit msg related to that env var in mesa) ? 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy warnin= g), does it mean the memory is lost until next reboot or does a gpu soft reset allow to recover these leaks ?=20 Thx ! --=20 You are receiving this mail because: You are the assignee for the bug.= --14914960590.dBdc.18660 Date: Thu, 6 Apr 2017 16:27:39 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 9 on bug 10046= 5 from Julien Isorce
When using R600_DEBUG=3Dcheck_vm on both Xorg and the gl app I=
 can get some
output in kern.log. It looks like a "ring 0 stalled" is detected =
and then
follow a gpu softreset which succeeds ("GPU reset succeeded, trying to=
 resume")
but fails to resume because:

[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi=
os
stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD
[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi=
os
stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E

Then there is two: radeon_mc_wait_for_idle failure "Wait for MC idle t=
imedout"
from si_mc_program

Finally si_startup fails because si_cp_resume fails because r600_ring_test
fails with: "radeon: ring 0 test failed (scratch(0x850C)=3D0xCAFEDEAD)=
"

But it seems it keeps looping trying to do a gpu softreset and at some poin=
t it
freezes. I need to confirm this ending scenario though but these atombios
failures are worring in the first place.

At the same time I get some "radeon_ttm_bo_destroy" notified by
"WARN_ON(!list_empty(&bo->va));" from kernel radeon driver=
. So it seems to leak
some buffers.=20

I will attach the full log tomorrow, it is mess-up with my traces atm but t=
he
essential is above I hope.

So I have 4 questions:
 1: Can an application causes a "ring 0 stalled" ? or is it a dri=
ver bug
(kernel side or mesa/drm or xserver) ?
 2: About these atombios failures, does it mean that it fails to load the g=
pu
microcode/firmware ?
 3: Does it try to do a gpu softreset because I added R600_DEBUG=3Dcheck_vm=
 ? Or
this one just help to flush the traces on vm fault (like mentioned in a com=
mit
msg related to that env var in mesa) ?
 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy warnin=
g),
does it mean the memory is lost until next reboot or does a gpu soft reset
allow to recover these leaks ?=20

Thx !


You are receiving this mail because:
  • You are the assignee for the bug.
= --14914960590.dBdc.18660-- --===============1554908520== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============1554908520==--