From mboxrd@z Thu Jan  1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 100465] Hard lockup with radeonsi driver on FirePro W600, W9000
 and W9100
Date: Thu, 06 Apr 2017 16:27:39 +0000
Message-ID: <bug-100465-502-OKVaiog0pr@http.bugs.freedesktop.org/>
References: <bug-100465-502@http.bugs.freedesktop.org/>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============1554908520=="
Return-path: <dri-devel-bounces@lists.freedesktop.org>
Received: from culpepper.freedesktop.org (culpepper.freedesktop.org
 [IPv6:2610:10:20:722:a800:ff:fe98:4b55])
 by gabe.freedesktop.org (Postfix) with ESMTP id A96326E9D4
 for <dri-devel@lists.freedesktop.org>; Thu,  6 Apr 2017 16:27:39 +0000 (UTC)
In-Reply-To: <bug-100465-502@http.bugs.freedesktop.org/>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
To: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org


--===============1554908520==
Content-Type: multipart/alternative; boundary="14914960590.dBdc.18660";
 charset="UTF-8"


--14914960590.dBdc.18660
Date: Thu, 6 Apr 2017 16:27:39 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated

https://bugs.freedesktop.org/show_bug.cgi?id=3D100465

--- Comment #9 from Julien Isorce <julien.isorce@gmail.com> ---
When using R600_DEBUG=3Dcheck_vm on both Xorg and the gl app I can get some
output in kern.log. It looks like a "ring 0 stalled" is detected and then
follow a gpu softreset which succeeds ("GPU reset succeeded, trying to resu=
me")
but fails to resume because:

[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi=
os
stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD
[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi=
os
stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E

Then there is two: radeon_mc_wait_for_idle failure "Wait for MC idle timedo=
ut"
from si_mc_program

Finally si_startup fails because si_cp_resume fails because r600_ring_test
fails with: "radeon: ring 0 test failed (scratch(0x850C)=3D0xCAFEDEAD)"

But it seems it keeps looping trying to do a gpu softreset and at some poin=
t it
freezes. I need to confirm this ending scenario though but these atombios
failures are worring in the first place.

At the same time I get some "radeon_ttm_bo_destroy" notified by
"WARN_ON(!list_empty(&bo->va));" from kernel radeon driver. So it seems to =
leak
some buffers.=20

I will attach the full log tomorrow, it is mess-up with my traces atm but t=
he
essential is above I hope.

So I have 4 questions:
 1: Can an application causes a "ring 0 stalled" ? or is it a driver bug
(kernel side or mesa/drm or xserver) ?
 2: About these atombios failures, does it mean that it fails to load the g=
pu
microcode/firmware ?
 3: Does it try to do a gpu softreset because I added R600_DEBUG=3Dcheck_vm=
 ? Or
this one just help to flush the traces on vm fault (like mentioned in a com=
mit
msg related to that env var in mesa) ?
 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy warnin=
g),
does it mean the memory is lost until next reboot or does a gpu soft reset
allow to recover these leaks ?=20

Thx !

--=20
You are receiving this mail because:
You are the assignee for the bug.=

--14914960590.dBdc.18660
Date: Thu, 6 Apr 2017 16:27:39 +0000
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated

<html>
    <head>
      <base href=3D"https://bugs.freedesktop.org/">
    </head>
    <body>
      <p>
        <div>
            <b><a class=3D"bz_bug_link=20
          bz_status_NEW "
   title=3D"NEW - Hard lockup with radeonsi driver on FirePro W600, W9000 a=
nd W9100"
   href=3D"https://bugs.freedesktop.org/show_bug.cgi?id=3D100465#c9">Commen=
t # 9</a>
              on <a class=3D"bz_bug_link=20
          bz_status_NEW "
   title=3D"NEW - Hard lockup with radeonsi driver on FirePro W600, W9000 a=
nd W9100"
   href=3D"https://bugs.freedesktop.org/show_bug.cgi?id=3D100465">bug 10046=
5</a>
              from <span class=3D"vcard"><a class=3D"email" href=3D"mailto:=
julien.isorce&#64;gmail.com" title=3D"Julien Isorce &lt;julien.isorce&#64;g=
mail.com&gt;"> <span class=3D"fn">Julien Isorce</span></a>
</span></b>
        <pre>When using R600_DEBUG=3Dcheck_vm on both Xorg and the gl app I=
 can get some
output in kern.log. It looks like a &quot;ring 0 stalled&quot; is detected =
and then
follow a gpu softreset which succeeds (&quot;GPU reset succeeded, trying to=
 resume&quot;)
but fails to resume because:

[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi=
os
stuck executing C483 (len 254, WS 0, PS 4) &#64; 0xC4AD
[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombi=
os
stuck executing BC59 (len 74, WS 0, PS 8) &#64; 0xBC8E

Then there is two: radeon_mc_wait_for_idle failure &quot;Wait for MC idle t=
imedout&quot;
from si_mc_program

Finally si_startup fails because si_cp_resume fails because r600_ring_test
fails with: &quot;radeon: ring 0 test failed (scratch(0x850C)=3D0xCAFEDEAD)=
&quot;

But it seems it keeps looping trying to do a gpu softreset and at some poin=
t it
freezes. I need to confirm this ending scenario though but these atombios
failures are worring in the first place.

At the same time I get some &quot;radeon_ttm_bo_destroy&quot; notified by
&quot;WARN_ON(!list_empty(&amp;bo-&gt;va));&quot; from kernel radeon driver=
. So it seems to leak
some buffers.=20

I will attach the full log tomorrow, it is mess-up with my traces atm but t=
he
essential is above I hope.

So I have 4 questions:
 1: Can an application causes a &quot;ring 0 stalled&quot; ? or is it a dri=
ver bug
(kernel side or mesa/drm or xserver) ?
 2: About these atombios failures, does it mean that it fails to load the g=
pu
microcode/firmware ?
 3: Does it try to do a gpu softreset because I added R600_DEBUG=3Dcheck_vm=
 ? Or
this one just help to flush the traces on vm fault (like mentioned in a com=
mit
msg related to that env var in mesa) ?
 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy warnin=
g),
does it mean the memory is lost until next reboot or does a gpu soft reset
allow to recover these leaks ?=20

Thx !</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>=

--14914960590.dBdc.18660--

--===============1554908520==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs
IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz
dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg==

--===============1554908520==--