From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout Date: Tue, 30 Apr 2019 01:26:49 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0448776659==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 7249289361 for ; Tue, 30 Apr 2019 01:26:49 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0448776659== Content-Type: multipart/alternative; boundary="15565876091.65c877.4165" Content-Transfer-Encoding: 7bit --15565876091.65c877.4165 Date: Tue, 30 Apr 2019 01:26:49 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D110509 --- Comment #9 from Alex Deucher --- (In reply to James.Dutton from comment #8) > Thank you for the feedback. > Is there a data sheet somewhere that might help me work out a fix for thi= s. > What I would like is: > 1) A way to scan all the engines and detect which ones have hung. If the gpu scheduler for a queue on a particular engine times out, you can = be pretty sure the engine has hung. At that point you can check the current b= usy status for the block (IP is_idle() callback). > 2) A way to intentionally halt an engine and tidy up. So that the modprob= e, > rmmod, modprobe scenario works.=20 hw_fini() IP callback. > 3) data sheet details regarding how to un-hang each engine. > Specifically, in this case the IP block . Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover. This is not a hw hang, it's a sw deadlock.=20=20 >=20 > Maybe that is not possible, and (I think you are hinting at it), one cann= ot > reset an individual IP block. So the approach is to suspend the card, and > then do a full reset of the entire card, then resume. All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback. >=20 > I think a different suspend process would be better. > We have a for_each within the suspend code. The output of that code should > not be a single error code, but instead an array indicating the current > state of each engine (running/hung), the intended state and status of > whether the intention worked or failed. If the loop through the for_each,= it > could compare the current state and intended state, and attempt to reach = the > intended state, and report an error code for each engine. Then the code to > achieve the transition can been different depending on the current -> > intended transition. > i.e. code for running -> suspended, can be different than code for hung -> > suspended. The code already needs to know which engines are enabled/disab= led > (Vega 56 vs Vega 64) We don't really care of the suspend fails or not. See amdgpu_device_gpu_recover() for the full sequence. >=20 > I can hang this IP block at will. I have 2 games that hang it within > seconds of starting. There was a deadlock in the dm code which has been fixed. Please try a new code base. e.g., https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Damd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Ddrm-next-5.2-wip --=20 You are receiving this mail because: You are the assignee for the bug.= --15565876091.65c877.4165 Date: Tue, 30 Apr 2019 01:26:49 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 9 on bug 11050= 9 from Alex Deucher
(In reply to James.Dutton from comment #8)
> Thank you for the feedback.
> Is there a data sheet somewhere that might help me work out a fix for =
this.
> What I would like is:
> 1) A way to scan all the engines and detect which ones have hung.

If the gpu scheduler for a queue on a particular engine times out, you can =
be
pretty sure the engine has hung.  At that point you can check the current b=
usy
status for the block (IP is_idle() callback).

> 2) A way to intentionally halt an engine and tid=
y up. So that the modprobe,
> rmmod, modprobe scenario works. 

hw_fini() IP callback.

> 3) data sheet details regarding how to un-hang e=
ach engine.
> Specifically, in this case the IP block <dm>.

Each IP has a soft reset (implemented via the IP soft_reset() callback), but
depending on the hang, in some cases, you may have to do a full GPU reset to
recover.  This is not a hw hang, it's a sw deadlock.=20=20

>=20
> Maybe that is not possible, and (I think you are hinting at it), one c=
annot
> reset an individual IP block. So the approach is to suspend the card, =
and
> then do a full reset of the entire card, then resume.

All asics support full GPU reset which is implemented via the SOC level
amdgpu_asic_funcs reset() callback.

>=20
> I think a different suspend process would be better.
> We have a for_each within the suspend code. The output of that code sh=
ould
> not be a single error code, but instead an array indicating the current
> state of each engine (running/hung), the intended state and status of
> whether the intention worked or failed. If the loop through the for_ea=
ch, it
> could compare the current state and intended state, and attempt to rea=
ch the
> intended state, and report an error code for each engine. Then the cod=
e to
> achieve the transition can been different depending on the current -&g=
t;
> intended transition.
> i.e. code for running -> suspended, can be different than code for =
hung ->
> suspended. The code already needs to know which engines are enabled/di=
sabled
> (Vega 56 vs Vega 64)

We don't really care of the suspend fails or not.  See
amdgpu_device_gpu_recover() for the full sequence.

>=20
> I can hang this IP block <dm> at will. I have 2 games that hang =
it within
> seconds of starting.

There was a deadlock in the dm code which has been fixed.  Please try a new
code base.  e.g.,
https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Damd-staging-drm=
-next
https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Ddrm-next-5.2-wip


You are receiving this mail because:
  • You are the assignee for the bug.
= --15565876091.65c877.4165-- --===============0448776659== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0448776659==--