From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout
Date: Tue, 30 Apr 2019 01:26:49 +0000
Message-ID:
References:
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============0448776659=="
Return-path:
Received: from culpepper.freedesktop.org (culpepper.freedesktop.org
[131.252.210.165])
by gabe.freedesktop.org (Postfix) with ESMTP id 7249289361
for ; Tue, 30 Apr 2019 01:26:49 +0000 (UTC)
In-Reply-To:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel"
To: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org
--===============0448776659==
Content-Type: multipart/alternative; boundary="15565876091.65c877.4165"
Content-Transfer-Encoding: 7bit
--15565876091.65c877.4165
Date: Tue, 30 Apr 2019 01:26:49 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
https://bugs.freedesktop.org/show_bug.cgi?id=3D110509
--- Comment #9 from Alex Deucher ---
(In reply to James.Dutton from comment #8)
> Thank you for the feedback.
> Is there a data sheet somewhere that might help me work out a fix for thi=
s.
> What I would like is:
> 1) A way to scan all the engines and detect which ones have hung.
If the gpu scheduler for a queue on a particular engine times out, you can =
be
pretty sure the engine has hung. At that point you can check the current b=
usy
status for the block (IP is_idle() callback).
> 2) A way to intentionally halt an engine and tidy up. So that the modprob=
e,
> rmmod, modprobe scenario works.=20
hw_fini() IP callback.
> 3) data sheet details regarding how to un-hang each engine.
> Specifically, in this case the IP block .
Each IP has a soft reset (implemented via the IP soft_reset() callback), but
depending on the hang, in some cases, you may have to do a full GPU reset to
recover. This is not a hw hang, it's a sw deadlock.=20=20
>=20
> Maybe that is not possible, and (I think you are hinting at it), one cann=
ot
> reset an individual IP block. So the approach is to suspend the card, and
> then do a full reset of the entire card, then resume.
All asics support full GPU reset which is implemented via the SOC level
amdgpu_asic_funcs reset() callback.
>=20
> I think a different suspend process would be better.
> We have a for_each within the suspend code. The output of that code should
> not be a single error code, but instead an array indicating the current
> state of each engine (running/hung), the intended state and status of
> whether the intention worked or failed. If the loop through the for_each,=
it
> could compare the current state and intended state, and attempt to reach =
the
> intended state, and report an error code for each engine. Then the code to
> achieve the transition can been different depending on the current ->
> intended transition.
> i.e. code for running -> suspended, can be different than code for hung ->
> suspended. The code already needs to know which engines are enabled/disab=
led
> (Vega 56 vs Vega 64)
We don't really care of the suspend fails or not. See
amdgpu_device_gpu_recover() for the full sequence.
>=20
> I can hang this IP block at will. I have 2 games that hang it within
> seconds of starting.
There was a deadlock in the dm code which has been fixed. Please try a new
code base. e.g.,
https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Damd-staging-drm-next
https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Ddrm-next-5.2-wip
--=20
You are receiving this mail because:
You are the assignee for the bug.=
--15565876091.65c877.4165
Date: Tue, 30 Apr 2019 01:26:49 +0000
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://bugs.freedesktop.org/
Auto-Submitted: auto-generated
Commen=
t # 9
on bug 11050=
9
from Alex Deucher
(In reply to James.Dutton from comment #8)
> Thank you for the feedback.
> Is there a data sheet somewhere that might help me work out a fix for =
this.
> What I would like is:
> 1) A way to scan all the engines and detect which ones have hung.
If the gpu scheduler for a queue on a particular engine times out, you can =
be
pretty sure the engine has hung. At that point you can check the current b=
usy
status for the block (IP is_idle() callback).
> 2) A way to intentionally halt an engine and tid=
y up. So that the modprobe,
> rmmod, modprobe scenario works.
hw_fini() IP callback.
> 3) data sheet details regarding how to un-hang e=
ach engine.
> Specifically, in this case the IP block <dm>.
Each IP has a soft reset (implemented via the IP soft_reset() callback), but
depending on the hang, in some cases, you may have to do a full GPU reset to
recover. This is not a hw hang, it's a sw deadlock.=20=20
>=20
> Maybe that is not possible, and (I think you are hinting at it), one c=
annot
> reset an individual IP block. So the approach is to suspend the card, =
and
> then do a full reset of the entire card, then resume.
All asics support full GPU reset which is implemented via the SOC level
amdgpu_asic_funcs reset() callback.
>=20
> I think a different suspend process would be better.
> We have a for_each within the suspend code. The output of that code sh=
ould
> not be a single error code, but instead an array indicating the current
> state of each engine (running/hung), the intended state and status of
> whether the intention worked or failed. If the loop through the for_ea=
ch, it
> could compare the current state and intended state, and attempt to rea=
ch the
> intended state, and report an error code for each engine. Then the cod=
e to
> achieve the transition can been different depending on the current -&g=
t;
> intended transition.
> i.e. code for running -> suspended, can be different than code for =
hung ->
> suspended. The code already needs to know which engines are enabled/di=
sabled
> (Vega 56 vs Vega 64)
We don't really care of the suspend fails or not. See
amdgpu_device_gpu_recover() for the full sequence.
>=20
> I can hang this IP block <dm> at will. I have 2 games that hang =
it within
> seconds of starting.
There was a deadlock in the dm code which has been fixed. Please try a new
code base. e.g.,
https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Damd-staging-drm=
-next
https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Ddrm-next-5.2-wip
You are receiving this mail because:
- You are the assignee for the bug.
=
--15565876091.65c877.4165--
--===============0448776659==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs
IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz
dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs
--===============0448776659==--