All of lore.kernel.org
 help / color / mirror / Atom feed
From: bugzilla-daemon@freedesktop.org
To: dri-devel@lists.freedesktop.org
Subject: [Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout
Date: Tue, 30 Apr 2019 01:26:49 +0000	[thread overview]
Message-ID: <bug-110509-502-d9ZlarR5ud@http.bugs.freedesktop.org/> (raw)
In-Reply-To: <bug-110509-502@http.bugs.freedesktop.org/>


[-- Attachment #1.1: Type: text/plain, Size: 2711 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #9 from Alex Deucher <alexdeucher@gmail.com> ---
(In reply to James.Dutton from comment #8)
> Thank you for the feedback.
> Is there a data sheet somewhere that might help me work out a fix for this.
> What I would like is:
> 1) A way to scan all the engines and detect which ones have hung.

If the gpu scheduler for a queue on a particular engine times out, you can be
pretty sure the engine has hung.  At that point you can check the current busy
status for the block (IP is_idle() callback).

> 2) A way to intentionally halt an engine and tidy up. So that the modprobe,
> rmmod, modprobe scenario works. 

hw_fini() IP callback.

> 3) data sheet details regarding how to un-hang each engine.
> Specifically, in this case the IP block <dm>.

Each IP has a soft reset (implemented via the IP soft_reset() callback), but
depending on the hang, in some cases, you may have to do a full GPU reset to
recover.  This is not a hw hang, it's a sw deadlock.  

> 
> Maybe that is not possible, and (I think you are hinting at it), one cannot
> reset an individual IP block. So the approach is to suspend the card, and
> then do a full reset of the entire card, then resume.

All asics support full GPU reset which is implemented via the SOC level
amdgpu_asic_funcs reset() callback.

> 
> I think a different suspend process would be better.
> We have a for_each within the suspend code. The output of that code should
> not be a single error code, but instead an array indicating the current
> state of each engine (running/hung), the intended state and status of
> whether the intention worked or failed. If the loop through the for_each, it
> could compare the current state and intended state, and attempt to reach the
> intended state, and report an error code for each engine. Then the code to
> achieve the transition can been different depending on the current ->
> intended transition.
> i.e. code for running -> suspended, can be different than code for hung ->
> suspended. The code already needs to know which engines are enabled/disabled
> (Vega 56 vs Vega 64)

We don't really care of the suspend fails or not.  See
amdgpu_device_gpu_recover() for the full sequence.

> 
> I can hang this IP block <dm> at will. I have 2 games that hang it within
> seconds of starting.

There was a deadlock in the dm code which has been fixed.  Please try a new
code base.  e.g.,
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 3993 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  parent reply	other threads:[~2019-04-30  1:26 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-24 17:26 [Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout bugzilla-daemon
2019-04-24 17:31 ` bugzilla-daemon
2019-04-24 17:32 ` bugzilla-daemon
2019-04-24 17:33 ` bugzilla-daemon
2019-04-24 17:35 ` bugzilla-daemon
2019-04-28 15:42 ` bugzilla-daemon
2019-04-29 13:41 ` bugzilla-daemon
2019-04-29 18:30 ` bugzilla-daemon
2019-04-29 22:41 ` bugzilla-daemon
2019-04-30  1:26 ` bugzilla-daemon [this message]
2019-04-30 10:40 ` bugzilla-daemon
2019-04-30 10:44 ` bugzilla-daemon
2019-04-30 14:22 ` bugzilla-daemon
2019-04-30 14:26 ` bugzilla-daemon
2019-04-30 14:43 ` bugzilla-daemon
2019-08-13 20:56 ` bugzilla-daemon
2019-08-13 21:20 ` bugzilla-daemon
2019-09-25 18:49 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-110509-502-d9ZlarR5ud@http.bugs.freedesktop.org/ \
    --to=bugzilla-daemon@freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.