From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout
Date: Wed, 24 Apr 2019 17:26:39 +0000
Message-ID:
Bug ID
110509
Summary
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout
Product
Mesa
Version
git
Hardware
Other
OS
All
Status
NEW
Severity
normal
Priority
medium
Component
Drivers/Gallium/radeonsi
Assignee
dri-devel@lists.freedesktop.org
Reporter
James.Dutton@gmail.com
QA Contact
dri-devel@lists.freedesktop.org
AMD Vega 56 fails to reset:
[ 188.771043] Evicting PASID 32782 queues
[ 188.782094] Restoring PASID 32782 queues
[ 214.563362] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=3D19285, emitted seq=3D19287
[ 214.563432] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process informati=
on:
process ACOdyssey.exe pid 3761 thread ACOdyssey.exe pid 3761
[ 214.563439] amdgpu 0000:43:00.0: GPU reset begin!
[ 214.563445] Evicting PASID 32782 queues
[ 224.793032] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-=
0]
hw_done or flip_done timed out
How do I go about diagnosing this problem?
Created attac=
hment 144084 [details]
./umr -O bits -r *.*.mmGRBM_STATUS
Output while GPU failed to reset.
Created attac=
hment 144085 [details]
/usr/src/umr/build/src/app/umr -wa
Output of the wave.
Created attachment 144086 [details]<=
/span>
dmesg
dmesg during reset.
What | Removed | Added |
---|---|---|
Attachment #144086 is obsolete= td> | 1 |
Created attachment 144087 [details]<=
/span>
dmesg
dmesg
This is a result of trying to play games in wine and dxvk. It used to work, but the latest mesa git fails. Games that fails are: Assassin's creed odyssey Devil May Cry 5 Both these games get through the title sequences, but fail when you reach t= he actual game play. The GPU hangs and tries to reset, but fails to reset. So, there are two problems: 1) Why does it hang in the first place 2) Why does it fail to recover and reset itself. I can ssh into the PC. poweroff <- Attempts to power off but never actually reaches off stat= e. echo b > /proc/sysrq-trigger <- reboots the box, and everything i= s then ok again, so long as one does not try to play a game.
I think I have found the problem. [ 657.526313] amdgpu 0000:43:00.0: GPU reset begin! [ 657.526318] Evicting PASID 32782 queues [ 667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-= 0] hw_done or flip_done timed out The intention is to do a GPU reset, but the implementation in the code is j= ust to try and do a suspend. Part of the suspend does this: Apr 29 14:29:19 thread kernel: [ 363.445607] INFO: task kworker/u258:0:55 blocked for more than 120 seconds. Apr 29 14:29:19 thread kernel: [ 363.445612] Not tainted 5.0.10-dirty #26 Apr 29 14:29:19 thread kernel: [ 363.445613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 14:29:19 thread kernel: [ 363.445615] kworker/u258:0 D 0 55= =20=20=20=20=20 2 0x80000000 Apr 29 14:29:19 thread kernel: [ 363.445628] Workqueue: events_unbound commit_work [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445629] Call Trace: Apr 29 14:29:19 thread kernel: [ 363.445635] __schedule+0x2c0/0x880 Apr 29 14:29:19 thread kernel: [ 363.445637] schedule+0x2c/0x70 Apr 29 14:29:19 thread kernel: [ 363.445639] schedule_timeout+0x1db/0x360 Apr 29 14:29:19 thread kernel: [ 363.445641] ? update_load_avg+0x8b/0x590 Apr 29 14:29:19 thread kernel: [ 363.445645]=20 dma_fence_default_wait+0x1eb/0x270 Apr 29 14:29:19 thread kernel: [ 363.445647] ? dma_fence_release+0xa0/0xa0 Apr 29 14:29:19 thread kernel: [ 363.445649]=20 dma_fence_wait_timeout+0xfd/0x110 Apr 29 14:29:19 thread kernel: [ 363.445651]=20 reservation_object_wait_timeout_rcu+0x17d/0x370 Apr 29 14:29:19 thread kernel: [ 363.445710] amdgpu_dm_do_flip+0x14a/0x4a0 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445767]=20 amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445820] ? amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445828] commit_tail+0x42/0x70 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445835] commit_work+0x12/0x20 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445838] process_one_work+0x1fd/0x400 Apr 29 14:29:19 thread kernel: [ 363.445840] worker_thread+0x34/0x410 Apr 29 14:29:19 thread kernel: [ 363.445841] kthread+0x121/0x140 Apr 29 14:29:19 thread kernel: [ 363.445843] ? process_one_work+0x400/0x4= 00 Apr 29 14:29:19 thread kernel: [ 363.445844] ? kthread_park+0x90/0x90 Apr 29 14:29:19 thread kernel: [ 363.445847] ret_from_fork+0x22/0x40 So, amggpu_dm_do_flip() is the bit that hangs. If the GPU needs to be reset because some of it has hung, trying a "fl= ip" is unlikely to work. It is failing/hanging when doing "suspend of IP block <dm>"= in amdgpu_device_ip_suspend_phase1(). I would suggest creating code that actually tries to reset the GPU, instead= of trying to suspend it while GPU is hung.
(In reply to James.Dutton from comment #6) >=20 > I would suggest creating code that actually tries to reset the GPU, in= stead > of trying to suspend it while GPU is hung. That is part of the GPU reset sequence. We need to attempt to stop the eng= ines before resetting the GPU. That is what the suspend code does. Not all of = the engines are necessarily hung so you need to stop and drain them properly.= pre>
Thank you for the feedback. Is there a data sheet somewhere that might help me work out a fix for this. What I would like is: 1) A way to scan all the engines and detect which ones have hung. 2) A way to intentionally halt an engine and tidy up. So that the modprobe, rmmod, modprobe scenario works.=20 3) data sheet details regarding how to un-hang each engine. Specifically, in this case the IP block <dm>. Maybe that is not possible, and (I think you are hinting at it), one cannot reset an individual IP block. So the approach is to suspend the card, and t= hen do a full reset of the entire card, then resume. I think a different suspend process would be better. We have a for_each within the suspend code. The output of that code should = not be a single error code, but instead an array indicating the current state of each engine (running/hung), the intended state and status of whether the intention worked or failed. If the loop through the for_each, it could comp= are the current state and intended state, and attempt to reach the intended sta= te, and report an error code for each engine. Then the code to achieve the transition can been different depending on the current -> intended trans= ition. i.e. code for running -> suspended, can be different than code for hung = -> suspended. The code already needs to know which engines are enabled/disable= d=20 (Vega 56 vs Vega 64) I can hang this IP block <dm> at will. I have 2 games that hang it wi= thin seconds of starting.
(In reply to James.Dutton from comment #8) > Thank you for the feedback. > Is there a data sheet somewhere that might help me work out a fix for = this. > What I would like is: > 1) A way to scan all the engines and detect which ones have hung. If the gpu scheduler for a queue on a particular engine times out, you can = be pretty sure the engine has hung. At that point you can check the current b= usy status for the block (IP is_idle() callback). > 2) A way to intentionally halt an engine and tid= y up. So that the modprobe, > rmmod, modprobe scenario works. hw_fini() IP callback. > 3) data sheet details regarding how to un-hang e= ach engine. > Specifically, in this case the IP block <dm>. Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover. This is not a hw hang, it's a sw deadlock.=20=20 >=20 > Maybe that is not possible, and (I think you are hinting at it), one c= annot > reset an individual IP block. So the approach is to suspend the card, = and > then do a full reset of the entire card, then resume. All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback. >=20 > I think a different suspend process would be better. > We have a for_each within the suspend code. The output of that code sh= ould > not be a single error code, but instead an array indicating the current > state of each engine (running/hung), the intended state and status of > whether the intention worked or failed. If the loop through the for_ea= ch, it > could compare the current state and intended state, and attempt to rea= ch the > intended state, and report an error code for each engine. Then the cod= e to > achieve the transition can been different depending on the current -&g= t; > intended transition. > i.e. code for running -> suspended, can be different than code for = hung -> > suspended. The code already needs to know which engines are enabled/di= sabled > (Vega 56 vs Vega 64) We don't really care of the suspend fails or not. See amdgpu_device_gpu_recover() for the full sequence. >=20 > I can hang this IP block <dm> at will. I have 2 games that hang = it within > seconds of starting. There was a deadlock in the dm code which has been fixed. Please try a new code base. e.g., https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Damd-staging-drm= -next https://cgit.freedesktop.org/~agd5f/linux/log/?h=3Ddrm-next-5.2-wip
Created attachment 1=
44118 [details]
dmesg with drm-next-5.2-wip
I tried with drm-next-5.2-wip. It does not hang any more, but I have a new error now. It is better, in the sense that I can now reboot the system normally, and n= ot resort to echo b >/proc/sysrq-trigger [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! After the GPU reset, the screen is corrupted. I can do, via ssh, service gdm stop. service gdm start and I then get a working login screen. (Mouse moves, I can type in password) I cannot actually login because X fails. The desktop fails to appear and it returns to the login greeter screen. I will try to get more details when I have time later.
The error is from this bit of code in: amdgpu_cs.c: Line about 232 In function: amdgpu_cs_parser_init: if (p->ctx->vram_lost_counter !=3D p->job->vram_lost_co= unter) { ret =3D -ECANCELED; goto free_all_kdata; } So, I guess, somewhere is the gpu reset, those values need to be fixed up.<= /pre>
(In reply to James.Dutton from comment #12) > In function: amdgpu_cs_parser_init: > if (p->ctx->vram_lost_counter !=3D p->job->vram_lo= st_counter) { > ret =3D -ECANCELED; > goto free_all_kdata; > } >=20 > So, I guess, somewhere is the gpu reset, those values need to be fixed= up. It means the VRAM contents were lost during the GPU reset, so any existing userspace contexts are invalid and need to be re-created (which at this poi= nt boils down to restarting any processes using the GPU for rendering).
I stop gdm and kill any remaining X processes. When I start gdm and login, it works, and displays the desktop. Previously, I was leaving on of the X processes running. So, I think this (drm-next-5.2-wip) has fixed this bug.
What | Removed | Added |
---|---|---|
CC | lifeisfoo@gmail.com |
Created attachment 145050 [details]
dmsg drm amdgpu
I'm facing the same issue with 5.2.x and 5.3-rc4 kernel and a Radeon RX 580=
.
What | Removed | Added |
---|---|---|
Attachment #145050 description= td> | dmsg drm amdgpu | dmsg drm amdgpu linux 5.3-rc4 from ubuntu ppa |
What | Removed | Added |
---|---|---|
Status | NEW | RESOLVED |
Resolution | --- | MOVED |
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this = link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1389.