From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout Date: Mon, 29 Apr 2019 13:41:31 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1948748831==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 1DC2F891A3 for ; Mon, 29 Apr 2019 13:41:31 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1948748831== Content-Type: multipart/alternative; boundary="15565452910.d19b68.5256" Content-Transfer-Encoding: 7bit --15565452910.d19b68.5256 Date: Mon, 29 Apr 2019 13:41:31 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D110509 --- Comment #6 from James.Dutton@gmail.com --- I think I have found the problem. [ 657.526313] amdgpu 0000:43:00.0: GPU reset begin! [ 657.526318] Evicting PASID 32782 queues [ 667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-= 0] hw_done or flip_done timed out The intention is to do a GPU reset, but the implementation in the code is j= ust to try and do a suspend. Part of the suspend does this: Apr 29 14:29:19 thread kernel: [ 363.445607] INFO: task kworker/u258:0:55 blocked for more than 120 seconds. Apr 29 14:29:19 thread kernel: [ 363.445612] Not tainted 5.0.10-dirty #26 Apr 29 14:29:19 thread kernel: [ 363.445613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 14:29:19 thread kernel: [ 363.445615] kworker/u258:0 D 0 55= =20=20=20=20=20 2 0x80000000 Apr 29 14:29:19 thread kernel: [ 363.445628] Workqueue: events_unbound commit_work [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445629] Call Trace: Apr 29 14:29:19 thread kernel: [ 363.445635] __schedule+0x2c0/0x880 Apr 29 14:29:19 thread kernel: [ 363.445637] schedule+0x2c/0x70 Apr 29 14:29:19 thread kernel: [ 363.445639] schedule_timeout+0x1db/0x360 Apr 29 14:29:19 thread kernel: [ 363.445641] ? update_load_avg+0x8b/0x590 Apr 29 14:29:19 thread kernel: [ 363.445645]=20 dma_fence_default_wait+0x1eb/0x270 Apr 29 14:29:19 thread kernel: [ 363.445647] ? dma_fence_release+0xa0/0xa0 Apr 29 14:29:19 thread kernel: [ 363.445649]=20 dma_fence_wait_timeout+0xfd/0x110 Apr 29 14:29:19 thread kernel: [ 363.445651]=20 reservation_object_wait_timeout_rcu+0x17d/0x370 Apr 29 14:29:19 thread kernel: [ 363.445710] amdgpu_dm_do_flip+0x14a/0x4a0 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445767]=20 amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445820] ? amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445828] commit_tail+0x42/0x70 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445835] commit_work+0x12/0x20 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445838] process_one_work+0x1fd/0x400 Apr 29 14:29:19 thread kernel: [ 363.445840] worker_thread+0x34/0x410 Apr 29 14:29:19 thread kernel: [ 363.445841] kthread+0x121/0x140 Apr 29 14:29:19 thread kernel: [ 363.445843] ? process_one_work+0x400/0x4= 00 Apr 29 14:29:19 thread kernel: [ 363.445844] ? kthread_park+0x90/0x90 Apr 29 14:29:19 thread kernel: [ 363.445847] ret_from_fork+0x22/0x40 So, amggpu_dm_do_flip() is the bit that hangs. If the GPU needs to be reset because some of it has hung, trying a "flip" is unlikely to work. It is failing/hanging when doing "suspend of IP block " in amdgpu_device_ip_suspend_phase1(). I would suggest creating code that actually tries to reset the GPU, instead= of trying to suspend it while GPU is hung. --=20 You are receiving this mail because: You are the assignee for the bug.= --15565452910.d19b68.5256 Date: Mon, 29 Apr 2019 13:41:31 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 6 on bug 11050= 9 from James.Dutt= on@gmail.com
I think I have found the problem.
[  657.526313] amdgpu 0000:43:00.0: GPU reset begin!
[  657.526318] Evicting PASID 32782 queues
[  667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-=
0]
hw_done or flip_done timed out


The intention is to do a GPU reset, but the implementation in the code is j=
ust
to try and do a suspend.
Part of the suspend does this:

Apr 29 14:29:19 thread kernel: [  363.445607] INFO: task kworker/u258:0:55
blocked for more than 120 seconds.
Apr 29 14:29:19 thread kernel: [  363.445612]       Not tainted 5.0.10-dirty
#26
Apr 29 14:29:19 thread kernel: [  363.445613] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 29 14:29:19 thread kernel: [  363.445615] kworker/u258:0  D    0    55=
=20=20=20=20=20
2 0x80000000
Apr 29 14:29:19 thread kernel: [  363.445628] Workqueue: events_unbound
commit_work [drm_kms_helper]
Apr 29 14:29:19 thread kernel: [  363.445629] Call Trace:
Apr 29 14:29:19 thread kernel: [  363.445635]  __schedule+0x2c0/0x880
Apr 29 14:29:19 thread kernel: [  363.445637]  schedule+0x2c/0x70
Apr 29 14:29:19 thread kernel: [  363.445639]  schedule_timeout+0x1db/0x360
Apr 29 14:29:19 thread kernel: [  363.445641]  ? update_load_avg+0x8b/0x590
Apr 29 14:29:19 thread kernel: [  363.445645]=20
dma_fence_default_wait+0x1eb/0x270
Apr 29 14:29:19 thread kernel: [  363.445647]  ? dma_fence_release+0xa0/0xa0
Apr 29 14:29:19 thread kernel: [  363.445649]=20
dma_fence_wait_timeout+0xfd/0x110
Apr 29 14:29:19 thread kernel: [  363.445651]=20
reservation_object_wait_timeout_rcu+0x17d/0x370
Apr 29 14:29:19 thread kernel: [  363.445710]  amdgpu_dm_do_flip+0x14a/0x4a0
[amdgpu]
Apr 29 14:29:19 thread kernel: [  363.445767]=20
amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu]
Apr 29 14:29:19 thread kernel: [  363.445820]  ?
amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu]
Apr 29 14:29:19 thread kernel: [  363.445828]  commit_tail+0x42/0x70
[drm_kms_helper]
Apr 29 14:29:19 thread kernel: [  363.445835]  commit_work+0x12/0x20
[drm_kms_helper]
Apr 29 14:29:19 thread kernel: [  363.445838]  process_one_work+0x1fd/0x400
Apr 29 14:29:19 thread kernel: [  363.445840]  worker_thread+0x34/0x410
Apr 29 14:29:19 thread kernel: [  363.445841]  kthread+0x121/0x140
Apr 29 14:29:19 thread kernel: [  363.445843]  ? process_one_work+0x400/0x4=
00
Apr 29 14:29:19 thread kernel: [  363.445844]  ? kthread_park+0x90/0x90
Apr 29 14:29:19 thread kernel: [  363.445847]  ret_from_fork+0x22/0x40


So, amggpu_dm_do_flip()  is the bit that hangs.
If the GPU needs to be reset because some of it has hung, trying a "fl=
ip" is
unlikely to work.
It is failing/hanging when doing "suspend of IP block <dm>"=
 in
amdgpu_device_ip_suspend_phase1().

I would suggest creating code that actually tries to reset the GPU, instead=
 of
trying to suspend it while GPU is hung.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15565452910.d19b68.5256-- --===============1948748831== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============1948748831==--