* Bug: amdgpu drm driver cause process into Disk sleep state
@ 2019-09-03 7:35 =?gb18030?B?Nzg2NjY2Nzk=?=
[not found] ` <tencent_4DEABBEB3BB4C6A6D84CA9F0DB225FBF5809-9uewiaClKEY@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: =?gb18030?B?Nzg2NjY2Nzk=?= @ 2019-09-03 7:35 UTC (permalink / raw)
To: =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?YWxleGFuZGVyLmRldWNoZXI=?=,
=?gb18030?B?Q2hyaXN0aWFuIEuBMIsybmln?=
Hi, Sirs:
I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
cps-new-ondemand-0587:~ # ps aux|grep -w D
root 11268 0.0 0.0 260628 3516 ? Ssl 8月26 0:00 /usr/sbin/gssproxy -D
root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
root 370684 0.0 0.0 17972 7428 ? Ss 9月02 0:04 /usr/sbin/sshd -D
10066 432951 0.0 0.0 0 0 ? D 9月02 0:00 [FakeFinalizerDa]
root 496774 0.0 0.0 0 0 ? D 9月02 0:17 [kworker/8:1+eve]
cps-new-ondemand-0587:~ # cat /proc/496774/stack
[<0>] __switch_to+0x94/0xe8
[<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
[<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
[<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
[<0>] filp_close+0x40/0xa0
[<0>] put_files_struct+0x118/0x120
[<0>] put_files_struct+0x30/0x68 [binder_linux]
[<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
[<0>] process_one_work+0x1b4/0x3f8
[<0>] worker_thread+0x54/0x470
[<0>] kthread+0x134/0x138
[<0>] ret_from_fork+0x10/0x18
[<0>] 0xffffffffffffffff
This issue troubled me a long time. looking eagerly to get help from you!
-----
Yanhua
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bug: amdgpu drm driver cause process into Disk sleep state
[not found] ` <tencent_4DEABBEB3BB4C6A6D84CA9F0DB225FBF5809-9uewiaClKEY@public.gmane.org>
@ 2019-09-03 8:21 ` Koenig, Christian
[not found] ` <f761fec0-c0cc-426c-6bcb-c3fd23808888-5C7GfCeVMHo@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: Koenig, Christian @ 2019-09-03 8:21 UTC (permalink / raw)
To: 78666679, amd-gfx; +Cc: Deucher, Alexander
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8月26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9月02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9月02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9月02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?gb18030?B?u9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?=
[not found] ` <f761fec0-c0cc-426c-6bcb-c3fd23808888-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-03 8:27 ` =?gb18030?B?Nzg2NjY2Nzk=?=
2019-09-03 12:50 ` =?gb18030?B?u9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?= =?gb18030?B?Nzg2NjY2Nzk=?=
1 sibling, 0 replies; 11+ messages in thread
From: =?gb18030?B?Nzg2NjY2Nzk=?= @ 2019-09-03 8:27 UTC (permalink / raw)
To: =?gb18030?B?S29lbmlnLCBDaHJpc3RpYW4=?=, =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?RGV1Y2hlciwgQWxleGFuZGVy?=
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="gb18030", Size: 2272 bytes --]
Hi, Christian:
Thanks very much for you so fast reply. I will try this commit patch first. My kernel version is 4.19.36(for some reasons, this version should not be upgraded).
----
yanhua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 3520 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?gb18030?B?u9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?=
[not found] ` <f761fec0-c0cc-426c-6bcb-c3fd23808888-5C7GfCeVMHo@public.gmane.org>
2019-09-03 8:27 ` =?gb18030?B?u9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?= =?gb18030?B?Nzg2NjY2Nzk=?=
@ 2019-09-03 12:50 ` =?gb18030?B?Nzg2NjY2Nzk=?=
[not found] ` <tencent_7DC9F5195A4D538FA626F85991875FC5F508-9uewiaClKEY@public.gmane.org>
1 sibling, 1 reply; 11+ messages in thread
From: =?gb18030?B?Nzg2NjY2Nzk=?= @ 2019-09-03 12:50 UTC (permalink / raw)
To: =?gb18030?B?S29lbmlnLCBDaHJpc3RpYW4=?=, =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?RGV1Y2hlciwgQWxleGFuZGVy?=
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="gb18030", Size: 3121 bytes --]
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 4595 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?GB18030?B?UmU6ILvYuLSjuiBCdWc6IGFtZGdwdSBkcm0gZHJpdmVyIGNhdXNlIHByb2Nlc3MgaW50byBEaXNrIHNsZWVwIHN0YXRl?=
[not found] ` <tencent_7DC9F5195A4D538FA626F85991875FC5F508-9uewiaClKEY@public.gmane.org>
@ 2019-09-03 13:07 ` Koenig, Christian
[not found] ` <2162676e-dbfa-a67d-248c-98e9eb2099c2-5C7GfCeVMHo@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: Koenig, Christian @ 2019-09-03 13:07 UTC (permalink / raw)
To: 78666679, amd-gfx; +Cc: Deucher, Alexander
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="GB18030", Size: 3601 bytes --]
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig-5C7GfCeVMHo@public.gmane.org><mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679-9uewiaClKEY@public.gmane.org><mailto:78666679-9uewiaClKEY@public.gmane.org>;"amd-gfx"<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org><mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org><mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 6043 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?gb18030?B?u9i4tKO6ILvYuLSjuiBCdWc6IGFtZGdwdSBkcm0gZHJpdmVyIGNhdXNlIHByb2Nlc3MgaW50byBEaXNrIHNsZWVwIHN0YXRl?=
[not found] ` <2162676e-dbfa-a67d-248c-98e9eb2099c2-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-03 13:16 ` =?gb18030?B?Nzg2NjY2Nzk=?=
[not found] ` <tencent_DFCD5A0853FDA639F81F91375F8DF55AF508-9uewiaClKEY@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: =?gb18030?B?Nzg2NjY2Nzk=?= @ 2019-09-03 13:16 UTC (permalink / raw)
To: =?gb18030?B?S29lbmlnLCBDaHJpc3RpYW4=?=, =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?RGV1Y2hlciwgQWxleGFuZGVy?=
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="gb18030", Size: 3946 bytes --]
Yes, with dmesg|grep drm , I get following.
348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:07
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 6198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?GB18030?B?UmU6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?=
[not found] ` <tencent_DFCD5A0853FDA639F81F91375F8DF55AF508-9uewiaClKEY@public.gmane.org>
@ 2019-09-03 13:19 ` Koenig, Christian
[not found] ` <88a08dcc-2e95-9379-693f-2d3fd928aa11-5C7GfCeVMHo@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: Koenig, Christian @ 2019-09-03 13:19 UTC (permalink / raw)
To: 78666679, amd-gfx; +Cc: Deucher, Alexander
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="GB18030", Size: 4648 bytes --]
This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.
Regards,
Christian.
Am 03.09.19 um 15:16 schrieb 78666679:
Yes, with dmesg|grep drm , I get following.
348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig-5C7GfCeVMHo@public.gmane.org><mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:07
ÊÕ¼þÈË: ""<78666679-9uewiaClKEY@public.gmane.org><mailto:78666679-9uewiaClKEY@public.gmane.org>;"amd-gfx"<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org><mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org><mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>;
Ö÷Ìâ: Re: »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig-5C7GfCeVMHo@public.gmane.org><mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679-9uewiaClKEY@public.gmane.org><mailto:78666679-9uewiaClKEY@public.gmane.org>;"amd-gfx"<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org><mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org><mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 7907 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?=
[not found] ` <88a08dcc-2e95-9379-693f-2d3fd928aa11-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-03 13:44 ` =?gb18030?B?Nzg2NjY2Nzk=?=
2019-09-05 1:36 ` =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?= =?gb18030?B?eWFuaHVh?=
1 sibling, 0 replies; 11+ messages in thread
From: =?gb18030?B?Nzg2NjY2Nzk=?= @ 2019-09-03 13:44 UTC (permalink / raw)
To: =?gb18030?B?S29lbmlnLCBDaHJpc3RpYW4=?=, =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?RGV1Y2hlciwgQWxleGFuZGVy?=
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="gb18030", Size: 4626 bytes --]
The bug url is:
https://bugs.freedesktop.org/show_bug.cgi?id=111551
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:19
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.
Regards,
Christian.
Am 03.09.19 um 15:16 schrieb 78666679:
Yes, with dmesg|grep drm , I get following.
348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:07
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 7747 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?=
[not found] ` <88a08dcc-2e95-9379-693f-2d3fd928aa11-5C7GfCeVMHo@public.gmane.org>
2019-09-03 13:44 ` =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?= =?gb18030?B?Nzg2NjY2Nzk=?=
@ 2019-09-05 1:36 ` =?gb18030?B?eWFuaHVh?=
[not found] ` <tencent_20683D4D4999B2E0A746EA7D01D677D6070A-9uewiaClKEY@public.gmane.org>
1 sibling, 1 reply; 11+ messages in thread
From: =?gb18030?B?eWFuaHVh?= @ 2019-09-05 1:36 UTC (permalink / raw)
To: =?gb18030?B?S29lbmlnLCBDaHJpc3RpYW4=?=, =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?RGV1Y2hlciwgQWxleGFuZGVy?=
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="gb18030", Size: 6813 bytes --]
Hi, Christian,
I noticed that you said 'amdgpu is known to not work on arm64 until very recently'. I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64".
@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
return false;
#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
return false;
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+ /*
+ * The DRM driver stack is designed to work with cache coherent devices
+ * only, but permits an optimization to be enabled in some cases, where
+ * for some buffers, both the CPU and the GPU use uncached mappings,
+ * removing the need for DMA snooping and allocation in the CPU caches.
+ *
+ * The use of uncached GPU mappings relies on the correct implementation
+ * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
+ * will use cached mappings nonetheless. On x86 platforms, this does not
+ * seem to matter, as uncached CPU mappings will snoop the caches in any
+ * case. However, on ARM and arm64, enabling this optimization on a
+ * platform where NoSnoop is ignored results in loss of coherency, which
+ * breaks correct operation of the device. Since we have no way of
+ * detecting whether NoSnoop works or not, just disable this
+ * optimization entirely for ARM and arm64.
+ */
+ return false;
#else
return true;
#endif
The real effect is to in amdgpu_object.c
if (!drm_arch_can_wc_memory())
bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not the cause of my bug. Are there anything I have missed ?
I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try. We also backport a series patches from newer kernel. But still we get the bad ring timeout.
We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout.
------------------
Yanhua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:19
ÊÕ¼þÈË: "yanhua"<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.
Regards,
Christian.
Am 03.09.19 um 15:16 schrieb 78666679:
Yes, with dmesg|grep drm , I get following.
348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:07
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 11026 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?GB18030?B?UmU6ILvYuLSjuiC72Li0o7ogu9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?=
[not found] ` <tencent_20683D4D4999B2E0A746EA7D01D677D6070A-9uewiaClKEY@public.gmane.org>
@ 2019-09-06 11:23 ` Koenig, Christian
[not found] ` <badd9ea1-6f78-abbc-bdbe-e11271188524-5C7GfCeVMHo@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: Koenig, Christian @ 2019-09-06 11:23 UTC (permalink / raw)
To: yanhua, amd-gfx; +Cc: Deucher, Alexander
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="GB18030", Size: 8503 bytes --]
Are there anything I have missed ?
Yeah, unfortunately quite a bunch of things. The fact that arm64 doesn't support the PCIe NoSnoop TLP attribute is only the tip of the iceberg.
You need a full "recent" driver stack, e.g. not older than a few month till a year, for this to work. And not only the kernel, but also recent userspace components.
Maybe that's something you could first, e.g. install a recent version of Mesa and/or tell Mesa to not use the SDMA at all. But since you are running into an SDMA lockup with a kernel triggered page table update I see little chance that this work.
The only other alternative I can see is the DKMS package of the pro-driver. With that one you might be able to compile the recent driver for an older kernel version.
But I can't guarantee at all that this actually works on ARM64.
Sorry that I don't have better news for you,
Christian.
Am 05.09.19 um 03:36 schrieb yanhua:
Hi, Christian,
I noticed that you said 'amdgpu is known to not work on arm64 until very recently'. I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64".
@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
return false;
#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
return false;
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+ /*
+ * The DRM driver stack is designed to work with cache coherent devices
+ * only, but permits an optimization to be enabled in some cases, where
+ * for some buffers, both the CPU and the GPU use uncached mappings,
+ * removing the need for DMA snooping and allocation in the CPU caches.
+ *
+ * The use of uncached GPU mappings relies on the correct implementation
+ * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
+ * will use cached mappings nonetheless. On x86 platforms, this does not
+ * seem to matter, as uncached CPU mappings will snoop the caches in any
+ * case. However, on ARM and arm64, enabling this optimization on a
+ * platform where NoSnoop is ignored results in loss of coherency, which
+ * breaks correct operation of the device. Since we have no way of
+ * detecting whether NoSnoop works or not, just disable this
+ * optimization entirely for ARM and arm64.
+ */
+ return false;
#else
return true;
#endif
The real effect is to in amdgpu_object.c
if (!drm_arch_can_wc_memory())
bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not the cause of my bug. Are there anything I have missed ?
I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try. We also backport a series patches from newer kernel. But still we get the bad ring timeout.
We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout.
------------------
Yanhua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig-5C7GfCeVMHo@public.gmane.org><mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:19
ÊÕ¼þÈË: "yanhua"<78666679-9uewiaClKEY@public.gmane.org><mailto:78666679-9uewiaClKEY@public.gmane.org>;"amd-gfx"<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org><mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org><mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>;
Ö÷Ìâ: Re: »Ø¸´£º »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.
Regards,
Christian.
Am 03.09.19 um 15:16 schrieb 78666679:
Yes, with dmesg|grep drm , I get following.
348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig-5C7GfCeVMHo@public.gmane.org><mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:07
ÊÕ¼þÈË: ""<78666679-9uewiaClKEY@public.gmane.org><mailto:78666679-9uewiaClKEY@public.gmane.org>;"amd-gfx"<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org><mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org><mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>;
Ö÷Ìâ: Re: »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig-5C7GfCeVMHo@public.gmane.org><mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679-9uewiaClKEY@public.gmane.org><mailto:78666679-9uewiaClKEY@public.gmane.org>;"amd-gfx"<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org><mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org><mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 14006 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogu9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?=
[not found] ` <badd9ea1-6f78-abbc-bdbe-e11271188524-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-11 2:43 ` =?gb18030?B?eWFuaHVh?=
0 siblings, 0 replies; 11+ messages in thread
From: =?gb18030?B?eWFuaHVh?= @ 2019-09-11 2:43 UTC (permalink / raw)
To: =?gb18030?B?S29lbmlnLCBDaHJpc3RpYW4=?=, =?gb18030?B?YW1kLWdmeA==?=
Cc: =?gb18030?B?RGV1Y2hlciwgQWxleGFuZGVy?=
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset="gb18030", Size: 8322 bytes --]
I'm leaving out for some days. Thanks very much for your detailed answer.
Best Regards.
Yanhua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ6ÈÕ(ÐÇÆÚÎå) ÍíÉÏ7:23
ÊÕ¼þÈË: "yanhua"<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º »Ø¸´£º »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Are there anything I have missed ?
Yeah, unfortunately quite a bunch of things. The fact that arm64 doesn't support the PCIe NoSnoop TLP attribute is only the tip of the iceberg.
You need a full "recent" driver stack, e.g. not older than a few month till a year, for this to work. And not only the kernel, but also recent userspace components.
Maybe that's something you could first, e.g. install a recent version of Mesa and/or tell Mesa to not use the SDMA at all. But since you are running into an SDMA lockup with a kernel triggered page table update I see little chance that this work.
The only other alternative I can see is the DKMS package of the pro-driver. With that one you might be able to compile the recent driver for an older kernel version.
But I can't guarantee at all that this actually works on ARM64.
Sorry that I don't have better news for you,
Christian.
Am 05.09.19 um 03:36 schrieb yanhua:
Hi, Christian,
I noticed that you said 'amdgpu is known to not work on arm64 until very recently'. I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64".
@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)
return false;
#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
return false;
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+ /*
+ * The DRM driver stack is designed to work with cache coherent devices
+ * only, but permits an optimization to be enabled in some cases, where
+ * for some buffers, both the CPU and the GPU use uncached mappings,
+ * removing the need for DMA snooping and allocation in the CPU caches.
+ *
+ * The use of uncached GPU mappings relies on the correct implementation
+ * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU
+ * will use cached mappings nonetheless. On x86 platforms, this does not
+ * seem to matter, as uncached CPU mappings will snoop the caches in any
+ * case. However, on ARM and arm64, enabling this optimization on a
+ * platform where NoSnoop is ignored results in loss of coherency, which
+ * breaks correct operation of the device. Since we have no way of
+ * detecting whether NoSnoop works or not, just disable this
+ * optimization entirely for ARM and arm64.
+ */
+ return false;
#else
return true;
#endif
The real effect is to in amdgpu_object.c
if (!drm_arch_can_wc_memory())
bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
And we have AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not the cause of my bug. Are there anything I have missed ?
I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try. We also backport a series patches from newer kernel. But still we get the bad ring timeout.
We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout.
------------------
Yanhua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:19
ÊÕ¼þÈË: "yanhua"<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.
Regards,
Christian.
Am 03.09.19 um 15:16 schrieb 78666679:
Yes, with dmesg|grep drm , I get following.
348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÍíÉÏ9:07
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: »Ø¸´£º Bug: amdgpu drm driver cause process into Disk sleep state
Well that looks like the hardware got stuck.
Do you get something in the locks about a timeout on the SDMA ring?
Regards,
Christian.
Am 03.09.19 um 14:50 schrieb 78666679:
Hi Christian,
Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources.
D 206833 227656 [surfaceflinger] <defunct> Binder:45_5
cat /proc/206833/task/227656/stack
[<0>] __switch_to+0x94/0xe8
[<0>] dma_fence_wait_any_timeout+0x234/0x2d0
[<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]
[<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]
[<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]
[<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]
[<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]
[<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]
[<0>] drm_ioctl_kernel+0x94/0x118 [drm]
[<0>] drm_ioctl+0x1f0/0x438 [drm]
[<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]
[<0>] do_vfs_ioctl+0xc4/0x8c0
[<0>] ksys_ioctl+0x8c/0xa0
[<0>] __arm64_sys_ioctl+0x28/0x38
[<0>] el0_svc_common+0xa0/0x180
[<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc
[<0>] 0xffffffffffffffff
--------------------
YanHua
------------------ ÔʼÓʼþ ------------------
·¢¼þÈË: "Koenig, Christian"<Christian.Koenig@amd.com>;
·¢ËÍʱ¼ä: 2019Äê9ÔÂ3ÈÕ(ÐÇÆÚ¶þ) ÏÂÎç4:21
ÊÕ¼þÈË: ""<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;
³ËÍ: "Deucher, Alexander"<Alexander.Deucher@amd.com>;
Ö÷Ìâ: Re: Bug: amdgpu drm driver cause process into Disk sleep state
Hi Yanhua,
please update your kernel first, cause that looks like a known issue
which was recently fixed by patch "drm/scheduler: use job count instead
of peek".
Probably best to try the latest bleeding edge kernel and if that doesn't
help please open up a bug report on https://bugs.freedesktop.org/.
Regards,
Christian.
Am 03.09.19 um 09:35 schrieb 78666679:
> Hi, Sirs:
> I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.
>
>
> cps-new-ondemand-0587:~ # ps aux|grep -w D
> root 11268 0.0 0.0 260628 3516 ? Ssl 8ÔÂ26 0:00 /usr/sbin/gssproxy -D
> root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D
> root 370684 0.0 0.0 17972 7428 ? Ss 9ÔÂ02 0:04 /usr/sbin/sshd -D
> 10066 432951 0.0 0.0 0 0 ? D 9ÔÂ02 0:00 [FakeFinalizerDa]
> root 496774 0.0 0.0 0 0 ? D 9ÔÂ02 0:17 [kworker/8:1+eve]
> cps-new-ondemand-0587:~ # cat /proc/496774/stack
> [<0>] __switch_to+0x94/0xe8
> [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]
> [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]
> [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]
> [<0>] filp_close+0x40/0xa0
> [<0>] put_files_struct+0x118/0x120
> [<0>] put_files_struct+0x30/0x68 [binder_linux]
> [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]
> [<0>] process_one_work+0x1b4/0x3f8
> [<0>] worker_thread+0x54/0x470
> [<0>] kthread+0x134/0x138
> [<0>] ret_from_fork+0x10/0x18
> [<0>] 0xffffffffffffffff
>
>
>
> This issue troubled me a long time. looking eagerly to get help from you!
>
>
> -----
> Yanhua
[-- Attachment #1.2: Type: text/html, Size: 13389 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-09-11 2:43 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 7:35 Bug: amdgpu drm driver cause process into Disk sleep state =?gb18030?B?Nzg2NjY2Nzk=?=
[not found] ` <tencent_4DEABBEB3BB4C6A6D84CA9F0DB225FBF5809-9uewiaClKEY@public.gmane.org>
2019-09-03 8:21 ` Koenig, Christian
[not found] ` <f761fec0-c0cc-426c-6bcb-c3fd23808888-5C7GfCeVMHo@public.gmane.org>
2019-09-03 8:27 ` =?gb18030?B?u9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?= =?gb18030?B?Nzg2NjY2Nzk=?=
2019-09-03 12:50 ` =?gb18030?B?u9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?= =?gb18030?B?Nzg2NjY2Nzk=?=
[not found] ` <tencent_7DC9F5195A4D538FA626F85991875FC5F508-9uewiaClKEY@public.gmane.org>
2019-09-03 13:07 ` =?GB18030?B?UmU6ILvYuLSjuiBCdWc6IGFtZGdwdSBkcm0gZHJpdmVyIGNhdXNlIHByb2Nlc3MgaW50byBEaXNrIHNsZWVwIHN0YXRl?= Koenig, Christian
[not found] ` <2162676e-dbfa-a67d-248c-98e9eb2099c2-5C7GfCeVMHo@public.gmane.org>
2019-09-03 13:16 ` =?gb18030?B?u9i4tKO6ILvYuLSjuiBCdWc6IGFtZGdwdSBkcm0gZHJpdmVyIGNhdXNlIHByb2Nlc3MgaW50byBEaXNrIHNsZWVwIHN0YXRl?= =?gb18030?B?Nzg2NjY2Nzk=?=
[not found] ` <tencent_DFCD5A0853FDA639F81F91375F8DF55AF508-9uewiaClKEY@public.gmane.org>
2019-09-03 13:19 ` =?GB18030?B?UmU6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?= Koenig, Christian
[not found] ` <88a08dcc-2e95-9379-693f-2d3fd928aa11-5C7GfCeVMHo@public.gmane.org>
2019-09-03 13:44 ` =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?= =?gb18030?B?Nzg2NjY2Nzk=?=
2019-09-05 1:36 ` =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogQnVnOiBhbWRncHUgZHJtIGRyaXZlciBjYXVzZSBwcm9jZXNzIGludG8gRGlzayBzbGVlcCBzdGF0ZQ==?= =?gb18030?B?eWFuaHVh?=
[not found] ` <tencent_20683D4D4999B2E0A746EA7D01D677D6070A-9uewiaClKEY@public.gmane.org>
2019-09-06 11:23 ` =?GB18030?B?UmU6ILvYuLSjuiC72Li0o7ogu9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?= Koenig, Christian
[not found] ` <badd9ea1-6f78-abbc-bdbe-e11271188524-5C7GfCeVMHo@public.gmane.org>
2019-09-11 2:43 ` =?gb18030?B?u9i4tKO6ILvYuLSjuiC72Li0o7ogu9i4tKO6IEJ1ZzogYW1kZ3B1IGRybSBkcml2ZXIgY2F1c2UgcHJvY2VzcyBpbnRvIERpc2sgc2xlZXAgc3RhdGU=?= =?gb18030?B?eWFuaHVh?=
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.