All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: kfd open return failed if device is locked
@ 2019-10-18 14:27 Yang, Philip
       [not found] ` <20191018142711.19261-1-Philip.Yang-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Yang, Philip @ 2019-10-18 14:27 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Yang, Philip

If device is locked for suspend and resume, kfd open should return
failed -EAGAIN without creating process, otherwise the application exit
to release the process will hang to wait for resume is done if the suspend
and resume is stuck somewhere. This is backtrace:

[Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
than 120 seconds.
[Thu Oct 17 16:43:37 2019]       Not tainted
5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
[Thu Oct 17 16:43:37 2019] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
0x80000000
[Thu Oct 17 16:43:37 2019] Call Trace:
[Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
[Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
[Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
[Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
[Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
[Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]
kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
[Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
[amdgpu]
[Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
[Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
[Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
[Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
[Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
[Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
[Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
[Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
[Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
[Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
[Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index d9e36dbf13d5..40d75c39f08e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
 		return -EPERM;
 	}
 
+	if (kfd_is_locked())
+		return -EAGAIN;
+
 	process = kfd_create_process(filep);
 	if (IS_ERR(process))
 		return PTR_ERR(process);
 
-	if (kfd_is_locked())
-		return -EAGAIN;
-
 	dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
 		process->pasid, process->is_32bit_user_mode);
 
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdkfd: kfd open return failed if device is locked
       [not found] ` <20191018142711.19261-1-Philip.Yang-5C7GfCeVMHo@public.gmane.org>
@ 2019-10-18 15:40   ` Kuehling, Felix
       [not found]     ` <73684157-12eb-9167-fb19-63e3ad02b16a-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Kuehling, Felix @ 2019-10-18 15:40 UTC (permalink / raw)
  To: Yang, Philip, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2019-10-18 10:27 a.m., Yang, Philip wrote:
> If device is locked for suspend and resume, kfd open should return
> failed -EAGAIN without creating process, otherwise the application exit
> to release the process will hang to wait for resume is done if the suspend
> and resume is stuck somewhere. This is backtrace:

This doesn't fix processes that were created before suspend/resume got 
stuck. They would still get stuck with the same backtrace. So this is 
jut a band-aid. The real underlying problem, that is not getting 
addressed, is suspend/resume getting stuck.

Am I missing something?

Regards,
   Felix


>
> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
> than 120 seconds.
> [Thu Oct 17 16:43:37 2019]       Not tainted
> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
> [Thu Oct 17 16:43:37 2019] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
> 0x80000000
> [Thu Oct 17 16:43:37 2019] Call Trace:
> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]
> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index d9e36dbf13d5..40d75c39f08e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
>   		return -EPERM;
>   	}
>   
> +	if (kfd_is_locked())
> +		return -EAGAIN;
> +
>   	process = kfd_create_process(filep);
>   	if (IS_ERR(process))
>   		return PTR_ERR(process);
>   
> -	if (kfd_is_locked())
> -		return -EAGAIN;
> -
>   	dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>   		process->pasid, process->is_32bit_user_mode);
>   
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdkfd: kfd open return failed if device is locked
       [not found]     ` <73684157-12eb-9167-fb19-63e3ad02b16a-5C7GfCeVMHo@public.gmane.org>
@ 2019-10-18 17:34       ` Yang, Philip
  0 siblings, 0 replies; 3+ messages in thread
From: Yang, Philip @ 2019-10-18 17:34 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 2019-10-18 11:40 a.m., Kuehling, Felix wrote:
> On 2019-10-18 10:27 a.m., Yang, Philip wrote:
>> If device is locked for suspend and resume, kfd open should return
>> failed -EAGAIN without creating process, otherwise the application exit
>> to release the process will hang to wait for resume is done if the suspend
>> and resume is stuck somewhere. This is backtrace:
> 
> This doesn't fix processes that were created before suspend/resume got
> stuck. They would still get stuck with the same backtrace. So this is
> jut a band-aid. The real underlying problem, that is not getting
> addressed, is suspend/resume getting stuck.
> 
> Am I missing something?
> 
This is to address application stuck to quit issue after suspend/resume 
got stuck. The real underlying suspend/resume issue should be addressed 
separately.

I will submit v2 patch to fix processes that were created before 
suspend/resume got stuck.

Philip

> Regards,
>     Felix
> 
> 
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]       Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
>> [Thu Oct 17 16:43:37 2019] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
>> 0x80000000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
>> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
>> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
>> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
>> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
>> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
>> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
>> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
>> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
>> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
>> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
>> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
>> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
>> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
>> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++---
>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index d9e36dbf13d5..40d75c39f08e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
>>    		return -EPERM;
>>    	}
>>    
>> +	if (kfd_is_locked())
>> +		return -EAGAIN;
>> +
>>    	process = kfd_create_process(filep);
>>    	if (IS_ERR(process))
>>    		return PTR_ERR(process);
>>    
>> -	if (kfd_is_locked())
>> -		return -EAGAIN;
>> -
>>    	dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>>    		process->pasid, process->is_32bit_user_mode);
>>    
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-10-18 17:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-18 14:27 [PATCH] drm/amdkfd: kfd open return failed if device is locked Yang, Philip
     [not found] ` <20191018142711.19261-1-Philip.Yang-5C7GfCeVMHo@public.gmane.org>
2019-10-18 15:40   ` Kuehling, Felix
     [not found]     ` <73684157-12eb-9167-fb19-63e3ad02b16a-5C7GfCeVMHo@public.gmane.org>
2019-10-18 17:34       ` Yang, Philip

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.