All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu:fix gpu recover missing skipping(v2)
@ 2017-11-08  7:08 Monk Liu
       [not found] ` <1510124882-3227-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Monk Liu @ 2017-11-08  7:08 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

if app close CTX right after IB submit, gpu recover
will fail to find out the entity behind this guilty
job thus lead to no job skipping for this guilty job.

to fix this corner case just move the increasement of
job->karma out of the entity iteration.

v2:
only do karma increasment if bad->s_priority != KERNEL
because we always consider KERNEL job be correct and always
want to recover an unfinished kernel job (sometimes kernel
job is interrupted by VF FLR or other GPU hang event)

Change-Id: I33e9e959e182d7e002a2108e565cb898acac4f9c
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 7aa6455..c999026 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -463,7 +463,8 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
 	}
 	spin_unlock(&sched->job_list_lock);
 
-	if (bad) {
+	if (bad && bad->s_priority != AMD_SCHED_PRIORITY_KERNEL) {
+		atomic_inc(&bad->karma);
 		/* don't increase @bad's karma if it's from KERNEL RQ,
 		 * becuase sometimes GPU hang would cause kernel jobs (like VM updating jobs)
 		 * corrupt but keep in mind that kernel jobs always considered good.
@@ -474,7 +475,7 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
 			spin_lock(&rq->lock);
 			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
 				if (bad->s_fence->scheduled.context == entity->fence_context) {
-				    if (atomic_inc_return(&bad->karma) > bad->sched->hang_limit)
+				    if (atomic_read(&bad->karma) > bad->sched->hang_limit)
 						if (entity->guilty)
 							atomic_set(entity->guilty, 1);
 					break;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* RE: [PATCH] drm/amdgpu:fix gpu recover missing skipping(v2)
       [not found] ` <1510124882-3227-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-11-08  7:25   ` Yu, Xiangliang
  2017-11-08  9:45   ` Christian König
  1 sibling, 0 replies; 3+ messages in thread
From: Yu, Xiangliang @ 2017-11-08  7:25 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Liu, Monk

Reviewed-By: Xiangliang Yu <Xiangliang.Yu@amd.com>


> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
> Of Monk Liu
> Sent: Wednesday, November 08, 2017 3:08 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu@amd.com>
> Subject: [PATCH] drm/amdgpu:fix gpu recover missing skipping(v2)
> 
> if app close CTX right after IB submit, gpu recover will fail to find out the
> entity behind this guilty job thus lead to no job skipping for this guilty job.
> 
> to fix this corner case just move the increasement of
> job->karma out of the entity iteration.
> 
> v2:
> only do karma increasment if bad->s_priority != KERNEL because we always
> consider KERNEL job be correct and always want to recover an unfinished
> kernel job (sometimes kernel job is interrupted by VF FLR or other GPU hang
> event)
> 
> Change-Id: I33e9e959e182d7e002a2108e565cb898acac4f9c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 7aa6455..c999026 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -463,7 +463,8 @@ void amd_sched_hw_job_reset(struct
> amd_gpu_scheduler *sched, struct amd_sched_jo
>  	}
>  	spin_unlock(&sched->job_list_lock);
> 
> -	if (bad) {
> +	if (bad && bad->s_priority != AMD_SCHED_PRIORITY_KERNEL) {
> +		atomic_inc(&bad->karma);
>  		/* don't increase @bad's karma if it's from KERNEL RQ,
>  		 * becuase sometimes GPU hang would cause kernel jobs
> (like VM updating jobs)
>  		 * corrupt but keep in mind that kernel jobs always
> considered good.
> @@ -474,7 +475,7 @@ void amd_sched_hw_job_reset(struct
> amd_gpu_scheduler *sched, struct amd_sched_jo
>  			spin_lock(&rq->lock);
>  			list_for_each_entry_safe(entity, tmp, &rq->entities,
> list) {
>  				if (bad->s_fence->scheduled.context ==
> entity->fence_context) {
> -				    if (atomic_inc_return(&bad->karma) > bad-
> >sched->hang_limit)
> +				    if (atomic_read(&bad->karma) > bad-
> >sched->hang_limit)
>  						if (entity->guilty)
>  							atomic_set(entity-
> >guilty, 1);
>  					break;
> --
> 2.7.4
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdgpu:fix gpu recover missing skipping(v2)
       [not found] ` <1510124882-3227-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-11-08  7:25   ` Yu, Xiangliang
@ 2017-11-08  9:45   ` Christian König
  1 sibling, 0 replies; 3+ messages in thread
From: Christian König @ 2017-11-08  9:45 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 08.11.2017 um 08:08 schrieb Monk Liu:
> if app close CTX right after IB submit, gpu recover
> will fail to find out the entity behind this guilty
> job thus lead to no job skipping for this guilty job.
>
> to fix this corner case just move the increasement of
> job->karma out of the entity iteration.
>
> v2:
> only do karma increasment if bad->s_priority != KERNEL
> because we always consider KERNEL job be correct and always
> want to recover an unfinished kernel job (sometimes kernel
> job is interrupted by VF FLR or other GPU hang event)

Good point, my rb still stands on that version.

Christian.

>
> Change-Id: I33e9e959e182d7e002a2108e565cb898acac4f9c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 7aa6455..c999026 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -463,7 +463,8 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   	}
>   	spin_unlock(&sched->job_list_lock);
>   
> -	if (bad) {
> +	if (bad && bad->s_priority != AMD_SCHED_PRIORITY_KERNEL) {
> +		atomic_inc(&bad->karma);
>   		/* don't increase @bad's karma if it's from KERNEL RQ,
>   		 * becuase sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>   		 * corrupt but keep in mind that kernel jobs always considered good.
> @@ -474,7 +475,7 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   			spin_lock(&rq->lock);
>   			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>   				if (bad->s_fence->scheduled.context == entity->fence_context) {
> -				    if (atomic_inc_return(&bad->karma) > bad->sched->hang_limit)
> +				    if (atomic_read(&bad->karma) > bad->sched->hang_limit)
>   						if (entity->guilty)
>   							atomic_set(entity->guilty, 1);
>   					break;


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-11-08  9:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-08  7:08 [PATCH] drm/amdgpu:fix gpu recover missing skipping(v2) Monk Liu
     [not found] ` <1510124882-3227-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-11-08  7:25   ` Yu, Xiangliang
2017-11-08  9:45   ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.