All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amd/amdgpu: consider paging job always not guilty
@ 2021-07-20 11:02 Jingwen Chen
  2021-07-20 12:13 ` Christian König
  0 siblings, 1 reply; 3+ messages in thread
From: Jingwen Chen @ 2021-07-20 11:02 UTC (permalink / raw)
  To: amd-gfx; +Cc: horace.chen, Jingwen Chen, monk.liu

[Why]
Currently all timedout job will be considered to be guilty. In SRIOV
multi-vf use case, the vf flr happens first and then job time out is
found. There can be several jobs timeout during a very small time slice.
And if the innocent sdma job time out is found before the real bad
job, then the innocent sdma job will be set to guilty. This will lead
to a page fault after resubmitting job.

[How]
If the job is a paging job, we will always consider it not guilty

Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 37fa199be8b3..40461547701a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4410,7 +4410,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_fence_driver_force_completion(ring);
 	}
 
-	if(job)
+	if (job && job->vm)
 		drm_sched_increase_karma(&job->base);
 
 	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
@@ -4874,7 +4874,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 				job ? job->base.id : -1, hive->hive_id);
 			amdgpu_put_xgmi_hive(hive);
-			if (job)
+			if (job && job->vm)
 				drm_sched_increase_karma(&job->base);
 			return 0;
 		}
@@ -4898,7 +4898,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 					job ? job->base.id : -1);
 
 		/* even we skipped this reset, still need to set the job to guilty */
-		if (job)
+		if (job && job->vm)
 			drm_sched_increase_karma(&job->base);
 		goto skip_recovery;
 	}
-- 
2.25.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: consider paging job always not guilty
  2021-07-20 11:02 [PATCH] drm/amd/amdgpu: consider paging job always not guilty Jingwen Chen
@ 2021-07-20 12:13 ` Christian König
  2021-07-21  4:20   ` Chen, JingWen
  0 siblings, 1 reply; 3+ messages in thread
From: Christian König @ 2021-07-20 12:13 UTC (permalink / raw)
  To: Jingwen Chen, amd-gfx; +Cc: horace.chen, monk.liu



Am 20.07.21 um 13:02 schrieb Jingwen Chen:
> [Why]
> Currently all timedout job will be considered to be guilty. In SRIOV
> multi-vf use case, the vf flr happens first and then job time out is
> found. There can be several jobs timeout during a very small time slice.
> And if the innocent sdma job time out is found before the real bad
> job, then the innocent sdma job will be set to guilty. This will lead
> to a page fault after resubmitting job.
>
> [How]
> If the job is a paging job, we will always consider it not guilty

Don't say "paging job", better "kernel job". Since the PTE updates we 
are using here are not even remotely related to paging.

Regards,
Christian.

>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 37fa199be8b3..40461547701a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4410,7 +4410,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_fence_driver_force_completion(ring);
>   	}
>   
> -	if(job)
> +	if (job && job->vm)
>   		drm_sched_increase_karma(&job->base);
>   
>   	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
> @@ -4874,7 +4874,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
>   				job ? job->base.id : -1, hive->hive_id);
>   			amdgpu_put_xgmi_hive(hive);
> -			if (job)
> +			if (job && job->vm)
>   				drm_sched_increase_karma(&job->base);
>   			return 0;
>   		}
> @@ -4898,7 +4898,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   					job ? job->base.id : -1);
>   
>   		/* even we skipped this reset, still need to set the job to guilty */
> -		if (job)
> +		if (job && job->vm)
>   			drm_sched_increase_karma(&job->base);
>   		goto skip_recovery;
>   	}

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: consider paging job always not guilty
  2021-07-20 12:13 ` Christian König
@ 2021-07-21  4:20   ` Chen, JingWen
  0 siblings, 0 replies; 3+ messages in thread
From: Chen, JingWen @ 2021-07-21  4:20 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Chen, Horace, Liu, Monk

[AMD Official Use Only]

Hi Christian,

I have uploaded the latest patch according to your suggestion.

Best Regards,
JingWen Chen

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Tuesday, July 20, 2021 8:13 PM
To: Chen, JingWen <JingWen.Chen2@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amd/amdgpu: consider paging job always not guilty



Am 20.07.21 um 13:02 schrieb Jingwen Chen:
> [Why]
> Currently all timedout job will be considered to be guilty. In SRIOV
> multi-vf use case, the vf flr happens first and then job time out is
> found. There can be several jobs timeout during a very small time slice.
> And if the innocent sdma job time out is found before the real bad
> job, then the innocent sdma job will be set to guilty. This will lead
> to a page fault after resubmitting job.
>
> [How]
> If the job is a paging job, we will always consider it not guilty

Don't say "paging job", better "kernel job". Since the PTE updates we are using here are not even remotely related to paging.

Regards,
Christian.

>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 37fa199be8b3..40461547701a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4410,7 +4410,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>               amdgpu_fence_driver_force_completion(ring);
>       }
>
> -     if(job)
> +     if (job && job->vm)
>               drm_sched_increase_karma(&job->base);
>
>       r = amdgpu_reset_prepare_hwcontext(adev, reset_context); @@ -4874,7
> +4874,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                       DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
>                               job ? job->base.id : -1, hive->hive_id);
>                       amdgpu_put_xgmi_hive(hive);
> -                     if (job)
> +                     if (job && job->vm)
>                               drm_sched_increase_karma(&job->base);
>                       return 0;
>               }
> @@ -4898,7 +4898,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                                       job ? job->base.id : -1);
>
>               /* even we skipped this reset, still need to set the job to guilty */
> -             if (job)
> +             if (job && job->vm)
>                       drm_sched_increase_karma(&job->base);
>               goto skip_recovery;
>       }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-07-21  4:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 11:02 [PATCH] drm/amd/amdgpu: consider paging job always not guilty Jingwen Chen
2021-07-20 12:13 ` Christian König
2021-07-21  4:20   ` Chen, JingWen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.