All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/sched: fix the bug of time out calculation
@ 2021-08-24  9:51 Monk Liu
  2021-08-24 14:46 ` Andrey Grodzovsky
  0 siblings, 1 reply; 4+ messages in thread
From: Monk Liu @ 2021-08-24  9:51 UTC (permalink / raw)
  To: amd-gfx; +Cc: Monk Liu

the original logic is wrong that the timeout will not be retriggerd
after the previous job siganled, and that lead to the scenario that all
jobs in the same scheduler shares the same timeout timer from the very
begining job in this scheduler which is wrong.

we should modify the timer everytime a previous job signaled.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..fb27025 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 		schedule_delayed_work(&sched->work_tdr, sched->timeout);
 }
 
+static void drm_sched_restart_timeout(struct drm_gpu_scheduler *sched)
+{
+	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
+	    !list_empty(&sched->pending_list))
+		mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
+}
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
@@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
 		/* remove job from pending_list */
 		list_del_init(&job->list);
+
+		/* once the job deleted from pending list we should restart
+		 * the timeout calculation for the next job.
+		 */
+		drm_sched_restart_timeout(sched);
 		/* make the scheduled timestamp more accurate */
 		next = list_first_entry_or_null(&sched->pending_list,
 						typeof(*next), list);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/sched: fix the bug of time out calculation
  2021-08-24  9:51 [PATCH] drm/sched: fix the bug of time out calculation Monk Liu
@ 2021-08-24 14:46 ` Andrey Grodzovsky
  2021-08-24 15:01   ` Andrey Grodzovsky
  0 siblings, 1 reply; 4+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 14:46 UTC (permalink / raw)
  To: Monk Liu, amd-gfx


On 2021-08-24 5:51 a.m., Monk Liu wrote:
> the original logic is wrong that the timeout will not be retriggerd
> after the previous job siganled, and that lead to the scenario that all
> jobs in the same scheduler shares the same timeout timer from the very
> begining job in this scheduler which is wrong.
>
> we should modify the timer everytime a previous job signaled.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..fb27025 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>   		schedule_delayed_work(&sched->work_tdr, sched->timeout);
>   }
>   
> +static void drm_sched_restart_timeout(struct drm_gpu_scheduler *sched)
> +{
> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> +	    !list_empty(&sched->pending_list))
> +		mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
> +}
> +
>   /**
>    * drm_sched_fault - immediately start timeout handler
>    *
> @@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>   	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>   		/* remove job from pending_list */
>   		list_del_init(&job->list);
> +
> +		/* once the job deleted from pending list we should restart
> +		 * the timeout calculation for the next job.
> +		 */
> +		drm_sched_restart_timeout(sched);


I think this should work, but 2 points -

1st you should probably remove this now 
https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L797

2nd - if you have two adjacent jobs started very closely you effectively 
letting the second job to be twice longer hang without TDR because
you reset TDR timer for it when it's almost expired. If we could have 
TTL (time to live counter) for each job and then do mod_delayed_work
to the TTL of the following job instead of just full timer reset then 
this would be more precise. But this is more of recommendation for 
improvement.

Andrey


>   		/* make the scheduled timestamp more accurate */
>   		next = list_first_entry_or_null(&sched->pending_list,
>   						typeof(*next), list);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/sched: fix the bug of time out calculation
  2021-08-24 14:46 ` Andrey Grodzovsky
@ 2021-08-24 15:01   ` Andrey Grodzovsky
  2021-08-25  4:13     ` Liu, Monk
  0 siblings, 1 reply; 4+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 15:01 UTC (permalink / raw)
  To: Monk Liu, amd-gfx


On 2021-08-24 10:46 a.m., Andrey Grodzovsky wrote:
>
> On 2021-08-24 5:51 a.m., Monk Liu wrote:
>> the original logic is wrong that the timeout will not be retriggerd
>> after the previous job siganled, and that lead to the scenario that all
>> jobs in the same scheduler shares the same timeout timer from the very
>> begining job in this scheduler which is wrong.
>>
>> we should modify the timer everytime a previous job signaled.
>>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a9536..fb27025 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct 
>> drm_gpu_scheduler *sched)
>>           schedule_delayed_work(&sched->work_tdr, sched->timeout);
>>   }
>>   +static void drm_sched_restart_timeout(struct drm_gpu_scheduler 
>> *sched)
>> +{
>> +    if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> +        !list_empty(&sched->pending_list))
>> +        mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);


3d point - if list empty you need to cancel the timer, let the new job 
coming after that restart it.

Andrey


>> +}
>> +
>>   /**
>>    * drm_sched_fault - immediately start timeout handler
>>    *
>> @@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct 
>> drm_gpu_scheduler *sched)
>>       if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>           /* remove job from pending_list */
>>           list_del_init(&job->list);
>> +
>> +        /* once the job deleted from pending list we should restart
>> +         * the timeout calculation for the next job.
>> +         */
>> +        drm_sched_restart_timeout(sched);
>
>
> I think this should work, but 2 points -
>
> 1st you should probably remove this now 
> https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L797
>
> 2nd - if you have two adjacent jobs started very closely you 
> effectively letting the second job to be twice longer hang without TDR 
> because
> you reset TDR timer for it when it's almost expired. If we could have 
> TTL (time to live counter) for each job and then do mod_delayed_work
> to the TTL of the following job instead of just full timer reset then 
> this would be more precise. But this is more of recommendation for 
> improvement.
>
> Andrey
>
>
>>           /* make the scheduled timestamp more accurate */
>>           next = list_first_entry_or_null(&sched->pending_list,
>>                           typeof(*next), list);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH] drm/sched: fix the bug of time out calculation
  2021-08-24 15:01   ` Andrey Grodzovsky
@ 2021-08-25  4:13     ` Liu, Monk
  0 siblings, 0 replies; 4+ messages in thread
From: Liu, Monk @ 2021-08-25  4:13 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx

[AMD Official Use Only]

>>
3d point - if list empty you need to cancel the timer, let the new job coming after that restart it.

>>
2nd - if you have two adjacent jobs started very closely you effectively letting the second job to be twice longer hang without TDR because you reset TDR timer for it when it's almost expired. If we could have TTL (time to live counter) for each job and then do mod_delayed_work to the TTL of the following job instead of just full timer reset then this would be more precise. But this is more of recommendation for improvement.

>>
1st you should probably remove this now
https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L797


I checked and thought all above points from you, end up with a V2 patch, please take a look again.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Tuesday, August 24, 2021 11:02 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation


On 2021-08-24 10:46 a.m., Andrey Grodzovsky wrote:
>
> On 2021-08-24 5:51 a.m., Monk Liu wrote:
>> the original logic is wrong that the timeout will not be retriggerd 
>> after the previous job siganled, and that lead to the scenario that 
>> all jobs in the same scheduler shares the same timeout timer from the 
>> very begining job in this scheduler which is wrong.
>>
>> we should modify the timer everytime a previous job signaled.
>>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a9536..fb27025 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct 
>> drm_gpu_scheduler *sched)
>>           schedule_delayed_work(&sched->work_tdr, sched->timeout);
>>   }
>>   +static void drm_sched_restart_timeout(struct drm_gpu_scheduler
>> *sched)
>> +{
>> +    if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> +        !list_empty(&sched->pending_list))
>> +        mod_delayed_work(system_wq, &sched->work_tdr, 
>> +sched->timeout);


3d point - if list empty you need to cancel the timer, let the new job coming after that restart it.

Andrey


>> +}
>> +
>>   /**
>>    * drm_sched_fault - immediately start timeout handler
>>    *
>> @@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct 
>> drm_gpu_scheduler *sched)
>>       if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>           /* remove job from pending_list */
>>           list_del_init(&job->list);
>> +
>> +        /* once the job deleted from pending list we should restart
>> +         * the timeout calculation for the next job.
>> +         */
>> +        drm_sched_restart_timeout(sched);
>
>
> I think this should work, but 2 points -
>
> 1st you should probably remove this now
> https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/sche
> duler/sched_main.c#L797
>
> 2nd - if you have two adjacent jobs started very closely you 
> effectively letting the second job to be twice longer hang without TDR 
> because you reset TDR timer for it when it's almost expired. If we 
> could have TTL (time to live counter) for each job and then do 
> mod_delayed_work to the TTL of the following job instead of just full 
> timer reset then this would be more precise. But this is more of 
> recommendation for improvement.
>
> Andrey
>
>
>>           /* make the scheduled timestamp more accurate */
>>           next = list_first_entry_or_null(&sched->pending_list,
>>                           typeof(*next), list);

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-08-25  4:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24  9:51 [PATCH] drm/sched: fix the bug of time out calculation Monk Liu
2021-08-24 14:46 ` Andrey Grodzovsky
2021-08-24 15:01   ` Andrey Grodzovsky
2021-08-25  4:13     ` Liu, Monk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.