All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Liu, Monk" <Monk.Liu@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
Subject: RE: [PATCH] drm/sched: fix the bug of time out calculation(v3)
Date: Thu, 26 Aug 2021 11:55:36 +0000	[thread overview]
Message-ID: <BL1PR12MB5269107A0A927EC3D7A7B6E784C79@BL1PR12MB5269.namprd12.prod.outlook.com> (raw)
In-Reply-To: <e000dc1a-8fe8-ea69-e16b-bf0b64d773f2@gmail.com>

[AMD Official Use Only]

>>I'm not sure if the work_tdr is initialized when a maximum timeout is specified. Please double check.

Ok, will do

>>BTW: Can we please drop the "tdr" naming from the scheduler? That is just a timeout functionality and not related to recovery in any way.
We even do not start hardware recovery in a lot of cases now (when wave kill is successfully).

Umm, sounds reasonable, I can rename it to "to" with another patch 

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Thursday, August 26, 2021 6:09 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

Am 26.08.21 um 06:55 schrieb Monk Liu:
> issue:
> in cleanup_job the cancle_delayed_work will cancel a TO timer even the 
> its corresponding job is still running.

Yeah, that makes a lot more sense.

>
> fix:
> do not cancel the timer in cleanup_job, instead do the cancelling only 
> when the heading job is signaled, and if there is a "next" job we 
> start_timeout again.
>
> v2:
> further cleanup the logic, and do the TDR timer cancelling if the 
> signaled job is the last one in its scheduler.
>
> v3:
> change the issue description
> remove the cancel_delayed_work in the begining of the cleanup_job 
> recover the implement of drm_sched_job_begin.
>
> TODO:
> 1)introduce pause/resume scheduler in job_timeout to serial the 
> handling of scheduler and job_timeout.
> 2)drop the bad job's del and insert in scheduler due to above 
> serialization (no race issue anymore with the serialization)
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 25 ++++++++++---------------
>   1 file changed, 10 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..ecf8140 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>   {
>   	struct drm_sched_job *job, *next;
>   
> -	/*
> -	 * Don't destroy jobs while the timeout worker is running  OR thread
> -	 * is being parked and hence assumed to not touch pending_list
> -	 */
> -	if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> -	    !cancel_delayed_work(&sched->work_tdr)) ||
> -	    kthread_should_park())
> +	if (kthread_should_park())
>   		return NULL;
>   
>   	spin_lock(&sched->job_list_lock);
> @@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>   	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>   		/* remove job from pending_list */
>   		list_del_init(&job->list);
> +
> +		/* cancel this job's TO timer */
> +		cancel_delayed_work(&sched->work_tdr);

I'm not sure if the work_tdr is initialized when a maximum timeout is specified. Please double check.

BTW: Can we please drop the "tdr" naming from the scheduler? That is just a timeout functionality and not related to recovery in any way.

We even do not start hardware recovery in a lot of cases now (when wave kill is successfully).

Regards,
Christian.

>   		/* make the scheduled timestamp more accurate */
>   		next = list_first_entry_or_null(&sched->pending_list,
>   						typeof(*next), list);
> -		if (next)
> +
> +		if (next) {
>   			next->s_fence->scheduled.timestamp =
>   				job->s_fence->finished.timestamp;
> -
> +			/* start TO timer for next job */
> +			drm_sched_start_timeout(sched);
> +		}
>   	} else {
>   		job = NULL;
> -		/* queue timeout for next job */
> -		drm_sched_start_timeout(sched);
>   	}
>   
>   	spin_unlock(&sched->job_list_lock);
> @@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
>   					  (entity = drm_sched_select_entity(sched))) ||
>   					 kthread_should_stop());
>   
> -		if (cleanup_job) {
> +		if (cleanup_job)
>   			sched->ops->free_job(cleanup_job);
> -			/* queue timeout for next job */
> -			drm_sched_start_timeout(sched);
> -		}
>   
>   		if (!entity)
>   			continue;

  reply	other threads:[~2021-08-26 12:28 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-26  4:55 [PATCH] drm/sched: fix the bug of time out calculation(v3) Monk Liu
2021-08-26 10:09 ` Christian König
2021-08-26 11:55   ` Liu, Monk [this message]
2021-08-26 12:37     ` Christian König
2021-08-26 12:46       ` Daniel Vetter
2021-08-27  7:30       ` Liu, Monk
2021-08-26 14:03 ` Andrey Grodzovsky
2021-08-26 20:14   ` Andrey Grodzovsky
2021-08-27  6:12     ` Christian König
2021-08-27  7:46       ` Liu, Monk
2021-08-27 13:45         ` Andrey Grodzovsky
2021-08-27 14:26           ` Christian König
2021-08-27  7:20     ` Liu, Monk
2021-08-27 13:57       ` Andrey Grodzovsky
2021-08-27 14:29         ` Christian König
2021-08-27 18:22           ` Andrey Grodzovsky
2021-08-27 18:30             ` Christian König
2021-08-27 18:39               ` Andrey Grodzovsky
2021-08-31 13:08               ` Daniel Vetter
2021-09-01  1:30                 ` Liu, Monk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BL1PR12MB5269107A0A927EC3D7A7B6E784C79@BL1PR12MB5269.namprd12.prod.outlook.com \
    --to=monk.liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.