Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

From: Luben Tuikov <luben.tuikov@amd.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org
Cc: ckoenig.leichtzumerken@gmail.com,
	Li Yunxiang <Yunxiang.Li@amd.com>,
	amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq
Date: Tue, 23 Aug 2022 14:30:26 -0400	[thread overview]
Message-ID: <1eadb06c-7d2c-0317-a34a-c219c68b93c3@amd.com> (raw)
In-Reply-To: <bf0791b0-c917-6b6e-75bf-0c16869aa045@amd.com>

On 2022-08-23 14:13, Andrey Grodzovsky wrote:
> 
> On 2022-08-23 12:58, Luben Tuikov wrote:
>> Inlined:
>>
>> On 2022-08-22 16:09, Andrey Grodzovsky wrote:
>>> Poblem: Given many entities competing for same rq on
>> ^Problem
>>
>>> same scheduler an uncceptabliy long wait time for some
>> ^unacceptably
>>
>>> jobs waiting stuck in rq before being picked up are
>>> observed (seen using  GPUVis).
>>> The issue is due to Round Robin policy used by scheduler
>>> to pick up the next entity for execution. Under stress
>>> of many entities and long job queus within entity some
>> ^queues
>>
>>> jobs could be stack for very long time in it's entity's
>>> queue before being popped from the queue and executed
>>> while for other entites with samller job queues a job
>> ^entities; smaller
>>
>>> might execute ealier even though that job arrived later
>> ^earlier
>>
>>> then the job in the long queue.
>>>
>>> Fix:
>>> Add FIFO selection policy to entites in RQ, chose next enitity
>>> on rq in such order that if job on one entity arrived
>>> ealrier then job on another entity the first job will start
>>> executing ealier regardless of the length of the entity's job
>>> queue.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Tested-by: Li Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_entity.c |  2 +
>>>   drivers/gpu/drm/scheduler/sched_main.c   | 65 ++++++++++++++++++++++--
>>>   include/drm/gpu_scheduler.h              |  8 +++
>>>   3 files changed, 71 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 6b25b2f4f5a3..3bb7f69306ef 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -507,6 +507,8 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>>   	atomic_inc(entity->rq->sched->score);
>>>   	WRITE_ONCE(entity->last_user, current->group_leader);
>>>   	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
>>> +	sched_job->submit_ts = ktime_get();
>>> +
>>>   
>>>   	/* first job wakes up scheduler */
>>>   	if (first) {
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 68317d3a7a27..c123aa120d06 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -59,6 +59,19 @@
>>>   #define CREATE_TRACE_POINTS
>>>   #include "gpu_scheduler_trace.h"
>>>   
>>> +
>>> +
>>> +int drm_sched_policy = -1;
>>> +
>>> +/**
>>> + * DOC: sched_policy (int)
>>> + * Used to override default entites scheduling policy in a run queue.
>>> + */
>>> +MODULE_PARM_DESC(sched_policy,
>>> +		"specify schedule policy for entites on a runqueue (-1 = auto(default) value, 0 = Round Robin,1  = use FIFO");
>>> +module_param_named(sched_policy, drm_sched_policy, int, 0444);
>> As per Christian's comments, you can drop the "auto" and perhaps leave one as the default,
>> say the RR.
>>
>> I do think it is beneficial to have a module parameter control the scheduling policy, as shown above.
> 
> 
> Christian is not against it, just against adding 'auto' here - like the 
> default.

Exactly what I said.

Also, I still think an O(1) scheduling (picking next to run) should be
what we strive for in such a FIFO patch implementation.
A FIFO mechanism is by it's nature an O(1) mechanism for picking the next
element.

Regards,
Luben

> 
> 
>>
>>> +
>>> +
>>>   #define to_drm_sched_job(sched_job)		\
>>>   		container_of((sched_job), struct drm_sched_job, queue_node)
>>>   
>>> @@ -120,14 +133,16 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>   }
>>>   
>>>   /**
>>> - * drm_sched_rq_select_entity - Select an entity which could provide a job to run
>>> + * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>    *
>>>    * @rq: scheduler run queue to check.
>>>    *
>>> - * Try to find a ready entity, returns NULL if none found.
>>> + * Try to find a ready entity, in round robin manner.
>>> + *
>>> + * Returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   {
>>>   	struct drm_sched_entity *entity;
>>>   
>>> @@ -163,6 +178,45 @@ drm_sched_rq_select_entity(struct drm_sched_rq *rq)
>>>   	return NULL;
>>>   }
>>>   
>>> +/**
>>> + * drm_sched_rq_select_entity_fifo - Select an entity which could provide a job to run
>>> + *
>>> + * @rq: scheduler run queue to check.
>>> + *
>>> + * Try to find a ready entity, based on FIFO order of jobs arrivals.
>>> + *
>>> + * Returns NULL if none found.
>>> + */
>>> +static struct drm_sched_entity *
>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +{
>>> +	struct drm_sched_entity *tmp, *entity = NULL;
>>> +	ktime_t oldest_ts = KTIME_MAX;
>>> +	struct drm_sched_job *sched_job;
>>> +
>>> +	spin_lock(&rq->lock);
>>> +
>>> +	list_for_each_entry(tmp, &rq->entities, list) {
>>> +
>>> +		if (drm_sched_entity_is_ready(tmp)) {
>>> +			sched_job = to_drm_sched_job(spsc_queue_peek(&tmp->job_queue));
>>> +
>>> +			if (ktime_before(sched_job->submit_ts, oldest_ts)) {
>>> +				oldest_ts = sched_job->submit_ts;
>>> +				entity = tmp;
>>> +			}
>>> +		}
>>> +	}
>> Here I think we need an O(1) lookup of the next job to pick out to run.
>> I see a number of optimizations, for instance keeping the current/oldest
>> timestamp in the rq struct itself,
> 
> 
> This was my original design with rb tree based min heap per rq based on 
> time stamp of
> the oldest job waiting in each entity's job queue (head of entity's SPCP 
> job queue). But how in this
> case you record the timestamps of all the jobs waiting in entity's the 
> SPCP queue behind the head job ?
> If you record only the oldest job and more jobs come you have no place 
> to store their timestamps and so
> on next job select after current HEAD how you will know who came before 
> or after between 2 job queues of
> of 2 entities ?
> 
> 
>> or better yet keeping the next job
>> to pick out to run at a head of list (a la timer wheel implementation).
>> For suck an optimization to work, you'd prep things up on job insertion, rather
>> than on job removal/pick to run.
> 
> 
> I looked at timer wheel and I don't see how this applies here - it 
> assumes you know before
> job submission your deadline time for job's execution to start - which 
> we don't so I don't see
> how we can use it. This seems more suitable for real time scheduler 
> implementation where you
> have a hard requirement to execute a job by some specific time T.
> 
> I also mentioned a list, obviously there is the brute force solution of 
> just ordering all jobs in one giant list and get
> naturally a FIFO ordering this way with O(1) insertions and extractions 
> but I assume we don't want one giant jobs queue
> for all the entities to avoid more dependeies between them (like locking 
> the entire list when one specific entity is killed and
> needs to remove it's jobs from SW queue).
> 
>>
>> I'm also surprised that there is no job transition from one queue to another,
>> as it is picked to run next--for the above optimizations to implemented, you'd
>> want a state transition from (state) queue to queue.
> 
> 
> Not sure what you have in mind here ? How this helps ?
> 
> Andrey
> 
> 
>>
>> Regards,
>> Luben
>>
> 
> In my origianl design
> 
>>> +
>>> +	if (entity) {
>>> +		rq->current_entity = entity;
>>> +		reinit_completion(&entity->entity_idle);
>>> +	}
>>> +
>>> +	spin_unlock(&rq->lock);
>>> +	return entity;
>>> +}
>>> +
>>>   /**
>>>    * drm_sched_job_done - complete a job
>>>    * @s_job: pointer to the job which is done
>>> @@ -804,7 +858,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>   
>>>   	/* Kernel run queue has higher priority than normal run queue*/
>>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>> -		entity = drm_sched_rq_select_entity(&sched->sched_rq[i]);
>>> +		entity = drm_sched_policy != 1 ?
>>> +				drm_sched_rq_select_entity_rr(&sched->sched_rq[i]) :
>>> +				drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]);
>>> +
>>>   		if (entity)
>>>   			break;
>>>   	}
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index addb135eeea6..95865881bfcf 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -314,6 +314,14 @@ struct drm_sched_job {
>>>   
>>>   	/** @last_dependency: tracks @dependencies as they signal */
>>>   	unsigned long			last_dependency;
>>> +
>>> +       /**
>>> +	* @submit_ts:
>>> +	*
>>> +	* Marks job submit time
>>> +	*/
>>> +       ktime_t				submit_ts;
>>> +
>>>   };
>>>   
>>>   static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,

Regards,
-- 
Luben