Re: [PATCH 4/8] drm/sched: Add generic scheduler message interface

From: "Christian König" <christian.koenig@amd.com>
To: Matthew Brost <matthew.brost@intel.com>, Daniel Vetter <daniel@ffwll.ch>
Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com,
	sarah.walker@imgtec.com, ketil.johnsen@arm.com,
	Liviu.Dudau@arm.com, dri-devel@lists.freedesktop.org,
	luben.tuikov@amd.com, lina@asahilina.net,
	donald.robson@imgtec.com, boris.brezillon@collabora.com,
	intel-xe@lists.freedesktop.org, faith.ekstrand@collabora.com
Subject: Re: [PATCH 4/8] drm/sched: Add generic scheduler message interface
Date: Mon, 7 Aug 2023 17:46:16 +0200	[thread overview]
Message-ID: <b19bb89c-a814-2519-e037-ee1c33dce02f@amd.com> (raw)
In-Reply-To: <ZM0HfRR8CczTLpao@DUT025-TGLU.fm.intel.com>

Am 04.08.23 um 16:13 schrieb Matthew Brost:
> [SNIP]
> Christian / Daniel - I've read both of you comments and having a hard
> time parsing them. I do not really understand the issue with this patch
> or exactly what is being suggested instead. Let's try to work through
> this.
>
>>>>> I'm still extremely frowned on this.
>>>>>
>>>>> If you need this functionality then let the drivers decide which
>>>>> runqueue the scheduler should use.
> What do you mean by runqueue here? Do you mean 'struct
> workqueue_struct'? The scheduler in this context is 'struct
> drm_gpu_scheduler', right?

Sorry for the confusing wording, your understanding is correct.

> Yes, we have added this functionality iin the first patch.
>
>>>>> When you then create a single threaded runqueue you can just submit work
>>>>> to it and serialize this with the scheduler work.
>>>>>
> We don't want to use a single threaded workqueue_struct in Xe, we want
> to use a system_wq as run_job() can be executed in parallel across
> multiple entites (or drm_gpu_scheduler as in Xe we have 1 to 1
> relationship between entity and scheduler). What we want is on per
> entity / scheduler granularity to be able to communicate into the
> backend a message synchronously (run_job / free_job not executing,
> scheduler execution not paused for a reset).
>
> If I'm underatanding what you suggesting in Xe we'd create an ordered
> workqueue_struct per drm_gpu_scheduler and the queue messages on the
> ordered workqueue_struct?

Yes, correct.

> This seems pretty messy to me as now we have
> open coded a solution bypassing the scheduler, every drm_gpu_scheduler
> creates its own workqueue_struct, and we'd also have to open code the
> pausing of these messages for resets too.
>
> IMO this is pretty clean solution that follows the pattern of cleanup
> jobs already in place.

Yeah, exactly that's the point. Moving the job cleanup into the 
scheduler thread is seen as very very bad idea by me.

And I really don't want to exercise that again for different use cases.

>
>>>>> This way we wouldn't duplicate this core kernel function inside the
>>>>> scheduler.
>>>> Yeah that's essentially the design we picked for the tdr workers,
>>>> where some drivers have requirements that all tdr work must be done on
>>>> the same thread (because of cross-engine coordination issues). But
>>>> that would require that we rework the scheduler as a pile of
>>>> self-submitting work items, and I'm not sure that actually fits all
>>>> that well into the core workqueue interfaces either.
> This is the ordering between TDRs firing between different
> drm_gpu_scheduler and larger external resets (GT in Xe) an ordered
> workqueue_struct makes sense for this. Here we are talking about
> ordering jobs and messages within a single drm_gpu_scheduler. Using the
> main execution thread to do ordering makes sense in my opinion.

I completely disagree to that.

Take a look at how this came to be. This is a very very ugly hack and we 
already had a hard time making lockdep understand the different fence 
signaling dependencies with freeing the job and I'm pretty sure that is 
still not 100% correct.

>
>>> There were already patches floating around which did exactly that.
>>>
>>> Last time I checked those were actually looking pretty good.
>>>
> Link to patches for reference.
>
>>> Additional to message passing advantage the real big issue with the
>>> scheduler and 1 to 1 mapping is that we create a kernel thread for each
>>> instance, which results in tons on overhead.
> First patch in the series switches from kthread to work queue, that is
> still a good idea.

This was the patch I was referring to. Sorry didn't remembered that this 
was in the same patch set.

>
>>> Just using a work item which is submitted to a work queue completely avoids
>>> that.
>> Hm I should have read the entire series first, since that does the
>> conversion still. Apologies for the confusion, and yeah we should be able
>> to just submit other work to the same wq with the first patch? And so
>> hand-rolling this infra here isn't needed at all?
>>
> I wouldn't call this hand rolling, rather it following patten already in
> place.

Basically workqueues are the in kernel infrastructure for exactly that 
use case and we are trying to re-create that here and that is usually a 
rather bad idea.

Regards,
Christian.

>
> Matt
>
>> Or what am I missing?
>>
>>> Regards,
>>> Christian.
>>>
>>>> Worst case I think this isn't a dead-end and can be refactored to
>>>> internally use the workqueue services, with the new functions here
>>>> just being dumb wrappers until everyone is converted over. So it
>>>> doesn't look like an expensive mistake, if it turns out to be a
>>>> mistake.
>>>> -Daniel
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
>>>>>>     include/drm/gpu_scheduler.h            | 29 +++++++++++++-
>>>>>>     2 files changed, 78 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 2597fb298733..84821a124ca2 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>>
>>>>>> +/**
>>>>>> + * drm_sched_add_msg - add scheduler message
>>>>>> + *
>>>>>> + * @sched: scheduler instance
>>>>>> + * @msg: message to be added
>>>>>> + *
>>>>>> + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
>>>>>> + * Messages processing will stop if schedule run wq is stopped and resume when
>>>>>> + * run wq is started.
>>>>>> + */
>>>>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>>>>> +                    struct drm_sched_msg *msg)
>>>>>> +{
>>>>>> +     spin_lock(&sched->job_list_lock);
>>>>>> +     list_add_tail(&msg->link, &sched->msgs);
>>>>>> +     spin_unlock(&sched->job_list_lock);
>>>>>> +
>>>>>> +     drm_sched_run_wq_queue(sched);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_sched_add_msg);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_sched_get_msg - get scheduler message
>>>>>> + *
>>>>>> + * @sched: scheduler instance
>>>>>> + *
>>>>>> + * Returns NULL or message
>>>>>> + */
>>>>>> +static struct drm_sched_msg *
>>>>>> +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
>>>>>> +{
>>>>>> +     struct drm_sched_msg *msg;
>>>>>> +
>>>>>> +     spin_lock(&sched->job_list_lock);
>>>>>> +     msg = list_first_entry_or_null(&sched->msgs,
>>>>>> +                                    struct drm_sched_msg, link);
>>>>>> +     if (msg)
>>>>>> +             list_del(&msg->link);
>>>>>> +     spin_unlock(&sched->job_list_lock);
>>>>>> +
>>>>>> +     return msg;
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * drm_sched_main - main scheduler thread
>>>>>>      *
>>>>>> @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>                 container_of(w, struct drm_gpu_scheduler, work_run);
>>>>>>         struct drm_sched_entity *entity;
>>>>>>         struct drm_sched_job *cleanup_job;
>>>>>> +     struct drm_sched_msg *msg;
>>>>>>         int r;
>>>>>>
>>>>>>         if (READ_ONCE(sched->pause_run_wq))
>>>>>> @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>
>>>>>>         cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>>>         entity = drm_sched_select_entity(sched);
>>>>>> +     msg = drm_sched_get_msg(sched);
>>>>>>
>>>>>> -     if (!entity && !cleanup_job)
>>>>>> +     if (!entity && !cleanup_job && !msg)
>>>>>>                 return; /* No more work */
>>>>>>
>>>>>>         if (cleanup_job)
>>>>>>                 sched->ops->free_job(cleanup_job);
>>>>>> +     if (msg)
>>>>>> +             sched->ops->process_msg(msg);
>>>>>>
>>>>>>         if (entity) {
>>>>>>                 struct dma_fence *fence;
>>>>>> @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>                 sched_job = drm_sched_entity_pop_job(entity);
>>>>>>                 if (!sched_job) {
>>>>>>                         complete_all(&entity->entity_idle);
>>>>>> -                     if (!cleanup_job)
>>>>>> +                     if (!cleanup_job && !msg)
>>>>>>                                 return; /* No more work */
>>>>>>                         goto again;
>>>>>>                 }
>>>>>> @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>
>>>>>>         init_waitqueue_head(&sched->job_scheduled);
>>>>>>         INIT_LIST_HEAD(&sched->pending_list);
>>>>>> +     INIT_LIST_HEAD(&sched->msgs);
>>>>>>         spin_lock_init(&sched->job_list_lock);
>>>>>>         atomic_set(&sched->hw_rq_count, 0);
>>>>>>         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>>> index df1993dd44ae..267bd060d178 100644
>>>>>> --- a/include/drm/gpu_scheduler.h
>>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>>> @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
>>>>>>         DRM_GPU_SCHED_STAT_ENODEV,
>>>>>>     };
>>>>>>
>>>>>> +/**
>>>>>> + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
>>>>>> + * message
>>>>>> + *
>>>>>> + * Generic enough for backend defined messages, backend can expand if needed.
>>>>>> + */
>>>>>> +struct drm_sched_msg {
>>>>>> +     /** @link: list link into the gpu scheduler list of messages */
>>>>>> +     struct list_head                link;
>>>>>> +     /**
>>>>>> +      * @private_data: opaque pointer to message private data (backend defined)
>>>>>> +      */
>>>>>> +     void                            *private_data;
>>>>>> +     /** @opcode: opcode of message (backend defined) */
>>>>>> +     unsigned int                    opcode;
>>>>>> +};
>>>>>> +
>>>>>>     /**
>>>>>>      * struct drm_sched_backend_ops - Define the backend operations
>>>>>>      *  called by the scheduler
>>>>>> @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
>>>>>>              * and it's time to clean it up.
>>>>>>          */
>>>>>>         void (*free_job)(struct drm_sched_job *sched_job);
>>>>>> +
>>>>>> +     /**
>>>>>> +      * @process_msg: Process a message. Allowed to block, it is this
>>>>>> +      * function's responsibility to free message if dynamically allocated.
>>>>>> +      */
>>>>>> +     void (*process_msg)(struct drm_sched_msg *msg);
>>>>>>     };
>>>>>>
>>>>>>     /**
>>>>>> @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
>>>>>>      * @timeout: the time after which a job is removed from the scheduler.
>>>>>>      * @name: name of the ring for which this scheduler is being used.
>>>>>>      * @sched_rq: priority wise array of run queues.
>>>>>> + * @msgs: list of messages to be processed in @work_run
>>>>>>      * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>>>>>      *                 waits on this wait queue until all the scheduled jobs are
>>>>>>      *                 finished.
>>>>>> @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
>>>>>>      * @job_id_count: used to assign unique id to the each job.
>>>>>>      * @run_wq: workqueue used to queue @work_run
>>>>>>      * @timeout_wq: workqueue used to queue @work_tdr
>>>>>> - * @work_run: schedules jobs and cleans up entities
>>>>>> + * @work_run: schedules jobs, cleans up jobs, and processes messages
>>>>>>      * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>>      *            timeout interval is over.
>>>>>>      * @pending_list: the list of jobs which are currently in the job queue.
>>>>>> @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
>>>>>>         long                            timeout;
>>>>>>         const char                      *name;
>>>>>>         struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
>>>>>> +     struct list_head                msgs;
>>>>>>         wait_queue_head_t               job_scheduled;
>>>>>>         atomic_t                        hw_rq_count;
>>>>>>         atomic64_t                      job_id_count;
>>>>>> @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>>>>
>>>>>>     void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>>>>     void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>>>>> +                    struct drm_sched_msg *msg);
>>>>>>     void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
>>>>>>     void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>>>>>>     void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch