Was this fix independent of the other
discussions? Should this be
applied to drm-misc?
Alex
On Wed, Sep 1, 2021 at 4:42 PM Alex Deucher
<alexdeucher@gmail.com> wrote:
>
> On Wed, Sep 1, 2021 at 2:50 AM Christian König
>
<ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Am 01.09.21 um 02:46 schrieb Monk Liu:
> > > issue:
> > > in cleanup_job the cancle_delayed_work will
cancel a TO timer
> > > even the its corresponding job is still
running.
> > >
> > > fix:
> > > do not cancel the timer in cleanup_job,
instead do the cancelling
> > > only when the heading job is signaled, and
if there is a "next" job
> > > we start_timeout again.
> > >
> > > v2:
> > > further cleanup the logic, and do the TDR
timer cancelling if the signaled job
> > > is the last one in its scheduler.
> > >
> > > v3:
> > > change the issue description
> > > remove the cancel_delayed_work in the
begining of the cleanup_job
> > > recover the implement of
drm_sched_job_begin.
> > >
> > > v4:
> > > remove the kthread_should_park() checking
in cleanup_job routine,
> > > we should cleanup the signaled job asap
> > >
> > > TODO:
> > > 1)introduce pause/resume scheduler in
job_timeout to serial the handling
> > > of scheduler and job_timeout.
> > > 2)drop the bad job's del and insert in
scheduler due to above serialization
> > > (no race issue anymore with the
serialization)
> > >
> > > tested-by: jingwen
<jingwen.chen@@amd.com>
> > > Signed-off-by: Monk Liu
<Monk.Liu@amd.com>
> >
> > Reviewed-by: Christian König
<christian.koenig@amd.com>
> >
>
> Are you planning to push this to drm-misc?
>
> Alex
>
>
> > > ---
> > > drivers/gpu/drm/scheduler/sched_main.c |
26 +++++++++-----------------
> > > 1 file changed, 9 insertions(+), 17
deletions(-)
> > >
> > > diff --git
a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
> > > index a2a9536..3e0bbc7 100644
> > > ---
a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++
b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -676,15 +676,6 @@
drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
> > > {
> > > struct drm_sched_job *job, *next;
> > >
> > > - /*
> > > - * Don't destroy jobs while the
timeout worker is running OR thread
> > > - * is being parked and hence assumed
to not touch pending_list
> > > - */
> > > - if ((sched->timeout !=
MAX_SCHEDULE_TIMEOUT &&
> > > -
!cancel_delayed_work(&sched->work_tdr)) ||
> > > - kthread_should_park())
> > > - return NULL;
> > > -
> > >
spin_lock(&sched->job_list_lock);
> > >
> > > job =
list_first_entry_or_null(&sched->pending_list,
> > > @@ -693,17 +684,21 @@
drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
> > > if (job &&
dma_fence_is_signaled(&job->s_fence->finished))
{
> > > /* remove job from
pending_list */
> > >
list_del_init(&job->list);
> > > +
> > > + /* cancel this job's TO timer
*/
> > > +
cancel_delayed_work(&sched->work_tdr);
> > > /* make the scheduled
timestamp more accurate */
> > > next =
list_first_entry_or_null(&sched->pending_list,
> >
>
typeof(*next), list);
> > > - if (next)
> > > +
> > > + if (next) {
> > >
next->s_fence->scheduled.timestamp =
> > >
job->s_fence->finished.timestamp;
> > > -
> > > + /* start TO timer for
next job */
> > > +
drm_sched_start_timeout(sched);
> > > + }
> > > } else {
> > > job = NULL;
> > > - /* queue timeout for next job
*/
> > > -
drm_sched_start_timeout(sched);
> > > }
> > >
> > >
spin_unlock(&sched->job_list_lock);
> > > @@ -791,11 +786,8 @@ static int
drm_sched_main(void *param)
> > >
(entity = drm_sched_select_entity(sched))) ||
> > >
kthread_should_stop());
> > >
> > > - if (cleanup_job) {
> > > + if (cleanup_job)
> > >
sched->ops->free_job(cleanup_job);
> > > - /* queue timeout for
next job */
> > > -
drm_sched_start_timeout(sched);
> > > - }
> > >
> > > if (!entity)
> > > continue;
> >