I don't think i can apply this patch 'as is' as this has dependency on 
patch by Steven which also wasn't applied yet - 588b982 Steven 
Price        6 weeks ago    drm: Don't free jobs in 
wait_event_interruptible()


Andrey


On 12/3/19 2:44 PM, Deucher, Alexander wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Please go ahead an apply whatever version is necessary for 
> amd-staging-drm-next.
>
> Alex
>
> ------------------------------------------------------------------------
> *From:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Sent:* Tuesday, December 3, 2019 2:10 PM
> *To:* Deng, Emily <Emily.Deng@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> *Cc:* dri-devel@lists.freedesktop.org 
> <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>; Koenig, Christian 
> <Christian.Koenig@amd.com>; steven.price@arm.com <steven.price@arm.com>
> *Subject:* Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.
> Yes - Christian just pushed it to drm-next-misc - I guess Alex/Christian
> didn't pull to amd-staging-drm-next yet.
>
> Andrey
>
> On 12/2/19 2:24 PM, Deng, Emily wrote:
> > [AMD Official Use Only - Internal Distribution Only]
> >
> > Hi Andrey,
> >      Seems this patch is still not in amd-staging-drm-next?
> >
> > Best wishes
> > Emily Deng
> >
> >
> >
> >> -----Original Message-----
> >> From: Deng, Emily
> >> Sent: Tuesday, November 26, 2019 4:41 PM
> >> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> >> Cc: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; 
> Koenig,
> >> Christian <Christian.Koenig@amd.com>; steven.price@arm.com
> >> Subject: RE: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.
> >>
> >> [AMD Official Use Only - Internal Distribution Only]
> >>
> >> Reviewed-by: Emily Deng <Emily.Deng@amd.com>
> >>
> >>> -----Original Message-----
> >>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> >>> Sent: Tuesday, November 26, 2019 7:37 AM
> >>> Cc: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org;
> >>> Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily
> >>> <Emily.Deng@amd.com>; steven.price@arm.com
> >>> Subject: Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.
> >>>
> >>> Ping
> >>>
> >>> Andrey
> >>>
> >>> On 11/25/19 3:51 PM, Andrey Grodzovsky wrote:
> >>>> Problem:
> >>>> Due to a race between drm_sched_cleanup_jobs in sched thread and
> >>>> drm_sched_job_timedout in timeout work there is a possiblity that bad
> >>>> job was already freed while still being accessed from the timeout
> >>>> thread.
> >>>>
> >>>> Fix:
> >>>> Instead of just peeking at the bad job in the mirror list remove it
> >>>> from the list under lock and then put it back later when we are
> >>>> garanteed no race with main sched thread is possible which is after
> >>>> the thread is parked.
> >>>>
> >>>> v2: Lock around processing ring_mirror_list in 
> drm_sched_cleanup_jobs.
> >>>>
> >>>> v3: Rebase on top of drm-misc-next. v2 is not needed anymore as
> >>>> drm_sched_get_cleanup_job already has a lock there.
> >>>>
> >>>> v4: Fix comments to relfect latest code in drm-misc.
> >>>>
> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> Tested-by: Emily Deng <Emily.Deng@amd.com>
> >>>> ---
> >>>> drivers/gpu/drm/scheduler/sched_main.c | 27
> >>> +++++++++++++++++++++++++++
> >>>>    1 file changed, 27 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> >>>> b/drivers/gpu/drm/scheduler/sched_main.c
> >>>> index 6774955..1bf9c40 100644
> >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>>> @@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct
> >>> work_struct *work)
> >>>>     unsigned long flags;
> >>>>
> >>>>     sched = container_of(work, struct drm_gpu_scheduler,
> >>>> work_tdr.work);
> >>>> +
> >>>> +  /* Protects against concurrent deletion in
> >>> drm_sched_get_cleanup_job */
> >>>> + spin_lock_irqsave(&sched->job_list_lock, flags);
> >>>>     job = list_first_entry_or_null(&sched->ring_mirror_list,
> >>>> struct drm_sched_job, node);
> >>>>
> >>>>     if (job) {
> >>>> +          /*
> >>>> +           * Remove the bad job so it cannot be freed by concurrent
> >>>> +           * drm_sched_cleanup_jobs. It will be reinserted back 
> after
> >>> sched->thread
> >>>> +           * is parked at which point it's safe.
> >>>> +           */
> >>>> + list_del_init(&job->node);
> >>>> + spin_unlock_irqrestore(&sched->job_list_lock, flags);
> >>>> +
> >>>> job->sched->ops->timedout_job(job);
> >>>>
> >>>>             /*
> >>>> @@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct
> >>> work_struct *work)
> >>>> job->sched->ops->free_job(job);
> >>>> sched->free_guilty = false;
> >>>>             }
> >>>> +  } else {
> >>>> + spin_unlock_irqrestore(&sched->job_list_lock, flags);
> >>>>     }
> >>>>
> >>>> spin_lock_irqsave(&sched->job_list_lock, flags); @@ -370,6 +383,20
> >>>> @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> >>> drm_sched_job *bad)
> >>>>     kthread_park(sched->thread);
> >>>>
> >>>>     /*
> >>>> +   * Reinsert back the bad job here - now it's safe as
> >>>> +   * drm_sched_get_cleanup_job cannot race against us and 
> release the
> >>>> +   * bad job at this point - we parked (waited for) any in progress
> >>>> +   * (earlier) cleanups and drm_sched_get_cleanup_job will not be
> >>> called
> >>>> +   * now until the scheduler thread is unparked.
> >>>> +   */
> >>>> +  if (bad && bad->sched == sched)
> >>>> +          /*
> >>>> +           * Add at the head of the queue to reflect it was the 
> earliest
> >>>> +           * job extracted.
> >>>> +           */
> >>>> +          list_add(&bad->node, &sched->ring_mirror_list);
> >>>> +
> >>>> +  /*
> >>>>      * Iterate the job list from later to  earlier one and either 
> deactive
> >>>>      * their HW callbacks or remove them from mirror list if they 
> already
> >>>>      * signaled.