Turns out Steven's patch was already in so i just cherry-picked the change from drm-next-misc Emily - it's in. Andrey On 12/3/19 2:59 PM, Deucher, Alexander wrote: > > [AMD Official Use Only - Internal Distribution Only] > > > Cherry pick whatever dependencies you need or pick the older version > of the patch.  Either way works. > > Alex > ------------------------------------------------------------------------ > *From:* Grodzovsky, Andrey > *Sent:* Tuesday, December 3, 2019 2:57 PM > *To:* Deucher, Alexander ; Deng, Emily > > *Cc:* dri-devel@lists.freedesktop.org > ; amd-gfx@lists.freedesktop.org > ; Koenig, Christian > ; steven.price@arm.com > *Subject:* Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job. > > I don't think i can apply this patch 'as is' as this has dependency on > patch by Steven which also wasn't applied yet - 588b982 Steven > Price        6 weeks ago    drm: Don't free jobs in > wait_event_interruptible() > > > Andrey > > > On 12/3/19 2:44 PM, Deucher, Alexander wrote: >> >> [AMD Official Use Only - Internal Distribution Only] >> >> >> Please go ahead an apply whatever version is necessary for >> amd-staging-drm-next. >> >> Alex >> >> ------------------------------------------------------------------------ >> *From:* Grodzovsky, Andrey >> >> *Sent:* Tuesday, December 3, 2019 2:10 PM >> *To:* Deng, Emily ; >> Deucher, Alexander >> >> *Cc:* dri-devel@lists.freedesktop.org >> >> >> ; >> amd-gfx@lists.freedesktop.org >> >> ; Koenig, Christian >> ; >> steven.price@arm.com >> >> *Subject:* Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job. >> Yes - Christian just pushed it to drm-next-misc - I guess Alex/Christian >> didn't pull to amd-staging-drm-next yet. >> >> Andrey >> >> On 12/2/19 2:24 PM, Deng, Emily wrote: >> > [AMD Official Use Only - Internal Distribution Only] >> > >> > Hi Andrey, >> >      Seems this patch is still not in amd-staging-drm-next? >> > >> > Best wishes >> > Emily Deng >> > >> > >> > >> >> -----Original Message----- >> >> From: Deng, Emily >> >> Sent: Tuesday, November 26, 2019 4:41 PM >> >> To: Grodzovsky, Andrey >> >> >> Cc: dri-devel@lists.freedesktop.org >> ; >> amd-gfx@lists.freedesktop.org ; >> Koenig, >> >> Christian >> ; steven.price@arm.com >> >> >> Subject: RE: [PATCH v4] drm/scheduler: Avoid accessing freed bad job. >> >> >> >> [AMD Official Use Only - Internal Distribution Only] >> >> >> >> Reviewed-by: Emily Deng >> >> >> >> >>> -----Original Message----- >> >>> From: Grodzovsky, Andrey >> >> >>> Sent: Tuesday, November 26, 2019 7:37 AM >> >>> Cc: dri-devel@lists.freedesktop.org >> ; >> amd-gfx@lists.freedesktop.org ; >> >>> Koenig, Christian >> ; Deng, Emily >> >>> ; >> steven.price@arm.com >> >>> Subject: Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job. >> >>> >> >>> Ping >> >>> >> >>> Andrey >> >>> >> >>> On 11/25/19 3:51 PM, Andrey Grodzovsky wrote: >> >>>> Problem: >> >>>> Due to a race between drm_sched_cleanup_jobs in sched thread and >> >>>> drm_sched_job_timedout in timeout work there is a possiblity >> that bad >> >>>> job was already freed while still being accessed from the timeout >> >>>> thread. >> >>>> >> >>>> Fix: >> >>>> Instead of just peeking at the bad job in the mirror list remove it >> >>>> from the list under lock and then put it back later when we are >> >>>> garanteed no race with main sched thread is possible which is after >> >>>> the thread is parked. >> >>>> >> >>>> v2: Lock around processing ring_mirror_list in >> drm_sched_cleanup_jobs. >> >>>> >> >>>> v3: Rebase on top of drm-misc-next. v2 is not needed anymore as >> >>>> drm_sched_get_cleanup_job already has a lock there. >> >>>> >> >>>> v4: Fix comments to relfect latest code in drm-misc. >> >>>> >> >>>> Signed-off-by: Andrey Grodzovsky >> >> >>>> Reviewed-by: Christian König >> >> >>>> Tested-by: Emily Deng >> >> >>>> --- >> >>>> drivers/gpu/drm/scheduler/sched_main.c | 27 >> >>> +++++++++++++++++++++++++++ >> >>>>    1 file changed, 27 insertions(+) >> >>>> >> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> >>>> b/drivers/gpu/drm/scheduler/sched_main.c >> >>>> index 6774955..1bf9c40 100644 >> >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >> >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> >>>> @@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct >> >>> work_struct *work) >> >>>>     unsigned long flags; >> >>>> >> >>>>     sched = container_of(work, struct drm_gpu_scheduler, >> >>>> work_tdr.work); >> >>>> + >> >>>> +  /* Protects against concurrent deletion in >> >>> drm_sched_get_cleanup_job */ >> >>>> + spin_lock_irqsave(&sched->job_list_lock, flags); >> >>>>     job = list_first_entry_or_null(&sched->ring_mirror_list, >> >>>>                                    struct drm_sched_job, node); >> >>>> >> >>>>     if (job) { >> >>>> +          /* >> >>>> +           * Remove the bad job so it cannot be freed by concurrent >> >>>> +           * drm_sched_cleanup_jobs. It will be reinserted back >> after >> >>> sched->thread >> >>>> +           * is parked at which point it's safe. >> >>>> +           */ >> >>>> + list_del_init(&job->node); >> >>>> + spin_unlock_irqrestore(&sched->job_list_lock, flags); >> >>>> + >> >>>> job->sched->ops->timedout_job(job); >> >>>> >> >>>>             /* >> >>>> @@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct >> >>> work_struct *work) >> >>>> job->sched->ops->free_job(job); >> >>>> sched->free_guilty = false; >> >>>>             } >> >>>> +  } else { >> >>>> + spin_unlock_irqrestore(&sched->job_list_lock, flags); >> >>>>     } >> >>>> >> >>>> spin_lock_irqsave(&sched->job_list_lock, flags); @@ -370,6 +383,20 >> >>>> @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct >> >>> drm_sched_job *bad) >> >>>> kthread_park(sched->thread); >> >>>> >> >>>>     /* >> >>>> +   * Reinsert back the bad job here - now it's safe as >> >>>> +   * drm_sched_get_cleanup_job cannot race against us and >> release the >> >>>> +   * bad job at this point - we parked (waited for) any in progress >> >>>> +   * (earlier) cleanups and drm_sched_get_cleanup_job will not be >> >>> called >> >>>> +   * now until the scheduler thread is unparked. >> >>>> +   */ >> >>>> +  if (bad && bad->sched == sched) >> >>>> +          /* >> >>>> +           * Add at the head of the queue to reflect it was the >> earliest >> >>>> +           * job extracted. >> >>>> +           */ >> >>>> + list_add(&bad->node, &sched->ring_mirror_list); >> >>>> + >> >>>> +  /* >> >>>>      * Iterate the job list from later to  earlier one and >> either deactive >> >>>>      * their HW callbacks or remove them from mirror list if >> they already >> >>>>      * signaled.