[PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
@ 2021-08-18 11:21 Jingwen Chen
  2021-08-18 14:02 ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Jingwen Chen @ 2021-08-18 11:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: monk.liu, christian.koenig, Andrey.Grodzovsky, Jingwen Chen

[Why]
for bailing job, this commit will delete it from pending list thus the
bailing job will never have a chance to be resubmitted even in advance
tdr mode.

[How]
after embeded hw_fence into amdgpu_job is done, the race condition that
this commit tries to work around is completely solved.So revert this
commit.
This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
v2:
add dma_fence_get/put() around timedout_job to avoid concurrent delete
during processing timedout_job

Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a953693b45..f9b9b3aefc4a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	struct dma_fence *fence;
 	enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
@@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
 
 	if (job) {
 		/*
-		 * Remove the bad job so it cannot be freed by concurrent
-		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-		 * is parked at which point it's safe.
+		 * Get job->s_fence->parent here to avoid concurrent delete during
+		 * processing timedout_job
 		 */
-		list_del_init(&job->list);
+		fence = dma_fence_get(job->s_fence->parent);
 		spin_unlock(&sched->job_list_lock);
 
 		status = job->sched->ops->timedout_job(job);
@@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			job->sched->ops->free_job(job);
 			sched->free_guilty = false;
 		}
+		dma_fence_put(fence);
 	} else {
 		spin_unlock(&sched->job_list_lock);
 	}
@@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 
 	kthread_park(sched->thread);
 
-	/*
-	 * Reinsert back the bad job here - now it's safe as
-	 * drm_sched_get_cleanup_job cannot race against us and release the
-	 * bad job at this point - we parked (waited for) any in progress
-	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-	 * now until the scheduler thread is unparked.
-	 */
-	if (bad && bad->sched == sched)
-		/*
-		 * Add at the head of the queue to reflect it was the earliest
-		 * job extracted.
-		 */
-		list_add(&bad->list, &sched->pending_list);
-
 	/*
 	 * Iterate the job list from later to  earlier one and either deactive
 	 * their HW callbacks or remove them from pending list if they already
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 11:21 [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job." Jingwen Chen
@ 2021-08-18 14:02 ` Alex Deucher
  2021-08-18 14:26   ` Andrey Grodzovsky
  2021-08-18 14:29   ` Daniel Vetter
  0 siblings, 2 replies; 30+ messages in thread
From: Alex Deucher @ 2021-08-18 14:02 UTC (permalink / raw)
  To: Jingwen Chen, Maling list - DRI developers
  Cc: amd-gfx list, monk.liu, Christian Koenig, Andrey Grodzovsky

+ dri-devel

Since scheduler is a shared component, please add dri-devel on all
scheduler patches.

On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>
> [Why]
> for bailing job, this commit will delete it from pending list thus the
> bailing job will never have a chance to be resubmitted even in advance
> tdr mode.
>
> [How]
> after embeded hw_fence into amdgpu_job is done, the race condition that
> this commit tries to work around is completely solved.So revert this
> commit.
> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> v2:
> add dma_fence_get/put() around timedout_job to avoid concurrent delete
> during processing timedout_job
>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a953693b45..f9b9b3aefc4a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  {
>         struct drm_gpu_scheduler *sched;
>         struct drm_sched_job *job;
> +       struct dma_fence *fence;
>         enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>
>         sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>
>         if (job) {
>                 /*
> -                * Remove the bad job so it cannot be freed by concurrent
> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -                * is parked at which point it's safe.
> +                * Get job->s_fence->parent here to avoid concurrent delete during
> +                * processing timedout_job
>                  */
> -               list_del_init(&job->list);
> +               fence = dma_fence_get(job->s_fence->parent);
>                 spin_unlock(&sched->job_list_lock);
>
>                 status = job->sched->ops->timedout_job(job);
> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>                         job->sched->ops->free_job(job);
>                         sched->free_guilty = false;
>                 }
> +               dma_fence_put(fence);
>         } else {
>                 spin_unlock(&sched->job_list_lock);
>         }
> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>
>         kthread_park(sched->thread);
>
> -       /*
> -        * Reinsert back the bad job here - now it's safe as
> -        * drm_sched_get_cleanup_job cannot race against us and release the
> -        * bad job at this point - we parked (waited for) any in progress
> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -        * now until the scheduler thread is unparked.
> -        */
> -       if (bad && bad->sched == sched)
> -               /*
> -                * Add at the head of the queue to reflect it was the earliest
> -                * job extracted.
> -                */
> -               list_add(&bad->list, &sched->pending_list);
> -
>         /*
>          * Iterate the job list from later to  earlier one and either deactive
>          * their HW callbacks or remove them from pending list if they already
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:02 ` Alex Deucher
@ 2021-08-18 14:26   ` Andrey Grodzovsky
  2021-08-18 14:32     ` Daniel Vetter
  2021-08-18 14:29   ` Daniel Vetter
  1 sibling, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-18 14:26 UTC (permalink / raw)
  To: Alex Deucher, Jingwen Chen, Maling list - DRI developers
  Cc: amd-gfx list, monk.liu, Christian Koenig

On 2021-08-18 10:02 a.m., Alex Deucher wrote:

> + dri-devel
>
> Since scheduler is a shared component, please add dri-devel on all
> scheduler patches.
>
> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>> [Why]
>> for bailing job, this commit will delete it from pending list thus the
>> bailing job will never have a chance to be resubmitted even in advance
>> tdr mode.
>>
>> [How]
>> after embeded hw_fence into amdgpu_job is done, the race condition that
>> this commit tries to work around is completely solved.So revert this
>> commit.
>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>> v2:
>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>> during processing timedout_job
>>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>   1 file changed, 5 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a953693b45..f9b9b3aefc4a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>   {
>>          struct drm_gpu_scheduler *sched;
>>          struct drm_sched_job *job;
>> +       struct dma_fence *fence;
>>          enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>
>>          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>
>>          if (job) {
>>                  /*
>> -                * Remove the bad job so it cannot be freed by concurrent
>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>> -                * is parked at which point it's safe.
>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>> +                * processing timedout_job
>>                   */
>> -               list_del_init(&job->list);
>> +               fence = dma_fence_get(job->s_fence->parent);


While this is true for amdgpu, it has no meaning for other drivers for 
whom we haven't
done the refactoring of embedding HW fence (parent) into the job 
structure. In fact thinking
about it, unless you do the HW fence embedding for all the drivers using 
the scheduler you cannot
revert this patch or you will just break them.

Andrey


>>                  spin_unlock(&sched->job_list_lock);
>>
>>                  status = job->sched->ops->timedout_job(job);
>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>                          job->sched->ops->free_job(job);
>>                          sched->free_guilty = false;
>>                  }
>> +               dma_fence_put(fence);
>>          } else {
>>                  spin_unlock(&sched->job_list_lock);
>>          }
>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>
>>          kthread_park(sched->thread);
>>
>> -       /*
>> -        * Reinsert back the bad job here - now it's safe as
>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>> -        * bad job at this point - we parked (waited for) any in progress
>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>> -        * now until the scheduler thread is unparked.
>> -        */
>> -       if (bad && bad->sched == sched)
>> -               /*
>> -                * Add at the head of the queue to reflect it was the earliest
>> -                * job extracted.
>> -                */
>> -               list_add(&bad->list, &sched->pending_list);
>> -
>>          /*
>>           * Iterate the job list from later to  earlier one and either deactive
>>           * their HW callbacks or remove them from pending list if they already
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:02 ` Alex Deucher
  2021-08-18 14:26   ` Andrey Grodzovsky
@ 2021-08-18 14:29   ` Daniel Vetter
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel Vetter @ 2021-08-18 14:29 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Jingwen Chen, Maling list - DRI developers, amd-gfx list,
	monk.liu, Christian Koenig, Andrey Grodzovsky

On Wed, Aug 18, 2021 at 10:02:06AM -0400, Alex Deucher wrote:
> + dri-devel
> 
> Since scheduler is a shared component, please add dri-devel on all
> scheduler patches.

Do we need a MAINTAINRS entry specifically for this, or just oversight?

> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> >
> > [Why]
> > for bailing job, this commit will delete it from pending list thus the
> > bailing job will never have a chance to be resubmitted even in advance
> > tdr mode.
> >
> > [How]
> > after embeded hw_fence into amdgpu_job is done, the race condition that
> > this commit tries to work around is completely solved.So revert this
> > commit.

Does this also hold for all other drivers? In general the commit message
feels rather rushed and I have no idea what's really going on.

Also at least around tdr there's been some solid clarifications around
how this is supposed to work between tdr and main scheduler thread, would
be good to explain how that all fits together. Or should fit together.
-Daniel

> > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > v2:
> > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > during processing timedout_job
> >
> > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> >  1 file changed, 5 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index a2a953693b45..f9b9b3aefc4a 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >  {
> >         struct drm_gpu_scheduler *sched;
> >         struct drm_sched_job *job;
> > +       struct dma_fence *fence;
> >         enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> >
> >         sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >
> >         if (job) {
> >                 /*
> > -                * Remove the bad job so it cannot be freed by concurrent
> > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -                * is parked at which point it's safe.
> > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > +                * processing timedout_job
> >                  */
> > -               list_del_init(&job->list);
> > +               fence = dma_fence_get(job->s_fence->parent);
> >                 spin_unlock(&sched->job_list_lock);
> >
> >                 status = job->sched->ops->timedout_job(job);
> > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >                         job->sched->ops->free_job(job);
> >                         sched->free_guilty = false;
> >                 }
> > +               dma_fence_put(fence);
> >         } else {
> >                 spin_unlock(&sched->job_list_lock);
> >         }
> > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >
> >         kthread_park(sched->thread);
> >
> > -       /*
> > -        * Reinsert back the bad job here - now it's safe as
> > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > -        * bad job at this point - we parked (waited for) any in progress
> > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -        * now until the scheduler thread is unparked.
> > -        */
> > -       if (bad && bad->sched == sched)
> > -               /*
> > -                * Add at the head of the queue to reflect it was the earliest
> > -                * job extracted.
> > -                */
> > -               list_add(&bad->list, &sched->pending_list);
> > -
> >         /*
> >          * Iterate the job list from later to  earlier one and either deactive
> >          * their HW callbacks or remove them from pending list if they already
> > --
> > 2.25.1
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:26   ` Andrey Grodzovsky
@ 2021-08-18 14:32     ` Daniel Vetter
  2021-08-18 14:36       ` Andrey Grodzovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Vetter @ 2021-08-18 14:32 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu, Christian Koenig

On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> 
> > + dri-devel
> > 
> > Since scheduler is a shared component, please add dri-devel on all
> > scheduler patches.
> > 
> > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > [Why]
> > > for bailing job, this commit will delete it from pending list thus the
> > > bailing job will never have a chance to be resubmitted even in advance
> > > tdr mode.
> > > 
> > > [How]
> > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > this commit tries to work around is completely solved.So revert this
> > > commit.
> > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > v2:
> > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > during processing timedout_job
> > > 
> > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > >   1 file changed, 5 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index a2a953693b45..f9b9b3aefc4a 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > >   {
> > >          struct drm_gpu_scheduler *sched;
> > >          struct drm_sched_job *job;
> > > +       struct dma_fence *fence;
> > >          enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > 
> > >          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > 
> > >          if (job) {
> > >                  /*
> > > -                * Remove the bad job so it cannot be freed by concurrent
> > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > -                * is parked at which point it's safe.
> > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > +                * processing timedout_job
> > >                   */
> > > -               list_del_init(&job->list);
> > > +               fence = dma_fence_get(job->s_fence->parent);
> 
> 
> While this is true for amdgpu, it has no meaning for other drivers for whom
> we haven't
> done the refactoring of embedding HW fence (parent) into the job structure.
> In fact thinking
> about it, unless you do the HW fence embedding for all the drivers using the
> scheduler you cannot
> revert this patch or you will just break them.

btw, why did you do that embedding? I do still have my patches with
dma_fence annotations floating around, but my idea at least was to fix
that issue with a mempool, not with embeddeding. What was the motivation
for embedding the wh fence?
-Daniel


> 
> Andrey
> 
> 
> > >                  spin_unlock(&sched->job_list_lock);
> > > 
> > >                  status = job->sched->ops->timedout_job(job);
> > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > >                          job->sched->ops->free_job(job);
> > >                          sched->free_guilty = false;
> > >                  }
> > > +               dma_fence_put(fence);
> > >          } else {
> > >                  spin_unlock(&sched->job_list_lock);
> > >          }
> > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > 
> > >          kthread_park(sched->thread);
> > > 
> > > -       /*
> > > -        * Reinsert back the bad job here - now it's safe as
> > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > -        * bad job at this point - we parked (waited for) any in progress
> > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > -        * now until the scheduler thread is unparked.
> > > -        */
> > > -       if (bad && bad->sched == sched)
> > > -               /*
> > > -                * Add at the head of the queue to reflect it was the earliest
> > > -                * job extracted.
> > > -                */
> > > -               list_add(&bad->list, &sched->pending_list);
> > > -
> > >          /*
> > >           * Iterate the job list from later to  earlier one and either deactive
> > >           * their HW callbacks or remove them from pending list if they already
> > > --
> > > 2.25.1
> > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:32     ` Daniel Vetter
@ 2021-08-18 14:36       ` Andrey Grodzovsky
  2021-08-18 14:42         ` Daniel Vetter
  0 siblings, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-18 14:36 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu, Christian Koenig


On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>
>>> + dri-devel
>>>
>>> Since scheduler is a shared component, please add dri-devel on all
>>> scheduler patches.
>>>
>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>> [Why]
>>>> for bailing job, this commit will delete it from pending list thus the
>>>> bailing job will never have a chance to be resubmitted even in advance
>>>> tdr mode.
>>>>
>>>> [How]
>>>> after embeded hw_fence into amdgpu_job is done, the race condition that
>>>> this commit tries to work around is completely solved.So revert this
>>>> commit.
>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>> v2:
>>>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>>>> during processing timedout_job
>>>>
>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>    1 file changed, 5 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>    {
>>>>           struct drm_gpu_scheduler *sched;
>>>>           struct drm_sched_job *job;
>>>> +       struct dma_fence *fence;
>>>>           enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>>>
>>>>           sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>
>>>>           if (job) {
>>>>                   /*
>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>> -                * is parked at which point it's safe.
>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>> +                * processing timedout_job
>>>>                    */
>>>> -               list_del_init(&job->list);
>>>> +               fence = dma_fence_get(job->s_fence->parent);
>>
>> While this is true for amdgpu, it has no meaning for other drivers for whom
>> we haven't
>> done the refactoring of embedding HW fence (parent) into the job structure.
>> In fact thinking
>> about it, unless you do the HW fence embedding for all the drivers using the
>> scheduler you cannot
>> revert this patch or you will just break them.
> btw, why did you do that embedding? I do still have my patches with
> dma_fence annotations floating around, but my idea at least was to fix
> that issue with a mempool, not with embeddeding. What was the motivation
> for embedding the wh fence?
> -Daniel


The motivation was 2 fold, avoid memory allocation during jobs submissions
(HW fence allocation) because as Christian explained this leads to 
deadlock with
mm code during evictions due to memory pressure (Christian can clarify 
if I messed
this explanation). Second is to exactly revert this patch because while 
it solved the issue
described in the patch it created another with drivers who baildc out 
early during TDR handling
for various reason and the job would just leak because it was already 
removed form pending list.

Andrey


>
>
>> Andrey
>>
>>
>>>>                   spin_unlock(&sched->job_list_lock);
>>>>
>>>>                   status = job->sched->ops->timedout_job(job);
>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>                           job->sched->ops->free_job(job);
>>>>                           sched->free_guilty = false;
>>>>                   }
>>>> +               dma_fence_put(fence);
>>>>           } else {
>>>>                   spin_unlock(&sched->job_list_lock);
>>>>           }
>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>
>>>>           kthread_park(sched->thread);
>>>>
>>>> -       /*
>>>> -        * Reinsert back the bad job here - now it's safe as
>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>> -        * now until the scheduler thread is unparked.
>>>> -        */
>>>> -       if (bad && bad->sched == sched)
>>>> -               /*
>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>> -                * job extracted.
>>>> -                */
>>>> -               list_add(&bad->list, &sched->pending_list);
>>>> -
>>>>           /*
>>>>            * Iterate the job list from later to  earlier one and either deactive
>>>>            * their HW callbacks or remove them from pending list if they already
>>>> --
>>>> 2.25.1
>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:36       ` Andrey Grodzovsky
@ 2021-08-18 14:42         ` Daniel Vetter
  2021-08-18 14:51           ` Andrey Grodzovsky
  2021-08-19  3:01           ` Liu, Monk
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Vetter @ 2021-08-18 14:42 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Alex Deucher, Jingwen Chen,
	Maling list - DRI developers, amd-gfx list, monk.liu,
	Christian Koenig

On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > 
> > > > + dri-devel
> > > > 
> > > > Since scheduler is a shared component, please add dri-devel on all
> > > > scheduler patches.
> > > > 
> > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > [Why]
> > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > tdr mode.
> > > > > 
> > > > > [How]
> > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > this commit tries to work around is completely solved.So revert this
> > > > > commit.
> > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > v2:
> > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > during processing timedout_job
> > > > > 
> > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > >    1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >    {
> > > > >           struct drm_gpu_scheduler *sched;
> > > > >           struct drm_sched_job *job;
> > > > > +       struct dma_fence *fence;
> > > > >           enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > 
> > > > >           sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > 
> > > > >           if (job) {
> > > > >                   /*
> > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > -                * is parked at which point it's safe.
> > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > +                * processing timedout_job
> > > > >                    */
> > > > > -               list_del_init(&job->list);
> > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > 
> > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > we haven't
> > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > In fact thinking
> > > about it, unless you do the HW fence embedding for all the drivers using the
> > > scheduler you cannot
> > > revert this patch or you will just break them.
> > btw, why did you do that embedding? I do still have my patches with
> > dma_fence annotations floating around, but my idea at least was to fix
> > that issue with a mempool, not with embeddeding. What was the motivation
> > for embedding the wh fence?
> > -Daniel
> 
> 
> The motivation was 2 fold, avoid memory allocation during jobs submissions
> (HW fence allocation) because as Christian explained this leads to deadlock
> with
> mm code during evictions due to memory pressure (Christian can clarify if I
> messed

Yeah that's the exact same thing I've chased with my dma_fence
annotations, but thus far zero to none interested in getting it sorted. I
think it'd be good to have some cross-driver agreement on how this should
be solved before someone just charges ahead ...

> this explanation). Second is to exactly revert this patch because while it
> solved the issue
> described in the patch it created another with drivers who baildc out early
> during TDR handling
> for various reason and the job would just leak because it was already
> removed form pending list.

Can't we reinsert it before we restart the scheduler thread? It might need
a separate list for that due to the lockless queue tricks. Or am I
thinking about the wrong kind of "we lost the job"?
-Danile

> 
> Andrey
> 
> 
> > 
> > 
> > > Andrey
> > > 
> > > 
> > > > >                   spin_unlock(&sched->job_list_lock);
> > > > > 
> > > > >                   status = job->sched->ops->timedout_job(job);
> > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >                           job->sched->ops->free_job(job);
> > > > >                           sched->free_guilty = false;
> > > > >                   }
> > > > > +               dma_fence_put(fence);
> > > > >           } else {
> > > > >                   spin_unlock(&sched->job_list_lock);
> > > > >           }
> > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > 
> > > > >           kthread_park(sched->thread);
> > > > > 
> > > > > -       /*
> > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > -        * now until the scheduler thread is unparked.
> > > > > -        */
> > > > > -       if (bad && bad->sched == sched)
> > > > > -               /*
> > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > -                * job extracted.
> > > > > -                */
> > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > -
> > > > >           /*
> > > > >            * Iterate the job list from later to  earlier one and either deactive
> > > > >            * their HW callbacks or remove them from pending list if they already
> > > > > --
> > > > > 2.25.1
> > > > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:42         ` Daniel Vetter
@ 2021-08-18 14:51           ` Andrey Grodzovsky
  2021-08-19  9:30             ` Daniel Vetter
  2021-08-19  3:01           ` Liu, Monk
  1 sibling, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-18 14:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu, Christian Koenig


On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>
>>>>> + dri-devel
>>>>>
>>>>> Since scheduler is a shared component, please add dri-devel on all
>>>>> scheduler patches.
>>>>>
>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>> [Why]
>>>>>> for bailing job, this commit will delete it from pending list thus the
>>>>>> bailing job will never have a chance to be resubmitted even in advance
>>>>>> tdr mode.
>>>>>>
>>>>>> [How]
>>>>>> after embeded hw_fence into amdgpu_job is done, the race condition that
>>>>>> this commit tries to work around is completely solved.So revert this
>>>>>> commit.
>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>> v2:
>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>>>>>> during processing timedout_job
>>>>>>
>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>     1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>     {
>>>>>>            struct drm_gpu_scheduler *sched;
>>>>>>            struct drm_sched_job *job;
>>>>>> +       struct dma_fence *fence;
>>>>>>            enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>
>>>>>>            sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>
>>>>>>            if (job) {
>>>>>>                    /*
>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>> -                * is parked at which point it's safe.
>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>> +                * processing timedout_job
>>>>>>                     */
>>>>>> -               list_del_init(&job->list);
>>>>>> +               fence = dma_fence_get(job->s_fence->parent);
>>>> While this is true for amdgpu, it has no meaning for other drivers for whom
>>>> we haven't
>>>> done the refactoring of embedding HW fence (parent) into the job structure.
>>>> In fact thinking
>>>> about it, unless you do the HW fence embedding for all the drivers using the
>>>> scheduler you cannot
>>>> revert this patch or you will just break them.
>>> btw, why did you do that embedding? I do still have my patches with
>>> dma_fence annotations floating around, but my idea at least was to fix
>>> that issue with a mempool, not with embeddeding. What was the motivation
>>> for embedding the wh fence?
>>> -Daniel
>>
>> The motivation was 2 fold, avoid memory allocation during jobs submissions
>> (HW fence allocation) because as Christian explained this leads to deadlock
>> with
>> mm code during evictions due to memory pressure (Christian can clarify if I
>> messed
> Yeah that's the exact same thing I've chased with my dma_fence
> annotations, but thus far zero to none interested in getting it sorted. I
> think it'd be good to have some cross-driver agreement on how this should
> be solved before someone just charges ahead ...
>
>> this explanation). Second is to exactly revert this patch because while it
>> solved the issue
>> described in the patch it created another with drivers who baildc out early
>> during TDR handling
>> for various reason and the job would just leak because it was already
>> removed form pending list.
> Can't we reinsert it before we restart the scheduler thread? It might need
> a separate list for that due to the lockless queue tricks. Or am I
> thinking about the wrong kind of "we lost the job"?
> -Danile


If you look at the original patch it would reinsert it even earlier - 
right after stopping the  SW scheduler thread, and even then it was to 
late for
some drivers as they would decide to return back from their TDR handler 
even before that. It is solvable but in an ugly way as far as I see, you 
need to
require each driver in his code to put the job back in the list if they 
do it before reaching the place where scheduler framework does it. Kind of
spaghetti code seems to me.

Andrey


>
>> Andrey
>>
>>
>>>
>>>> Andrey
>>>>
>>>>
>>>>>>                    spin_unlock(&sched->job_list_lock);
>>>>>>
>>>>>>                    status = job->sched->ops->timedout_job(job);
>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>                            job->sched->ops->free_job(job);
>>>>>>                            sched->free_guilty = false;
>>>>>>                    }
>>>>>> +               dma_fence_put(fence);
>>>>>>            } else {
>>>>>>                    spin_unlock(&sched->job_list_lock);
>>>>>>            }
>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>
>>>>>>            kthread_park(sched->thread);
>>>>>>
>>>>>> -       /*
>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>> -        * now until the scheduler thread is unparked.
>>>>>> -        */
>>>>>> -       if (bad && bad->sched == sched)
>>>>>> -               /*
>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>> -                * job extracted.
>>>>>> -                */
>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>> -
>>>>>>            /*
>>>>>>             * Iterate the job list from later to  earlier one and either deactive
>>>>>>             * their HW callbacks or remove them from pending list if they already
>>>>>> --
>>>>>> 2.25.1
>>>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:42         ` Daniel Vetter
  2021-08-18 14:51           ` Andrey Grodzovsky
@ 2021-08-19  3:01           ` Liu, Monk
  2021-08-19  9:24             ` Daniel Vetter
  1 sibling, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2021-08-19  3:01 UTC (permalink / raw)
  To: Daniel Vetter, Grodzovsky, Andrey
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers,
	amd-gfx list, Koenig, Christian

[AMD Official Use Only]

Hi Andrey and Daniel

We worked for a really long time on this new feature to AMD that finally can pick up the bad job from all timedout ones, and the change in scheduler (get/put fence in drm_sched_job_timedout, and remove the bad job delete and put back) is the last piece for us.

While we understand and realized that after the "bad job list node delete logic" being removed from job_timedout,  there will be race issues introduced if vendor's job_timeout calback is accessing the bad job  in parallel of scheduler doing "sched->ops->free_job(leanup_job)".

And to not introduce impact at all on those vendors I'd like to proposal a very simple change (which introduced a new bool member for scheduler to indicate if the del/put-back logic is needed or not) , check  patch here below:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea468..5e0bdc4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -495,6 +495,8 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 		return r;
 	}
 
+	ring->sched.keep_bad_job = true;
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 92d8de2..e7ac384 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	struct dma_fence *f = NULL;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
@@ -328,7 +329,11 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
 		 * is parked at which point it's safe.
 		 */
-		list_del_init(&job->list);
+		if (sched->keep_bad_job == false)
+			list_del_init(&job->list);
+		else
+			f = dma_fence_get(job->s_fence->parent);//get parent fence here to prevent hw_fence dropping to zero due to sched-main's cleanup_jobs, for amdgpu once parent fence drop to zero the sched_job will be kfree-ed 
+
 		spin_unlock(&sched->job_list_lock);
 
 		job->sched->ops->timedout_job(job);
@@ -341,6 +346,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			job->sched->ops->free_job(job);
 			sched->free_guilty = false;
 		}
+
+		dma_fence_put(f);
 	} else {
 		spin_unlock(&sched->job_list_lock);
 	}
@@ -396,7 +403,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
 	 * now until the scheduler thread is unparked.
 	 */
-	if (bad && bad->sched == sched)
+	if (bad && bad->sched == sched && sched->keep_bad_job == false)
 		/*
 		 * Add at the head of the queue to reflect it was the earliest
 		 * job extracted.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 4ea8606..5f9a640 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -301,6 +301,7 @@ struct drm_gpu_scheduler {
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
+	bool keep_bad_job;
 };
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,


Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch> 
Sent: Wednesday, August 18, 2021 10:43 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > 
> > > > + dri-devel
> > > > 
> > > > Since scheduler is a shared component, please add dri-devel on 
> > > > all scheduler patches.
> > > > 
> > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > [Why]
> > > > > for bailing job, this commit will delete it from pending list 
> > > > > thus the bailing job will never have a chance to be 
> > > > > resubmitted even in advance tdr mode.
> > > > > 
> > > > > [How]
> > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > condition that this commit tries to work around is completely 
> > > > > solved.So revert this commit.
> > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > v2:
> > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > concurrent delete during processing timedout_job
> > > > > 
> > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > >    1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >    {
> > > > >           struct drm_gpu_scheduler *sched;
> > > > >           struct drm_sched_job *job;
> > > > > +       struct dma_fence *fence;
> > > > >           enum drm_gpu_sched_stat status = 
> > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > 
> > > > >           sched = container_of(work, struct drm_gpu_scheduler, 
> > > > > work_tdr.work); @@ -325,11 +326,10 @@ static void 
> > > > > drm_sched_job_timedout(struct work_struct *work)
> > > > > 
> > > > >           if (job) {
> > > > >                   /*
> > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > -                * is parked at which point it's safe.
> > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > +                * processing timedout_job
> > > > >                    */
> > > > > -               list_del_init(&job->list);
> > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > 
> > > While this is true for amdgpu, it has no meaning for other drivers 
> > > for whom we haven't done the refactoring of embedding HW fence 
> > > (parent) into the job structure.
> > > In fact thinking
> > > about it, unless you do the HW fence embedding for all the drivers 
> > > using the scheduler you cannot revert this patch or you will just 
> > > break them.
> > btw, why did you do that embedding? I do still have my patches with 
> > dma_fence annotations floating around, but my idea at least was to 
> > fix that issue with a mempool, not with embeddeding. What was the 
> > motivation for embedding the wh fence?
> > -Daniel
> 
> 
> The motivation was 2 fold, avoid memory allocation during jobs 
> submissions (HW fence allocation) because as Christian explained this 
> leads to deadlock with mm code during evictions due to memory pressure 
> (Christian can clarify if I messed

Yeah that's the exact same thing I've chased with my dma_fence annotations, but thus far zero to none interested in getting it sorted. I think it'd be good to have some cross-driver agreement on how this should be solved before someone just charges ahead ...

> this explanation). Second is to exactly revert this patch because 
> while it solved the issue described in the patch it created another 
> with drivers who baildc out early during TDR handling for various 
> reason and the job would just leak because it was already removed form 
> pending list.

Can't we reinsert it before we restart the scheduler thread? It might need a separate list for that due to the lockless queue tricks. Or am I thinking about the wrong kind of "we lost the job"?
-Danile

> 
> Andrey
> 
> 
> > 
> > 
> > > Andrey
> > > 
> > > 
> > > > >                   spin_unlock(&sched->job_list_lock);
> > > > > 
> > > > >                   status = job->sched->ops->timedout_job(job);
> > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >                           job->sched->ops->free_job(job);
> > > > >                           sched->free_guilty = false;
> > > > >                   }
> > > > > +               dma_fence_put(fence);
> > > > >           } else {
> > > > >                   spin_unlock(&sched->job_list_lock);
> > > > >           }
> > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > 
> > > > >           kthread_park(sched->thread);
> > > > > 
> > > > > -       /*
> > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > -        * now until the scheduler thread is unparked.
> > > > > -        */
> > > > > -       if (bad && bad->sched == sched)
> > > > > -               /*
> > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > -                * job extracted.
> > > > > -                */
> > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > -
> > > > >           /*
> > > > >            * Iterate the job list from later to  earlier one and either deactive
> > > > >            * their HW callbacks or remove them from pending 
> > > > > list if they already
> > > > > --
> > > > > 2.25.1
> > > > > 

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C8ddd8838028242eb82c708d9625678cf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637648945806335873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uFdAwQH6yWm%2FC%2FdDeG8wXKNsOqI7dSQRGO9NbKkjYyU%3D&amp;reserved=0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-19  3:01           ` Liu, Monk
@ 2021-08-19  9:24             ` Daniel Vetter
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Vetter @ 2021-08-19  9:24 UTC (permalink / raw)
  To: Liu, Monk
  Cc: Daniel Vetter, Grodzovsky, Andrey, Alex Deucher, Chen, JingWen,
	Maling list - DRI developers, amd-gfx list, Koenig, Christian

On Thu, Aug 19, 2021 at 03:01:26AM +0000, Liu, Monk wrote:
> [AMD Official Use Only]
> 
> Hi Andrey and Daniel
> 
> We worked for a really long time on this new feature to AMD that finally
> can pick up the bad job from all timedout ones, and the change in
> scheduler (get/put fence in drm_sched_job_timedout, and remove the bad
> job delete and put back) is the last piece for us.
> 
> While we understand and realized that after the "bad job list node
> delete logic" being removed from job_timedout,  there will be race
> issues introduced if vendor's job_timeout calback is accessing the bad
> job  in parallel of scheduler doing "sched->ops->free_job(leanup_job)".
> 
> And to not introduce impact at all on those vendors I'd like to proposal
> a very simple change (which introduced a new bool member for scheduler
> to indicate if the del/put-back logic is needed or not) , check  patch
> here below:

If everyone operates like that then the shared code becomes a massive mess
of incompatible options and unmaintainable. I don't think that's a good
path forward.
-Daniel

> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 47ea468..5e0bdc4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -495,6 +495,8 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>  		return r;
>  	}
>  
> +	ring->sched.keep_bad_job = true;
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 92d8de2..e7ac384 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  {
>  	struct drm_gpu_scheduler *sched;
>  	struct drm_sched_job *job;
> +	struct dma_fence *f = NULL;
>  
>  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>  
> @@ -328,7 +329,11 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>  		 * is parked at which point it's safe.
>  		 */
> -		list_del_init(&job->list);
> +		if (sched->keep_bad_job == false)
> +			list_del_init(&job->list);
> +		else
> +			f = dma_fence_get(job->s_fence->parent);//get parent fence here to prevent hw_fence dropping to zero due to sched-main's cleanup_jobs, for amdgpu once parent fence drop to zero the sched_job will be kfree-ed 
> +
>  		spin_unlock(&sched->job_list_lock);
>  
>  		job->sched->ops->timedout_job(job);
> @@ -341,6 +346,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		dma_fence_put(f);
>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -396,7 +403,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>  	 * now until the scheduler thread is unparked.
>  	 */
> -	if (bad && bad->sched == sched)
> +	if (bad && bad->sched == sched && sched->keep_bad_job == false)
>  		/*
>  		 * Add at the head of the queue to reflect it was the earliest
>  		 * job extracted.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 4ea8606..5f9a640 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -301,6 +301,7 @@ struct drm_gpu_scheduler {
>  	atomic_t                        _score;
>  	bool				ready;
>  	bool				free_guilty;
> +	bool keep_bad_job;
>  };
>  
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
> 
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch> 
> Sent: Wednesday, August 18, 2021 10:43 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> 
> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > 
> > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > 
> > > > > + dri-devel
> > > > > 
> > > > > Since scheduler is a shared component, please add dri-devel on 
> > > > > all scheduler patches.
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > [Why]
> > > > > > for bailing job, this commit will delete it from pending list 
> > > > > > thus the bailing job will never have a chance to be 
> > > > > > resubmitted even in advance tdr mode.
> > > > > > 
> > > > > > [How]
> > > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > > condition that this commit tries to work around is completely 
> > > > > > solved.So revert this commit.
> > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > v2:
> > > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > > concurrent delete during processing timedout_job
> > > > > > 
> > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > >    1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > >    {
> > > > > >           struct drm_gpu_scheduler *sched;
> > > > > >           struct drm_sched_job *job;
> > > > > > +       struct dma_fence *fence;
> > > > > >           enum drm_gpu_sched_stat status = 
> > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > 
> > > > > >           sched = container_of(work, struct drm_gpu_scheduler, 
> > > > > > work_tdr.work); @@ -325,11 +326,10 @@ static void 
> > > > > > drm_sched_job_timedout(struct work_struct *work)
> > > > > > 
> > > > > >           if (job) {
> > > > > >                   /*
> > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > -                * is parked at which point it's safe.
> > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > +                * processing timedout_job
> > > > > >                    */
> > > > > > -               list_del_init(&job->list);
> > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > 
> > > > While this is true for amdgpu, it has no meaning for other drivers 
> > > > for whom we haven't done the refactoring of embedding HW fence 
> > > > (parent) into the job structure.
> > > > In fact thinking
> > > > about it, unless you do the HW fence embedding for all the drivers 
> > > > using the scheduler you cannot revert this patch or you will just 
> > > > break them.
> > > btw, why did you do that embedding? I do still have my patches with 
> > > dma_fence annotations floating around, but my idea at least was to 
> > > fix that issue with a mempool, not with embeddeding. What was the 
> > > motivation for embedding the wh fence?
> > > -Daniel
> > 
> > 
> > The motivation was 2 fold, avoid memory allocation during jobs 
> > submissions (HW fence allocation) because as Christian explained this 
> > leads to deadlock with mm code during evictions due to memory pressure 
> > (Christian can clarify if I messed
> 
> Yeah that's the exact same thing I've chased with my dma_fence annotations, but thus far zero to none interested in getting it sorted. I think it'd be good to have some cross-driver agreement on how this should be solved before someone just charges ahead ...
> 
> > this explanation). Second is to exactly revert this patch because 
> > while it solved the issue described in the patch it created another 
> > with drivers who baildc out early during TDR handling for various 
> > reason and the job would just leak because it was already removed form 
> > pending list.
> 
> Can't we reinsert it before we restart the scheduler thread? It might need a separate list for that due to the lockless queue tricks. Or am I thinking about the wrong kind of "we lost the job"?
> -Danile
> 
> > 
> > Andrey
> > 
> > 
> > > 
> > > 
> > > > Andrey
> > > > 
> > > > 
> > > > > >                   spin_unlock(&sched->job_list_lock);
> > > > > > 
> > > > > >                   status = job->sched->ops->timedout_job(job);
> > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > >                           job->sched->ops->free_job(job);
> > > > > >                           sched->free_guilty = false;
> > > > > >                   }
> > > > > > +               dma_fence_put(fence);
> > > > > >           } else {
> > > > > >                   spin_unlock(&sched->job_list_lock);
> > > > > >           }
> > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > 
> > > > > >           kthread_park(sched->thread);
> > > > > > 
> > > > > > -       /*
> > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > -        * now until the scheduler thread is unparked.
> > > > > > -        */
> > > > > > -       if (bad && bad->sched == sched)
> > > > > > -               /*
> > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > -                * job extracted.
> > > > > > -                */
> > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > -
> > > > > >           /*
> > > > > >            * Iterate the job list from later to  earlier one and either deactive
> > > > > >            * their HW callbacks or remove them from pending 
> > > > > > list if they already
> > > > > > --
> > > > > > 2.25.1
> > > > > > 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C8ddd8838028242eb82c708d9625678cf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637648945806335873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uFdAwQH6yWm%2FC%2FdDeG8wXKNsOqI7dSQRGO9NbKkjYyU%3D&amp;reserved=0

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-18 14:51           ` Andrey Grodzovsky
@ 2021-08-19  9:30             ` Daniel Vetter
  2021-08-19 10:25               ` Liu, Monk
  2021-08-19 15:25               ` Andrey Grodzovsky
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Vetter @ 2021-08-19  9:30 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Alex Deucher, Jingwen Chen,
	Maling list - DRI developers, amd-gfx list, monk.liu,
	Christian Koenig

On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > 
> > > > > > + dri-devel
> > > > > > 
> > > > > > Since scheduler is a shared component, please add dri-devel on all
> > > > > > scheduler patches.
> > > > > > 
> > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > [Why]
> > > > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > > > tdr mode.
> > > > > > > 
> > > > > > > [How]
> > > > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > > > this commit tries to work around is completely solved.So revert this
> > > > > > > commit.
> > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > v2:
> > > > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > > > during processing timedout_job
> > > > > > > 
> > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > >     1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >     {
> > > > > > >            struct drm_gpu_scheduler *sched;
> > > > > > >            struct drm_sched_job *job;
> > > > > > > +       struct dma_fence *fence;
> > > > > > >            enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > 
> > > > > > >            sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > 
> > > > > > >            if (job) {
> > > > > > >                    /*
> > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > -                * is parked at which point it's safe.
> > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > +                * processing timedout_job
> > > > > > >                     */
> > > > > > > -               list_del_init(&job->list);
> > > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > > > we haven't
> > > > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > > > In fact thinking
> > > > > about it, unless you do the HW fence embedding for all the drivers using the
> > > > > scheduler you cannot
> > > > > revert this patch or you will just break them.
> > > > btw, why did you do that embedding? I do still have my patches with
> > > > dma_fence annotations floating around, but my idea at least was to fix
> > > > that issue with a mempool, not with embeddeding. What was the motivation
> > > > for embedding the wh fence?
> > > > -Daniel
> > > 
> > > The motivation was 2 fold, avoid memory allocation during jobs submissions
> > > (HW fence allocation) because as Christian explained this leads to deadlock
> > > with
> > > mm code during evictions due to memory pressure (Christian can clarify if I
> > > messed
> > Yeah that's the exact same thing I've chased with my dma_fence
> > annotations, but thus far zero to none interested in getting it sorted. I
> > think it'd be good to have some cross-driver agreement on how this should
> > be solved before someone just charges ahead ...
> > 
> > > this explanation). Second is to exactly revert this patch because while it
> > > solved the issue
> > > described in the patch it created another with drivers who baildc out early
> > > during TDR handling
> > > for various reason and the job would just leak because it was already
> > > removed form pending list.
> > Can't we reinsert it before we restart the scheduler thread? It might need
> > a separate list for that due to the lockless queue tricks. Or am I
> > thinking about the wrong kind of "we lost the job"?
> > -Danile
> 
> 
> If you look at the original patch it would reinsert it even earlier - right
> after stopping the  SW scheduler thread, and even then it was to late for
> some drivers as they would decide to return back from their TDR handler even
> before that. It is solvable but in an ugly way as far as I see, you need to
> require each driver in his code to put the job back in the list if they do
> it before reaching the place where scheduler framework does it. Kind of
> spaghetti code seems to me.

Hm yeah I didn't realize this all happens before we stop the scheduler
thread.

Why can't we stop the scheduler thread first, so that there's guaranteed
no race? I've recently had a lot of discussions with panfrost folks about
their reset that spawns across engines, and without stopping the scheduler
thread first before you touch anything it's just plain impossible.

I'm also still not understanding what exactly you guys have done,
can someone please dig out the the amdgpu patches that motivate all this
maybe that's clearer? A full explanation would still be good since I've
only started in scheduler stuff.

Another thing I recently pondered for tdr races looking at i915 code is
whether the tdr should first block the completion fence for that job. My
motivation is to have a race-free error capture (if the completion races
then we might start evicting memory and everything goes boom), but maybe
that helps here too. Some kind of atomic "block this fence from
completing thing.

Or I'm I completely guessing in the wrong direction?
-Daniel

> 
> Andrey
> 
> 
> > 
> > > Andrey
> > > 
> > > 
> > > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > 
> > > > > > >                    status = job->sched->ops->timedout_job(job);
> > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >                            job->sched->ops->free_job(job);
> > > > > > >                            sched->free_guilty = false;
> > > > > > >                    }
> > > > > > > +               dma_fence_put(fence);
> > > > > > >            } else {
> > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > >            }
> > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > 
> > > > > > >            kthread_park(sched->thread);
> > > > > > > 
> > > > > > > -       /*
> > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > -        */
> > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > -               /*
> > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > -                * job extracted.
> > > > > > > -                */
> > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > -
> > > > > > >            /*
> > > > > > >             * Iterate the job list from later to  earlier one and either deactive
> > > > > > >             * their HW callbacks or remove them from pending list if they already
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-19  9:30             ` Daniel Vetter
@ 2021-08-19 10:25               ` Liu, Monk
  2021-08-20  7:12                 ` Liu, Monk
  2021-08-19 15:25               ` Andrey Grodzovsky
  1 sibling, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2021-08-19 10:25 UTC (permalink / raw)
  To: Daniel Vetter, Grodzovsky, Andrey
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers,
	amd-gfx list, Koenig, Christian

[AMD Official Use Only]

Hi Daniel

>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.

Yeah we had this though as well in our mind.

Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing,
Check this sample patch one as well please:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..50a49cb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
        sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
        /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
+       kthread_park(sched->thread);
        spin_lock(&sched->job_list_lock);
        job = list_first_entry_or_null(&sched->pending_list,
                                       struct drm_sched_job, list);
 
        if (job) {
-               /*
-                * Remove the bad job so it cannot be freed by concurrent
-                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-                * is parked at which point it's safe.
-                */
-               list_del_init(&job->list);
                spin_unlock(&sched->job_list_lock);
 
                status = job->sched->ops->timedout_job(job);
@@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
        } else {
                spin_unlock(&sched->job_list_lock);
        }
+       kthread_unpark(sched->thread);
 
        if (status != DRM_GPU_SCHED_STAT_ENODEV) {
                spin_lock(&sched->job_list_lock);
@@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
        kthread_park(sched->thread);
 
        /*
-        * Reinsert back the bad job here - now it's safe as
-        * drm_sched_get_cleanup_job cannot race against us and release the
-        * bad job at this point - we parked (waited for) any in progress
-        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-        * now until the scheduler thread is unparked.
-        */
-       if (bad && bad->sched == sched)
-               /*
-                * Add at the head of the queue to reflect it was the earliest
-                * job extracted.
-                */
-               list_add(&bad->list, &sched->pending_list);
-
-       /*
         * Iterate the job list from later to  earlier one and either deactive
         * their HW callbacks or remove them from pending list if they already
         * signaled.


Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch> 
Sent: Thursday, August 19, 2021 5:31 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > 
> > > > > > + dri-devel
> > > > > > 
> > > > > > Since scheduler is a shared component, please add dri-devel 
> > > > > > on all scheduler patches.
> > > > > > 
> > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > [Why]
> > > > > > > for bailing job, this commit will delete it from pending 
> > > > > > > list thus the bailing job will never have a chance to be 
> > > > > > > resubmitted even in advance tdr mode.
> > > > > > > 
> > > > > > > [How]
> > > > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > > > condition that this commit tries to work around is 
> > > > > > > completely solved.So revert this commit.
> > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > v2:
> > > > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > > > concurrent delete during processing timedout_job
> > > > > > > 
> > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > >     1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >     {
> > > > > > >            struct drm_gpu_scheduler *sched;
> > > > > > >            struct drm_sched_job *job;
> > > > > > > +       struct dma_fence *fence;
> > > > > > >            enum drm_gpu_sched_stat status = 
> > > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > 
> > > > > > >            sched = container_of(work, struct 
> > > > > > > drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ 
> > > > > > > static void drm_sched_job_timedout(struct work_struct 
> > > > > > > *work)
> > > > > > > 
> > > > > > >            if (job) {
> > > > > > >                    /*
> > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > -                * is parked at which point it's safe.
> > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > +                * processing timedout_job
> > > > > > >                     */
> > > > > > > -               list_del_init(&job->list);
> > > > > > > +               fence = 
> > > > > > > + dma_fence_get(job->s_fence->parent);
> > > > > While this is true for amdgpu, it has no meaning for other 
> > > > > drivers for whom we haven't done the refactoring of embedding 
> > > > > HW fence (parent) into the job structure.
> > > > > In fact thinking
> > > > > about it, unless you do the HW fence embedding for all the 
> > > > > drivers using the scheduler you cannot revert this patch or 
> > > > > you will just break them.
> > > > btw, why did you do that embedding? I do still have my patches 
> > > > with dma_fence annotations floating around, but my idea at least 
> > > > was to fix that issue with a mempool, not with embeddeding. What 
> > > > was the motivation for embedding the wh fence?
> > > > -Daniel
> > > 
> > > The motivation was 2 fold, avoid memory allocation during jobs 
> > > submissions (HW fence allocation) because as Christian explained 
> > > this leads to deadlock with mm code during evictions due to memory 
> > > pressure (Christian can clarify if I messed
> > Yeah that's the exact same thing I've chased with my dma_fence 
> > annotations, but thus far zero to none interested in getting it 
> > sorted. I think it'd be good to have some cross-driver agreement on 
> > how this should be solved before someone just charges ahead ...
> > 
> > > this explanation). Second is to exactly revert this patch because 
> > > while it solved the issue described in the patch it created 
> > > another with drivers who baildc out early during TDR handling for 
> > > various reason and the job would just leak because it was already 
> > > removed form pending list.
> > Can't we reinsert it before we restart the scheduler thread? It 
> > might need a separate list for that due to the lockless queue 
> > tricks. Or am I thinking about the wrong kind of "we lost the job"?
> > -Danile
> 
> 
> If you look at the original patch it would reinsert it even earlier - 
> right after stopping the  SW scheduler thread, and even then it was to 
> late for some drivers as they would decide to return back from their 
> TDR handler even before that. It is solvable but in an ugly way as far 
> as I see, you need to require each driver in his code to put the job 
> back in the list if they do it before reaching the place where 
> scheduler framework does it. Kind of spaghetti code seems to me.

Hm yeah I didn't realize this all happens before we stop the scheduler thread.

Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.

I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.

Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.

Or I'm I completely guessing in the wrong direction?
-Daniel

> 
> Andrey
> 
> 
> > 
> > > Andrey
> > > 
> > > 
> > > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > 
> > > > > > >                    status = 
> > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >                            job->sched->ops->free_job(job);
> > > > > > >                            sched->free_guilty = false;
> > > > > > >                    }
> > > > > > > +               dma_fence_put(fence);
> > > > > > >            } else {
> > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > >            }
> > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > 
> > > > > > >            kthread_park(sched->thread);
> > > > > > > 
> > > > > > > -       /*
> > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > -        */
> > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > -               /*
> > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > -                * job extracted.
> > > > > > > -                */
> > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > -
> > > > > > >            /*
> > > > > > >             * Iterate the job list from later to  earlier one and either deactive
> > > > > > >             * their HW callbacks or remove them from 
> > > > > > > pending list if they already
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > > 

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-19  9:30             ` Daniel Vetter
  2021-08-19 10:25               ` Liu, Monk
@ 2021-08-19 15:25               ` Andrey Grodzovsky
  2021-08-26  9:04                 ` Daniel Vetter
  1 sibling, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-19 15:25 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu, Christian Koenig


On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>
>>>>>>> + dri-devel
>>>>>>>
>>>>>>> Since scheduler is a shared component, please add dri-devel on all
>>>>>>> scheduler patches.
>>>>>>>
>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>> [Why]
>>>>>>>> for bailing job, this commit will delete it from pending list thus the
>>>>>>>> bailing job will never have a chance to be resubmitted even in advance
>>>>>>>> tdr mode.
>>>>>>>>
>>>>>>>> [How]
>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race condition that
>>>>>>>> this commit tries to work around is completely solved.So revert this
>>>>>>>> commit.
>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>> v2:
>>>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>>>>>>>> during processing timedout_job
>>>>>>>>
>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>      1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>      {
>>>>>>>>             struct drm_gpu_scheduler *sched;
>>>>>>>>             struct drm_sched_job *job;
>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>             enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>
>>>>>>>>             sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>
>>>>>>>>             if (job) {
>>>>>>>>                     /*
>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>> +                * processing timedout_job
>>>>>>>>                      */
>>>>>>>> -               list_del_init(&job->list);
>>>>>>>> +               fence = dma_fence_get(job->s_fence->parent);
>>>>>> While this is true for amdgpu, it has no meaning for other drivers for whom
>>>>>> we haven't
>>>>>> done the refactoring of embedding HW fence (parent) into the job structure.
>>>>>> In fact thinking
>>>>>> about it, unless you do the HW fence embedding for all the drivers using the
>>>>>> scheduler you cannot
>>>>>> revert this patch or you will just break them.
>>>>> btw, why did you do that embedding? I do still have my patches with
>>>>> dma_fence annotations floating around, but my idea at least was to fix
>>>>> that issue with a mempool, not with embeddeding. What was the motivation
>>>>> for embedding the wh fence?
>>>>> -Daniel
>>>> The motivation was 2 fold, avoid memory allocation during jobs submissions
>>>> (HW fence allocation) because as Christian explained this leads to deadlock
>>>> with
>>>> mm code during evictions due to memory pressure (Christian can clarify if I
>>>> messed
>>> Yeah that's the exact same thing I've chased with my dma_fence
>>> annotations, but thus far zero to none interested in getting it sorted. I
>>> think it'd be good to have some cross-driver agreement on how this should
>>> be solved before someone just charges ahead ...
>>>
>>>> this explanation). Second is to exactly revert this patch because while it
>>>> solved the issue
>>>> described in the patch it created another with drivers who baildc out early
>>>> during TDR handling
>>>> for various reason and the job would just leak because it was already
>>>> removed form pending list.
>>> Can't we reinsert it before we restart the scheduler thread? It might need
>>> a separate list for that due to the lockless queue tricks. Or am I
>>> thinking about the wrong kind of "we lost the job"?
>>> -Danile
>>
>> If you look at the original patch it would reinsert it even earlier - right
>> after stopping the  SW scheduler thread, and even then it was to late for
>> some drivers as they would decide to return back from their TDR handler even
>> before that. It is solvable but in an ugly way as far as I see, you need to
>> require each driver in his code to put the job back in the list if they do
>> it before reaching the place where scheduler framework does it. Kind of
>> spaghetti code seems to me.
> Hm yeah I didn't realize this all happens before we stop the scheduler
> thread.
>
> Why can't we stop the scheduler thread first, so that there's guaranteed
> no race? I've recently had a lot of discussions with panfrost folks about
> their reset that spawns across engines, and without stopping the scheduler
> thread first before you touch anything it's just plain impossible.


Talked with Christian on that, for each TDR we actually stop all the
schedulers for all the rings and not only the hanged ring since
ASIC reset will impact all the rings anyway. So we cannot allow
other timeout handlers for other rings run in parallel to ours
as they will stop/restart the threads we just stopped and rely
on them being stopped. So it's all done with device wide lock
inside the amdgpu tTDR handler. Only inside the locked
section then we may stop/restart the scheduler threads.
Christian also mentioned that you proposed at some point
to serialize all TDR handling into single threading for all rings - this 
seems
like something that could be used - we then don't need any
locking against TDR handlers from other rings and then we may
stop the scheduler thread as first step


>
> I'm also still not understanding what exactly you guys have done,
> can someone please dig out the the amdgpu patches that motivate all this
> maybe that's clearer? A full explanation would still be good since I've
> only started in scheduler stuff.


https://gitlab.freedesktop.org/agd5f/linux/-/commit/de7515d43659f852590645a688f8d493e4a18141


>
> Another thing I recently pondered for tdr races looking at i915 code is
> whether the tdr should first block the completion fence for that job. My
> motivation is to have a race-free error capture (if the completion races
> then we might start evicting memory and everything goes boom), but maybe
> that helps here too. Some kind of atomic "block this fence from
> completing thing.
>
> Or I'm I completely guessing in the wrong direction?


I think we already do it here - 
https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L410

Andrey


> -Daniel
>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>
>>>>>>>>                     status = job->sched->ops->timedout_job(job);
>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>                             job->sched->ops->free_job(job);
>>>>>>>>                             sched->free_guilty = false;
>>>>>>>>                     }
>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>             } else {
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>             }
>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>
>>>>>>>>             kthread_park(sched->thread);
>>>>>>>>
>>>>>>>> -       /*
>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>> -        */
>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>> -               /*
>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>> -                * job extracted.
>>>>>>>> -                */
>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>> -
>>>>>>>>             /*
>>>>>>>>              * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>              * their HW callbacks or remove them from pending list if they already
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-19 10:25               ` Liu, Monk
@ 2021-08-20  7:12                 ` Liu, Monk
  2021-08-20  7:20                   ` Christian König
                                     ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Liu, Monk @ 2021-08-20  7:12 UTC (permalink / raw)
  To: Daniel Vetter, Grodzovsky, Andrey, Koenig, Christian
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers, amd-gfx list

[AMD Official Use Only]

@Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
 
Do you have any concern on the kthread_park() approach ?

Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
There was already something wrong/stuck on that ring/scheduler 

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Liu, Monk 
Sent: Thursday, August 19, 2021 6:26 PM
To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

[AMD Official Use Only]

Hi Daniel

>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.

Yeah we had this though as well in our mind.

Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..50a49cb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
        sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
        /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
+       kthread_park(sched->thread);
        spin_lock(&sched->job_list_lock);
        job = list_first_entry_or_null(&sched->pending_list,
                                       struct drm_sched_job, list);
 
        if (job) {
-               /*
-                * Remove the bad job so it cannot be freed by concurrent
-                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-                * is parked at which point it's safe.
-                */
-               list_del_init(&job->list);
                spin_unlock(&sched->job_list_lock);
 
                status = job->sched->ops->timedout_job(job);
@@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
        } else {
                spin_unlock(&sched->job_list_lock);
        }
+       kthread_unpark(sched->thread);
 
        if (status != DRM_GPU_SCHED_STAT_ENODEV) {
                spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
        kthread_park(sched->thread);
 
        /*
-        * Reinsert back the bad job here - now it's safe as
-        * drm_sched_get_cleanup_job cannot race against us and release the
-        * bad job at this point - we parked (waited for) any in progress
-        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-        * now until the scheduler thread is unparked.
-        */
-       if (bad && bad->sched == sched)
-               /*
-                * Add at the head of the queue to reflect it was the earliest
-                * job extracted.
-                */
-               list_add(&bad->list, &sched->pending_list);
-
-       /*
         * Iterate the job list from later to  earlier one and either deactive
         * their HW callbacks or remove them from pending list if they already
         * signaled.


Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch>
Sent: Thursday, August 19, 2021 5:31 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > 
> > > > > > + dri-devel
> > > > > > 
> > > > > > Since scheduler is a shared component, please add dri-devel 
> > > > > > on all scheduler patches.
> > > > > > 
> > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > [Why]
> > > > > > > for bailing job, this commit will delete it from pending 
> > > > > > > list thus the bailing job will never have a chance to be 
> > > > > > > resubmitted even in advance tdr mode.
> > > > > > > 
> > > > > > > [How]
> > > > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > > > condition that this commit tries to work around is 
> > > > > > > completely solved.So revert this commit.
> > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > v2:
> > > > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > > > concurrent delete during processing timedout_job
> > > > > > > 
> > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > >     1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >     {
> > > > > > >            struct drm_gpu_scheduler *sched;
> > > > > > >            struct drm_sched_job *job;
> > > > > > > +       struct dma_fence *fence;
> > > > > > >            enum drm_gpu_sched_stat status = 
> > > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > 
> > > > > > >            sched = container_of(work, struct 
> > > > > > > drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ 
> > > > > > > static void drm_sched_job_timedout(struct work_struct
> > > > > > > *work)
> > > > > > > 
> > > > > > >            if (job) {
> > > > > > >                    /*
> > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > -                * is parked at which point it's safe.
> > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > +                * processing timedout_job
> > > > > > >                     */
> > > > > > > -               list_del_init(&job->list);
> > > > > > > +               fence =
> > > > > > > + dma_fence_get(job->s_fence->parent);
> > > > > While this is true for amdgpu, it has no meaning for other 
> > > > > drivers for whom we haven't done the refactoring of embedding 
> > > > > HW fence (parent) into the job structure.
> > > > > In fact thinking
> > > > > about it, unless you do the HW fence embedding for all the 
> > > > > drivers using the scheduler you cannot revert this patch or 
> > > > > you will just break them.
> > > > btw, why did you do that embedding? I do still have my patches 
> > > > with dma_fence annotations floating around, but my idea at least 
> > > > was to fix that issue with a mempool, not with embeddeding. What 
> > > > was the motivation for embedding the wh fence?
> > > > -Daniel
> > > 
> > > The motivation was 2 fold, avoid memory allocation during jobs 
> > > submissions (HW fence allocation) because as Christian explained 
> > > this leads to deadlock with mm code during evictions due to memory 
> > > pressure (Christian can clarify if I messed
> > Yeah that's the exact same thing I've chased with my dma_fence 
> > annotations, but thus far zero to none interested in getting it 
> > sorted. I think it'd be good to have some cross-driver agreement on 
> > how this should be solved before someone just charges ahead ...
> > 
> > > this explanation). Second is to exactly revert this patch because 
> > > while it solved the issue described in the patch it created 
> > > another with drivers who baildc out early during TDR handling for 
> > > various reason and the job would just leak because it was already 
> > > removed form pending list.
> > Can't we reinsert it before we restart the scheduler thread? It 
> > might need a separate list for that due to the lockless queue 
> > tricks. Or am I thinking about the wrong kind of "we lost the job"?
> > -Danile
> 
> 
> If you look at the original patch it would reinsert it even earlier - 
> right after stopping the  SW scheduler thread, and even then it was to 
> late for some drivers as they would decide to return back from their 
> TDR handler even before that. It is solvable but in an ugly way as far 
> as I see, you need to require each driver in his code to put the job 
> back in the list if they do it before reaching the place where 
> scheduler framework does it. Kind of spaghetti code seems to me.

Hm yeah I didn't realize this all happens before we stop the scheduler thread.

Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.

I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.

Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.

Or I'm I completely guessing in the wrong direction?
-Daniel

> 
> Andrey
> 
> 
> > 
> > > Andrey
> > > 
> > > 
> > > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > 
> > > > > > >                    status =
> > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >                            job->sched->ops->free_job(job);
> > > > > > >                            sched->free_guilty = false;
> > > > > > >                    }
> > > > > > > +               dma_fence_put(fence);
> > > > > > >            } else {
> > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > >            }
> > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > 
> > > > > > >            kthread_park(sched->thread);
> > > > > > > 
> > > > > > > -       /*
> > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > -        */
> > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > -               /*
> > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > -                * job extracted.
> > > > > > > -                */
> > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > -
> > > > > > >            /*
> > > > > > >             * Iterate the job list from later to  earlier one and either deactive
> > > > > > >             * their HW callbacks or remove them from 
> > > > > > > pending list if they already
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > > 

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20  7:12                 ` Liu, Monk
@ 2021-08-20  7:20                   ` Christian König
  2021-08-20  8:09                     ` Jingwen Chen
  2021-08-26  8:59                     ` Daniel Vetter
  2021-08-20 14:07                   ` Andrey Grodzovsky
  2023-06-08 16:40                     ` Lucas Stach
  2 siblings, 2 replies; 30+ messages in thread
From: Christian König @ 2021-08-20  7:20 UTC (permalink / raw)
  To: Liu, Monk, Daniel Vetter, Grodzovsky, Andrey
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers, amd-gfx list

No, that perfectly works for me.

The problem we used to have with this approach was that we potentially 
have multiple timeouts at the same time.

But when we serialize the timeout handling by using a single workqueue 
as suggested by Daniel now as well then that isn't an issue any more.

Regards,
Christian.

Am 20.08.21 um 09:12 schrieb Liu, Monk:
> [AMD Official Use Only]
>
> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>   
> Do you have any concern on the kthread_park() approach ?
>
> Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
> There was already something wrong/stuck on that ring/scheduler
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Thursday, August 19, 2021 6:26 PM
> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
> [AMD Official Use Only]
>
> Hi Daniel
>
>>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> Yeah we had this though as well in our mind.
>
> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..50a49cb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
>          /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> +       kthread_park(sched->thread);
>          spin_lock(&sched->job_list_lock);
>          job = list_first_entry_or_null(&sched->pending_list,
>                                         struct drm_sched_job, list);
>   
>          if (job) {
> -               /*
> -                * Remove the bad job so it cannot be freed by concurrent
> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -                * is parked at which point it's safe.
> -                */
> -               list_del_init(&job->list);
>                  spin_unlock(&sched->job_list_lock);
>   
>                  status = job->sched->ops->timedout_job(job);
> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>          } else {
>                  spin_unlock(&sched->job_list_lock);
>          }
> +       kthread_unpark(sched->thread);
>   
>          if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>                  spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>          kthread_park(sched->thread);
>   
>          /*
> -        * Reinsert back the bad job here - now it's safe as
> -        * drm_sched_get_cleanup_job cannot race against us and release the
> -        * bad job at this point - we parked (waited for) any in progress
> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -        * now until the scheduler thread is unparked.
> -        */
> -       if (bad && bad->sched == sched)
> -               /*
> -                * Add at the head of the queue to reflect it was the earliest
> -                * job extracted.
> -                */
> -               list_add(&bad->list, &sched->pending_list);
> -
> -       /*
>           * Iterate the job list from later to  earlier one and either deactive
>           * their HW callbacks or remove them from pending list if they already
>           * signaled.
>
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Thursday, August 19, 2021 5:31 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>
>>>>>>> + dri-devel
>>>>>>>
>>>>>>> Since scheduler is a shared component, please add dri-devel
>>>>>>> on all scheduler patches.
>>>>>>>
>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>> [Why]
>>>>>>>> for bailing job, this commit will delete it from pending
>>>>>>>> list thus the bailing job will never have a chance to be
>>>>>>>> resubmitted even in advance tdr mode.
>>>>>>>>
>>>>>>>> [How]
>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race
>>>>>>>> condition that this commit tries to work around is
>>>>>>>> completely solved.So revert this commit.
>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>> v2:
>>>>>>>> add dma_fence_get/put() around timedout_job to avoid
>>>>>>>> concurrent delete during processing timedout_job
>>>>>>>>
>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>      1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>      {
>>>>>>>>             struct drm_gpu_scheduler *sched;
>>>>>>>>             struct drm_sched_job *job;
>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>             enum drm_gpu_sched_stat status =
>>>>>>>> DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>
>>>>>>>>             sched = container_of(work, struct
>>>>>>>> drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@
>>>>>>>> static void drm_sched_job_timedout(struct work_struct
>>>>>>>> *work)
>>>>>>>>
>>>>>>>>             if (job) {
>>>>>>>>                     /*
>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>> +                * processing timedout_job
>>>>>>>>                      */
>>>>>>>> -               list_del_init(&job->list);
>>>>>>>> +               fence =
>>>>>>>> + dma_fence_get(job->s_fence->parent);
>>>>>> While this is true for amdgpu, it has no meaning for other
>>>>>> drivers for whom we haven't done the refactoring of embedding
>>>>>> HW fence (parent) into the job structure.
>>>>>> In fact thinking
>>>>>> about it, unless you do the HW fence embedding for all the
>>>>>> drivers using the scheduler you cannot revert this patch or
>>>>>> you will just break them.
>>>>> btw, why did you do that embedding? I do still have my patches
>>>>> with dma_fence annotations floating around, but my idea at least
>>>>> was to fix that issue with a mempool, not with embeddeding. What
>>>>> was the motivation for embedding the wh fence?
>>>>> -Daniel
>>>> The motivation was 2 fold, avoid memory allocation during jobs
>>>> submissions (HW fence allocation) because as Christian explained
>>>> this leads to deadlock with mm code during evictions due to memory
>>>> pressure (Christian can clarify if I messed
>>> Yeah that's the exact same thing I've chased with my dma_fence
>>> annotations, but thus far zero to none interested in getting it
>>> sorted. I think it'd be good to have some cross-driver agreement on
>>> how this should be solved before someone just charges ahead ...
>>>
>>>> this explanation). Second is to exactly revert this patch because
>>>> while it solved the issue described in the patch it created
>>>> another with drivers who baildc out early during TDR handling for
>>>> various reason and the job would just leak because it was already
>>>> removed form pending list.
>>> Can't we reinsert it before we restart the scheduler thread? It
>>> might need a separate list for that due to the lockless queue
>>> tricks. Or am I thinking about the wrong kind of "we lost the job"?
>>> -Danile
>>
>> If you look at the original patch it would reinsert it even earlier -
>> right after stopping the  SW scheduler thread, and even then it was to
>> late for some drivers as they would decide to return back from their
>> TDR handler even before that. It is solvable but in an ugly way as far
>> as I see, you need to require each driver in his code to put the job
>> back in the list if they do it before reaching the place where
>> scheduler framework does it. Kind of spaghetti code seems to me.
> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
>
> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>
> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
>
> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
>
> Or I'm I completely guessing in the wrong direction?
> -Daniel
>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>
>>>>>>>>                     status =
>>>>>>>> job->sched->ops->timedout_job(job);
>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>                             job->sched->ops->free_job(job);
>>>>>>>>                             sched->free_guilty = false;
>>>>>>>>                     }
>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>             } else {
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>             }
>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct
>>>>>>>> drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>
>>>>>>>>             kthread_park(sched->thread);
>>>>>>>>
>>>>>>>> -       /*
>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>> -        */
>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>> -               /*
>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>> -                * job extracted.
>>>>>>>> -                */
>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>> -
>>>>>>>>             /*
>>>>>>>>              * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>              * their HW callbacks or remove them from
>>>>>>>> pending list if they already
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20  7:20                   ` Christian König
@ 2021-08-20  8:09                     ` Jingwen Chen
  2021-08-20 13:49                       ` Andrey Grodzovsky
  2021-08-26  8:59                     ` Daniel Vetter
  1 sibling, 1 reply; 30+ messages in thread
From: Jingwen Chen @ 2021-08-20  8:09 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Daniel Vetter, Grodzovsky, Andrey
  Cc: Alex Deucher, Maling list - DRI developers, amd-gfx list

Hi all,

I just submit a v3 patch according your opinion on using kthread_park
instead.

Thanks,
Jingwen
On Fri Aug 20, 2021 at 09:20:42AM +0200, Christian König wrote:
> No, that perfectly works for me.
> 
> The problem we used to have with this approach was that we potentially have
> multiple timeouts at the same time.
> 
> But when we serialize the timeout handling by using a single workqueue as
> suggested by Daniel now as well then that isn't an issue any more.
> 
> Regards,
> Christian.
> 
> Am 20.08.21 um 09:12 schrieb Liu, Monk:
> > [AMD Official Use Only]
> > 
> > @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
> > Do you have any concern on the kthread_park() approach ?
> > 
> > Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
> > There was already something wrong/stuck on that ring/scheduler
> > 
> > Thanks
> > 
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> > 
> > -----Original Message-----
> > From: Liu, Monk
> > Sent: Thursday, August 19, 2021 6:26 PM
> > To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> > Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> > 
> > [AMD Official Use Only]
> > 
> > Hi Daniel
> > 
> > > > Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> > Yeah we had this though as well in our mind.
> > 
> > Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index a2a9536..50a49cb 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> >          /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> > +       kthread_park(sched->thread);
> >          spin_lock(&sched->job_list_lock);
> >          job = list_first_entry_or_null(&sched->pending_list,
> >                                         struct drm_sched_job, list);
> >          if (job) {
> > -               /*
> > -                * Remove the bad job so it cannot be freed by concurrent
> > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -                * is parked at which point it's safe.
> > -                */
> > -               list_del_init(&job->list);
> >                  spin_unlock(&sched->job_list_lock);
> >                  status = job->sched->ops->timedout_job(job);
> > @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >          } else {
> >                  spin_unlock(&sched->job_list_lock);
> >          }
> > +       kthread_unpark(sched->thread);
> >          if (status != DRM_GPU_SCHED_STAT_ENODEV) {
> >                  spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >          kthread_park(sched->thread);
> >          /*
> > -        * Reinsert back the bad job here - now it's safe as
> > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > -        * bad job at this point - we parked (waited for) any in progress
> > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -        * now until the scheduler thread is unparked.
> > -        */
> > -       if (bad && bad->sched == sched)
> > -               /*
> > -                * Add at the head of the queue to reflect it was the earliest
> > -                * job extracted.
> > -                */
> > -               list_add(&bad->list, &sched->pending_list);
> > -
> > -       /*
> >           * Iterate the job list from later to  earlier one and either deactive
> >           * their HW callbacks or remove them from pending list if they already
> >           * signaled.
> > 
> > 
> > Thanks
> > 
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> > 
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Thursday, August 19, 2021 5:31 PM
> > To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> > 
> > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > 
> > > > > > > > + dri-devel
> > > > > > > > 
> > > > > > > > Since scheduler is a shared component, please add dri-devel
> > > > > > > > on all scheduler patches.
> > > > > > > > 
> > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > > [Why]
> > > > > > > > > for bailing job, this commit will delete it from pending
> > > > > > > > > list thus the bailing job will never have a chance to be
> > > > > > > > > resubmitted even in advance tdr mode.
> > > > > > > > > 
> > > > > > > > > [How]
> > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race
> > > > > > > > > condition that this commit tries to work around is
> > > > > > > > > completely solved.So revert this commit.
> > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > v2:
> > > > > > > > > add dma_fence_get/put() around timedout_job to avoid
> > > > > > > > > concurrent delete during processing timedout_job
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > >      1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > >      {
> > > > > > > > >             struct drm_gpu_scheduler *sched;
> > > > > > > > >             struct drm_sched_job *job;
> > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > >             enum drm_gpu_sched_stat status =
> > > > > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > 
> > > > > > > > >             sched = container_of(work, struct
> > > > > > > > > drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@
> > > > > > > > > static void drm_sched_job_timedout(struct work_struct
> > > > > > > > > *work)
> > > > > > > > > 
> > > > > > > > >             if (job) {
> > > > > > > > >                     /*
> > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > +                * processing timedout_job
> > > > > > > > >                      */
> > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > +               fence =
> > > > > > > > > + dma_fence_get(job->s_fence->parent);
> > > > > > > While this is true for amdgpu, it has no meaning for other
> > > > > > > drivers for whom we haven't done the refactoring of embedding
> > > > > > > HW fence (parent) into the job structure.
> > > > > > > In fact thinking
> > > > > > > about it, unless you do the HW fence embedding for all the
> > > > > > > drivers using the scheduler you cannot revert this patch or
> > > > > > > you will just break them.
> > > > > > btw, why did you do that embedding? I do still have my patches
> > > > > > with dma_fence annotations floating around, but my idea at least
> > > > > > was to fix that issue with a mempool, not with embeddeding. What
> > > > > > was the motivation for embedding the wh fence?
> > > > > > -Daniel
> > > > > The motivation was 2 fold, avoid memory allocation during jobs
> > > > > submissions (HW fence allocation) because as Christian explained
> > > > > this leads to deadlock with mm code during evictions due to memory
> > > > > pressure (Christian can clarify if I messed
> > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > annotations, but thus far zero to none interested in getting it
> > > > sorted. I think it'd be good to have some cross-driver agreement on
> > > > how this should be solved before someone just charges ahead ...
> > > > 
> > > > > this explanation). Second is to exactly revert this patch because
> > > > > while it solved the issue described in the patch it created
> > > > > another with drivers who baildc out early during TDR handling for
> > > > > various reason and the job would just leak because it was already
> > > > > removed form pending list.
> > > > Can't we reinsert it before we restart the scheduler thread? It
> > > > might need a separate list for that due to the lockless queue
> > > > tricks. Or am I thinking about the wrong kind of "we lost the job"?
> > > > -Danile
> > > 
> > > If you look at the original patch it would reinsert it even earlier -
> > > right after stopping the  SW scheduler thread, and even then it was to
> > > late for some drivers as they would decide to return back from their
> > > TDR handler even before that. It is solvable but in an ugly way as far
> > > as I see, you need to require each driver in his code to put the job
> > > back in the list if they do it before reaching the place where
> > > scheduler framework does it. Kind of spaghetti code seems to me.
> > Hm yeah I didn't realize this all happens before we stop the scheduler thread.
> > 
> > Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> > 
> > I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
> > 
> > Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
> > 
> > Or I'm I completely guessing in the wrong direction?
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > > Andrey
> > > > > > > 
> > > > > > > 
> > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > 
> > > > > > > > >                     status =
> > > > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > >                             job->sched->ops->free_job(job);
> > > > > > > > >                             sched->free_guilty = false;
> > > > > > > > >                     }
> > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > >             } else {
> > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > >             }
> > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct
> > > > > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > 
> > > > > > > > >             kthread_park(sched->thread);
> > > > > > > > > 
> > > > > > > > > -       /*
> > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > -        */
> > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > -               /*
> > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > -                * job extracted.
> > > > > > > > > -                */
> > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > -
> > > > > > > > >             /*
> > > > > > > > >              * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > >              * their HW callbacks or remove them from
> > > > > > > > > pending list if they already
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20  8:09                     ` Jingwen Chen
@ 2021-08-20 13:49                       ` Andrey Grodzovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-20 13:49 UTC (permalink / raw)
  To: Jingwen Chen, Christian König, Liu, Monk, Daniel Vetter
  Cc: Alex Deucher, Maling list - DRI developers, amd-gfx list

I believe we have some minor confusion here

On 2021-08-20 4:09 a.m., Jingwen Chen wrote:
> Hi all,
>
> I just submit a v3 patch according your opinion on using kthread_park
> instead.
>
> Thanks,
> Jingwen
> On Fri Aug 20, 2021 at 09:20:42AM +0200, Christian König wrote:
>> No, that perfectly works for me.
>>
>> The problem we used to have with this approach was that we potentially have
>> multiple timeouts at the same time.
>>
>> But when we serialize the timeout handling by using a single workqueue as
>> suggested by Daniel now as well then that isn't an issue any more.


While we do use single work queue by default (system_wq) for this, we 
use different
work items, one per scheduler which means they still run in parallel.  I 
didn't see the original
mail by Daniel but from what Christian mentioned I assume he suggested 
to serialize all TO handlers
from all possible engines by either using single work item for TO 
handler or by using single threaded queue for all TO handlers.
So i believe it's premature to send V3 patch without also switching all 
TDR handling to actual single threaded
handling per entire ASIC or in case of amdgpu we actually need to 
consider XGMI hives and so it goes beyond a single
device.

Andrey


>>
>> Regards,
>> Christian.
>>
>> Am 20.08.21 um 09:12 schrieb Liu, Monk:
>>> [AMD Official Use Only]
>>>
>>> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>>> Do you have any concern on the kthread_park() approach ?
>>>
>>> Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
>>> There was already something wrong/stuck on that ring/scheduler
>>>
>>> Thanks
>>>
>>> ------------------------------------------
>>> Monk Liu | Cloud-GPU Core team
>>> ------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Liu, Monk
>>> Sent: Thursday, August 19, 2021 6:26 PM
>>> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
>>> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>>>
>>> [AMD Official Use Only]
>>>
>>> Hi Daniel
>>>
>>>>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>>> Yeah we had this though as well in our mind.
>>>
>>> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index a2a9536..50a49cb 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>           sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>           /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>>> +       kthread_park(sched->thread);
>>>           spin_lock(&sched->job_list_lock);
>>>           job = list_first_entry_or_null(&sched->pending_list,
>>>                                          struct drm_sched_job, list);
>>>           if (job) {
>>> -               /*
>>> -                * Remove the bad job so it cannot be freed by concurrent
>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>> -                * is parked at which point it's safe.
>>> -                */
>>> -               list_del_init(&job->list);
>>>                   spin_unlock(&sched->job_list_lock);
>>>                   status = job->sched->ops->timedout_job(job);
>>> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>           } else {
>>>                   spin_unlock(&sched->job_list_lock);
>>>           }
>>> +       kthread_unpark(sched->thread);
>>>           if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>>>                   spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>           kthread_park(sched->thread);
>>>           /*
>>> -        * Reinsert back the bad job here - now it's safe as
>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>> -        * bad job at this point - we parked (waited for) any in progress
>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>> -        * now until the scheduler thread is unparked.
>>> -        */
>>> -       if (bad && bad->sched == sched)
>>> -               /*
>>> -                * Add at the head of the queue to reflect it was the earliest
>>> -                * job extracted.
>>> -                */
>>> -               list_add(&bad->list, &sched->pending_list);
>>> -
>>> -       /*
>>>            * Iterate the job list from later to  earlier one and either deactive
>>>            * their HW callbacks or remove them from pending list if they already
>>>            * signaled.
>>>
>>>
>>> Thanks
>>>
>>> ------------------------------------------
>>> Monk Liu | Cloud-GPU Core team
>>> ------------------------------------------
>>>
>>> -----Original Message-----
>>> From: Daniel Vetter <daniel@ffwll.ch>
>>> Sent: Thursday, August 19, 2021 5:31 PM
>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>>> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>>>
>>> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>>>
>>>>>>>>> + dri-devel
>>>>>>>>>
>>>>>>>>> Since scheduler is a shared component, please add dri-devel
>>>>>>>>> on all scheduler patches.
>>>>>>>>>
>>>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>>>> [Why]
>>>>>>>>>> for bailing job, this commit will delete it from pending
>>>>>>>>>> list thus the bailing job will never have a chance to be
>>>>>>>>>> resubmitted even in advance tdr mode.
>>>>>>>>>>
>>>>>>>>>> [How]
>>>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race
>>>>>>>>>> condition that this commit tries to work around is
>>>>>>>>>> completely solved.So revert this commit.
>>>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>>>> v2:
>>>>>>>>>> add dma_fence_get/put() around timedout_job to avoid
>>>>>>>>>> concurrent delete during processing timedout_job
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>>>       1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>       {
>>>>>>>>>>              struct drm_gpu_scheduler *sched;
>>>>>>>>>>              struct drm_sched_job *job;
>>>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>>>              enum drm_gpu_sched_stat status =
>>>>>>>>>> DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>>>
>>>>>>>>>>              sched = container_of(work, struct
>>>>>>>>>> drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@
>>>>>>>>>> static void drm_sched_job_timedout(struct work_struct
>>>>>>>>>> *work)
>>>>>>>>>>
>>>>>>>>>>              if (job) {
>>>>>>>>>>                      /*
>>>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>>>> +                * processing timedout_job
>>>>>>>>>>                       */
>>>>>>>>>> -               list_del_init(&job->list);
>>>>>>>>>> +               fence =
>>>>>>>>>> + dma_fence_get(job->s_fence->parent);
>>>>>>>> While this is true for amdgpu, it has no meaning for other
>>>>>>>> drivers for whom we haven't done the refactoring of embedding
>>>>>>>> HW fence (parent) into the job structure.
>>>>>>>> In fact thinking
>>>>>>>> about it, unless you do the HW fence embedding for all the
>>>>>>>> drivers using the scheduler you cannot revert this patch or
>>>>>>>> you will just break them.
>>>>>>> btw, why did you do that embedding? I do still have my patches
>>>>>>> with dma_fence annotations floating around, but my idea at least
>>>>>>> was to fix that issue with a mempool, not with embeddeding. What
>>>>>>> was the motivation for embedding the wh fence?
>>>>>>> -Daniel
>>>>>> The motivation was 2 fold, avoid memory allocation during jobs
>>>>>> submissions (HW fence allocation) because as Christian explained
>>>>>> this leads to deadlock with mm code during evictions due to memory
>>>>>> pressure (Christian can clarify if I messed
>>>>> Yeah that's the exact same thing I've chased with my dma_fence
>>>>> annotations, but thus far zero to none interested in getting it
>>>>> sorted. I think it'd be good to have some cross-driver agreement on
>>>>> how this should be solved before someone just charges ahead ...
>>>>>
>>>>>> this explanation). Second is to exactly revert this patch because
>>>>>> while it solved the issue described in the patch it created
>>>>>> another with drivers who baildc out early during TDR handling for
>>>>>> various reason and the job would just leak because it was already
>>>>>> removed form pending list.
>>>>> Can't we reinsert it before we restart the scheduler thread? It
>>>>> might need a separate list for that due to the lockless queue
>>>>> tricks. Or am I thinking about the wrong kind of "we lost the job"?
>>>>> -Danile
>>>> If you look at the original patch it would reinsert it even earlier -
>>>> right after stopping the  SW scheduler thread, and even then it was to
>>>> late for some drivers as they would decide to return back from their
>>>> TDR handler even before that. It is solvable but in an ugly way as far
>>>> as I see, you need to require each driver in his code to put the job
>>>> back in the list if they do it before reaching the place where
>>>> scheduler framework does it. Kind of spaghetti code seems to me.
>>> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
>>>
>>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>>>
>>> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
>>>
>>> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
>>>
>>> Or I'm I completely guessing in the wrong direction?
>>> -Daniel
>>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>>
>>>>>>>>>>                      status =
>>>>>>>>>> job->sched->ops->timedout_job(job);
>>>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>                              job->sched->ops->free_job(job);
>>>>>>>>>>                              sched->free_guilty = false;
>>>>>>>>>>                      }
>>>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>>>              } else {
>>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>>              }
>>>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct
>>>>>>>>>> drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>>
>>>>>>>>>>              kthread_park(sched->thread);
>>>>>>>>>>
>>>>>>>>>> -       /*
>>>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>>>> -        */
>>>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>>>> -               /*
>>>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>>>> -                * job extracted.
>>>>>>>>>> -                */
>>>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>>>> -
>>>>>>>>>>              /*
>>>>>>>>>>               * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>>               * their HW callbacks or remove them from
>>>>>>>>>> pending list if they already
>>>>>>>>>> --
>>>>>>>>>> 2.25.1
>>>>>>>>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20  7:12                 ` Liu, Monk
  2021-08-20  7:20                   ` Christian König
@ 2021-08-20 14:07                   ` Andrey Grodzovsky
  2021-08-24  7:24                     ` Liu, Monk
  2023-06-08 16:40                     ` Lucas Stach
  2 siblings, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-20 14:07 UTC (permalink / raw)
  To: Liu, Monk, Daniel Vetter, Koenig, Christian
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers, amd-gfx list


On 2021-08-20 3:12 a.m., Liu, Monk wrote:
> [AMD Official Use Only]
>
> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>   
> Do you have any concern on the kthread_park() approach ?
>
> Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
> There was already something wrong/stuck on that ring/scheduler


Regarding last paragraph, and specifically the claim that there was 
already something wrong if the TO handler
starts execution - Not sure about this and I wonder if we have a 
potential bug here - when we start the timeout timer in
drm_sched_job_begin we do it for each new incoming job. In a constant 
rapid stream of jobs each new job comming
will try to start the timer but most of the time this operation just 
bails out as there is already pending timer from one
of the previous jobs which cancels out any new ones [1] so, when the TO 
handler does execute eventually it's not
because something wrong but simply because TO has expired. If in this 
case the pending list not empty a false
TDR will be triggered. I think long ago we used TO handler per job and 
not per scheduler, this would solve this problem
but hurt the serialization issue we are trying to solve. So not sure 
what to do.

[1] - 
https://elixir.bootlin.com/linux/v5.14-rc1/source/kernel/workqueue.c#L1665

Andrey

>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Thursday, August 19, 2021 6:26 PM
> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
> [AMD Official Use Only]
>
> Hi Daniel
>
>>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> Yeah we had this though as well in our mind.
>
> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..50a49cb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
>          /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> +       kthread_park(sched->thread);
>          spin_lock(&sched->job_list_lock);
>          job = list_first_entry_or_null(&sched->pending_list,
>                                         struct drm_sched_job, list);
>   
>          if (job) {
> -               /*
> -                * Remove the bad job so it cannot be freed by concurrent
> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -                * is parked at which point it's safe.
> -                */
> -               list_del_init(&job->list);
>                  spin_unlock(&sched->job_list_lock);
>   
>                  status = job->sched->ops->timedout_job(job);
> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>          } else {
>                  spin_unlock(&sched->job_list_lock);
>          }
> +       kthread_unpark(sched->thread);
>   
>          if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>                  spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>          kthread_park(sched->thread);
>   
>          /*
> -        * Reinsert back the bad job here - now it's safe as
> -        * drm_sched_get_cleanup_job cannot race against us and release the
> -        * bad job at this point - we parked (waited for) any in progress
> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -        * now until the scheduler thread is unparked.
> -        */
> -       if (bad && bad->sched == sched)
> -               /*
> -                * Add at the head of the queue to reflect it was the earliest
> -                * job extracted.
> -                */
> -               list_add(&bad->list, &sched->pending_list);
> -
> -       /*
>           * Iterate the job list from later to  earlier one and either deactive
>           * their HW callbacks or remove them from pending list if they already
>           * signaled.
>
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Thursday, August 19, 2021 5:31 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>
>>>>>>> + dri-devel
>>>>>>>
>>>>>>> Since scheduler is a shared component, please add dri-devel
>>>>>>> on all scheduler patches.
>>>>>>>
>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>> [Why]
>>>>>>>> for bailing job, this commit will delete it from pending
>>>>>>>> list thus the bailing job will never have a chance to be
>>>>>>>> resubmitted even in advance tdr mode.
>>>>>>>>
>>>>>>>> [How]
>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race
>>>>>>>> condition that this commit tries to work around is
>>>>>>>> completely solved.So revert this commit.
>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>> v2:
>>>>>>>> add dma_fence_get/put() around timedout_job to avoid
>>>>>>>> concurrent delete during processing timedout_job
>>>>>>>>
>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>      1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>      {
>>>>>>>>             struct drm_gpu_scheduler *sched;
>>>>>>>>             struct drm_sched_job *job;
>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>             enum drm_gpu_sched_stat status =
>>>>>>>> DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>
>>>>>>>>             sched = container_of(work, struct
>>>>>>>> drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@
>>>>>>>> static void drm_sched_job_timedout(struct work_struct
>>>>>>>> *work)
>>>>>>>>
>>>>>>>>             if (job) {
>>>>>>>>                     /*
>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>> +                * processing timedout_job
>>>>>>>>                      */
>>>>>>>> -               list_del_init(&job->list);
>>>>>>>> +               fence =
>>>>>>>> + dma_fence_get(job->s_fence->parent);
>>>>>> While this is true for amdgpu, it has no meaning for other
>>>>>> drivers for whom we haven't done the refactoring of embedding
>>>>>> HW fence (parent) into the job structure.
>>>>>> In fact thinking
>>>>>> about it, unless you do the HW fence embedding for all the
>>>>>> drivers using the scheduler you cannot revert this patch or
>>>>>> you will just break them.
>>>>> btw, why did you do that embedding? I do still have my patches
>>>>> with dma_fence annotations floating around, but my idea at least
>>>>> was to fix that issue with a mempool, not with embeddeding. What
>>>>> was the motivation for embedding the wh fence?
>>>>> -Daniel
>>>> The motivation was 2 fold, avoid memory allocation during jobs
>>>> submissions (HW fence allocation) because as Christian explained
>>>> this leads to deadlock with mm code during evictions due to memory
>>>> pressure (Christian can clarify if I messed
>>> Yeah that's the exact same thing I've chased with my dma_fence
>>> annotations, but thus far zero to none interested in getting it
>>> sorted. I think it'd be good to have some cross-driver agreement on
>>> how this should be solved before someone just charges ahead ...
>>>
>>>> this explanation). Second is to exactly revert this patch because
>>>> while it solved the issue described in the patch it created
>>>> another with drivers who baildc out early during TDR handling for
>>>> various reason and the job would just leak because it was already
>>>> removed form pending list.
>>> Can't we reinsert it before we restart the scheduler thread? It
>>> might need a separate list for that due to the lockless queue
>>> tricks. Or am I thinking about the wrong kind of "we lost the job"?
>>> -Danile
>>
>> If you look at the original patch it would reinsert it even earlier -
>> right after stopping the  SW scheduler thread, and even then it was to
>> late for some drivers as they would decide to return back from their
>> TDR handler even before that. It is solvable but in an ugly way as far
>> as I see, you need to require each driver in his code to put the job
>> back in the list if they do it before reaching the place where
>> scheduler framework does it. Kind of spaghetti code seems to me.
> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
>
> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>
> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
>
> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
>
> Or I'm I completely guessing in the wrong direction?
> -Daniel
>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>
>>>>>>>>                     status =
>>>>>>>> job->sched->ops->timedout_job(job);
>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>                             job->sched->ops->free_job(job);
>>>>>>>>                             sched->free_guilty = false;
>>>>>>>>                     }
>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>             } else {
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>             }
>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct
>>>>>>>> drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>
>>>>>>>>             kthread_park(sched->thread);
>>>>>>>>
>>>>>>>> -       /*
>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>> -        */
>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>> -               /*
>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>> -                * job extracted.
>>>>>>>> -                */
>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>> -
>>>>>>>>             /*
>>>>>>>>              * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>              * their HW callbacks or remove them from
>>>>>>>> pending list if they already
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20 14:07                   ` Andrey Grodzovsky
@ 2021-08-24  7:24                     ` Liu, Monk
  2021-08-24 14:23                       ` Andrey Grodzovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2021-08-24  7:24 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Daniel Vetter, Koenig, Christian
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers, amd-gfx list

[AMD Official Use Only]

Hi Andrey

Sorry that it is really hard for me to get any particular or solid potential bugs from your reply, can you be more specific, e.g.: what kind of race issue is introduced by this "kthread_stop/start" approach.

To your another question/concern: 
>> . In a constant rapid stream of jobs each new job comming will try to start the timer but most of the time this operation just bails out as there is already pending timer from one of the previous jobs which cancels out any new ones [1] so, when the TO handler does execute eventually it's not because something wrong but simply because TO has 
Expired

I totally agree withy you on this point, and I think I have a patch to address this, but this problem is not related with our current topic at all ... our  current topic is the bailout bad job handling from advanced TDR mode.

The bug here is our current TO handler only do the counting on the first job to the given scheduler, and the following coming job won't recalculate the TO at all, and I can assure you that this is a regression because when I implement TDR years ago I already considered planned for such problem.
Please check this change to resolve it:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..7b5f99a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
                schedule_delayed_work(&sched->work_tdr, sched->timeout);
 }
 
+static void drm_sched_restart_timeout(struct drm_gpu_scheduler *sched)
+{
+       if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
+           !list_empty(&sched->pending_list))
+               mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
+}
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
@@ -693,6 +682,11 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
        if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
                /* remove job from pending_list */
                list_del_init(&job->list);
+
+               /* once the job deleted from pending list we should restart
+                * the timeout calculation for the next job.
+                */
+               drm_sched_restart_timeout(sched);
                /* make the scheduled timestamp more accurate */
                next = list_first_entry_or_null(&sched->pending_list,
                                                typeof(*next), list);


if you guys do not have concerns I can submit this patch for review, but again, let's focus on bailing out had job handling as our priority, we are very close to our purpose, let me know what's your concerned race issue and we can address it.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Friday, August 20, 2021 10:07 PM
To: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>
Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."


On 2021-08-20 3:12 a.m., Liu, Monk wrote:
> [AMD Official Use Only]
>
> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>   
> Do you have any concern on the kthread_park() approach ?
>
> Theoretically speaking sched_main shall run there exclusively with 
> job_timeout since they both touches jobs, and stop scheduler during 
> job_timeout won't impact performance since in that scenario There was 
> already something wrong/stuck on that ring/scheduler


Regarding last paragraph, and specifically the claim that there was already something wrong if the TO handler starts execution - Not sure about this and I wonder if we have a potential bug here - when we start the timeout timer in drm_sched_job_begin we do it for each new incoming job. In a constant rapid stream of jobs each new job comming will try to start the timer but most of the time this operation just bails out as there is already pending timer from one of the previous jobs which cancels out any new ones [1] so, when the TO handler does execute eventually it's not because something wrong but simply because TO has expired. If in this case the pending list not empty a false TDR will be triggered. I think long ago we used TO handler per job and not per scheduler, this would solve this problem but hurt the serialization issue we are trying to solve. So not sure what to do.

[1] -
https://elixir.bootlin.com/linux/v5.14-rc1/source/kernel/workqueue.c#L1665

Andrey

>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Thursday, August 19, 2021 6:26 PM
> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen 
> <JingWen.Chen2@amd.com>; Maling list - DRI developers 
> <dri-devel@lists.freedesktop.org>; amd-gfx list 
> <amd-gfx@lists.freedesktop.org>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
> [AMD Official Use Only]
>
> Hi Daniel
>
>>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> Yeah we had this though as well in our mind.
>
> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..50a49cb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>          sched = container_of(work, struct drm_gpu_scheduler, 
> work_tdr.work);
>   
>          /* Protects against concurrent deletion in 
> drm_sched_get_cleanup_job */
> +       kthread_park(sched->thread);
>          spin_lock(&sched->job_list_lock);
>          job = list_first_entry_or_null(&sched->pending_list,
>                                         struct drm_sched_job, list);
>   
>          if (job) {
> -               /*
> -                * Remove the bad job so it cannot be freed by concurrent
> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -                * is parked at which point it's safe.
> -                */
> -               list_del_init(&job->list);
>                  spin_unlock(&sched->job_list_lock);
>   
>                  status = job->sched->ops->timedout_job(job);
> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>          } else {
>                  spin_unlock(&sched->job_list_lock);
>          }
> +       kthread_unpark(sched->thread);
>   
>          if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>                  spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>          kthread_park(sched->thread);
>   
>          /*
> -        * Reinsert back the bad job here - now it's safe as
> -        * drm_sched_get_cleanup_job cannot race against us and release the
> -        * bad job at this point - we parked (waited for) any in progress
> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -        * now until the scheduler thread is unparked.
> -        */
> -       if (bad && bad->sched == sched)
> -               /*
> -                * Add at the head of the queue to reflect it was the earliest
> -                * job extracted.
> -                */
> -               list_add(&bad->list, &sched->pending_list);
> -
> -       /*
>           * Iterate the job list from later to  earlier one and either deactive
>           * their HW callbacks or remove them from pending list if they already
>           * signaled.
>
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Thursday, August 19, 2021 5:31 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher 
> <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling 
> list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list 
> <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, 
> Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>
>>>>>>> + dri-devel
>>>>>>>
>>>>>>> Since scheduler is a shared component, please add dri-devel on 
>>>>>>> all scheduler patches.
>>>>>>>
>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>> [Why]
>>>>>>>> for bailing job, this commit will delete it from pending list 
>>>>>>>> thus the bailing job will never have a chance to be resubmitted 
>>>>>>>> even in advance tdr mode.
>>>>>>>>
>>>>>>>> [How]
>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race 
>>>>>>>> condition that this commit tries to work around is completely 
>>>>>>>> solved.So revert this commit.
>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>> v2:
>>>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent 
>>>>>>>> delete during processing timedout_job
>>>>>>>>
>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>      1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>      {
>>>>>>>>             struct drm_gpu_scheduler *sched;
>>>>>>>>             struct drm_sched_job *job;
>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>             enum drm_gpu_sched_stat status = 
>>>>>>>> DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>
>>>>>>>>             sched = container_of(work, struct 
>>>>>>>> drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ static 
>>>>>>>> void drm_sched_job_timedout(struct work_struct
>>>>>>>> *work)
>>>>>>>>
>>>>>>>>             if (job) {
>>>>>>>>                     /*
>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>> +                * processing timedout_job
>>>>>>>>                      */
>>>>>>>> -               list_del_init(&job->list);
>>>>>>>> +               fence =
>>>>>>>> + dma_fence_get(job->s_fence->parent);
>>>>>> While this is true for amdgpu, it has no meaning for other 
>>>>>> drivers for whom we haven't done the refactoring of embedding HW 
>>>>>> fence (parent) into the job structure.
>>>>>> In fact thinking
>>>>>> about it, unless you do the HW fence embedding for all the 
>>>>>> drivers using the scheduler you cannot revert this patch or you 
>>>>>> will just break them.
>>>>> btw, why did you do that embedding? I do still have my patches 
>>>>> with dma_fence annotations floating around, but my idea at least 
>>>>> was to fix that issue with a mempool, not with embeddeding. What 
>>>>> was the motivation for embedding the wh fence?
>>>>> -Daniel
>>>> The motivation was 2 fold, avoid memory allocation during jobs 
>>>> submissions (HW fence allocation) because as Christian explained 
>>>> this leads to deadlock with mm code during evictions due to memory 
>>>> pressure (Christian can clarify if I messed
>>> Yeah that's the exact same thing I've chased with my dma_fence 
>>> annotations, but thus far zero to none interested in getting it 
>>> sorted. I think it'd be good to have some cross-driver agreement on 
>>> how this should be solved before someone just charges ahead ...
>>>
>>>> this explanation). Second is to exactly revert this patch because 
>>>> while it solved the issue described in the patch it created another 
>>>> with drivers who baildc out early during TDR handling for various 
>>>> reason and the job would just leak because it was already removed 
>>>> form pending list.
>>> Can't we reinsert it before we restart the scheduler thread? It 
>>> might need a separate list for that due to the lockless queue 
>>> tricks. Or am I thinking about the wrong kind of "we lost the job"?
>>> -Danile
>>
>> If you look at the original patch it would reinsert it even earlier - 
>> right after stopping the  SW scheduler thread, and even then it was 
>> to late for some drivers as they would decide to return back from 
>> their TDR handler even before that. It is solvable but in an ugly way 
>> as far as I see, you need to require each driver in his code to put 
>> the job back in the list if they do it before reaching the place 
>> where scheduler framework does it. Kind of spaghetti code seems to me.
> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
>
> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>
> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
>
> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
>
> Or I'm I completely guessing in the wrong direction?
> -Daniel
>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>
>>>>>>>>                     status =
>>>>>>>> job->sched->ops->timedout_job(job);
>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>                             job->sched->ops->free_job(job);
>>>>>>>>                             sched->free_guilty = false;
>>>>>>>>                     }
>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>             } else {
>>>>>>>>                     spin_unlock(&sched->job_list_lock);
>>>>>>>>             }
>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
>>>>>>>> drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>
>>>>>>>>             kthread_park(sched->thread);
>>>>>>>>
>>>>>>>> -       /*
>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>> -        */
>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>> -               /*
>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>> -                * job extracted.
>>>>>>>> -                */
>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>> -
>>>>>>>>             /*
>>>>>>>>              * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>              * their HW callbacks or remove them from pending 
>>>>>>>> list if they already
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f3960
> 8508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376496226
> 57672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3M
> xVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-24  7:24                     ` Liu, Monk
@ 2021-08-24 14:23                       ` Andrey Grodzovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 14:23 UTC (permalink / raw)
  To: Liu, Monk, Daniel Vetter, Koenig, Christian
  Cc: Alex Deucher, Chen, JingWen, Maling list - DRI developers, amd-gfx list

On 2021-08-24 3:24 a.m., Liu, Monk wrote:

> [AMD Official Use Only]
>
> Hi Andrey
>
> Sorry that it is really hard for me to get any particular or solid potential bugs from your reply, can you be more specific, e.g.: what kind of race issue is introduced by this "kthread_stop/start" approach.


Hey, you might have missed my replies in the thread regarding this. 
Check them here.

https://www.spinics.net/lists/amd-gfx/msg67041.html
https://www.spinics.net/lists/amd-gfx/msg67090.html

In summery IMHO we can park/unpark only within serialized section 
against all other possible TDR handlers (at whole ASIC or even XGMI hive 
level).
Today we achieve this by locking. IN the new proposal there is no 
locking - so we either add one or just serialize TDRs to single thread 
execution.
Let me know if you think it's not an issue actually - i might be missing 
something.

Andrey


>
> To your another question/concern:
>>> . In a constant rapid stream of jobs each new job comming will try to start the timer but most of the time this operation just bails out as there is already pending timer from one of the previous jobs which cancels out any new ones [1] so, when the TO handler does execute eventually it's not because something wrong but simply because TO has
> Expired
>
> I totally agree withy you on this point, and I think I have a patch to address this, but this problem is not related with our current topic at all ... our  current topic is the bailout bad job handling from advanced TDR mode.
>
> The bug here is our current TO handler only do the counting on the first job to the given scheduler, and the following coming job won't recalculate the TO at all, and I can assure you that this is a regression because when I implement TDR years ago I already considered planned for such problem.
> Please check this change to resolve it:
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..7b5f99a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>                  schedule_delayed_work(&sched->work_tdr, sched->timeout);
>   }
>   
> +static void drm_sched_restart_timeout(struct drm_gpu_scheduler *sched)
> +{
> +       if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> +           !list_empty(&sched->pending_list))
> +               mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
> +}
> +
>   /**
>    * drm_sched_fault - immediately start timeout handler
>    *
> @@ -693,6 +682,11 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>          if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>                  /* remove job from pending_list */
>                  list_del_init(&job->list);
> +
> +               /* once the job deleted from pending list we should restart
> +                * the timeout calculation for the next job.
> +                */
> +               drm_sched_restart_timeout(sched);
>                  /* make the scheduled timestamp more accurate */
>                  next = list_first_entry_or_null(&sched->pending_list,
>                                                  typeof(*next), list);
>
>
> if you guys do not have concerns I can submit this patch for review, but again, let's focus on bailing out had job handling as our priority, we are very close to our purpose, let me know what's your concerned race issue and we can address it.
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Friday, August 20, 2021 10:07 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>
>
> On 2021-08-20 3:12 a.m., Liu, Monk wrote:
>> [AMD Official Use Only]
>>
>> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>>    
>> Do you have any concern on the kthread_park() approach ?
>>
>> Theoretically speaking sched_main shall run there exclusively with
>> job_timeout since they both touches jobs, and stop scheduler during
>> job_timeout won't impact performance since in that scenario There was
>> already something wrong/stuck on that ring/scheduler
>
> Regarding last paragraph, and specifically the claim that there was already something wrong if the TO handler starts execution - Not sure about this and I wonder if we have a potential bug here - when we start the timeout timer in drm_sched_job_begin we do it for each new incoming job. In a constant rapid stream of jobs each new job comming will try to start the timer but most of the time this operation just bails out as there is already pending timer from one of the previous jobs which cancels out any new ones [1] so, when the TO handler does execute eventually it's not because something wrong but simply because TO has expired. If in this case the pending list not empty a false TDR will be triggered. I think long ago we used TO handler per job and not per scheduler, this would solve this problem but hurt the serialization issue we are trying to solve. So not sure what to do.
>
> [1] -
> https://elixir.bootlin.com/linux/v5.14-rc1/source/kernel/workqueue.c#L1665
>
> Andrey
>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Liu, Monk
>> Sent: Thursday, August 19, 2021 6:26 PM
>> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>
>> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen
>> <JingWen.Chen2@amd.com>; Maling list - DRI developers
>> <dri-devel@lists.freedesktop.org>; amd-gfx list
>> <amd-gfx@lists.freedesktop.org>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>>
>> [AMD Official Use Only]
>>
>> Hi Daniel
>>
>>>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>> Yeah we had this though as well in our mind.
>>
>> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a9536..50a49cb 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>           sched = container_of(work, struct drm_gpu_scheduler,
>> work_tdr.work);
>>    
>>           /* Protects against concurrent deletion in
>> drm_sched_get_cleanup_job */
>> +       kthread_park(sched->thread);
>>           spin_lock(&sched->job_list_lock);
>>           job = list_first_entry_or_null(&sched->pending_list,
>>                                          struct drm_sched_job, list);
>>    
>>           if (job) {
>> -               /*
>> -                * Remove the bad job so it cannot be freed by concurrent
>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>> -                * is parked at which point it's safe.
>> -                */
>> -               list_del_init(&job->list);
>>                   spin_unlock(&sched->job_list_lock);
>>    
>>                   status = job->sched->ops->timedout_job(job);
>> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>           } else {
>>                   spin_unlock(&sched->job_list_lock);
>>           }
>> +       kthread_unpark(sched->thread);
>>    
>>           if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>>                   spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>           kthread_park(sched->thread);
>>    
>>           /*
>> -        * Reinsert back the bad job here - now it's safe as
>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>> -        * bad job at this point - we parked (waited for) any in progress
>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>> -        * now until the scheduler thread is unparked.
>> -        */
>> -       if (bad && bad->sched == sched)
>> -               /*
>> -                * Add at the head of the queue to reflect it was the earliest
>> -                * job extracted.
>> -                */
>> -               list_add(&bad->list, &sched->pending_list);
>> -
>> -       /*
>>            * Iterate the job list from later to  earlier one and either deactive
>>            * their HW callbacks or remove them from pending list if they already
>>            * signaled.
>>
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Daniel Vetter <daniel@ffwll.ch>
>> Sent: Thursday, August 19, 2021 5:31 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher
>> <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling
>> list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list
>> <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig,
>> Christian <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
>>
>> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>>
>>>>>>>> + dri-devel
>>>>>>>>
>>>>>>>> Since scheduler is a shared component, please add dri-devel on
>>>>>>>> all scheduler patches.
>>>>>>>>
>>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>>> [Why]
>>>>>>>>> for bailing job, this commit will delete it from pending list
>>>>>>>>> thus the bailing job will never have a chance to be resubmitted
>>>>>>>>> even in advance tdr mode.
>>>>>>>>>
>>>>>>>>> [How]
>>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race
>>>>>>>>> condition that this commit tries to work around is completely
>>>>>>>>> solved.So revert this commit.
>>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>>> v2:
>>>>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent
>>>>>>>>> delete during processing timedout_job
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>>       1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>       {
>>>>>>>>>              struct drm_gpu_scheduler *sched;
>>>>>>>>>              struct drm_sched_job *job;
>>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>>              enum drm_gpu_sched_stat status =
>>>>>>>>> DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>>
>>>>>>>>>              sched = container_of(work, struct
>>>>>>>>> drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ static
>>>>>>>>> void drm_sched_job_timedout(struct work_struct
>>>>>>>>> *work)
>>>>>>>>>
>>>>>>>>>              if (job) {
>>>>>>>>>                      /*
>>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>>> +                * processing timedout_job
>>>>>>>>>                       */
>>>>>>>>> -               list_del_init(&job->list);
>>>>>>>>> +               fence =
>>>>>>>>> + dma_fence_get(job->s_fence->parent);
>>>>>>> While this is true for amdgpu, it has no meaning for other
>>>>>>> drivers for whom we haven't done the refactoring of embedding HW
>>>>>>> fence (parent) into the job structure.
>>>>>>> In fact thinking
>>>>>>> about it, unless you do the HW fence embedding for all the
>>>>>>> drivers using the scheduler you cannot revert this patch or you
>>>>>>> will just break them.
>>>>>> btw, why did you do that embedding? I do still have my patches
>>>>>> with dma_fence annotations floating around, but my idea at least
>>>>>> was to fix that issue with a mempool, not with embeddeding. What
>>>>>> was the motivation for embedding the wh fence?
>>>>>> -Daniel
>>>>> The motivation was 2 fold, avoid memory allocation during jobs
>>>>> submissions (HW fence allocation) because as Christian explained
>>>>> this leads to deadlock with mm code during evictions due to memory
>>>>> pressure (Christian can clarify if I messed
>>>> Yeah that's the exact same thing I've chased with my dma_fence
>>>> annotations, but thus far zero to none interested in getting it
>>>> sorted. I think it'd be good to have some cross-driver agreement on
>>>> how this should be solved before someone just charges ahead ...
>>>>
>>>>> this explanation). Second is to exactly revert this patch because
>>>>> while it solved the issue described in the patch it created another
>>>>> with drivers who baildc out early during TDR handling for various
>>>>> reason and the job would just leak because it was already removed
>>>>> form pending list.
>>>> Can't we reinsert it before we restart the scheduler thread? It
>>>> might need a separate list for that due to the lockless queue
>>>> tricks. Or am I thinking about the wrong kind of "we lost the job"?
>>>> -Danile
>>> If you look at the original patch it would reinsert it even earlier -
>>> right after stopping the  SW scheduler thread, and even then it was
>>> to late for some drivers as they would decide to return back from
>>> their TDR handler even before that. It is solvable but in an ugly way
>>> as far as I see, you need to require each driver in his code to put
>>> the job back in the list if they do it before reaching the place
>>> where scheduler framework does it. Kind of spaghetti code seems to me.
>> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
>>
>> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
>>
>> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
>>
>> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
>>
>> Or I'm I completely guessing in the wrong direction?
>> -Daniel
>>
>>> Andrey
>>>
>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>
>>>>>>>>>                      status =
>>>>>>>>> job->sched->ops->timedout_job(job);
>>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>                              job->sched->ops->free_job(job);
>>>>>>>>>                              sched->free_guilty = false;
>>>>>>>>>                      }
>>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>>              } else {
>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>              }
>>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct
>>>>>>>>> drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>
>>>>>>>>>              kthread_park(sched->thread);
>>>>>>>>>
>>>>>>>>> -       /*
>>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>>> -        */
>>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>>> -               /*
>>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>>> -                * job extracted.
>>>>>>>>> -                */
>>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>>> -
>>>>>>>>>              /*
>>>>>>>>>               * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>               * their HW callbacks or remove them from pending
>>>>>>>>> list if they already
>>>>>>>>> --
>>>>>>>>> 2.25.1
>>>>>>>>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
>> ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f3960
>> 8508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376496226
>> 57672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
>> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3M
>> xVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20  7:20                   ` Christian König
  2021-08-20  8:09                     ` Jingwen Chen
@ 2021-08-26  8:59                     ` Daniel Vetter
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel Vetter @ 2021-08-26  8:59 UTC (permalink / raw)
  To: Christian König
  Cc: Liu, Monk, Daniel Vetter, Grodzovsky, Andrey, Alex Deucher, Chen,
	JingWen, Maling list - DRI developers, amd-gfx list

On Fri, Aug 20, 2021 at 09:20:42AM +0200, Christian König wrote:
> No, that perfectly works for me.
> 
> The problem we used to have with this approach was that we potentially have
> multiple timeouts at the same time.
> 
> But when we serialize the timeout handling by using a single workqueue as
> suggested by Daniel now as well then that isn't an issue any more.

Sorry I got massively burried in everything, catching up. Iirc there's a
special function for parking schedulers (which panfrost now uses to handle
its cross-engine reset), would be good to use that.

And yeah if your reset code is potentially spawning across engines I think
you need a single workqueue to make sure stuff doesn't go boom. Tbh might
be best to check out what panfrost has done and ask panfrost folks for an
ack on your approach.
-Daniel

> 
> Regards,
> Christian.
> 
> Am 20.08.21 um 09:12 schrieb Liu, Monk:
> > [AMD Official Use Only]
> > 
> > @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
> > Do you have any concern on the kthread_park() approach ?
> > 
> > Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
> > There was already something wrong/stuck on that ring/scheduler
> > 
> > Thanks
> > 
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> > 
> > -----Original Message-----
> > From: Liu, Monk
> > Sent: Thursday, August 19, 2021 6:26 PM
> > To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> > Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> > 
> > [AMD Official Use Only]
> > 
> > Hi Daniel
> > 
> > > > Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> > Yeah we had this though as well in our mind.
> > 
> > Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index a2a9536..50a49cb 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> >          /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> > +       kthread_park(sched->thread);
> >          spin_lock(&sched->job_list_lock);
> >          job = list_first_entry_or_null(&sched->pending_list,
> >                                         struct drm_sched_job, list);
> >          if (job) {
> > -               /*
> > -                * Remove the bad job so it cannot be freed by concurrent
> > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -                * is parked at which point it's safe.
> > -                */
> > -               list_del_init(&job->list);
> >                  spin_unlock(&sched->job_list_lock);
> >                  status = job->sched->ops->timedout_job(job);
> > @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >          } else {
> >                  spin_unlock(&sched->job_list_lock);
> >          }
> > +       kthread_unpark(sched->thread);
> >          if (status != DRM_GPU_SCHED_STAT_ENODEV) {
> >                  spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >          kthread_park(sched->thread);
> >          /*
> > -        * Reinsert back the bad job here - now it's safe as
> > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > -        * bad job at this point - we parked (waited for) any in progress
> > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -        * now until the scheduler thread is unparked.
> > -        */
> > -       if (bad && bad->sched == sched)
> > -               /*
> > -                * Add at the head of the queue to reflect it was the earliest
> > -                * job extracted.
> > -                */
> > -               list_add(&bad->list, &sched->pending_list);
> > -
> > -       /*
> >           * Iterate the job list from later to  earlier one and either deactive
> >           * their HW callbacks or remove them from pending list if they already
> >           * signaled.
> > 
> > 
> > Thanks
> > 
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> > 
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Thursday, August 19, 2021 5:31 PM
> > To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> > 
> > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > 
> > > > > > > > + dri-devel
> > > > > > > > 
> > > > > > > > Since scheduler is a shared component, please add dri-devel
> > > > > > > > on all scheduler patches.
> > > > > > > > 
> > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > > [Why]
> > > > > > > > > for bailing job, this commit will delete it from pending
> > > > > > > > > list thus the bailing job will never have a chance to be
> > > > > > > > > resubmitted even in advance tdr mode.
> > > > > > > > > 
> > > > > > > > > [How]
> > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race
> > > > > > > > > condition that this commit tries to work around is
> > > > > > > > > completely solved.So revert this commit.
> > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > v2:
> > > > > > > > > add dma_fence_get/put() around timedout_job to avoid
> > > > > > > > > concurrent delete during processing timedout_job
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > >      1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > >      {
> > > > > > > > >             struct drm_gpu_scheduler *sched;
> > > > > > > > >             struct drm_sched_job *job;
> > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > >             enum drm_gpu_sched_stat status =
> > > > > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > 
> > > > > > > > >             sched = container_of(work, struct
> > > > > > > > > drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@
> > > > > > > > > static void drm_sched_job_timedout(struct work_struct
> > > > > > > > > *work)
> > > > > > > > > 
> > > > > > > > >             if (job) {
> > > > > > > > >                     /*
> > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > +                * processing timedout_job
> > > > > > > > >                      */
> > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > +               fence =
> > > > > > > > > + dma_fence_get(job->s_fence->parent);
> > > > > > > While this is true for amdgpu, it has no meaning for other
> > > > > > > drivers for whom we haven't done the refactoring of embedding
> > > > > > > HW fence (parent) into the job structure.
> > > > > > > In fact thinking
> > > > > > > about it, unless you do the HW fence embedding for all the
> > > > > > > drivers using the scheduler you cannot revert this patch or
> > > > > > > you will just break them.
> > > > > > btw, why did you do that embedding? I do still have my patches
> > > > > > with dma_fence annotations floating around, but my idea at least
> > > > > > was to fix that issue with a mempool, not with embeddeding. What
> > > > > > was the motivation for embedding the wh fence?
> > > > > > -Daniel
> > > > > The motivation was 2 fold, avoid memory allocation during jobs
> > > > > submissions (HW fence allocation) because as Christian explained
> > > > > this leads to deadlock with mm code during evictions due to memory
> > > > > pressure (Christian can clarify if I messed
> > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > annotations, but thus far zero to none interested in getting it
> > > > sorted. I think it'd be good to have some cross-driver agreement on
> > > > how this should be solved before someone just charges ahead ...
> > > > 
> > > > > this explanation). Second is to exactly revert this patch because
> > > > > while it solved the issue described in the patch it created
> > > > > another with drivers who baildc out early during TDR handling for
> > > > > various reason and the job would just leak because it was already
> > > > > removed form pending list.
> > > > Can't we reinsert it before we restart the scheduler thread? It
> > > > might need a separate list for that due to the lockless queue
> > > > tricks. Or am I thinking about the wrong kind of "we lost the job"?
> > > > -Danile
> > > 
> > > If you look at the original patch it would reinsert it even earlier -
> > > right after stopping the  SW scheduler thread, and even then it was to
> > > late for some drivers as they would decide to return back from their
> > > TDR handler even before that. It is solvable but in an ugly way as far
> > > as I see, you need to require each driver in his code to put the job
> > > back in the list if they do it before reaching the place where
> > > scheduler framework does it. Kind of spaghetti code seems to me.
> > Hm yeah I didn't realize this all happens before we stop the scheduler thread.
> > 
> > Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> > 
> > I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
> > 
> > Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
> > 
> > Or I'm I completely guessing in the wrong direction?
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > > Andrey
> > > > > > > 
> > > > > > > 
> > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > 
> > > > > > > > >                     status =
> > > > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > >                             job->sched->ops->free_job(job);
> > > > > > > > >                             sched->free_guilty = false;
> > > > > > > > >                     }
> > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > >             } else {
> > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > >             }
> > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct
> > > > > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > 
> > > > > > > > >             kthread_park(sched->thread);
> > > > > > > > > 
> > > > > > > > > -       /*
> > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > -        */
> > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > -               /*
> > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > -                * job extracted.
> > > > > > > > > -                */
> > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > -
> > > > > > > > >             /*
> > > > > > > > >              * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > >              * their HW callbacks or remove them from
> > > > > > > > > pending list if they already
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-19 15:25               ` Andrey Grodzovsky
@ 2021-08-26  9:04                 ` Daniel Vetter
  2021-08-31 13:11                   ` Daniel Vetter
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Vetter @ 2021-08-26  9:04 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Alex Deucher, Jingwen Chen,
	Maling list - DRI developers, amd-gfx list, monk.liu,
	Christian Koenig

On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > 
> > > > > > > > + dri-devel
> > > > > > > > 
> > > > > > > > Since scheduler is a shared component, please add dri-devel on all
> > > > > > > > scheduler patches.
> > > > > > > > 
> > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > > [Why]
> > > > > > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > > > > > tdr mode.
> > > > > > > > > 
> > > > > > > > > [How]
> > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > > > > > this commit tries to work around is completely solved.So revert this
> > > > > > > > > commit.
> > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > v2:
> > > > > > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > > > > > during processing timedout_job
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > >      1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > >      {
> > > > > > > > >             struct drm_gpu_scheduler *sched;
> > > > > > > > >             struct drm_sched_job *job;
> > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > >             enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > 
> > > > > > > > >             sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > 
> > > > > > > > >             if (job) {
> > > > > > > > >                     /*
> > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > +                * processing timedout_job
> > > > > > > > >                      */
> > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > > > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > > > > > we haven't
> > > > > > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > > > > > In fact thinking
> > > > > > > about it, unless you do the HW fence embedding for all the drivers using the
> > > > > > > scheduler you cannot
> > > > > > > revert this patch or you will just break them.
> > > > > > btw, why did you do that embedding? I do still have my patches with
> > > > > > dma_fence annotations floating around, but my idea at least was to fix
> > > > > > that issue with a mempool, not with embeddeding. What was the motivation
> > > > > > for embedding the wh fence?
> > > > > > -Daniel
> > > > > The motivation was 2 fold, avoid memory allocation during jobs submissions
> > > > > (HW fence allocation) because as Christian explained this leads to deadlock
> > > > > with
> > > > > mm code during evictions due to memory pressure (Christian can clarify if I
> > > > > messed
> > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > annotations, but thus far zero to none interested in getting it sorted. I
> > > > think it'd be good to have some cross-driver agreement on how this should
> > > > be solved before someone just charges ahead ...
> > > > 
> > > > > this explanation). Second is to exactly revert this patch because while it
> > > > > solved the issue
> > > > > described in the patch it created another with drivers who baildc out early
> > > > > during TDR handling
> > > > > for various reason and the job would just leak because it was already
> > > > > removed form pending list.
> > > > Can't we reinsert it before we restart the scheduler thread? It might need
> > > > a separate list for that due to the lockless queue tricks. Or am I
> > > > thinking about the wrong kind of "we lost the job"?
> > > > -Danile
> > > 
> > > If you look at the original patch it would reinsert it even earlier - right
> > > after stopping the  SW scheduler thread, and even then it was to late for
> > > some drivers as they would decide to return back from their TDR handler even
> > > before that. It is solvable but in an ugly way as far as I see, you need to
> > > require each driver in his code to put the job back in the list if they do
> > > it before reaching the place where scheduler framework does it. Kind of
> > > spaghetti code seems to me.
> > Hm yeah I didn't realize this all happens before we stop the scheduler
> > thread.
> > 
> > Why can't we stop the scheduler thread first, so that there's guaranteed
> > no race? I've recently had a lot of discussions with panfrost folks about
> > their reset that spawns across engines, and without stopping the scheduler
> > thread first before you touch anything it's just plain impossible.
> 
> 
> Talked with Christian on that, for each TDR we actually stop all the
> schedulers for all the rings and not only the hanged ring since
> ASIC reset will impact all the rings anyway. So we cannot allow
> other timeout handlers for other rings run in parallel to ours
> as they will stop/restart the threads we just stopped and rely
> on them being stopped. So it's all done with device wide lock
> inside the amdgpu tTDR handler. Only inside the locked
> section then we may stop/restart the scheduler threads.
> Christian also mentioned that you proposed at some point
> to serialize all TDR handling into single threading for all rings - this
> seems
> like something that could be used - we then don't need any
> locking against TDR handlers from other rings and then we may
> stop the scheduler thread as first step
> 
> 
> > 
> > I'm also still not understanding what exactly you guys have done,
> > can someone please dig out the the amdgpu patches that motivate all this
> > maybe that's clearer? A full explanation would still be good since I've
> > only started in scheduler stuff.
> 
> 
> https://gitlab.freedesktop.org/agd5f/linux/-/commit/de7515d43659f852590645a688f8d493e4a18141

Uh, it would have been really good if this was discussed a bit wider
beforehand. Now we have rather diverging approaches to this. Also would be
really good to resurrect the dma_fence annotations too.

Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
up rfc conversion patches for the other drivers.


> > Another thing I recently pondered for tdr races looking at i915 code is
> > whether the tdr should first block the completion fence for that job. My
> > motivation is to have a race-free error capture (if the completion races
> > then we might start evicting memory and everything goes boom), but maybe
> > that helps here too. Some kind of atomic "block this fence from
> > completing thing.
> > 
> > Or I'm I completely guessing in the wrong direction?
> 
> 
> I think we already do it here - https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L410

Ah yes this works becase drm/sched has separate hw fence from the logical
job fence.
-Daniel

> 
> Andrey
> 
> 
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > > Andrey
> > > > > > > 
> > > > > > > 
> > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > 
> > > > > > > > >                     status = job->sched->ops->timedout_job(job);
> > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > >                             job->sched->ops->free_job(job);
> > > > > > > > >                             sched->free_guilty = false;
> > > > > > > > >                     }
> > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > >             } else {
> > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > >             }
> > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > 
> > > > > > > > >             kthread_park(sched->thread);
> > > > > > > > > 
> > > > > > > > > -       /*
> > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > -        */
> > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > -               /*
> > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > -                * job extracted.
> > > > > > > > > -                */
> > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > -
> > > > > > > > >             /*
> > > > > > > > >              * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > >              * their HW callbacks or remove them from pending list if they already
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-26  9:04                 ` Daniel Vetter
@ 2021-08-31 13:11                   ` Daniel Vetter
  2021-08-31 18:24                     ` Andrey Grodzovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Vetter @ 2021-08-31 13:11 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Alex Deucher, Jingwen Chen,
	Maling list - DRI developers, amd-gfx list, monk.liu,
	Christian Koenig

On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
> On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
> > 
> > On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > > 
> > > > > > > > > + dri-devel
> > > > > > > > > 
> > > > > > > > > Since scheduler is a shared component, please add dri-devel on all
> > > > > > > > > scheduler patches.
> > > > > > > > > 
> > > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > > > [Why]
> > > > > > > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > > > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > > > > > > tdr mode.
> > > > > > > > > > 
> > > > > > > > > > [How]
> > > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > > > > > > this commit tries to work around is completely solved.So revert this
> > > > > > > > > > commit.
> > > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > > v2:
> > > > > > > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > > > > > > during processing timedout_job
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > > > ---
> > > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > > >      1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > >      {
> > > > > > > > > >             struct drm_gpu_scheduler *sched;
> > > > > > > > > >             struct drm_sched_job *job;
> > > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > > >             enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > > 
> > > > > > > > > >             sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > > > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > 
> > > > > > > > > >             if (job) {
> > > > > > > > > >                     /*
> > > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > > +                * processing timedout_job
> > > > > > > > > >                      */
> > > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > > > > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > > > > > > we haven't
> > > > > > > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > > > > > > In fact thinking
> > > > > > > > about it, unless you do the HW fence embedding for all the drivers using the
> > > > > > > > scheduler you cannot
> > > > > > > > revert this patch or you will just break them.
> > > > > > > btw, why did you do that embedding? I do still have my patches with
> > > > > > > dma_fence annotations floating around, but my idea at least was to fix
> > > > > > > that issue with a mempool, not with embeddeding. What was the motivation
> > > > > > > for embedding the wh fence?
> > > > > > > -Daniel
> > > > > > The motivation was 2 fold, avoid memory allocation during jobs submissions
> > > > > > (HW fence allocation) because as Christian explained this leads to deadlock
> > > > > > with
> > > > > > mm code during evictions due to memory pressure (Christian can clarify if I
> > > > > > messed
> > > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > > annotations, but thus far zero to none interested in getting it sorted. I
> > > > > think it'd be good to have some cross-driver agreement on how this should
> > > > > be solved before someone just charges ahead ...
> > > > > 
> > > > > > this explanation). Second is to exactly revert this patch because while it
> > > > > > solved the issue
> > > > > > described in the patch it created another with drivers who baildc out early
> > > > > > during TDR handling
> > > > > > for various reason and the job would just leak because it was already
> > > > > > removed form pending list.
> > > > > Can't we reinsert it before we restart the scheduler thread? It might need
> > > > > a separate list for that due to the lockless queue tricks. Or am I
> > > > > thinking about the wrong kind of "we lost the job"?
> > > > > -Danile
> > > > 
> > > > If you look at the original patch it would reinsert it even earlier - right
> > > > after stopping the  SW scheduler thread, and even then it was to late for
> > > > some drivers as they would decide to return back from their TDR handler even
> > > > before that. It is solvable but in an ugly way as far as I see, you need to
> > > > require each driver in his code to put the job back in the list if they do
> > > > it before reaching the place where scheduler framework does it. Kind of
> > > > spaghetti code seems to me.
> > > Hm yeah I didn't realize this all happens before we stop the scheduler
> > > thread.
> > > 
> > > Why can't we stop the scheduler thread first, so that there's guaranteed
> > > no race? I've recently had a lot of discussions with panfrost folks about
> > > their reset that spawns across engines, and without stopping the scheduler
> > > thread first before you touch anything it's just plain impossible.
> > 
> > 
> > Talked with Christian on that, for each TDR we actually stop all the
> > schedulers for all the rings and not only the hanged ring since
> > ASIC reset will impact all the rings anyway. So we cannot allow
> > other timeout handlers for other rings run in parallel to ours
> > as they will stop/restart the threads we just stopped and rely
> > on them being stopped. So it's all done with device wide lock
> > inside the amdgpu tTDR handler. Only inside the locked
> > section then we may stop/restart the scheduler threads.
> > Christian also mentioned that you proposed at some point
> > to serialize all TDR handling into single threading for all rings - this
> > seems
> > like something that could be used - we then don't need any
> > locking against TDR handlers from other rings and then we may
> > stop the scheduler thread as first step
> > 
> > 
> > > 
> > > I'm also still not understanding what exactly you guys have done,
> > > can someone please dig out the the amdgpu patches that motivate all this
> > > maybe that's clearer? A full explanation would still be good since I've
> > > only started in scheduler stuff.
> > 
> > 
> > https://gitlab.freedesktop.org/agd5f/linux/-/commit/de7515d43659f852590645a688f8d493e4a18141
> 
> Uh, it would have been really good if this was discussed a bit wider
> beforehand. Now we have rather diverging approaches to this. Also would be
> really good to resurrect the dma_fence annotations too.
> 
> Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
> up rfc conversion patches for the other drivers.

Ping for this. Currently the hw fence is returned from the ->run_job
callback, and that's not great design.

If we embed it, then I think it should start existing latest from
drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
the right thing to do here is to have the hw fence as a pointer in
struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
allocated.

Otherwise we're just diverging across drivers and tempting them to do the
wrong thing with the current ->run_job callback interface.

Can you guys look into this?
-Daniel

> > > Another thing I recently pondered for tdr races looking at i915 code is
> > > whether the tdr should first block the completion fence for that job. My
> > > motivation is to have a race-free error capture (if the completion races
> > > then we might start evicting memory and everything goes boom), but maybe
> > > that helps here too. Some kind of atomic "block this fence from
> > > completing thing.
> > > 
> > > Or I'm I completely guessing in the wrong direction?
> > 
> > 
> > I think we already do it here - https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L410
> 
> Ah yes this works becase drm/sched has separate hw fence from the logical
> job fence.
> -Daniel
> 
> > 
> > Andrey
> > 
> > 
> > > -Daniel
> > > 
> > > > Andrey
> > > > 
> > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > > Andrey
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > > 
> > > > > > > > > >                     status = job->sched->ops->timedout_job(job);
> > > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > >                             job->sched->ops->free_job(job);
> > > > > > > > > >                             sched->free_guilty = false;
> > > > > > > > > >                     }
> > > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > > >             } else {
> > > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > >             }
> > > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > > 
> > > > > > > > > >             kthread_park(sched->thread);
> > > > > > > > > > 
> > > > > > > > > > -       /*
> > > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > > -        */
> > > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > > -               /*
> > > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > > -                * job extracted.
> > > > > > > > > > -                */
> > > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > > -
> > > > > > > > > >             /*
> > > > > > > > > >              * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > > >              * their HW callbacks or remove them from pending list if they already
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-31 13:11                   ` Daniel Vetter
@ 2021-08-31 18:24                     ` Andrey Grodzovsky
  2021-09-02 14:28                       ` Daniel Vetter
  0 siblings, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-08-31 18:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu, Christian Koenig


On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
> On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
>> On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
>>> On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
>>>> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>>>>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>>>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>>>>
>>>>>>>>>> + dri-devel
>>>>>>>>>>
>>>>>>>>>> Since scheduler is a shared component, please add dri-devel on all
>>>>>>>>>> scheduler patches.
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>>>>> [Why]
>>>>>>>>>>> for bailing job, this commit will delete it from pending list thus the
>>>>>>>>>>> bailing job will never have a chance to be resubmitted even in advance
>>>>>>>>>>> tdr mode.
>>>>>>>>>>>
>>>>>>>>>>> [How]
>>>>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race condition that
>>>>>>>>>>> this commit tries to work around is completely solved.So revert this
>>>>>>>>>>> commit.
>>>>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>>>>> v2:
>>>>>>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>>>>>>>>>>> during processing timedout_job
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>>>>       1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>       {
>>>>>>>>>>>              struct drm_gpu_scheduler *sched;
>>>>>>>>>>>              struct drm_sched_job *job;
>>>>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>>>>              enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>>>>
>>>>>>>>>>>              sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>>>>>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>
>>>>>>>>>>>              if (job) {
>>>>>>>>>>>                      /*
>>>>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>>>>> +                * processing timedout_job
>>>>>>>>>>>                       */
>>>>>>>>>>> -               list_del_init(&job->list);
>>>>>>>>>>> +               fence = dma_fence_get(job->s_fence->parent);
>>>>>>>>> While this is true for amdgpu, it has no meaning for other drivers for whom
>>>>>>>>> we haven't
>>>>>>>>> done the refactoring of embedding HW fence (parent) into the job structure.
>>>>>>>>> In fact thinking
>>>>>>>>> about it, unless you do the HW fence embedding for all the drivers using the
>>>>>>>>> scheduler you cannot
>>>>>>>>> revert this patch or you will just break them.
>>>>>>>> btw, why did you do that embedding? I do still have my patches with
>>>>>>>> dma_fence annotations floating around, but my idea at least was to fix
>>>>>>>> that issue with a mempool, not with embeddeding. What was the motivation
>>>>>>>> for embedding the wh fence?
>>>>>>>> -Daniel
>>>>>>> The motivation was 2 fold, avoid memory allocation during jobs submissions
>>>>>>> (HW fence allocation) because as Christian explained this leads to deadlock
>>>>>>> with
>>>>>>> mm code during evictions due to memory pressure (Christian can clarify if I
>>>>>>> messed
>>>>>> Yeah that's the exact same thing I've chased with my dma_fence
>>>>>> annotations, but thus far zero to none interested in getting it sorted. I
>>>>>> think it'd be good to have some cross-driver agreement on how this should
>>>>>> be solved before someone just charges ahead ...
>>>>>>
>>>>>>> this explanation). Second is to exactly revert this patch because while it
>>>>>>> solved the issue
>>>>>>> described in the patch it created another with drivers who baildc out early
>>>>>>> during TDR handling
>>>>>>> for various reason and the job would just leak because it was already
>>>>>>> removed form pending list.
>>>>>> Can't we reinsert it before we restart the scheduler thread? It might need
>>>>>> a separate list for that due to the lockless queue tricks. Or am I
>>>>>> thinking about the wrong kind of "we lost the job"?
>>>>>> -Danile
>>>>> If you look at the original patch it would reinsert it even earlier - right
>>>>> after stopping the  SW scheduler thread, and even then it was to late for
>>>>> some drivers as they would decide to return back from their TDR handler even
>>>>> before that. It is solvable but in an ugly way as far as I see, you need to
>>>>> require each driver in his code to put the job back in the list if they do
>>>>> it before reaching the place where scheduler framework does it. Kind of
>>>>> spaghetti code seems to me.
>>>> Hm yeah I didn't realize this all happens before we stop the scheduler
>>>> thread.
>>>>
>>>> Why can't we stop the scheduler thread first, so that there's guaranteed
>>>> no race? I've recently had a lot of discussions with panfrost folks about
>>>> their reset that spawns across engines, and without stopping the scheduler
>>>> thread first before you touch anything it's just plain impossible.
>>>
>>> Talked with Christian on that, for each TDR we actually stop all the
>>> schedulers for all the rings and not only the hanged ring since
>>> ASIC reset will impact all the rings anyway. So we cannot allow
>>> other timeout handlers for other rings run in parallel to ours
>>> as they will stop/restart the threads we just stopped and rely
>>> on them being stopped. So it's all done with device wide lock
>>> inside the amdgpu tTDR handler. Only inside the locked
>>> section then we may stop/restart the scheduler threads.
>>> Christian also mentioned that you proposed at some point
>>> to serialize all TDR handling into single threading for all rings - this
>>> seems
>>> like something that could be used - we then don't need any
>>> locking against TDR handlers from other rings and then we may
>>> stop the scheduler thread as first step
>>>
>>>
>>>> I'm also still not understanding what exactly you guys have done,
>>>> can someone please dig out the the amdgpu patches that motivate all this
>>>> maybe that's clearer? A full explanation would still be good since I've
>>>> only started in scheduler stuff.
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Fde7515d43659f852590645a688f8d493e4a18141&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=itjKBNUdOAyze1%2FOWJtBD7ed%2B8PBbB28QbJEddkc98w%3D&amp;reserved=0
>> Uh, it would have been really good if this was discussed a bit wider
>> beforehand. Now we have rather diverging approaches to this. Also would be
>> really good to resurrect the dma_fence annotations too.
>>
>> Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
>> up rfc conversion patches for the other drivers.
> Ping for this. Currently the hw fence is returned from the ->run_job
> callback, and that's not great design.


What's the problem you see there ?


>
> If we embed it, then I think it should start existing latest from
> drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
> the right thing to do here is to have the hw fence as a pointer in
> struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
> allocated.


Why we need to allocate the HW fence if it's embedded within a job struct ?


>
> Otherwise we're just diverging across drivers and tempting them to do the
> wrong thing with the current ->run_job callback interface.


Maybe we should switch from embedding in driver level job struct as it's now
to drm_sched_job and just leave the fence initialization to driver 
specific code ?

Andrey


>
> Can you guys look into this?
> -Daniel
>
>>>> Another thing I recently pondered for tdr races looking at i915 code is
>>>> whether the tdr should first block the completion fence for that job. My
>>>> motivation is to have a race-free error capture (if the completion races
>>>> then we might start evicting memory and everything goes boom), but maybe
>>>> that helps here too. Some kind of atomic "block this fence from
>>>> completing thing.
>>>>
>>>> Or I'm I completely guessing in the wrong direction?
>>>
>>> I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.14-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fscheduler%2Fsched_main.c%23L410&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Maya0Mk1sAliheOv7fCM8bTC7qTOp74Agt1u67kYCJw%3D&amp;reserved=0
>> Ah yes this works becase drm/sched has separate hw fence from the logical
>> job fence.
>> -Daniel
>>
>>> Andrey
>>>
>>>
>>>> -Daniel
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>
>>>>>>>>>>>                      status = job->sched->ops->timedout_job(job);
>>>>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>                              job->sched->ops->free_job(job);
>>>>>>>>>>>                              sched->free_guilty = false;
>>>>>>>>>>>                      }
>>>>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>>>>              } else {
>>>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>              }
>>>>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>>>
>>>>>>>>>>>              kthread_park(sched->thread);
>>>>>>>>>>>
>>>>>>>>>>> -       /*
>>>>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>>>>> -        */
>>>>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>>>>> -               /*
>>>>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>>>>> -                * job extracted.
>>>>>>>>>>> -                */
>>>>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>>>>> -
>>>>>>>>>>>              /*
>>>>>>>>>>>               * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>>>               * their HW callbacks or remove them from pending list if they already
>>>>>>>>>>> --
>>>>>>>>>>> 2.25.1
>>>>>>>>>>>
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=r7EGQcWGcRinVxmD%2F%2FIFA8WgRpYNnt7feQseD92U6kc%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-31 18:24                     ` Andrey Grodzovsky
@ 2021-09-02 14:28                       ` Daniel Vetter
  2021-09-02 15:36                         ` Andrey Grodzovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Vetter @ 2021-09-02 14:28 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Alex Deucher, Jingwen Chen,
	Maling list - DRI developers, amd-gfx list, monk.liu,
	Christian Koenig

On Tue, Aug 31, 2021 at 02:24:52PM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
> > On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
> > > > > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > > > > 
> > > > > > > > > > > + dri-devel
> > > > > > > > > > > 
> > > > > > > > > > > Since scheduler is a shared component, please add dri-devel on all
> > > > > > > > > > > scheduler patches.
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > > > > > [Why]
> > > > > > > > > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > > > > > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > > > > > > > > tdr mode.
> > > > > > > > > > > > 
> > > > > > > > > > > > [How]
> > > > > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > > > > > > > > this commit tries to work around is completely solved.So revert this
> > > > > > > > > > > > commit.
> > > > > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > > > > v2:
> > > > > > > > > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > > > > > > > > during processing timedout_job
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >       drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > > > > >       1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > > >       {
> > > > > > > > > > > >              struct drm_gpu_scheduler *sched;
> > > > > > > > > > > >              struct drm_sched_job *job;
> > > > > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > > > > >              enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > > > > 
> > > > > > > > > > > >              sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > > > > > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > > > 
> > > > > > > > > > > >              if (job) {
> > > > > > > > > > > >                      /*
> > > > > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > > > > +                * processing timedout_job
> > > > > > > > > > > >                       */
> > > > > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > > > > > > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > > > > > > > > we haven't
> > > > > > > > > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > > > > > > > > In fact thinking
> > > > > > > > > > about it, unless you do the HW fence embedding for all the drivers using the
> > > > > > > > > > scheduler you cannot
> > > > > > > > > > revert this patch or you will just break them.
> > > > > > > > > btw, why did you do that embedding? I do still have my patches with
> > > > > > > > > dma_fence annotations floating around, but my idea at least was to fix
> > > > > > > > > that issue with a mempool, not with embeddeding. What was the motivation
> > > > > > > > > for embedding the wh fence?
> > > > > > > > > -Daniel
> > > > > > > > The motivation was 2 fold, avoid memory allocation during jobs submissions
> > > > > > > > (HW fence allocation) because as Christian explained this leads to deadlock
> > > > > > > > with
> > > > > > > > mm code during evictions due to memory pressure (Christian can clarify if I
> > > > > > > > messed
> > > > > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > > > > annotations, but thus far zero to none interested in getting it sorted. I
> > > > > > > think it'd be good to have some cross-driver agreement on how this should
> > > > > > > be solved before someone just charges ahead ...
> > > > > > > 
> > > > > > > > this explanation). Second is to exactly revert this patch because while it
> > > > > > > > solved the issue
> > > > > > > > described in the patch it created another with drivers who baildc out early
> > > > > > > > during TDR handling
> > > > > > > > for various reason and the job would just leak because it was already
> > > > > > > > removed form pending list.
> > > > > > > Can't we reinsert it before we restart the scheduler thread? It might need
> > > > > > > a separate list for that due to the lockless queue tricks. Or am I
> > > > > > > thinking about the wrong kind of "we lost the job"?
> > > > > > > -Danile
> > > > > > If you look at the original patch it would reinsert it even earlier - right
> > > > > > after stopping the  SW scheduler thread, and even then it was to late for
> > > > > > some drivers as they would decide to return back from their TDR handler even
> > > > > > before that. It is solvable but in an ugly way as far as I see, you need to
> > > > > > require each driver in his code to put the job back in the list if they do
> > > > > > it before reaching the place where scheduler framework does it. Kind of
> > > > > > spaghetti code seems to me.
> > > > > Hm yeah I didn't realize this all happens before we stop the scheduler
> > > > > thread.
> > > > > 
> > > > > Why can't we stop the scheduler thread first, so that there's guaranteed
> > > > > no race? I've recently had a lot of discussions with panfrost folks about
> > > > > their reset that spawns across engines, and without stopping the scheduler
> > > > > thread first before you touch anything it's just plain impossible.
> > > > 
> > > > Talked with Christian on that, for each TDR we actually stop all the
> > > > schedulers for all the rings and not only the hanged ring since
> > > > ASIC reset will impact all the rings anyway. So we cannot allow
> > > > other timeout handlers for other rings run in parallel to ours
> > > > as they will stop/restart the threads we just stopped and rely
> > > > on them being stopped. So it's all done with device wide lock
> > > > inside the amdgpu tTDR handler. Only inside the locked
> > > > section then we may stop/restart the scheduler threads.
> > > > Christian also mentioned that you proposed at some point
> > > > to serialize all TDR handling into single threading for all rings - this
> > > > seems
> > > > like something that could be used - we then don't need any
> > > > locking against TDR handlers from other rings and then we may
> > > > stop the scheduler thread as first step
> > > > 
> > > > 
> > > > > I'm also still not understanding what exactly you guys have done,
> > > > > can someone please dig out the the amdgpu patches that motivate all this
> > > > > maybe that's clearer? A full explanation would still be good since I've
> > > > > only started in scheduler stuff.
> > > > 
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Fde7515d43659f852590645a688f8d493e4a18141&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=itjKBNUdOAyze1%2FOWJtBD7ed%2B8PBbB28QbJEddkc98w%3D&amp;reserved=0
> > > Uh, it would have been really good if this was discussed a bit wider
> > > beforehand. Now we have rather diverging approaches to this. Also would be
> > > really good to resurrect the dma_fence annotations too.
> > > 
> > > Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
> > > up rfc conversion patches for the other drivers.
> > Ping for this. Currently the hw fence is returned from the ->run_job
> > callback, and that's not great design.
> 
> 
> What's the problem you see there ?

For one, all other drivers work like that, and it's not great to be
inconsistent. And it allows that inconsistent/wrong pattern to continue.

Second I'm not even sure you can embed the hw fence, because there's this
job restarting going on. Which at least thus far allocated a new hw fence.
So this needs considerations.

> > If we embed it, then I think it should start existing latest from
> > drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
> > the right thing to do here is to have the hw fence as a pointer in
> > struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
> > allocated.
> 
> 
> Why we need to allocate the HW fence if it's embedded within a job struct ?

the hw fence is a refcounted struct, and the drm_sched_job is a different
struct. And we didn't have a dri-devel discussion about whether it's
correct to conflate these two lifetimes, amdgpu folks simply hacked
something together.

> > Otherwise we're just diverging across drivers and tempting them to do the
> > wrong thing with the current ->run_job callback interface.
> 
> 
> Maybe we should switch from embedding in driver level job struct as it's now
> to drm_sched_job and just leave the fence initialization to driver specific
> code ?

Maybe? Like I've not been involved in these discussion ont he amd side at
all, I'm just noticing that we do have a now rather inconsistently used
inteface across drivers. Which is no good.
-Daniel

> 
> Andrey
> 
> 
> > 
> > Can you guys look into this?
> > -Daniel
> > 
> > > > > Another thing I recently pondered for tdr races looking at i915 code is
> > > > > whether the tdr should first block the completion fence for that job. My
> > > > > motivation is to have a race-free error capture (if the completion races
> > > > > then we might start evicting memory and everything goes boom), but maybe
> > > > > that helps here too. Some kind of atomic "block this fence from
> > > > > completing thing.
> > > > > 
> > > > > Or I'm I completely guessing in the wrong direction?
> > > > 
> > > > I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.14-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fscheduler%2Fsched_main.c%23L410&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Maya0Mk1sAliheOv7fCM8bTC7qTOp74Agt1u67kYCJw%3D&amp;reserved=0
> > > Ah yes this works becase drm/sched has separate hw fence from the logical
> > > job fence.
> > > -Daniel
> > > 
> > > > Andrey
> > > > 
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > > Andrey
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > Andrey
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > >                      spin_unlock(&sched->job_list_lock);
> > > > > > > > > > > > 
> > > > > > > > > > > >                      status = job->sched->ops->timedout_job(job);
> > > > > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > > >                              job->sched->ops->free_job(job);
> > > > > > > > > > > >                              sched->free_guilty = false;
> > > > > > > > > > > >                      }
> > > > > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > > > > >              } else {
> > > > > > > > > > > >                      spin_unlock(&sched->job_list_lock);
> > > > > > > > > > > >              }
> > > > > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > > > > 
> > > > > > > > > > > >              kthread_park(sched->thread);
> > > > > > > > > > > > 
> > > > > > > > > > > > -       /*
> > > > > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > > > > -        */
> > > > > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > > > > -               /*
> > > > > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > > > > -                * job extracted.
> > > > > > > > > > > > -                */
> > > > > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > > > > -
> > > > > > > > > > > >              /*
> > > > > > > > > > > >               * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > > > > >               * their HW callbacks or remove them from pending list if they already
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=r7EGQcWGcRinVxmD%2F%2FIFA8WgRpYNnt7feQseD92U6kc%3D&amp;reserved=0

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-09-02 14:28                       ` Daniel Vetter
@ 2021-09-02 15:36                         ` Andrey Grodzovsky
  2021-09-07  8:47                           ` Daniel Vetter
  0 siblings, 1 reply; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-09-02 15:36 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu, Christian Koenig


On 2021-09-02 10:28 a.m., Daniel Vetter wrote:
> On Tue, Aug 31, 2021 at 02:24:52PM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
>>> On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
>>>> On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
>>>>> On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
>>>>>> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>>>>>>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>>>>>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>>>>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>>>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>>>>>>
>>>>>>>>>>>> + dri-devel
>>>>>>>>>>>>
>>>>>>>>>>>> Since scheduler is a shared component, please add dri-devel on all
>>>>>>>>>>>> scheduler patches.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
>>>>>>>>>>>>> [Why]
>>>>>>>>>>>>> for bailing job, this commit will delete it from pending list thus the
>>>>>>>>>>>>> bailing job will never have a chance to be resubmitted even in advance
>>>>>>>>>>>>> tdr mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [How]
>>>>>>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race condition that
>>>>>>>>>>>>> this commit tries to work around is completely solved.So revert this
>>>>>>>>>>>>> commit.
>>>>>>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>>>>>>> v2:
>>>>>>>>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>>>>>>>>>>>>> during processing timedout_job
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>        drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>>>>>>        1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>>>        {
>>>>>>>>>>>>>               struct drm_gpu_scheduler *sched;
>>>>>>>>>>>>>               struct drm_sched_job *job;
>>>>>>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>>>>>>               enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>>>>>>
>>>>>>>>>>>>>               sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>>>>>>>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>>>
>>>>>>>>>>>>>               if (job) {
>>>>>>>>>>>>>                       /*
>>>>>>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>>>>>>> +                * processing timedout_job
>>>>>>>>>>>>>                        */
>>>>>>>>>>>>> -               list_del_init(&job->list);
>>>>>>>>>>>>> +               fence = dma_fence_get(job->s_fence->parent);
>>>>>>>>>>> While this is true for amdgpu, it has no meaning for other drivers for whom
>>>>>>>>>>> we haven't
>>>>>>>>>>> done the refactoring of embedding HW fence (parent) into the job structure.
>>>>>>>>>>> In fact thinking
>>>>>>>>>>> about it, unless you do the HW fence embedding for all the drivers using the
>>>>>>>>>>> scheduler you cannot
>>>>>>>>>>> revert this patch or you will just break them.
>>>>>>>>>> btw, why did you do that embedding? I do still have my patches with
>>>>>>>>>> dma_fence annotations floating around, but my idea at least was to fix
>>>>>>>>>> that issue with a mempool, not with embeddeding. What was the motivation
>>>>>>>>>> for embedding the wh fence?
>>>>>>>>>> -Daniel
>>>>>>>>> The motivation was 2 fold, avoid memory allocation during jobs submissions
>>>>>>>>> (HW fence allocation) because as Christian explained this leads to deadlock
>>>>>>>>> with
>>>>>>>>> mm code during evictions due to memory pressure (Christian can clarify if I
>>>>>>>>> messed
>>>>>>>> Yeah that's the exact same thing I've chased with my dma_fence
>>>>>>>> annotations, but thus far zero to none interested in getting it sorted. I
>>>>>>>> think it'd be good to have some cross-driver agreement on how this should
>>>>>>>> be solved before someone just charges ahead ...
>>>>>>>>
>>>>>>>>> this explanation). Second is to exactly revert this patch because while it
>>>>>>>>> solved the issue
>>>>>>>>> described in the patch it created another with drivers who baildc out early
>>>>>>>>> during TDR handling
>>>>>>>>> for various reason and the job would just leak because it was already
>>>>>>>>> removed form pending list.
>>>>>>>> Can't we reinsert it before we restart the scheduler thread? It might need
>>>>>>>> a separate list for that due to the lockless queue tricks. Or am I
>>>>>>>> thinking about the wrong kind of "we lost the job"?
>>>>>>>> -Danile
>>>>>>> If you look at the original patch it would reinsert it even earlier - right
>>>>>>> after stopping the  SW scheduler thread, and even then it was to late for
>>>>>>> some drivers as they would decide to return back from their TDR handler even
>>>>>>> before that. It is solvable but in an ugly way as far as I see, you need to
>>>>>>> require each driver in his code to put the job back in the list if they do
>>>>>>> it before reaching the place where scheduler framework does it. Kind of
>>>>>>> spaghetti code seems to me.
>>>>>> Hm yeah I didn't realize this all happens before we stop the scheduler
>>>>>> thread.
>>>>>>
>>>>>> Why can't we stop the scheduler thread first, so that there's guaranteed
>>>>>> no race? I've recently had a lot of discussions with panfrost folks about
>>>>>> their reset that spawns across engines, and without stopping the scheduler
>>>>>> thread first before you touch anything it's just plain impossible.
>>>>> Talked with Christian on that, for each TDR we actually stop all the
>>>>> schedulers for all the rings and not only the hanged ring since
>>>>> ASIC reset will impact all the rings anyway. So we cannot allow
>>>>> other timeout handlers for other rings run in parallel to ours
>>>>> as they will stop/restart the threads we just stopped and rely
>>>>> on them being stopped. So it's all done with device wide lock
>>>>> inside the amdgpu tTDR handler. Only inside the locked
>>>>> section then we may stop/restart the scheduler threads.
>>>>> Christian also mentioned that you proposed at some point
>>>>> to serialize all TDR handling into single threading for all rings - this
>>>>> seems
>>>>> like something that could be used - we then don't need any
>>>>> locking against TDR handlers from other rings and then we may
>>>>> stop the scheduler thread as first step
>>>>>
>>>>>
>>>>>> I'm also still not understanding what exactly you guys have done,
>>>>>> can someone please dig out the the amdgpu patches that motivate all this
>>>>>> maybe that's clearer? A full explanation would still be good since I've
>>>>>> only started in scheduler stuff.
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Fde7515d43659f852590645a688f8d493e4a18141&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca5d9bacd4415453ba6c308d96e1de455%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661896953391179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=T3WC4%2B3BcWBy9gnjCMLRJjPM%2BWXfmN4GfR2Ojn8P3qc%3D&amp;reserved=0
>>>> Uh, it would have been really good if this was discussed a bit wider
>>>> beforehand. Now we have rather diverging approaches to this. Also would be
>>>> really good to resurrect the dma_fence annotations too.
>>>>
>>>> Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
>>>> up rfc conversion patches for the other drivers.
>>> Ping for this. Currently the hw fence is returned from the ->run_job
>>> callback, and that's not great design.
>>
>> What's the problem you see there ?
> For one, all other drivers work like that, and it's not great to be
> inconsistent. And it allows that inconsistent/wrong pattern to continue.
>
> Second I'm not even sure you can embed the hw fence, because there's this
> job restarting going on. Which at least thus far allocated a new hw fence.
> So this needs considerations.


There is a solution to this at least at the amdgou level, see here -
https://www.spinics.net/lists/amd-gfx/msg66614.html So we would
reset the embedded fence seqno for this purpose (see amdgpu_fence_emit).


>
>>> If we embed it, then I think it should start existing latest from
>>> drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
>>> the right thing to do here is to have the hw fence as a pointer in
>>> struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
>>> allocated.
>>
>> Why we need to allocate the HW fence if it's embedded within a job struct ?
> the hw fence is a refcounted struct, and the drm_sched_job is a different
> struct. And we didn't have a dri-devel discussion about whether it's
> correct to conflate these two lifetimes, amdgpu folks simply hacked
> something together.


Obviously scheduler level changes must be discussed at dri-devel forum 
level.
What happened here and as Monk already mentioned - we had internal 
discussion
about how to fix the problem in the header of this thread - avoiding 
accessing feed job
from TDR handler without the current hack in place of removal and back 
insertion
into pending list. It's there we we came up (I think Christian first 
mentioned this) with the
idea of embedding the HW fence into amdgpu job - both for avoiding 
memory allocations
but also - because this allows to use the HW fence's recounting as a 
solution to the above
problem by simply grabbing a reference to the next fence in the pending 
list as a first step
in the TDR handler. What we didn't take into account at the time is that 
indeed this change
cannot be limited to amdgpu level - this we noticed much later during 
final code reviews.

Andrey


>
>>> Otherwise we're just diverging across drivers and tempting them to do the
>>> wrong thing with the current ->run_job callback interface.
>>
>> Maybe we should switch from embedding in driver level job struct as it's now
>> to drm_sched_job and just leave the fence initialization to driver specific
>> code ?
> Maybe? Like I've not been involved in these discussion ont he amd side at
> all, I'm just noticing that we do have a now rather inconsistently used
> inteface across drivers. Which is no good.
> -Daniel
>
>> Andrey
>>
>>
>>> Can you guys look into this?
>>> -Daniel
>>>
>>>>>> Another thing I recently pondered for tdr races looking at i915 code is
>>>>>> whether the tdr should first block the completion fence for that job. My
>>>>>> motivation is to have a race-free error capture (if the completion races
>>>>>> then we might start evicting memory and everything goes boom), but maybe
>>>>>> that helps here too. Some kind of atomic "block this fence from
>>>>>> completing thing.
>>>>>>
>>>>>> Or I'm I completely guessing in the wrong direction?
>>>>> I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.14-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fscheduler%2Fsched_main.c%23L410&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca5d9bacd4415453ba6c308d96e1de455%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661896953391179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1xqr4XCqY%2FCYJjAzT3GI8MyBi15tJQmOt6sB79COsmc%3D&amp;reserved=0
>>>> Ah yes this works becase drm/sched has separate hw fence from the logical
>>>> job fence.
>>>> -Daniel
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> -Daniel
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>                       spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       status = job->sched->ops->timedout_job(job);
>>>>>>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>>>                               job->sched->ops->free_job(job);
>>>>>>>>>>>>>                               sched->free_guilty = false;
>>>>>>>>>>>>>                       }
>>>>>>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>                       spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>>>               }
>>>>>>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>
>>>>>>>>>>>>>               kthread_park(sched->thread);
>>>>>>>>>>>>>
>>>>>>>>>>>>> -       /*
>>>>>>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>>>>>>> -        */
>>>>>>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>>>>>>> -               /*
>>>>>>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>>>>>>> -                * job extracted.
>>>>>>>>>>>>> -                */
>>>>>>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>               /*
>>>>>>>>>>>>>                * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>>>>>                * their HW callbacks or remove them from pending list if they already
>>>>>>>>>>>>> --
>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>
>>>> -- 
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca5d9bacd4415453ba6c308d96e1de455%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661896953391179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4V%2Fri%2B3gnISZfC6HOUxR1Z8dIkseE9dT1EqiXsTuVi8%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-09-02 15:36                         ` Andrey Grodzovsky
@ 2021-09-07  8:47                           ` Daniel Vetter
  2021-09-07  9:25                             ` Christian König
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Vetter @ 2021-09-07  8:47 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Alex Deucher, Jingwen Chen,
	Maling list - DRI developers, amd-gfx list, monk.liu,
	Christian Koenig

On Thu, Sep 02, 2021 at 11:36:34AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-09-02 10:28 a.m., Daniel Vetter wrote:
> > On Tue, Aug 31, 2021 at 02:24:52PM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
> > > > On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
> > > > > On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
> > > > > > > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > > > > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > + dri-devel
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Since scheduler is a shared component, please add dri-devel on all
> > > > > > > > > > > > > scheduler patches.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > > > > > > > [Why]
> > > > > > > > > > > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > > > > > > > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > > > > > > > > > > tdr mode.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > [How]
> > > > > > > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > > > > > > > > > > this commit tries to work around is completely solved.So revert this
> > > > > > > > > > > > > > commit.
> > > > > > > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > > > > > > v2:
> > > > > > > > > > > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > > > > > > > > > > during processing timedout_job
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >        drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > > > > > > >        1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > > > > >        {
> > > > > > > > > > > > > >               struct drm_gpu_scheduler *sched;
> > > > > > > > > > > > > >               struct drm_sched_job *job;
> > > > > > > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > > > > > > >               enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >               sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > > > > > > > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >               if (job) {
> > > > > > > > > > > > > >                       /*
> > > > > > > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > > > > > > +                * processing timedout_job
> > > > > > > > > > > > > >                        */
> > > > > > > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > > > > > > > > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > > > > > > > > > > we haven't
> > > > > > > > > > > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > > > > > > > > > > In fact thinking
> > > > > > > > > > > > about it, unless you do the HW fence embedding for all the drivers using the
> > > > > > > > > > > > scheduler you cannot
> > > > > > > > > > > > revert this patch or you will just break them.
> > > > > > > > > > > btw, why did you do that embedding? I do still have my patches with
> > > > > > > > > > > dma_fence annotations floating around, but my idea at least was to fix
> > > > > > > > > > > that issue with a mempool, not with embeddeding. What was the motivation
> > > > > > > > > > > for embedding the wh fence?
> > > > > > > > > > > -Daniel
> > > > > > > > > > The motivation was 2 fold, avoid memory allocation during jobs submissions
> > > > > > > > > > (HW fence allocation) because as Christian explained this leads to deadlock
> > > > > > > > > > with
> > > > > > > > > > mm code during evictions due to memory pressure (Christian can clarify if I
> > > > > > > > > > messed
> > > > > > > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > > > > > > annotations, but thus far zero to none interested in getting it sorted. I
> > > > > > > > > think it'd be good to have some cross-driver agreement on how this should
> > > > > > > > > be solved before someone just charges ahead ...
> > > > > > > > > 
> > > > > > > > > > this explanation). Second is to exactly revert this patch because while it
> > > > > > > > > > solved the issue
> > > > > > > > > > described in the patch it created another with drivers who baildc out early
> > > > > > > > > > during TDR handling
> > > > > > > > > > for various reason and the job would just leak because it was already
> > > > > > > > > > removed form pending list.
> > > > > > > > > Can't we reinsert it before we restart the scheduler thread? It might need
> > > > > > > > > a separate list for that due to the lockless queue tricks. Or am I
> > > > > > > > > thinking about the wrong kind of "we lost the job"?
> > > > > > > > > -Danile
> > > > > > > > If you look at the original patch it would reinsert it even earlier - right
> > > > > > > > after stopping the  SW scheduler thread, and even then it was to late for
> > > > > > > > some drivers as they would decide to return back from their TDR handler even
> > > > > > > > before that. It is solvable but in an ugly way as far as I see, you need to
> > > > > > > > require each driver in his code to put the job back in the list if they do
> > > > > > > > it before reaching the place where scheduler framework does it. Kind of
> > > > > > > > spaghetti code seems to me.
> > > > > > > Hm yeah I didn't realize this all happens before we stop the scheduler
> > > > > > > thread.
> > > > > > > 
> > > > > > > Why can't we stop the scheduler thread first, so that there's guaranteed
> > > > > > > no race? I've recently had a lot of discussions with panfrost folks about
> > > > > > > their reset that spawns across engines, and without stopping the scheduler
> > > > > > > thread first before you touch anything it's just plain impossible.
> > > > > > Talked with Christian on that, for each TDR we actually stop all the
> > > > > > schedulers for all the rings and not only the hanged ring since
> > > > > > ASIC reset will impact all the rings anyway. So we cannot allow
> > > > > > other timeout handlers for other rings run in parallel to ours
> > > > > > as they will stop/restart the threads we just stopped and rely
> > > > > > on them being stopped. So it's all done with device wide lock
> > > > > > inside the amdgpu tTDR handler. Only inside the locked
> > > > > > section then we may stop/restart the scheduler threads.
> > > > > > Christian also mentioned that you proposed at some point
> > > > > > to serialize all TDR handling into single threading for all rings - this
> > > > > > seems
> > > > > > like something that could be used - we then don't need any
> > > > > > locking against TDR handlers from other rings and then we may
> > > > > > stop the scheduler thread as first step
> > > > > > 
> > > > > > 
> > > > > > > I'm also still not understanding what exactly you guys have done,
> > > > > > > can someone please dig out the the amdgpu patches that motivate all this
> > > > > > > maybe that's clearer? A full explanation would still be good since I've
> > > > > > > only started in scheduler stuff.
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Fde7515d43659f852590645a688f8d493e4a18141&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca5d9bacd4415453ba6c308d96e1de455%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661896953391179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=T3WC4%2B3BcWBy9gnjCMLRJjPM%2BWXfmN4GfR2Ojn8P3qc%3D&amp;reserved=0
> > > > > Uh, it would have been really good if this was discussed a bit wider
> > > > > beforehand. Now we have rather diverging approaches to this. Also would be
> > > > > really good to resurrect the dma_fence annotations too.
> > > > > 
> > > > > Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
> > > > > up rfc conversion patches for the other drivers.
> > > > Ping for this. Currently the hw fence is returned from the ->run_job
> > > > callback, and that's not great design.
> > > 
> > > What's the problem you see there ?
> > For one, all other drivers work like that, and it's not great to be
> > inconsistent. And it allows that inconsistent/wrong pattern to continue.
> > 
> > Second I'm not even sure you can embed the hw fence, because there's this
> > job restarting going on. Which at least thus far allocated a new hw fence.
> > So this needs considerations.
> 
> 
> There is a solution to this at least at the amdgou level, see here -
> https://www.spinics.net/lists/amd-gfx/msg66614.html So we would
> reset the embedded fence seqno for this purpose (see amdgpu_fence_emit).

I think stuff like this really should be lifted into standard behaviour. I
have no idea whether this is doable across the board in all drivers, and
having incompatible solutions here without understanding the constraints
across drivers is no good at all.

> > > > If we embed it, then I think it should start existing latest from
> > > > drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
> > > > the right thing to do here is to have the hw fence as a pointer in
> > > > struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
> > > > allocated.
> > > 
> > > Why we need to allocate the HW fence if it's embedded within a job struct ?
> > the hw fence is a refcounted struct, and the drm_sched_job is a different
> > struct. And we didn't have a dri-devel discussion about whether it's
> > correct to conflate these two lifetimes, amdgpu folks simply hacked
> > something together.
> 
> 
> Obviously scheduler level changes must be discussed at dri-devel forum
> level.
> What happened here and as Monk already mentioned - we had internal
> discussion
> about how to fix the problem in the header of this thread - avoiding
> accessing feed job
> from TDR handler without the current hack in place of removal and back
> insertion
> into pending list. It's there we we came up (I think Christian first
> mentioned this) with the
> idea of embedding the HW fence into amdgpu job - both for avoiding memory
> allocations
> but also - because this allows to use the HW fence's recounting as a
> solution to the above
> problem by simply grabbing a reference to the next fence in the pending list
> as a first step
> in the TDR handler. What we didn't take into account at the time is that
> indeed this change
> cannot be limited to amdgpu level - this we noticed much later during final
> code reviews.

Not sure where this fell through cracks, but imo at least changing where
the hw fence is allocated is a very fundamental change, and latest then
you should have discussed this on dri-devel.

But even the tdr races would probably have been good to start on
dri-devel. Now it looks like Monk&team have lost 6 months for nothing.
-Daniel


> 
> Andrey
> 
> 
> > 
> > > > Otherwise we're just diverging across drivers and tempting them to do the
> > > > wrong thing with the current ->run_job callback interface.
> > > 
> > > Maybe we should switch from embedding in driver level job struct as it's now
> > > to drm_sched_job and just leave the fence initialization to driver specific
> > > code ?
> > Maybe? Like I've not been involved in these discussion ont he amd side at
> > all, I'm just noticing that we do have a now rather inconsistently used
> > inteface across drivers. Which is no good.
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > Can you guys look into this?
> > > > -Daniel
> > > > 
> > > > > > > Another thing I recently pondered for tdr races looking at i915 code is
> > > > > > > whether the tdr should first block the completion fence for that job. My
> > > > > > > motivation is to have a race-free error capture (if the completion races
> > > > > > > then we might start evicting memory and everything goes boom), but maybe
> > > > > > > that helps here too. Some kind of atomic "block this fence from
> > > > > > > completing thing.
> > > > > > > 
> > > > > > > Or I'm I completely guessing in the wrong direction?
> > > > > > I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.14-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fscheduler%2Fsched_main.c%23L410&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca5d9bacd4415453ba6c308d96e1de455%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661896953391179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1xqr4XCqY%2FCYJjAzT3GI8MyBi15tJQmOt6sB79COsmc%3D&amp;reserved=0
> > > > > Ah yes this works becase drm/sched has separate hw fence from the logical
> > > > > job fence.
> > > > > -Daniel
> > > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > -Daniel
> > > > > > > 
> > > > > > > > Andrey
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > Andrey
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > > Andrey
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > >                       spin_unlock(&sched->job_list_lock);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >                       status = job->sched->ops->timedout_job(job);
> > > > > > > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > > > > >                               job->sched->ops->free_job(job);
> > > > > > > > > > > > > >                               sched->free_guilty = false;
> > > > > > > > > > > > > >                       }
> > > > > > > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > > > > > > >               } else {
> > > > > > > > > > > > > >                       spin_unlock(&sched->job_list_lock);
> > > > > > > > > > > > > >               }
> > > > > > > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >               kthread_park(sched->thread);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -       /*
> > > > > > > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > > > > > > -        */
> > > > > > > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > > > > > > -               /*
> > > > > > > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > > > > > > -                * job extracted.
> > > > > > > > > > > > > > -                */
> > > > > > > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > > > > > > -
> > > > > > > > > > > > > >               /*
> > > > > > > > > > > > > >                * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > > > > > > >                * their HW callbacks or remove them from pending list if they already
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > 
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca5d9bacd4415453ba6c308d96e1de455%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661896953391179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4V%2Fri%2B3gnISZfC6HOUxR1Z8dIkseE9dT1EqiXsTuVi8%3D&amp;reserved=0

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-09-07  8:47                           ` Daniel Vetter
@ 2021-09-07  9:25                             ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-09-07  9:25 UTC (permalink / raw)
  To: Daniel Vetter, Andrey Grodzovsky
  Cc: Alex Deucher, Jingwen Chen, Maling list - DRI developers,
	amd-gfx list, monk.liu

Am 07.09.21 um 10:47 schrieb Daniel Vetter:
> [SNIP]
>>>>> If we embed it, then I think it should start existing latest from
>>>>> drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
>>>>> the right thing to do here is to have the hw fence as a pointer in
>>>>> struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
>>>>> allocated.
>>>> Why we need to allocate the HW fence if it's embedded within a job struct ?
>>> the hw fence is a refcounted struct, and the drm_sched_job is a different
>>> struct. And we didn't have a dri-devel discussion about whether it's
>>> correct to conflate these two lifetimes, amdgpu folks simply hacked
>>> something together.
>>
>> Obviously scheduler level changes must be discussed at dri-devel forum
>> level.
>> What happened here and as Monk already mentioned - we had internal
>> discussion
>> about how to fix the problem in the header of this thread - avoiding
>> accessing feed job
>> from TDR handler without the current hack in place of removal and back
>> insertion
>> into pending list. It's there we we came up (I think Christian first
>> mentioned this) with the
>> idea of embedding the HW fence into amdgpu job - both for avoiding memory
>> allocations
>> but also - because this allows to use the HW fence's recounting as a
>> solution to the above
>> problem by simply grabbing a reference to the next fence in the pending list
>> as a first step
>> in the TDR handler. What we didn't take into account at the time is that
>> indeed this change
>> cannot be limited to amdgpu level - this we noticed much later during final
>> code reviews.
> Not sure where this fell through cracks, but imo at least changing where
> the hw fence is allocated is a very fundamental change, and latest then
> you should have discussed this on dri-devel.

I'm the one who kicked this off in April and I made a nice internal 
presentation to explain what the problems is etc... So the idea of 
embedding the hardware fence into the job came from me.

But during the presentation I also noted that we need to sync up with a 
guy named Daniel Vetter because it was his patch set which surfaced this 
issue by annotating fence completion prerequisite in lockdep.

> But even the tdr races would probably have been good to start on
> dri-devel. Now it looks like Monk&team have lost 6 months for nothing.

Well to make it clear I've noted during the presentation in April that 
this needs to be discussed with you, I've also noted to the first guy 
working on this that this needs to be discussed on dri-devel instead of 
internally and I'm pretty sure that I've noted this a couple of more 
times after it moved to somebody else. And IIRC Andrey also noted that 
we should not discuss this internally pretty early as well.

So if people are not listening it is not a surprise that they spend time 
on stuff which isn't upstreamable like this.

Christian.

> -Daniel
>
>
>> Andrey
>>
>>
>>>>> Otherwise we're just diverging across drivers and tempting them to do the
>>>>> wrong thing with the current ->run_job callback interface.
>>>> Maybe we should switch from embedding in driver level job struct as it's now
>>>> to drm_sched_job and just leave the fence initialization to driver specific
>>>> code ?
>>> Maybe? Like I've not been involved in these discussion ont he amd side at
>>> all, I'm just noticing that we do have a now rather inconsistently used
>>> inteface across drivers. Which is no good.
>>> -Daniel
>>>
>>>> Andrey
>>>>
>>>>
>>>>> Can you guys look into this?
>>>>> -Daniel
>>>>>
>>>>>>>> Another thing I recently pondered for tdr races looking at i915 code is
>>>>>>>> whether the tdr should first block the completion fence for that job. My
>>>>>>>> motivation is to have a race-free error capture (if the completion races
>>>>>>>> then we might start evicting memory and everything goes boom), but maybe
>>>>>>>> that helps here too. Some kind of atomic "block this fence from
>>>>>>>> completing thing.
>>>>>>>>
>>>>>>>> Or I'm I completely guessing in the wrong direction?
>>>>>>> I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.14-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fscheduler%2Fsched_main.c%23L410&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C485eb1f956d8488a041408d971dc1398%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637666013202978201%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=RILakhBoNRBPNFhvI5VfDDUP9l6R%2FnHrylglDBtg7%2Bo%3D&amp;reserved=0
>>>>>> Ah yes this works becase drm/sched has separate hw fence from the logical
>>>>>> job fence.
>>>>>> -Daniel
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>> -Daniel
>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                        spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                        status = job->sched->ops->timedout_job(job);
>>>>>>>>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>>>>>                                job->sched->ops->free_job(job);
>>>>>>>>>>>>>>>                                sched->free_guilty = false;
>>>>>>>>>>>>>>>                        }
>>>>>>>>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>>>>>>>>                } else {
>>>>>>>>>>>>>>>                        spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                kthread_park(sched->thread);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -       /*
>>>>>>>>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>>>>>>>>> -        */
>>>>>>>>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>>>>>>>>> -               /*
>>>>>>>>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>>>>>>>>> -                * job extracted.
>>>>>>>>>>>>>>> -                */
>>>>>>>>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>                /*
>>>>>>>>>>>>>>>                 * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>>>>>>>                 * their HW callbacks or remove them from pending list if they already
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>
>>>>>> -- 
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C485eb1f956d8488a041408d971dc1398%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637666013202978201%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QXBZgAv4sCcE1OTdhC%2BGeRimDFteEC85YEhjJUj7Sig%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
  2021-08-20  7:12                 ` Liu, Monk
@ 2023-06-08 16:40                     ` Lucas Stach
  2021-08-20 14:07                   ` Andrey Grodzovsky
  2023-06-08 16:40                     ` Lucas Stach
  2 siblings, 0 replies; 30+ messages in thread
From: Lucas Stach @ 2023-06-08 16:40 UTC (permalink / raw)
  To: Liu, Monk, Daniel Vetter, Koenig, Christian, Luben Tuikov
  Cc: Chen, JingWen, amd-gfx list, Maling list - DRI developers

Hi all,

and almost 2 years later I stumbled across this exact issue still being
present in the scheduler: if the driver bails out of the timeout
handling before calling drm_sched_stop(), the timeout job will be
leaked and the TDR timer will potentially not be restarted as the job
isn't put back in the pending_list.

How do we solve this? Apply the below suggestion?

Regards,
Lucas

Am Freitag, dem 20.08.2021 um 07:12 +0000 schrieb Liu, Monk:
> [AMD Official Use Only]
> 
> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>  
> Do you have any concern on the kthread_park() approach ?
> 
> Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
> There was already something wrong/stuck on that ring/scheduler 
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Liu, Monk 
> Sent: Thursday, August 19, 2021 6:26 PM
> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> 
> [AMD Official Use Only]
> 
> Hi Daniel
> 
> > > Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> 
> Yeah we had this though as well in our mind.
> 
> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..50a49cb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>         sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>  
>         /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> +       kthread_park(sched->thread);
>         spin_lock(&sched->job_list_lock);
>         job = list_first_entry_or_null(&sched->pending_list,
>                                        struct drm_sched_job, list);
>  
>         if (job) {
> -               /*
> -                * Remove the bad job so it cannot be freed by concurrent
> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -                * is parked at which point it's safe.
> -                */
> -               list_del_init(&job->list);
>                 spin_unlock(&sched->job_list_lock);
>  
>                 status = job->sched->ops->timedout_job(job);
> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>         } else {
>                 spin_unlock(&sched->job_list_lock);
>         }
> +       kthread_unpark(sched->thread);
>  
>         if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>                 spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>         kthread_park(sched->thread);
>  
>         /*
> -        * Reinsert back the bad job here - now it's safe as
> -        * drm_sched_get_cleanup_job cannot race against us and release the
> -        * bad job at this point - we parked (waited for) any in progress
> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -        * now until the scheduler thread is unparked.
> -        */
> -       if (bad && bad->sched == sched)
> -               /*
> -                * Add at the head of the queue to reflect it was the earliest
> -                * job extracted.
> -                */
> -               list_add(&bad->list, &sched->pending_list);
> -
> -       /*
>          * Iterate the job list from later to  earlier one and either deactive
>          * their HW callbacks or remove them from pending list if they already
>          * signaled.
> 
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Thursday, August 19, 2021 5:31 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> 
> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > 
> > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > 
> > > > > > > + dri-devel
> > > > > > > 
> > > > > > > Since scheduler is a shared component, please add dri-devel 
> > > > > > > on all scheduler patches.
> > > > > > > 
> > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > [Why]
> > > > > > > > for bailing job, this commit will delete it from pending 
> > > > > > > > list thus the bailing job will never have a chance to be 
> > > > > > > > resubmitted even in advance tdr mode.
> > > > > > > > 
> > > > > > > > [How]
> > > > > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > > > > condition that this commit tries to work around is 
> > > > > > > > completely solved.So revert this commit.
> > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > v2:
> > > > > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > > > > concurrent delete during processing timedout_job
> > > > > > > > 
> > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > ---
> > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > >     1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > >     {
> > > > > > > >            struct drm_gpu_scheduler *sched;
> > > > > > > >            struct drm_sched_job *job;
> > > > > > > > +       struct dma_fence *fence;
> > > > > > > >            enum drm_gpu_sched_stat status = 
> > > > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > 
> > > > > > > >            sched = container_of(work, struct 
> > > > > > > > drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ 
> > > > > > > > static void drm_sched_job_timedout(struct work_struct
> > > > > > > > *work)
> > > > > > > > 
> > > > > > > >            if (job) {
> > > > > > > >                    /*
> > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > +                * processing timedout_job
> > > > > > > >                     */
> > > > > > > > -               list_del_init(&job->list);
> > > > > > > > +               fence =
> > > > > > > > + dma_fence_get(job->s_fence->parent);
> > > > > > While this is true for amdgpu, it has no meaning for other 
> > > > > > drivers for whom we haven't done the refactoring of embedding 
> > > > > > HW fence (parent) into the job structure.
> > > > > > In fact thinking
> > > > > > about it, unless you do the HW fence embedding for all the 
> > > > > > drivers using the scheduler you cannot revert this patch or 
> > > > > > you will just break them.
> > > > > btw, why did you do that embedding? I do still have my patches 
> > > > > with dma_fence annotations floating around, but my idea at least 
> > > > > was to fix that issue with a mempool, not with embeddeding. What 
> > > > > was the motivation for embedding the wh fence?
> > > > > -Daniel
> > > > 
> > > > The motivation was 2 fold, avoid memory allocation during jobs 
> > > > submissions (HW fence allocation) because as Christian explained 
> > > > this leads to deadlock with mm code during evictions due to memory 
> > > > pressure (Christian can clarify if I messed
> > > Yeah that's the exact same thing I've chased with my dma_fence 
> > > annotations, but thus far zero to none interested in getting it 
> > > sorted. I think it'd be good to have some cross-driver agreement on 
> > > how this should be solved before someone just charges ahead ...
> > > 
> > > > this explanation). Second is to exactly revert this patch because 
> > > > while it solved the issue described in the patch it created 
> > > > another with drivers who baildc out early during TDR handling for 
> > > > various reason and the job would just leak because it was already 
> > > > removed form pending list.
> > > Can't we reinsert it before we restart the scheduler thread? It 
> > > might need a separate list for that due to the lockless queue 
> > > tricks. Or am I thinking about the wrong kind of "we lost the job"?
> > > -Danile
> > 
> > 
> > If you look at the original patch it would reinsert it even earlier - 
> > right after stopping the  SW scheduler thread, and even then it was to 
> > late for some drivers as they would decide to return back from their 
> > TDR handler even before that. It is solvable but in an ugly way as far 
> > as I see, you need to require each driver in his code to put the job 
> > back in the list if they do it before reaching the place where 
> > scheduler framework does it. Kind of spaghetti code seems to me.
> 
> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
> 
> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> 
> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
> 
> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
> 
> Or I'm I completely guessing in the wrong direction?
> -Daniel
> 
> > 
> > Andrey
> > 
> > 
> > > 
> > > > Andrey
> > > > 
> > > > 
> > > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > > 
> > > > > > > >                    status =
> > > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > >                            job->sched->ops->free_job(job);
> > > > > > > >                            sched->free_guilty = false;
> > > > > > > >                    }
> > > > > > > > +               dma_fence_put(fence);
> > > > > > > >            } else {
> > > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > >            }
> > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > 
> > > > > > > >            kthread_park(sched->thread);
> > > > > > > > 
> > > > > > > > -       /*
> > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > -        */
> > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > -               /*
> > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > -                * job extracted.
> > > > > > > > -                */
> > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > -
> > > > > > > >            /*
> > > > > > > >             * Iterate the job list from later to  earlier one and either deactive
> > > > > > > >             * their HW callbacks or remove them from 
> > > > > > > > pending list if they already
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
@ 2023-06-08 16:40                     ` Lucas Stach
  0 siblings, 0 replies; 30+ messages in thread
From: Lucas Stach @ 2023-06-08 16:40 UTC (permalink / raw)
  To: Liu, Monk, Daniel Vetter, Koenig, Christian, Luben Tuikov
  Cc: Alex Deucher, Chen, JingWen, amd-gfx list, Maling list - DRI developers

Hi all,

and almost 2 years later I stumbled across this exact issue still being
present in the scheduler: if the driver bails out of the timeout
handling before calling drm_sched_stop(), the timeout job will be
leaked and the TDR timer will potentially not be restarted as the job
isn't put back in the pending_list.

How do we solve this? Apply the below suggestion?

Regards,
Lucas

Am Freitag, dem 20.08.2021 um 07:12 +0000 schrieb Liu, Monk:
> [AMD Official Use Only]
> 
> @Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
>  
> Do you have any concern on the kthread_park() approach ?
> 
> Theoretically speaking sched_main shall run there exclusively with job_timeout since they both touches jobs, and stop scheduler during job_timeout won't impact performance since in that scenario
> There was already something wrong/stuck on that ring/scheduler 
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Liu, Monk 
> Sent: Thursday, August 19, 2021 6:26 PM
> To: Daniel Vetter <daniel@ffwll.ch>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> 
> [AMD Official Use Only]
> 
> Hi Daniel
> 
> > > Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> 
> Yeah we had this though as well in our mind.
> 
> Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..50a49cb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
>         sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>  
>         /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> +       kthread_park(sched->thread);
>         spin_lock(&sched->job_list_lock);
>         job = list_first_entry_or_null(&sched->pending_list,
>                                        struct drm_sched_job, list);
>  
>         if (job) {
> -               /*
> -                * Remove the bad job so it cannot be freed by concurrent
> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -                * is parked at which point it's safe.
> -                */
> -               list_del_init(&job->list);
>                 spin_unlock(&sched->job_list_lock);
>  
>                 status = job->sched->ops->timedout_job(job);
> @@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>         } else {
>                 spin_unlock(&sched->job_list_lock);
>         }
> +       kthread_unpark(sched->thread);
>  
>         if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>                 spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>         kthread_park(sched->thread);
>  
>         /*
> -        * Reinsert back the bad job here - now it's safe as
> -        * drm_sched_get_cleanup_job cannot race against us and release the
> -        * bad job at this point - we parked (waited for) any in progress
> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -        * now until the scheduler thread is unparked.
> -        */
> -       if (bad && bad->sched == sched)
> -               /*
> -                * Add at the head of the queue to reflect it was the earliest
> -                * job extracted.
> -                */
> -               list_add(&bad->list, &sched->pending_list);
> -
> -       /*
>          * Iterate the job list from later to  earlier one and either deactive
>          * their HW callbacks or remove them from pending list if they already
>          * signaled.
> 
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Thursday, August 19, 2021 5:31 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Alex Deucher <alexdeucher@gmail.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."
> 
> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > 
> > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > 
> > > > > > > + dri-devel
> > > > > > > 
> > > > > > > Since scheduler is a shared component, please add dri-devel 
> > > > > > > on all scheduler patches.
> > > > > > > 
> > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@amd.com> wrote:
> > > > > > > > [Why]
> > > > > > > > for bailing job, this commit will delete it from pending 
> > > > > > > > list thus the bailing job will never have a chance to be 
> > > > > > > > resubmitted even in advance tdr mode.
> > > > > > > > 
> > > > > > > > [How]
> > > > > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > > > > condition that this commit tries to work around is 
> > > > > > > > completely solved.So revert this commit.
> > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > v2:
> > > > > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > > > > concurrent delete during processing timedout_job
> > > > > > > > 
> > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> > > > > > > > ---
> > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > >     1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > >     {
> > > > > > > >            struct drm_gpu_scheduler *sched;
> > > > > > > >            struct drm_sched_job *job;
> > > > > > > > +       struct dma_fence *fence;
> > > > > > > >            enum drm_gpu_sched_stat status = 
> > > > > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > 
> > > > > > > >            sched = container_of(work, struct 
> > > > > > > > drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ 
> > > > > > > > static void drm_sched_job_timedout(struct work_struct
> > > > > > > > *work)
> > > > > > > > 
> > > > > > > >            if (job) {
> > > > > > > >                    /*
> > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > +                * processing timedout_job
> > > > > > > >                     */
> > > > > > > > -               list_del_init(&job->list);
> > > > > > > > +               fence =
> > > > > > > > + dma_fence_get(job->s_fence->parent);
> > > > > > While this is true for amdgpu, it has no meaning for other 
> > > > > > drivers for whom we haven't done the refactoring of embedding 
> > > > > > HW fence (parent) into the job structure.
> > > > > > In fact thinking
> > > > > > about it, unless you do the HW fence embedding for all the 
> > > > > > drivers using the scheduler you cannot revert this patch or 
> > > > > > you will just break them.
> > > > > btw, why did you do that embedding? I do still have my patches 
> > > > > with dma_fence annotations floating around, but my idea at least 
> > > > > was to fix that issue with a mempool, not with embeddeding. What 
> > > > > was the motivation for embedding the wh fence?
> > > > > -Daniel
> > > > 
> > > > The motivation was 2 fold, avoid memory allocation during jobs 
> > > > submissions (HW fence allocation) because as Christian explained 
> > > > this leads to deadlock with mm code during evictions due to memory 
> > > > pressure (Christian can clarify if I messed
> > > Yeah that's the exact same thing I've chased with my dma_fence 
> > > annotations, but thus far zero to none interested in getting it 
> > > sorted. I think it'd be good to have some cross-driver agreement on 
> > > how this should be solved before someone just charges ahead ...
> > > 
> > > > this explanation). Second is to exactly revert this patch because 
> > > > while it solved the issue described in the patch it created 
> > > > another with drivers who baildc out early during TDR handling for 
> > > > various reason and the job would just leak because it was already 
> > > > removed form pending list.
> > > Can't we reinsert it before we restart the scheduler thread? It 
> > > might need a separate list for that due to the lockless queue 
> > > tricks. Or am I thinking about the wrong kind of "we lost the job"?
> > > -Danile
> > 
> > 
> > If you look at the original patch it would reinsert it even earlier - 
> > right after stopping the  SW scheduler thread, and even then it was to 
> > late for some drivers as they would decide to return back from their 
> > TDR handler even before that. It is solvable but in an ugly way as far 
> > as I see, you need to require each driver in his code to put the job 
> > back in the list if they do it before reaching the place where 
> > scheduler framework does it. Kind of spaghetti code seems to me.
> 
> Hm yeah I didn't realize this all happens before we stop the scheduler thread.
> 
> Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
> 
> I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.
> 
> Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.
> 
> Or I'm I completely guessing in the wrong direction?
> -Daniel
> 
> > 
> > Andrey
> > 
> > 
> > > 
> > > > Andrey
> > > > 
> > > > 
> > > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > > 
> > > > > > > >                    status =
> > > > > > > > job->sched->ops->timedout_job(job);
> > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > >                            job->sched->ops->free_job(job);
> > > > > > > >                            sched->free_guilty = false;
> > > > > > > >                    }
> > > > > > > > +               dma_fence_put(fence);
> > > > > > > >            } else {
> > > > > > > >                    spin_unlock(&sched->job_list_lock);
> > > > > > > >            }
> > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > 
> > > > > > > >            kthread_park(sched->thread);
> > > > > > > > 
> > > > > > > > -       /*
> > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > -        */
> > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > -               /*
> > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > -                * job extracted.
> > > > > > > > -                */
> > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > -
> > > > > > > >            /*
> > > > > > > >             * Iterate the job list from later to  earlier one and either deactive
> > > > > > > >             * their HW callbacks or remove them from 
> > > > > > > > pending list if they already
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f39608508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649622657672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3MxVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-06-08 16:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18 11:21 [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job." Jingwen Chen
2021-08-18 14:02 ` Alex Deucher
2021-08-18 14:26   ` Andrey Grodzovsky
2021-08-18 14:32     ` Daniel Vetter
2021-08-18 14:36       ` Andrey Grodzovsky
2021-08-18 14:42         ` Daniel Vetter
2021-08-18 14:51           ` Andrey Grodzovsky
2021-08-19  9:30             ` Daniel Vetter
2021-08-19 10:25               ` Liu, Monk
2021-08-20  7:12                 ` Liu, Monk
2021-08-20  7:20                   ` Christian König
2021-08-20  8:09                     ` Jingwen Chen
2021-08-20 13:49                       ` Andrey Grodzovsky
2021-08-26  8:59                     ` Daniel Vetter
2021-08-20 14:07                   ` Andrey Grodzovsky
2021-08-24  7:24                     ` Liu, Monk
2021-08-24 14:23                       ` Andrey Grodzovsky
2023-06-08 16:40                   ` Lucas Stach
2023-06-08 16:40                     ` Lucas Stach
2021-08-19 15:25               ` Andrey Grodzovsky
2021-08-26  9:04                 ` Daniel Vetter
2021-08-31 13:11                   ` Daniel Vetter
2021-08-31 18:24                     ` Andrey Grodzovsky
2021-09-02 14:28                       ` Daniel Vetter
2021-09-02 15:36                         ` Andrey Grodzovsky
2021-09-07  8:47                           ` Daniel Vetter
2021-09-07  9:25                             ` Christian König
2021-08-19  3:01           ` Liu, Monk
2021-08-19  9:24             ` Daniel Vetter
2021-08-18 14:29   ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.