dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* drm scheduler redesign causes deadlocks [extended repost]
@ 2023-11-21  9:00 Bert Karwatzki
  2023-11-21  9:22 ` Jani Nikula
  2023-11-22 23:02 ` Luben Tuikov
  0 siblings, 2 replies; 8+ messages in thread
From: Bert Karwatzki @ 2023-11-21  9:00 UTC (permalink / raw)
  To: dri-devel
  Cc: Matthew Brost, Tvrtko Ursulin, Luben Tuikov, Danilo Krummrich,
	Bert Karwatzki, Christian König

Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop)
suffers from random deadlocks which can occur after  30 - 180min of usage. These
deadlocks can be actively provoked by creating high system load (usually by
compiling a kernel with make -j NRCPUS) and the opening instances of libreoffice
--writer until the system GUI locks (the mouse cursor can still be moved but the
screen is frozen). In this state ssh'ing into the machine is still possible and
at least sometimes log messages about hung tasks appear in /var/log/kern.log.

More info can be found here:
https://gitlab.freedesktop.org/drm/amd/-/issues/2994

Using the method described to trigger the bug I bisected the problem in the
linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
As this simple patch fixes the problem

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 044a8c4875ba..25b97db1b623 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
                      struct drm_sched_entity *entity)
 {
-       if (drm_sched_entity_is_ready(entity))
-               if (drm_sched_can_queue(sched, entity))
-                       drm_sched_run_job_queue(sched);
+       if (drm_sched_can_queue(sched, entity))
+               drm_sched_run_job_queue(sched);
 }
 
 /**

there might be in the entity->dependency branch of drm_sched_entity_is_ready
(some kind of circular dependencies ...).

To see if the change to drm_sched_wakeup is the actual cause of the problem or
if this problem has been cause by the redesign of the drm scheduler in linux
next-20231115+, I created the following patch for linux-6.6.0:

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index a42763e1429d..dc2abd299aeb 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
 container_of(cb, struct drm_sched_entity, cb);

 drm_sched_entity_clear_dep(f, cb);
- drm_sched_wakeup_if_can_queue(entity->rq->sched);
+ drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
 }

 /**
@@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
*sched_job)
 if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
 drm_sched_rq_update_fifo(entity, submit_ts);

- drm_sched_wakeup_if_can_queue(entity->rq->sched);
+ drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
 }
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 5a3a622fc672..bbe06403b33d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler
*sched)
  *
  * Wake up the scheduler if we can queue jobs.
  */
-void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
+void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
drm_sched_entity *entity)
 {
- if (drm_sched_can_queue(sched))
- wake_up_interruptible(&sched->wake_up_worker);
+ if(drm_sched_entity_is_ready(entity))
+ if (drm_sched_can_queue(sched))
+ wake_up_interruptible(&sched->wake_up_worker);
 }

 /**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index ac65f0626cfc..6cfe3d193e69 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity
*entity,
                                    unsigned int num_sched_list);

 void drm_sched_job_cleanup(struct drm_sched_job *job);
-void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
+void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
drm_sched_entity *entity);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
*bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);

This brings the extra check to the old scheduler and has so far not caused any
trouble (using the same stress test described above), so chances are that the
error is somewhere else in redesigned scheduler.


Bert Karwatzki


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: drm scheduler redesign causes deadlocks [extended repost]
  2023-11-21  9:00 drm scheduler redesign causes deadlocks [extended repost] Bert Karwatzki
@ 2023-11-21  9:22 ` Jani Nikula
  2023-11-21  9:45   ` Bert Karwatzki
  2023-11-22 23:02 ` Luben Tuikov
  1 sibling, 1 reply; 8+ messages in thread
From: Jani Nikula @ 2023-11-21  9:22 UTC (permalink / raw)
  To: Bert Karwatzki, dri-devel
  Cc: Matthew Brost, Luben Tuikov, Tvrtko Ursulin, Danilo Krummrich,
	Bert Karwatzki, Christian König

On Tue, 21 Nov 2023, Bert Karwatzki <spasswolf@web.de> wrote:
> As this simple patch fixes the problem

Please use git send-email to send patches. Evolution botched up the
whitespace here.

BR,
Jani.


-- 
Jani Nikula, Intel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: drm scheduler redesign causes deadlocks [extended repost]
  2023-11-21  9:22 ` Jani Nikula
@ 2023-11-21  9:45   ` Bert Karwatzki
  0 siblings, 0 replies; 8+ messages in thread
From: Bert Karwatzki @ 2023-11-21  9:45 UTC (permalink / raw)
  To: Jani Nikula, dri-devel
  Cc: Matthew Brost, Luben Tuikov, Danilo Krummrich,
	Christian König, Tvrtko Ursulin

Am Dienstag, dem 21.11.2023 um 11:22 +0200 schrieb Jani Nikula:
> On Tue, 21 Nov 2023, Bert Karwatzki <spasswolf@web.de> wrote:
> > As this simple patch fixes the problem
>
> Please use git send-email to send patches. Evolution botched up the
> whitespace here.
>
> BR,
> Jani.
>
>

Noted (actually not the first time evolution has done this). But I'm not
resending the patches here because they are not meant as actual fixes, they only
serve to illustrate the problem, the cause of which is not known, yet.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: drm scheduler redesign causes deadlocks [extended repost]
  2023-11-21  9:00 drm scheduler redesign causes deadlocks [extended repost] Bert Karwatzki
  2023-11-21  9:22 ` Jani Nikula
@ 2023-11-22 23:02 ` Luben Tuikov
  2023-11-24  9:38   ` Bert Karwatzki
  1 sibling, 1 reply; 8+ messages in thread
From: Luben Tuikov @ 2023-11-22 23:02 UTC (permalink / raw)
  To: Bert Karwatzki, dri-devel
  Cc: Matthew Brost, Danilo Krummrich, Christian König, Tvrtko Ursulin


[-- Attachment #1.1.1: Type: text/plain, Size: 5310 bytes --]

On 2023-11-21 04:00, Bert Karwatzki wrote:
> Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop)
> suffers from random deadlocks which can occur after  30 - 180min of usage. These
> deadlocks can be actively provoked by creating high system load (usually by
> compiling a kernel with make -j NRCPUS) and the opening instances of libreoffice
> --writer until the system GUI locks (the mouse cursor can still be moved but the
> screen is frozen). In this state ssh'ing into the machine is still possible and
> at least sometimes log messages about hung tasks appear in /var/log/kern.log.
> 
> More info can be found here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> 
> Using the method described to trigger the bug I bisected the problem in the
> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> As this simple patch fixes the problem
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..25b97db1b623 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>                       struct drm_sched_entity *entity)
>  {
> -       if (drm_sched_entity_is_ready(entity))
> -               if (drm_sched_can_queue(sched, entity))
> -                       drm_sched_run_job_queue(sched);
> +       if (drm_sched_can_queue(sched, entity))
> +               drm_sched_run_job_queue(sched);
>  }
>  
>  /**
> 
> there might be in the entity->dependency branch of drm_sched_entity_is_ready
> (some kind of circular dependencies ...).
> 
> To see if the change to drm_sched_wakeup is the actual cause of the problem or
> if this problem has been cause by the redesign of the drm scheduler in linux
> next-20231115+, I created the following patch for linux-6.6.0:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index a42763e1429d..dc2abd299aeb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  container_of(cb, struct drm_sched_entity, cb);
> 
>  drm_sched_entity_clear_dep(f, cb);
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
> 
>  /**
> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> *sched_job)
>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  drm_sched_rq_update_fifo(entity, submit_ts);
> 
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a3a622fc672..bbe06403b33d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler
> *sched)
>   *
>   * Wake up the scheduler if we can queue jobs.
>   */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity)
>  {
> - if (drm_sched_can_queue(sched))
> - wake_up_interruptible(&sched->wake_up_worker);
> + if(drm_sched_entity_is_ready(entity))
> + if (drm_sched_can_queue(sched))
> + wake_up_interruptible(&sched->wake_up_worker);
>  }
> 
>  /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index ac65f0626cfc..6cfe3d193e69 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity
> *entity,
>                                     unsigned int num_sched_list);
> 
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> 
> This brings the extra check to the old scheduler and has so far not caused any
> trouble (using the same stress test described above), so chances are that the
> error is somewhere else in redesigned scheduler.
> 
> 
> Bert Karwatzki

Hi Bert,

Thanks for looking into this.

As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, makes
the scheduling more opportunistic, and I agree that that is the more correct approach.
Commit f3123c2590005, basically made the code as close to the way it functioned before
the move to run-queues.

Would you like to create a patch for this?
-- 
Regards,
Luben

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 677 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: drm scheduler redesign causes deadlocks [extended repost]
  2023-11-22 23:02 ` Luben Tuikov
@ 2023-11-24  9:38   ` Bert Karwatzki
  2023-11-25 20:03     ` Luben Tuikov
  0 siblings, 1 reply; 8+ messages in thread
From: Bert Karwatzki @ 2023-11-24  9:38 UTC (permalink / raw)
  To: Luben Tuikov, dri-devel
  Cc: Matthew Brost, Danilo Krummrich, Christian König, Tvrtko Ursulin

Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
> On 2023-11-21 04:00, Bert Karwatzki wrote:
> > Since linux-next-20231115 my linux system (debian sid on msi alpha 15
> > laptop)
> > suffers from random deadlocks which can occur after  30 - 180min of usage.
> > These
> > deadlocks can be actively provoked by creating high system load (usually by
> > compiling a kernel with make -j NRCPUS) and the opening instances of
> > libreoffice
> > --writer until the system GUI locks (the mouse cursor can still be moved but
> > the
> > screen is frozen). In this state ssh'ing into the machine is still possible
> > and
> > at least sometimes log messages about hung tasks appear in
> > /var/log/kern.log.
> >
> > More info can be found here:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> >
> > Using the method described to trigger the bug I bisected the problem in the
> > linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> > As this simple patch fixes the problem
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 044a8c4875ba..25b97db1b623 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
> >                       struct drm_sched_entity *entity)
> >  {
> > -       if (drm_sched_entity_is_ready(entity))
> > -               if (drm_sched_can_queue(sched, entity))
> > -                       drm_sched_run_job_queue(sched);
> > +       if (drm_sched_can_queue(sched, entity))
> > +               drm_sched_run_job_queue(sched);
> >  }
> >  
> >  /**
> >
> > there might be in the entity->dependency branch of drm_sched_entity_is_ready
> > (some kind of circular dependencies ...).
> >
> > To see if the change to drm_sched_wakeup is the actual cause of the problem
> > or
> > if this problem has been cause by the redesign of the drm scheduler in linux
> > next-20231115+, I created the following patch for linux-6.6.0:
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index a42763e1429d..dc2abd299aeb 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
> >  container_of(cb, struct drm_sched_entity, cb);
> >
> >  drm_sched_entity_clear_dep(f, cb);
> > - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >
> >  /**
> > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> > *sched_job)
> >  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> >  drm_sched_rq_update_fifo(entity, submit_ts);
> >
> > - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >  }
> >  EXPORT_SYMBOL(drm_sched_entity_push_job);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 5a3a622fc672..bbe06403b33d 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
> > drm_gpu_scheduler
> > *sched)
> >   *
> >   * Wake up the scheduler if we can queue jobs.
> >   */
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> > drm_sched_entity *entity)
> >  {
> > - if (drm_sched_can_queue(sched))
> > - wake_up_interruptible(&sched->wake_up_worker);
> > + if(drm_sched_entity_is_ready(entity))
> > + if (drm_sched_can_queue(sched))
> > + wake_up_interruptible(&sched->wake_up_worker);
> >  }
> >
> >  /**
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index ac65f0626cfc..6cfe3d193e69 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct
> > drm_sched_entity
> > *entity,
> >                                     unsigned int num_sched_list);
> >
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> > drm_sched_entity *entity);
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> > *bad);
> >  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> >  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> >
> > This brings the extra check to the old scheduler and has so far not caused
> > any
> > trouble (using the same stress test described above), so chances are that
> > the
> > error is somewhere else in redesigned scheduler.
> >
> >
> > Bert Karwatzki
>
> Hi Bert,
>
> Thanks for looking into this.
>
> As an afterthought, removing the "entity_is_ready()" qualifier in wake-up,
> makes
> the scheduling more opportunistic, and I agree that that is the more correct
> approach.
> Commit f3123c2590005, basically made the code as close to the way it
> functioned before
> the move to run-queues.
>
> Would you like to create a patch for this?

Should I create the patch (is a simple revert of f3123c.. + explaining commit
message enough?) for the drm-misc tree?

Also, I did a little more experimentation: I applied f3123c25900 on top
of commit a6149f0393699 and commit 35963cf2cd25e to see where the problems
start. The result is that f3.. + a6.. leads to lock ups while f3.. + 35.. seems
to be fine so the problems starts with the move from kthread to work queue.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: drm scheduler redesign causes deadlocks [extended repost]
  2023-11-24  9:38   ` Bert Karwatzki
@ 2023-11-25 20:03     ` Luben Tuikov
  2023-11-27 16:09       ` [PATCH] drm/sched: Partial revert of "Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()" Bert Karwatzki
  0 siblings, 1 reply; 8+ messages in thread
From: Luben Tuikov @ 2023-11-25 20:03 UTC (permalink / raw)
  To: Bert Karwatzki, dri-devel
  Cc: Matthew Brost, Danilo Krummrich, Christian König, Tvrtko Ursulin


[-- Attachment #1.1.1: Type: text/plain, Size: 8198 bytes --]

On 2023-11-24 04:38, Bert Karwatzki wrote:
> Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
>> On 2023-11-21 04:00, Bert Karwatzki wrote:
>>> Since linux-next-20231115 my linux system (debian sid on msi alpha 15
>>> laptop)
>>> suffers from random deadlocks which can occur after  30 - 180min of usage.
>>> These
>>> deadlocks can be actively provoked by creating high system load (usually by
>>> compiling a kernel with make -j NRCPUS) and the opening instances of
>>> libreoffice
>>> --writer until the system GUI locks (the mouse cursor can still be moved but
>>> the
>>> screen is frozen). In this state ssh'ing into the machine is still possible
>>> and
>>> at least sometimes log messages about hung tasks appear in
>>> /var/log/kern.log.
>>>
>>> More info can be found here:
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
>>>
>>> Using the method described to trigger the bug I bisected the problem in the
>>> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
>>> As this simple patch fixes the problem
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..25b97db1b623 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>>>                       struct drm_sched_entity *entity)
>>>  {
>>> -       if (drm_sched_entity_is_ready(entity))
>>> -               if (drm_sched_can_queue(sched, entity))
>>> -                       drm_sched_run_job_queue(sched);
>>> +       if (drm_sched_can_queue(sched, entity))
>>> +               drm_sched_run_job_queue(sched);
>>>  }
>>>  
>>>  /**
>>>
>>> there might be in the entity->dependency branch of drm_sched_entity_is_ready
>>> (some kind of circular dependencies ...).
>>>
>>> To see if the change to drm_sched_wakeup is the actual cause of the problem
>>> or
>>> if this problem has been cause by the redesign of the drm scheduler in linux
>>> next-20231115+, I created the following patch for linux-6.6.0:
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index a42763e1429d..dc2abd299aeb 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>  container_of(cb, struct drm_sched_entity, cb);
>>>
>>>  drm_sched_entity_clear_dep(f, cb);
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>
>>>  /**
>>> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
>>> *sched_job)
>>>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>  drm_sched_rq_update_fifo(entity, submit_ts);
>>>
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_entity_push_job);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 5a3a622fc672..bbe06403b33d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
>>> drm_gpu_scheduler
>>> *sched)
>>>   *
>>>   * Wake up the scheduler if we can queue jobs.
>>>   */
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity)
>>>  {
>>> - if (drm_sched_can_queue(sched))
>>> - wake_up_interruptible(&sched->wake_up_worker);
>>> + if(drm_sched_entity_is_ready(entity))
>>> + if (drm_sched_can_queue(sched))
>>> + wake_up_interruptible(&sched->wake_up_worker);
>>>  }
>>>
>>>  /**
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index ac65f0626cfc..6cfe3d193e69 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct
>>> drm_sched_entity
>>> *entity,
>>>                                     unsigned int num_sched_list);
>>>
>>>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity);
>>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
>>> *bad);
>>>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>>>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>>
>>> This brings the extra check to the old scheduler and has so far not caused
>>> any
>>> trouble (using the same stress test described above), so chances are that
>>> the
>>> error is somewhere else in redesigned scheduler.
>>>
>>>
>>> Bert Karwatzki
>>
>> Hi Bert,
>>
>> Thanks for looking into this.
>>
>> As an afterthought, removing the "entity_is_ready()" qualifier in wake-up,
>> makes
>> the scheduling more opportunistic, and I agree that that is the more correct
>> approach.
>> Commit f3123c2590005, basically made the code as close to the way it
>> functioned before
>> the move to run-queues.
>>
>> Would you like to create a patch for this?
> 
> Should I create the patch (is a simple revert of f3123c.. + explaining commit
> message enough?) for the drm-misc tree?

Yes, that'd be awesome!

Please CC all the people in this email, and also if you could --in-reply-to=
the last message in this thread with all the CC and all--an easy way to
do this is to just copy the "git send-email" line from lore.kernel.org. Thanks!

> Also, I did a little more experimentation: I applied f3123c25900 on top
> of commit a6149f0393699 and commit 35963cf2cd25e to see where the problems
> start. The result is that f3.. + a6.. leads to lock ups while f3.. + 35.. seems
> to be fine so the problems starts with the move from kthread to work queue.

Outstanding! Thank you Bert for doing all this testing.

So while the conclusion might be that a6 is the problem, it really is not.
Scheduling (CPU/GPU) is a fickle thing and care and deep thought needs to be exercised
when implementing it--it really is not a trivial thing.

See the thread responses starting with this email:
https://lore.kernel.org/all/d2bf144f-e388-4cb1-bc18-12efad4f677b@linux.intel.com/

Yes, to avoid lock-ups we should revert f3.

However, I think that we should go further and have drm_sched_wakeup() call
drm_sched_run_job_queue() unconditionally. drm_sched_run_job_work() does do its
own checks whether to continue or dequeue, and traditionally it's been done like this
to avoid races.

Could you also test with,

void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
		      struct drm_sched_entity *entity)
{
	drm_sched_run_job_queue(sched);
}

IOW, traditionally, a scheduler wake-up is uni-directional, i.e.,
	free up a resource,
	wake up thread waiting to use it;
The resource could be, for instance, adding an item to a queue, atomically, etc., etc.
What is important however, is that there is no decision making in this process.
The modified version of drm_sched_wakeup() above achieves this one-direction.

In our wake-up, we check whether to schedule work, and using the same
checks it does when after it's scheduled and running. The former could lead to problems. I did mention
in the linked thread above that the worst we get is an extra wake-up and extra check, in a string
of schedules, so the overhead would be 1/#(runs of drm_sched_run_job_work), i.e. the longer the string
of schedules, the smaller the overhead.

Thanks again!
-- 
Regards,
Luben

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 677 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] drm/sched: Partial revert of "Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()"
  2023-11-25 20:03     ` Luben Tuikov
@ 2023-11-27 16:09       ` Bert Karwatzki
  2023-11-28 19:19         ` Luben Tuikov
  0 siblings, 1 reply; 8+ messages in thread
From: Bert Karwatzki @ 2023-11-27 16:09 UTC (permalink / raw)
  To: ltuikov89
  Cc: matthew.brost, tvrtko.ursulin, dri-devel, dakr, spasswolf,
	christian.koenig

Commit f3123c2590005c, in combination with the use of work queues by the GPU
scheduler, leads to random lock-ups of the GUI.

This is a partial revert of of commit f3123c2590005c since drm_sched_wakeup() still
needs its entity argument to pass it to drm_sched_can_queue().

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2994
Link: https://lists.freedesktop.org/archives/dri-devel/2023-November/431606.html
Fixes: f3123c2590005c ("drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()")

Signed-off-by: Bert Karwatzki <spasswolf@web.de>
---
 drivers/gpu/drm/scheduler/sched_main.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 682aebe96db7..550492a7a031 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
 		      struct drm_sched_entity *entity)
 {
-	if (drm_sched_entity_is_ready(entity))
-		if (drm_sched_can_queue(sched, entity))
-			drm_sched_run_job_queue(sched);
+	if (drm_sched_can_queue(sched, entity))
+		drm_sched_run_job_queue(sched);
 }

 /**
--
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/sched: Partial revert of "Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()"
  2023-11-27 16:09       ` [PATCH] drm/sched: Partial revert of "Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()" Bert Karwatzki
@ 2023-11-28 19:19         ` Luben Tuikov
  0 siblings, 0 replies; 8+ messages in thread
From: Luben Tuikov @ 2023-11-28 19:19 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: matthew.brost, dakr, christian.koenig, dri-devel, tvrtko.ursulin


[-- Attachment #1.1.1: Type: text/plain, Size: 1522 bytes --]

On 2023-11-27 11:09, Bert Karwatzki wrote:
> Commit f3123c2590005c, in combination with the use of work queues by the GPU
> scheduler, leads to random lock-ups of the GUI.
> 
> This is a partial revert of of commit f3123c2590005c since drm_sched_wakeup() still
> needs its entity argument to pass it to drm_sched_can_queue().
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> Link: https://lists.freedesktop.org/archives/dri-devel/2023-November/431606.html
> Fixes: f3123c2590005c ("drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()")
> 
> Signed-off-by: Bert Karwatzki <spasswolf@web.de>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 682aebe96db7..550492a7a031 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>  		      struct drm_sched_entity *entity)
>  {
> -	if (drm_sched_entity_is_ready(entity))
> -		if (drm_sched_can_queue(sched, entity))
> -			drm_sched_run_job_queue(sched);
> +	if (drm_sched_can_queue(sched, entity))
> +		drm_sched_run_job_queue(sched);
>  }
> 
>  /**
> --
> 2.43.0
> 

Reviewed-by: Luben Tuikov <ltuikov89@gmail.com>

Pushed to drm-misc-next.

Thanks!
-- 
Regards,
Luben

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 677 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-11-28 19:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-21  9:00 drm scheduler redesign causes deadlocks [extended repost] Bert Karwatzki
2023-11-21  9:22 ` Jani Nikula
2023-11-21  9:45   ` Bert Karwatzki
2023-11-22 23:02 ` Luben Tuikov
2023-11-24  9:38   ` Bert Karwatzki
2023-11-25 20:03     ` Luben Tuikov
2023-11-27 16:09       ` [PATCH] drm/sched: Partial revert of "Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()" Bert Karwatzki
2023-11-28 19:19         ` Luben Tuikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).