[PATCH] drm/sched: Drain all entities in DRM sched run job worker

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] drm/sched: Drain all entities in DRM sched run job worker
@ 2024-01-24 21:08 Matthew Brost
  2024-01-25  9:24 ` Vlastimil Babka
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Matthew Brost @ 2024-01-24 21:08 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: Matthew Brost, ltuikov89, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, christian.koenig,
	Vlastimil Babka

All entities must be drained in the DRM scheduler run job worker to
avoid the following case. An entity found that is ready, no job found
ready on entity, and run job worker goes idle with other entities + jobs
ready. Draining all ready entities (i.e. loop over all ready entities)
in the run job worker ensures all job that are ready will be scheduled.

Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 550492a7a031..85f082396d42 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
 	struct drm_sched_entity *entity;
 	struct dma_fence *fence;
 	struct drm_sched_fence *s_fence;
-	struct drm_sched_job *sched_job;
+	struct drm_sched_job *sched_job = NULL;
 	int r;
 
 	if (READ_ONCE(sched->pause_submit))
 		return;
 
-	entity = drm_sched_select_entity(sched);
+	/* Find entity with a ready job */
+	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
+		sched_job = drm_sched_entity_pop_job(entity);
+		if (!sched_job)
+			complete_all(&entity->entity_idle);
+	}
 	if (!entity)
-		return;
-
-	sched_job = drm_sched_entity_pop_job(entity);
-	if (!sched_job) {
-		complete_all(&entity->entity_idle);
 		return;	/* No more work */
-	}
 
 	s_fence = sched_job->s_fence;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-24 21:08 [PATCH] drm/sched: Drain all entities in DRM sched run job worker Matthew Brost
@ 2024-01-25  9:24 ` Vlastimil Babka
  2024-01-25 17:30   ` Matthew Brost
  2024-01-25 15:12 ` Christian König
  2024-01-29  4:08 ` Luben Tuikov
  2 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2024-01-25  9:24 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: ltuikov89, Thorsten Leemhuis, Mario Limonciello, daniel,
	Mikhail Gavrilov, airlied, christian.koenig

On 1/24/24 22:08, Matthew Brost wrote:
> All entities must be drained in the DRM scheduler run job worker to
> avoid the following case. An entity found that is ready, no job found
> ready on entity, and run job worker goes idle with other entities + jobs
> ready. Draining all ready entities (i.e. loop over all ready entities)
> in the run job worker ensures all job that are ready will be scheduled.
> 
> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> Reported-by: Vlastimil Babka <vbabka@suse.cz>

Can change to Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 550492a7a031..85f082396d42 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  	struct drm_sched_entity *entity;
>  	struct dma_fence *fence;
>  	struct drm_sched_fence *s_fence;
> -	struct drm_sched_job *sched_job;
> +	struct drm_sched_job *sched_job = NULL;
>  	int r;
>  
>  	if (READ_ONCE(sched->pause_submit))
>  		return;
>  
> -	entity = drm_sched_select_entity(sched);
> +	/* Find entity with a ready job */
> +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> +		sched_job = drm_sched_entity_pop_job(entity);
> +		if (!sched_job)
> +			complete_all(&entity->entity_idle);
> +	}
>  	if (!entity)
> -		return;
> -
> -	sched_job = drm_sched_entity_pop_job(entity);
> -	if (!sched_job) {
> -		complete_all(&entity->entity_idle);
>  		return;	/* No more work */
> -	}
>  
>  	s_fence = sched_job->s_fence;
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-24 21:08 [PATCH] drm/sched: Drain all entities in DRM sched run job worker Matthew Brost
  2024-01-25  9:24 ` Vlastimil Babka
@ 2024-01-25 15:12 ` Christian König
  2024-01-25 17:30   ` Matthew Brost
  2024-01-29  4:08 ` Luben Tuikov
  2 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2024-01-25 15:12 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: ltuikov89, Thorsten Leemhuis, Mario Limonciello, daniel,
	Mikhail Gavrilov, airlied, Vlastimil Babka



Am 24.01.24 um 22:08 schrieb Matthew Brost:
> All entities must be drained in the DRM scheduler run job worker to
> avoid the following case. An entity found that is ready, no job found
> ready on entity, and run job worker goes idle with other entities + jobs
> ready. Draining all ready entities (i.e. loop over all ready entities)
> in the run job worker ensures all job that are ready will be scheduled.

That doesn't make sense. drm_sched_select_entity() only returns entities 
which are "ready", e.g. have a job to run.

If that's not the case any more then you have broken something else.

Regards,
Christian.

>
> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>   1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 550492a7a031..85f082396d42 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>   	struct drm_sched_entity *entity;
>   	struct dma_fence *fence;
>   	struct drm_sched_fence *s_fence;
> -	struct drm_sched_job *sched_job;
> +	struct drm_sched_job *sched_job = NULL;
>   	int r;
>   
>   	if (READ_ONCE(sched->pause_submit))
>   		return;
>   
> -	entity = drm_sched_select_entity(sched);
> +	/* Find entity with a ready job */
> +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> +		sched_job = drm_sched_entity_pop_job(entity);
> +		if (!sched_job)
> +			complete_all(&entity->entity_idle);
> +	}
>   	if (!entity)
> -		return;
> -
> -	sched_job = drm_sched_entity_pop_job(entity);
> -	if (!sched_job) {
> -		complete_all(&entity->entity_idle);
>   		return;	/* No more work */
> -	}
>   
>   	s_fence = sched_job->s_fence;
>   


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-25 15:12 ` Christian König
@ 2024-01-25 17:30   ` Matthew Brost
  2024-01-26 10:32     ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2024-01-25 17:30 UTC (permalink / raw)
  To: Christian König
  Cc: ltuikov89, dri-devel, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, intel-xe, Vlastimil Babka

On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
> 
> 
> Am 24.01.24 um 22:08 schrieb Matthew Brost:
> > All entities must be drained in the DRM scheduler run job worker to
> > avoid the following case. An entity found that is ready, no job found
> > ready on entity, and run job worker goes idle with other entities + jobs
> > ready. Draining all ready entities (i.e. loop over all ready entities)
> > in the run job worker ensures all job that are ready will be scheduled.
> 
> That doesn't make sense. drm_sched_select_entity() only returns entities
> which are "ready", e.g. have a job to run.
> 

That is what I thought too, hence my original design but it is not
exactly true. Let me explain.

drm_sched_select_entity() returns an entity with a non-empty spsc queue
(job in queue) and no *current* waiting dependecies [1]. Dependecies for
an entity can be added when drm_sched_entity_pop_job() is called [2][3]
returning a NULL job. Thus we can get into a scenario where 2 entities
A and B both have jobs and no current dependecies. A's job is waiting
B's job, entity A gets selected first, a dependecy gets installed in
drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.

The proper solution is to loop over all ready entities until one with a
job is found via drm_sched_entity_pop_job() and then requeue the run
job worker. Or loop over all entities until drm_sched_select_entity()
returns NULL and then let the run job worker go idle. This is what the
old threaded design did too [4]. Hope this clears everything up.

Matt

[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
[2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
[3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
[4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011

> If that's not the case any more then you have broken something else.
> 
> Regards,
> Christian.
> 
> > 
> > Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> > Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> > Reported-by: Vlastimil Babka <vbabka@suse.cz>
> > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
> >   1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 550492a7a031..85f082396d42 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
> >   	struct drm_sched_entity *entity;
> >   	struct dma_fence *fence;
> >   	struct drm_sched_fence *s_fence;
> > -	struct drm_sched_job *sched_job;
> > +	struct drm_sched_job *sched_job = NULL;
> >   	int r;
> >   	if (READ_ONCE(sched->pause_submit))
> >   		return;
> > -	entity = drm_sched_select_entity(sched);
> > +	/* Find entity with a ready job */
> > +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> > +		sched_job = drm_sched_entity_pop_job(entity);
> > +		if (!sched_job)
> > +			complete_all(&entity->entity_idle);
> > +	}
> >   	if (!entity)
> > -		return;
> > -
> > -	sched_job = drm_sched_entity_pop_job(entity);
> > -	if (!sched_job) {
> > -		complete_all(&entity->entity_idle);
> >   		return;	/* No more work */
> > -	}
> >   	s_fence = sched_job->s_fence;
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-25  9:24 ` Vlastimil Babka
@ 2024-01-25 17:30   ` Matthew Brost
  2024-01-26  2:45     ` Dave Airlie
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2024-01-25 17:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: ltuikov89, dri-devel, intel-xe, Thorsten Leemhuis,
	Mario Limonciello, daniel, Mikhail Gavrilov, airlied,
	christian.koenig

On Thu, Jan 25, 2024 at 10:24:24AM +0100, Vlastimil Babka wrote:
> On 1/24/24 22:08, Matthew Brost wrote:
> > All entities must be drained in the DRM scheduler run job worker to
> > avoid the following case. An entity found that is ready, no job found
> > ready on entity, and run job worker goes idle with other entities + jobs
> > ready. Draining all ready entities (i.e. loop over all ready entities)
> > in the run job worker ensures all job that are ready will be scheduled.
> > 
> > Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> > Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> > Reported-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Can change to Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
> 

+1, got it.

Matt

> Thanks!
> 
> > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 550492a7a031..85f082396d42 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
> >  	struct drm_sched_entity *entity;
> >  	struct dma_fence *fence;
> >  	struct drm_sched_fence *s_fence;
> > -	struct drm_sched_job *sched_job;
> > +	struct drm_sched_job *sched_job = NULL;
> >  	int r;
> >  
> >  	if (READ_ONCE(sched->pause_submit))
> >  		return;
> >  
> > -	entity = drm_sched_select_entity(sched);
> > +	/* Find entity with a ready job */
> > +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> > +		sched_job = drm_sched_entity_pop_job(entity);
> > +		if (!sched_job)
> > +			complete_all(&entity->entity_idle);
> > +	}
> >  	if (!entity)
> > -		return;
> > -
> > -	sched_job = drm_sched_entity_pop_job(entity);
> > -	if (!sched_job) {
> > -		complete_all(&entity->entity_idle);
> >  		return;	/* No more work */
> > -	}
> >  
> >  	s_fence = sched_job->s_fence;
> >  
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-25 17:30   ` Matthew Brost
@ 2024-01-26  2:45     ` Dave Airlie
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Airlie @ 2024-01-26  2:45 UTC (permalink / raw)
  To: Matthew Brost
  Cc: ltuikov89, dri-devel, intel-xe, Thorsten Leemhuis,
	Mario Limonciello, daniel, Mikhail Gavrilov, christian.koenig,
	Vlastimil Babka

 Just FYI I'm pulling this into drm-fixes straight as is, since if
fixes the regression and avoids the revert, however please keep
discussing until we are sure things are right, and we can deal with
any fixes in a follow-up patch.

Dave.

On Fri, 26 Jan 2024 at 03:32, Matthew Brost <matthew.brost@intel.com> wrote:
>
> On Thu, Jan 25, 2024 at 10:24:24AM +0100, Vlastimil Babka wrote:
> > On 1/24/24 22:08, Matthew Brost wrote:
> > > All entities must be drained in the DRM scheduler run job worker to
> > > avoid the following case. An entity found that is ready, no job found
> > > ready on entity, and run job worker goes idle with other entities + jobs
> > > ready. Draining all ready entities (i.e. loop over all ready entities)
> > > in the run job worker ensures all job that are ready will be scheduled.
> > >
> > > Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> > > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> > > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> > > Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> > > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> > > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> > > Reported-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > Can change to Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
> >
>
> +1, got it.
>
> Matt
>
> > Thanks!
> >
> > > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
> > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 550492a7a031..85f082396d42 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
> > >     struct drm_sched_entity *entity;
> > >     struct dma_fence *fence;
> > >     struct drm_sched_fence *s_fence;
> > > -   struct drm_sched_job *sched_job;
> > > +   struct drm_sched_job *sched_job = NULL;
> > >     int r;
> > >
> > >     if (READ_ONCE(sched->pause_submit))
> > >             return;
> > >
> > > -   entity = drm_sched_select_entity(sched);
> > > +   /* Find entity with a ready job */
> > > +   while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> > > +           sched_job = drm_sched_entity_pop_job(entity);
> > > +           if (!sched_job)
> > > +                   complete_all(&entity->entity_idle);
> > > +   }
> > >     if (!entity)
> > > -           return;
> > > -
> > > -   sched_job = drm_sched_entity_pop_job(entity);
> > > -   if (!sched_job) {
> > > -           complete_all(&entity->entity_idle);
> > >             return; /* No more work */
> > > -   }
> > >
> > >     s_fence = sched_job->s_fence;
> > >
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-25 17:30   ` Matthew Brost
@ 2024-01-26 10:32     ` Christian König
  2024-01-26 16:29       ` Matthew Brost
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2024-01-26 10:32 UTC (permalink / raw)
  To: Matthew Brost
  Cc: ltuikov89, dri-devel, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, intel-xe, Vlastimil Babka

Am 25.01.24 um 18:30 schrieb Matthew Brost:
> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>
>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>> All entities must be drained in the DRM scheduler run job worker to
>>> avoid the following case. An entity found that is ready, no job found
>>> ready on entity, and run job worker goes idle with other entities + jobs
>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>> in the run job worker ensures all job that are ready will be scheduled.
>> That doesn't make sense. drm_sched_select_entity() only returns entities
>> which are "ready", e.g. have a job to run.
>>
> That is what I thought too, hence my original design but it is not
> exactly true. Let me explain.
>
> drm_sched_select_entity() returns an entity with a non-empty spsc queue
> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
> returning a NULL job. Thus we can get into a scenario where 2 entities
> A and B both have jobs and no current dependecies. A's job is waiting
> B's job, entity A gets selected first, a dependecy gets installed in
> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.

And here is the real problem. run work doesn't goes idle in that moment.

drm_sched_run_job_work() should restarts itself until there is either no 
more space in the ring buffer or it can't find a ready entity any more.

At least that was the original design when that was all still driven by 
a kthread.

It can perfectly be that we messed this up when switching from kthread 
to a work item.

Regards,
Christian.

>
> The proper solution is to loop over all ready entities until one with a
> job is found via drm_sched_entity_pop_job() and then requeue the run
> job worker. Or loop over all entities until drm_sched_select_entity()
> returns NULL and then let the run job worker go idle. This is what the
> old threaded design did too [4]. Hope this clears everything up.
>
> Matt
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
> [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
> [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011
>
>> If that's not the case any more then you have broken something else.
>>
>> Regards,
>> Christian.
>>
>>> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
>>> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>>> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
>>> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
>>> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
>>> Reported-by: Vlastimil Babka <vbabka@suse.cz>
>>> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>>>    1 file changed, 7 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 550492a7a031..85f082396d42 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>    	struct drm_sched_entity *entity;
>>>    	struct dma_fence *fence;
>>>    	struct drm_sched_fence *s_fence;
>>> -	struct drm_sched_job *sched_job;
>>> +	struct drm_sched_job *sched_job = NULL;
>>>    	int r;
>>>    	if (READ_ONCE(sched->pause_submit))
>>>    		return;
>>> -	entity = drm_sched_select_entity(sched);
>>> +	/* Find entity with a ready job */
>>> +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
>>> +		sched_job = drm_sched_entity_pop_job(entity);
>>> +		if (!sched_job)
>>> +			complete_all(&entity->entity_idle);
>>> +	}
>>>    	if (!entity)
>>> -		return;
>>> -
>>> -	sched_job = drm_sched_entity_pop_job(entity);
>>> -	if (!sched_job) {
>>> -		complete_all(&entity->entity_idle);
>>>    		return;	/* No more work */
>>> -	}
>>>    	s_fence = sched_job->s_fence;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-26 10:32     ` Christian König
@ 2024-01-26 16:29       ` Matthew Brost
  2024-01-29  5:29         ` Luben Tuikov
  2024-01-29  7:44         ` Christian König
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Brost @ 2024-01-26 16:29 UTC (permalink / raw)
  To: Christian König
  Cc: ltuikov89, dri-devel, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, intel-xe, Vlastimil Babka

On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
> Am 25.01.24 um 18:30 schrieb Matthew Brost:
> > On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
> > > 
> > > Am 24.01.24 um 22:08 schrieb Matthew Brost:
> > > > All entities must be drained in the DRM scheduler run job worker to
> > > > avoid the following case. An entity found that is ready, no job found
> > > > ready on entity, and run job worker goes idle with other entities + jobs
> > > > ready. Draining all ready entities (i.e. loop over all ready entities)
> > > > in the run job worker ensures all job that are ready will be scheduled.
> > > That doesn't make sense. drm_sched_select_entity() only returns entities
> > > which are "ready", e.g. have a job to run.
> > > 
> > That is what I thought too, hence my original design but it is not
> > exactly true. Let me explain.
> > 
> > drm_sched_select_entity() returns an entity with a non-empty spsc queue
> > (job in queue) and no *current* waiting dependecies [1]. Dependecies for
> > an entity can be added when drm_sched_entity_pop_job() is called [2][3]
> > returning a NULL job. Thus we can get into a scenario where 2 entities
> > A and B both have jobs and no current dependecies. A's job is waiting
> > B's job, entity A gets selected first, a dependecy gets installed in
> > drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
> 
> And here is the real problem. run work doesn't goes idle in that moment.
> 
> drm_sched_run_job_work() should restarts itself until there is either no
> more space in the ring buffer or it can't find a ready entity any more.
> 
> At least that was the original design when that was all still driven by a
> kthread.
> 
> It can perfectly be that we messed this up when switching from kthread to a
> work item.
> 

Right, that what this patch does - the run worker does not go idle until
no ready entities are found. That was incorrect in the original patch
and fixed here. Do you have any issues with this fix? It has been tested
3x times and clearly fixes the issue. 

Matt

> Regards,
> Christian.
> 
> > 
> > The proper solution is to loop over all ready entities until one with a
> > job is found via drm_sched_entity_pop_job() and then requeue the run
> > job worker. Or loop over all entities until drm_sched_select_entity()
> > returns NULL and then let the run job worker go idle. This is what the
> > old threaded design did too [4]. Hope this clears everything up.
> > 
> > Matt
> > 
> > [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
> > [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
> > [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
> > [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011
> > 
> > > If that's not the case any more then you have broken something else.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> > > > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> > > > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> > > > Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> > > > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> > > > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> > > > Reported-by: Vlastimil Babka <vbabka@suse.cz>
> > > > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
> > > >    1 file changed, 7 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index 550492a7a031..85f082396d42 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
> > > >    	struct drm_sched_entity *entity;
> > > >    	struct dma_fence *fence;
> > > >    	struct drm_sched_fence *s_fence;
> > > > -	struct drm_sched_job *sched_job;
> > > > +	struct drm_sched_job *sched_job = NULL;
> > > >    	int r;
> > > >    	if (READ_ONCE(sched->pause_submit))
> > > >    		return;
> > > > -	entity = drm_sched_select_entity(sched);
> > > > +	/* Find entity with a ready job */
> > > > +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> > > > +		sched_job = drm_sched_entity_pop_job(entity);
> > > > +		if (!sched_job)
> > > > +			complete_all(&entity->entity_idle);
> > > > +	}
> > > >    	if (!entity)
> > > > -		return;
> > > > -
> > > > -	sched_job = drm_sched_entity_pop_job(entity);
> > > > -	if (!sched_job) {
> > > > -		complete_all(&entity->entity_idle);
> > > >    		return;	/* No more work */
> > > > -	}
> > > >    	s_fence = sched_job->s_fence;
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-24 21:08 [PATCH] drm/sched: Drain all entities in DRM sched run job worker Matthew Brost
  2024-01-25  9:24 ` Vlastimil Babka
  2024-01-25 15:12 ` Christian König
@ 2024-01-29  4:08 ` Luben Tuikov
  2 siblings, 0 replies; 14+ messages in thread
From: Luben Tuikov @ 2024-01-29  4:08 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: Thorsten Leemhuis, Mario Limonciello, daniel, Mikhail Gavrilov,
	airlied, christian.koenig, Vlastimil Babka


[-- Attachment #1.1.1: Type: text/plain, Size: 2691 bytes --]

On 2024-01-24 16:08, Matthew Brost wrote:
> All entities must be drained in the DRM scheduler run job worker to
> avoid the following case. An entity found that is ready, no job found
> ready on entity, and run job worker goes idle with other entities + jobs
> ready. Draining all ready entities (i.e. loop over all ready entities)
> in the run job worker ensures all job that are ready will be scheduled.
> 
> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Hi Matthew,

Thanks for working on this and sending the patch.

Could we add a fixes-tag to the tag list,

Fixes: f7fe64ad0f22 ("drm/sched: Split free_job into own work item")

This really drives to point as shown here,
https://gitlab.freedesktop.org/drm/amd/-/issues/3124
which is mentioned in a Closes tag--thanks!
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 550492a7a031..85f082396d42 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  	struct drm_sched_entity *entity;
>  	struct dma_fence *fence;
>  	struct drm_sched_fence *s_fence;
> -	struct drm_sched_job *sched_job;
> +	struct drm_sched_job *sched_job = NULL;
>  	int r;
>  
>  	if (READ_ONCE(sched->pause_submit))
>  		return;
>  
> -	entity = drm_sched_select_entity(sched);
> +	/* Find entity with a ready job */
> +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> +		sched_job = drm_sched_entity_pop_job(entity);
> +		if (!sched_job)
> +			complete_all(&entity->entity_idle);
> +	}
>  	if (!entity)
> -		return;
> -
> -	sched_job = drm_sched_entity_pop_job(entity);
> -	if (!sched_job) {
> -		complete_all(&entity->entity_idle);
>  		return;	/* No more work */
> -	}
>  
>  	s_fence = sched_job->s_fence;
>  

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 677 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-26 16:29       ` Matthew Brost
@ 2024-01-29  5:29         ` Luben Tuikov
  2024-01-29  7:44         ` Christian König
  1 sibling, 0 replies; 14+ messages in thread
From: Luben Tuikov @ 2024-01-29  5:29 UTC (permalink / raw)
  To: Matthew Brost, Christian König
  Cc: dri-devel, Thorsten Leemhuis, Mario Limonciello, daniel,
	Mikhail Gavrilov, airlied, intel-xe, Vlastimil Babka


[-- Attachment #1.1.1: Type: text/plain, Size: 6792 bytes --]

On 2024-01-26 11:29, Matthew Brost wrote:
> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>>
>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>> All entities must be drained in the DRM scheduler run job worker to

Hi Matt,

Thanks for the patch. Under close review, let's use "checked" instead of "drained",
to read as follows,

All entities must be checked in the DRM scheduler run job worker to ...

>>>>> avoid the following case. An entity found that is ready, no job found

Continue with the example given by using a colon, as follows,

... avoid the following case: an entity is found which is ready, yet
no job is returned for that entity when calling drm_sched_entity_pop_job(entity).
This causes the job worker to go idle. The correct behaviour is to loop
over all ready entities, until drm_sched_entity_pop_job(entity) returns non-NULL,
or there are no more ready entities.

>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)

You see here how "drain" isn't clear enough, and you clarify in parenthesis
that we in fact "loop over all ready entities". So, perhaps let's not use the
verb "drain" and simply use the sentence in the paragraph I've redacted above.

Also, let's please not use "drain" in the title, as it is confusing and makes me
think of capacitors, transistors, or buckets with water and Archimedes screws and siphons,
and instead say,

[PATCH]: drm/sched: Really find a ready entity and job in DRM sched run-job worker

Which makes it really simple and accessible a description. :-)

>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>> which are "ready", e.g. have a job to run.
>>>>
>>> That is what I thought too, hence my original design but it is not
>>> exactly true. Let me explain.
>>>
>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>> A and B both have jobs and no current dependecies. A's job is waiting
>>> B's job, entity A gets selected first, a dependecy gets installed in
>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>>
>> And here is the real problem. run work doesn't goes idle in that moment.
>>
>> drm_sched_run_job_work() should restarts itself until there is either no
>> more space in the ring buffer or it can't find a ready entity any more.
>>
>> At least that was the original design when that was all still driven by a
>> kthread.
>>
>> It can perfectly be that we messed this up when switching from kthread to a
>> work item.
>>
> 
> Right, that what this patch does - the run worker does not go idle until
> no ready entities are found. That was incorrect in the original patch
> and fixed here. Do you have any issues with this fix? It has been tested
> 3x times and clearly fixes the issue.

Thanks for following up with Christian.

I agree, the fix makes sense and achieves the original intention as described
by Christian. Also, thanks to all who tested it. Good work, thanks!

With the above changes to the patch title and text addressed, this patch would be then,

Reviewed-by: Luben Tuikov <ltuikov89@gmail.com>

-- 
Regards,
Luben

 
> 
> Matt
> 
>> Regards,
>> Christian.
>>
>>>
>>> The proper solution is to loop over all ready entities until one with a
>>> job is found via drm_sched_entity_pop_job() and then requeue the run
>>> job worker. Or loop over all entities until drm_sched_select_entity()
>>> returns NULL and then let the run job worker go idle. This is what the
>>> old threaded design did too [4]. Hope this clears everything up.
>>>
>>> Matt
>>>
>>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
>>> [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
>>> [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
>>> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011
>>>
>>>> If that's not the case any more then you have broken something else.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
>>>>> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>>>>> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
>>>>> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
>>>>> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
>>>>> Reported-by: Vlastimil Babka <vbabka@suse.cz>
>>>>> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>>>>>    1 file changed, 7 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 550492a7a031..85f082396d42 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>>>    	struct drm_sched_entity *entity;
>>>>>    	struct dma_fence *fence;
>>>>>    	struct drm_sched_fence *s_fence;
>>>>> -	struct drm_sched_job *sched_job;
>>>>> +	struct drm_sched_job *sched_job = NULL;
>>>>>    	int r;
>>>>>    	if (READ_ONCE(sched->pause_submit))
>>>>>    		return;
>>>>> -	entity = drm_sched_select_entity(sched);
>>>>> +	/* Find entity with a ready job */
>>>>> +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
>>>>> +		sched_job = drm_sched_entity_pop_job(entity);
>>>>> +		if (!sched_job)
>>>>> +			complete_all(&entity->entity_idle);
>>>>> +	}
>>>>>    	if (!entity)
>>>>> -		return;
>>>>> -
>>>>> -	sched_job = drm_sched_entity_pop_job(entity);
>>>>> -	if (!sched_job) {
>>>>> -		complete_all(&entity->entity_idle);
>>>>>    		return;	/* No more work */
>>>>> -	}
>>>>>    	s_fence = sched_job->s_fence;
>>

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 677 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-26 16:29       ` Matthew Brost
  2024-01-29  5:29         ` Luben Tuikov
@ 2024-01-29  7:44         ` Christian König
  2024-01-29  7:49           ` Vlastimil Babka
  2024-01-29 17:10           ` Luben Tuikov
  1 sibling, 2 replies; 14+ messages in thread
From: Christian König @ 2024-01-29  7:44 UTC (permalink / raw)
  To: Matthew Brost
  Cc: ltuikov89, dri-devel, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, intel-xe, Vlastimil Babka

Am 26.01.24 um 17:29 schrieb Matthew Brost:
> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>> All entities must be drained in the DRM scheduler run job worker to
>>>>> avoid the following case. An entity found that is ready, no job found
>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>> which are "ready", e.g. have a job to run.
>>>>
>>> That is what I thought too, hence my original design but it is not
>>> exactly true. Let me explain.
>>>
>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>> A and B both have jobs and no current dependecies. A's job is waiting
>>> B's job, entity A gets selected first, a dependecy gets installed in
>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>> And here is the real problem. run work doesn't goes idle in that moment.
>>
>> drm_sched_run_job_work() should restarts itself until there is either no
>> more space in the ring buffer or it can't find a ready entity any more.
>>
>> At least that was the original design when that was all still driven by a
>> kthread.
>>
>> It can perfectly be that we messed this up when switching from kthread to a
>> work item.
>>
> Right, that what this patch does - the run worker does not go idle until
> no ready entities are found. That was incorrect in the original patch
> and fixed here. Do you have any issues with this fix? It has been tested
> 3x times and clearly fixes the issue.

Ah! Yes in this case that patch here is a little bit ugly as well.

The original idea was that run_job restarts so that we are able to pause 
the submission thread without searching for an entity to submit more.

I strongly suggest to replace the while loop with a call to 
drm_sched_run_job_queue() so that when the entity can't provide a job we 
just restart the queuing work.

Regards,
Christian.

>   
>
> Matt
>
>> Regards,
>> Christian.
>>
>>> The proper solution is to loop over all ready entities until one with a
>>> job is found via drm_sched_entity_pop_job() and then requeue the run
>>> job worker. Or loop over all entities until drm_sched_select_entity()
>>> returns NULL and then let the run job worker go idle. This is what the
>>> old threaded design did too [4]. Hope this clears everything up.
>>>
>>> Matt
>>>
>>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
>>> [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
>>> [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
>>> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011
>>>
>>>> If that's not the case any more then you have broken something else.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
>>>>> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>>>>> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
>>>>> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
>>>>> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
>>>>> Reported-by: Vlastimil Babka <vbabka@suse.cz>
>>>>> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>>>>>     1 file changed, 7 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 550492a7a031..85f082396d42 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>>>     	struct drm_sched_entity *entity;
>>>>>     	struct dma_fence *fence;
>>>>>     	struct drm_sched_fence *s_fence;
>>>>> -	struct drm_sched_job *sched_job;
>>>>> +	struct drm_sched_job *sched_job = NULL;
>>>>>     	int r;
>>>>>     	if (READ_ONCE(sched->pause_submit))
>>>>>     		return;
>>>>> -	entity = drm_sched_select_entity(sched);
>>>>> +	/* Find entity with a ready job */
>>>>> +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
>>>>> +		sched_job = drm_sched_entity_pop_job(entity);
>>>>> +		if (!sched_job)
>>>>> +			complete_all(&entity->entity_idle);
>>>>> +	}
>>>>>     	if (!entity)
>>>>> -		return;
>>>>> -
>>>>> -	sched_job = drm_sched_entity_pop_job(entity);
>>>>> -	if (!sched_job) {
>>>>> -		complete_all(&entity->entity_idle);
>>>>>     		return;	/* No more work */
>>>>> -	}
>>>>>     	s_fence = sched_job->s_fence;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-29  7:44         ` Christian König
@ 2024-01-29  7:49           ` Vlastimil Babka
  2024-01-29 17:10           ` Luben Tuikov
  1 sibling, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2024-01-29  7:49 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: ltuikov89, dri-devel, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, intel-xe

On 1/29/24 08:44, Christian König wrote:
> Am 26.01.24 um 17:29 schrieb Matthew Brost:
>> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>>> All entities must be drained in the DRM scheduler run job worker to
>>>>>> avoid the following case. An entity found that is ready, no job found
>>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>>> which are "ready", e.g. have a job to run.
>>>>>
>>>> That is what I thought too, hence my original design but it is not
>>>> exactly true. Let me explain.
>>>>
>>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>>> A and B both have jobs and no current dependecies. A's job is waiting
>>>> B's job, entity A gets selected first, a dependecy gets installed in
>>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>>> And here is the real problem. run work doesn't goes idle in that moment.
>>>
>>> drm_sched_run_job_work() should restarts itself until there is either no
>>> more space in the ring buffer or it can't find a ready entity any more.
>>>
>>> At least that was the original design when that was all still driven by a
>>> kthread.
>>>
>>> It can perfectly be that we messed this up when switching from kthread to a
>>> work item.
>>>
>> Right, that what this patch does - the run worker does not go idle until
>> no ready entities are found. That was incorrect in the original patch
>> and fixed here. Do you have any issues with this fix? It has been tested
>> 3x times and clearly fixes the issue.
> 
> Ah! Yes in this case that patch here is a little bit ugly as well.
> 
> The original idea was that run_job restarts so that we are able to pause 
> the submission thread without searching for an entity to submit more.
> 
> I strongly suggest to replace the while loop with a call to 
> drm_sched_run_job_queue() so that when the entity can't provide a job we 
> just restart the queuing work.

Note it's already included in rc2, so any changes need to be a followup fix.
If these are important, then please make sure they get to rc3 :)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-29  7:44         ` Christian König
  2024-01-29  7:49           ` Vlastimil Babka
@ 2024-01-29 17:10           ` Luben Tuikov
  2024-01-29 18:31             ` Matthew Brost
  1 sibling, 1 reply; 14+ messages in thread
From: Luben Tuikov @ 2024-01-29 17:10 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: dri-devel, Thorsten Leemhuis, Mario Limonciello, daniel,
	Mikhail Gavrilov, airlied, intel-xe, Vlastimil Babka


[-- Attachment #1.1.1: Type: text/plain, Size: 2787 bytes --]

On 2024-01-29 02:44, Christian König wrote:
> Am 26.01.24 um 17:29 schrieb Matthew Brost:
>> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>>> All entities must be drained in the DRM scheduler run job worker to
>>>>>> avoid the following case. An entity found that is ready, no job found
>>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>>> which are "ready", e.g. have a job to run.
>>>>>
>>>> That is what I thought too, hence my original design but it is not
>>>> exactly true. Let me explain.
>>>>
>>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>>> A and B both have jobs and no current dependecies. A's job is waiting
>>>> B's job, entity A gets selected first, a dependecy gets installed in
>>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>>> And here is the real problem. run work doesn't goes idle in that moment.
>>>
>>> drm_sched_run_job_work() should restarts itself until there is either no
>>> more space in the ring buffer or it can't find a ready entity any more.
>>>
>>> At least that was the original design when that was all still driven by a
>>> kthread.
>>>
>>> It can perfectly be that we messed this up when switching from kthread to a
>>> work item.
>>>
>> Right, that what this patch does - the run worker does not go idle until
>> no ready entities are found. That was incorrect in the original patch
>> and fixed here. Do you have any issues with this fix? It has been tested
>> 3x times and clearly fixes the issue.
> 
> Ah! Yes in this case that patch here is a little bit ugly as well.
> 
> The original idea was that run_job restarts so that we are able to pause 
> the submission thread without searching for an entity to submit more.
> 
> I strongly suggest to replace the while loop with a call to 
> drm_sched_run_job_queue() so that when the entity can't provide a job we 
> just restart the queuing work.

I agree with Christian. This more closely preserves the original design
of the GPU schedulers, so we should go with that.
-- 
Regards,
Luben

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 677 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
  2024-01-29 17:10           ` Luben Tuikov
@ 2024-01-29 18:31             ` Matthew Brost
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2024-01-29 18:31 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: dri-devel, intel-xe, Thorsten Leemhuis, Mario Limonciello,
	daniel, Mikhail Gavrilov, airlied, Christian König,
	Vlastimil Babka

On Mon, Jan 29, 2024 at 12:10:52PM -0500, Luben Tuikov wrote:
> On 2024-01-29 02:44, Christian König wrote:
> > Am 26.01.24 um 17:29 schrieb Matthew Brost:
> >> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
> >>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
> >>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
> >>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
> >>>>>> All entities must be drained in the DRM scheduler run job worker to
> >>>>>> avoid the following case. An entity found that is ready, no job found
> >>>>>> ready on entity, and run job worker goes idle with other entities + jobs
> >>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)
> >>>>>> in the run job worker ensures all job that are ready will be scheduled.
> >>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
> >>>>> which are "ready", e.g. have a job to run.
> >>>>>
> >>>> That is what I thought too, hence my original design but it is not
> >>>> exactly true. Let me explain.
> >>>>
> >>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
> >>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
> >>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
> >>>> returning a NULL job. Thus we can get into a scenario where 2 entities
> >>>> A and B both have jobs and no current dependecies. A's job is waiting
> >>>> B's job, entity A gets selected first, a dependecy gets installed in
> >>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
> >>> And here is the real problem. run work doesn't goes idle in that moment.
> >>>
> >>> drm_sched_run_job_work() should restarts itself until there is either no
> >>> more space in the ring buffer or it can't find a ready entity any more.
> >>>
> >>> At least that was the original design when that was all still driven by a
> >>> kthread.
> >>>
> >>> It can perfectly be that we messed this up when switching from kthread to a
> >>> work item.
> >>>
> >> Right, that what this patch does - the run worker does not go idle until
> >> no ready entities are found. That was incorrect in the original patch
> >> and fixed here. Do you have any issues with this fix? It has been tested
> >> 3x times and clearly fixes the issue.
> > 
> > Ah! Yes in this case that patch here is a little bit ugly as well.
> > 
> > The original idea was that run_job restarts so that we are able to pause 
> > the submission thread without searching for an entity to submit more.
> > 
> > I strongly suggest to replace the while loop with a call to 
> > drm_sched_run_job_queue() so that when the entity can't provide a job we 
> > just restart the queuing work.
> 
> I agree with Christian. This more closely preserves the original design
> of the GPU schedulers, so we should go with that.
> -- 
> Regards,
> Luben

As this patch is already in rc2 will post a patch shortly replacing the
loop with a re-queuing design.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-01-29 18:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24 21:08 [PATCH] drm/sched: Drain all entities in DRM sched run job worker Matthew Brost
2024-01-25  9:24 ` Vlastimil Babka
2024-01-25 17:30   ` Matthew Brost
2024-01-26  2:45     ` Dave Airlie
2024-01-25 15:12 ` Christian König
2024-01-25 17:30   ` Matthew Brost
2024-01-26 10:32     ` Christian König
2024-01-26 16:29       ` Matthew Brost
2024-01-29  5:29         ` Luben Tuikov
2024-01-29  7:44         ` Christian König
2024-01-29  7:49           ` Vlastimil Babka
2024-01-29 17:10           ` Luben Tuikov
2024-01-29 18:31             ` Matthew Brost
2024-01-29  4:08 ` Luben Tuikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).