[PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions
@ 2018-08-03 14:29 Lucas Stach
  2018-08-03 16:55 ` Christian König
       [not found] ` <20180803142947.30724-1-l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
  0 siblings, 2 replies; 6+ messages in thread
From: Lucas Stach @ 2018-08-03 14:29 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx, patchwork-lst, dri-devel, kernel, Nayan Deshmukh

drm_sched_job_finish() is a work item scheduled for each finished job on
a unbound system workqueue. This means the workers can execute out of order
with regard to the real hardware job completions.

If this happens queueing a timeout worker for the first job on the ring
mirror list is wrong, as this may be a job which has already finished
executing. Fix this by reorganizing the code to always queue the worker
for the next job on the list, if this job hasn't finished yet. This is
robust against a potential reordering of the finish workers.

Also move out the timeout worker cancelling, so that we don't need to
take the job list lock twice. As a small optimization list_del is used
to remove the job from the ring mirror list, as there is no need to
reinit the list head in the job we are about to free.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
---
v2: - properly handle last job in the ring
    - check correct fence for compeletion
---
 drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 44d480768dfe..574875e2c206 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
 						   finish_work);
 	struct drm_gpu_scheduler *sched = s_job->sched;
 
-	/* remove job from ring_mirror_list */
-	spin_lock(&sched->job_list_lock);
-	list_del_init(&s_job->node);
-	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
-		struct drm_sched_job *next;
-
-		spin_unlock(&sched->job_list_lock);
+	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
 		cancel_delayed_work_sync(&s_job->work_tdr);
-		spin_lock(&sched->job_list_lock);
 
-		/* queue TDR for next job */
-		next = list_first_entry_or_null(&sched->ring_mirror_list,
-						struct drm_sched_job, node);
+	spin_lock(&sched->job_list_lock);
+	/* queue TDR for next job */
+	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
+	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
+		struct drm_sched_job *next = list_next_entry(s_job, node);
 
-		if (next)
+		if (!dma_fence_is_signaled(&next->s_fence->finished))
 			schedule_delayed_work(&next->work_tdr, sched->timeout);
 	}
+	/* remove job from ring_mirror_list */
+	list_del(&s_job->node);
 	spin_unlock(&sched->job_list_lock);
+
 	dma_fence_put(&s_job->s_fence->finished);
 	sched->ops->free_job(s_job);
 }
-- 
2.18.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions
       [not found] ` <20180803142947.30724-1-l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
@ 2018-08-03 16:55   ` Christian König
       [not found]     ` <b9fb5404-81aa-16f1-ca1c-dfd064ac4d2a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2018-08-03 16:55 UTC (permalink / raw)
  To: Lucas Stach, Christian König
  Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	patchwork-lst-bIcnvbaLZ9MEGnE8C9+IrQ, Eric Anholt,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	kernel-bIcnvbaLZ9MEGnE8C9+IrQ, Nayan Deshmukh

Am 03.08.2018 um 16:29 schrieb Lucas Stach:
> drm_sched_job_finish() is a work item scheduled for each finished job on
> a unbound system workqueue. This means the workers can execute out of order
> with regard to the real hardware job completions.
>
> If this happens queueing a timeout worker for the first job on the ring
> mirror list is wrong, as this may be a job which has already finished
> executing. Fix this by reorganizing the code to always queue the worker
> for the next job on the list, if this job hasn't finished yet. This is
> robust against a potential reordering of the finish workers.
>
> Also move out the timeout worker cancelling, so that we don't need to
> take the job list lock twice. As a small optimization list_del is used
> to remove the job from the ring mirror list, as there is no need to
> reinit the list head in the job we are about to free.
>
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> ---
> v2: - properly handle last job in the ring
>      - check correct fence for compeletion
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
>   1 file changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 44d480768dfe..574875e2c206 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
>   						   finish_work);
>   	struct drm_gpu_scheduler *sched = s_job->sched;
>   
> -	/* remove job from ring_mirror_list */
> -	spin_lock(&sched->job_list_lock);
> -	list_del_init(&s_job->node);
> -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
> -		struct drm_sched_job *next;
> -
> -		spin_unlock(&sched->job_list_lock);
> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
>   		cancel_delayed_work_sync(&s_job->work_tdr);

That is unfortunately still racy here.

Between canceling the job and removing it from the list someone could 
actually start the time (in theory) :)

Cancel it, remove it from the list and cancel it again.

BTW: You could completely drop the "if (sched->timeout != 
MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the 
structure is initialized.

Christian.

> -		spin_lock(&sched->job_list_lock);
>   
> -		/* queue TDR for next job */
> -		next = list_first_entry_or_null(&sched->ring_mirror_list,
> -						struct drm_sched_job, node);
> +	spin_lock(&sched->job_list_lock);
> +	/* queue TDR for next job */
> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> +	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
> +		struct drm_sched_job *next = list_next_entry(s_job, node);
>   
> -		if (next)
> +		if (!dma_fence_is_signaled(&next->s_fence->finished))
>   			schedule_delayed_work(&next->work_tdr, sched->timeout);
>   	}
> +	/* remove job from ring_mirror_list */
> +	list_del(&s_job->node);
>   	spin_unlock(&sched->job_list_lock);
> +
>   	dma_fence_put(&s_job->s_fence->finished);
>   	sched->ops->free_job(s_job);
>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions
  2018-08-03 14:29 [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions Lucas Stach
@ 2018-08-03 16:55 ` Christian König
       [not found] ` <20180803142947.30724-1-l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
  1 sibling, 0 replies; 6+ messages in thread
From: Christian König @ 2018-08-03 16:55 UTC (permalink / raw)
  To: Lucas Stach, Christian König
  Cc: dri-devel, patchwork-lst, amd-gfx, kernel, Nayan Deshmukh

Am 03.08.2018 um 16:29 schrieb Lucas Stach:
> drm_sched_job_finish() is a work item scheduled for each finished job on
> a unbound system workqueue. This means the workers can execute out of order
> with regard to the real hardware job completions.
>
> If this happens queueing a timeout worker for the first job on the ring
> mirror list is wrong, as this may be a job which has already finished
> executing. Fix this by reorganizing the code to always queue the worker
> for the next job on the list, if this job hasn't finished yet. This is
> robust against a potential reordering of the finish workers.
>
> Also move out the timeout worker cancelling, so that we don't need to
> take the job list lock twice. As a small optimization list_del is used
> to remove the job from the ring mirror list, as there is no need to
> reinit the list head in the job we are about to free.
>
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> ---
> v2: - properly handle last job in the ring
>      - check correct fence for compeletion
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
>   1 file changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 44d480768dfe..574875e2c206 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
>   						   finish_work);
>   	struct drm_gpu_scheduler *sched = s_job->sched;
>   
> -	/* remove job from ring_mirror_list */
> -	spin_lock(&sched->job_list_lock);
> -	list_del_init(&s_job->node);
> -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
> -		struct drm_sched_job *next;
> -
> -		spin_unlock(&sched->job_list_lock);
> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
>   		cancel_delayed_work_sync(&s_job->work_tdr);

That is unfortunately still racy here.

Between canceling the job and removing it from the list someone could 
actually start the time (in theory) :)

Cancel it, remove it from the list and cancel it again.

BTW: You could completely drop the "if (sched->timeout != 
MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the 
structure is initialized.

Christian.

> -		spin_lock(&sched->job_list_lock);
>   
> -		/* queue TDR for next job */
> -		next = list_first_entry_or_null(&sched->ring_mirror_list,
> -						struct drm_sched_job, node);
> +	spin_lock(&sched->job_list_lock);
> +	/* queue TDR for next job */
> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> +	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
> +		struct drm_sched_job *next = list_next_entry(s_job, node);
>   
> -		if (next)
> +		if (!dma_fence_is_signaled(&next->s_fence->finished))
>   			schedule_delayed_work(&next->work_tdr, sched->timeout);
>   	}
> +	/* remove job from ring_mirror_list */
> +	list_del(&s_job->node);
>   	spin_unlock(&sched->job_list_lock);
> +
>   	dma_fence_put(&s_job->s_fence->finished);
>   	sched->ops->free_job(s_job);
>   }

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions
       [not found]     ` <b9fb5404-81aa-16f1-ca1c-dfd064ac4d2a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-08-03 17:31       ` Lucas Stach
       [not found]         ` <1533317473.20186.35.camel-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Lucas Stach @ 2018-08-03 17:31 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo
  Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	patchwork-lst-bIcnvbaLZ9MEGnE8C9+IrQ, Eric Anholt,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	kernel-bIcnvbaLZ9MEGnE8C9+IrQ, Nayan Deshmukh

Am Montag, den 06.08.2018, 14:57 +0200 schrieb Christian König:
> Am 03.08.2018 um 16:29 schrieb Lucas Stach:
> > drm_sched_job_finish() is a work item scheduled for each finished job on
> > a unbound system workqueue. This means the workers can execute out of order
> > with regard to the real hardware job completions.
> > 
> > If this happens queueing a timeout worker for the first job on the ring
> > mirror list is wrong, as this may be a job which has already finished
> > executing. Fix this by reorganizing the code to always queue the worker
> > for the next job on the list, if this job hasn't finished yet. This is
> > robust against a potential reordering of the finish workers.
> > 
> > Also move out the timeout worker cancelling, so that we don't need to
> > take the job list lock twice. As a small optimization list_del is used
> > to remove the job from the ring mirror list, as there is no need to
> > reinit the list head in the job we are about to free.
> > 
> > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> > ---
> > v2: - properly handle last job in the ring
> >      - check correct fence for compeletion
> > ---
> >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
> >   1 file changed, 10 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > index 44d480768dfe..574875e2c206 100644
> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
> > > >   						   finish_work);
> > > >   	struct drm_gpu_scheduler *sched = s_job->sched;
> >   
> > > > -	/* remove job from ring_mirror_list */
> > > > -	spin_lock(&sched->job_list_lock);
> > > > -	list_del_init(&s_job->node);
> > > > -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
> > > > -		struct drm_sched_job *next;
> > -
> > > > -		spin_unlock(&sched->job_list_lock);
> > > > +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
> >   		cancel_delayed_work_sync(&s_job->work_tdr);
> 
> That is unfortunately still racy here.
> 
> Between canceling the job and removing it from the list someone could 
> actually start the time (in theory) :)
> 
> Cancel it, remove it from the list and cancel it again.

I don't see how. If we end up in this worker the finished fence of the
job is already certainly signaled (as this is what triggers queueing of
the worker). So even if some other worker manages to find this job as
the next job in the list, the dma_fence_is_signaled check should
prevent the timeout worker from getting scheduled again.

> BTW: You could completely drop the "if (sched->timeout != 
> MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the 
> structure is initialized.

Right.

Regards,
Lucas

> Christian.
> 
> > > > -		spin_lock(&sched->job_list_lock);
> >   
> > > > -		/* queue TDR for next job */
> > > > -		next = list_first_entry_or_null(&sched->ring_mirror_list,
> > > > -						struct drm_sched_job, node);
> > > > +	spin_lock(&sched->job_list_lock);
> > > > +	/* queue TDR for next job */
> > > > +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> > > > +	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
> > > > +		struct drm_sched_job *next = list_next_entry(s_job, node);
> >   
> > > > -		if (next)
> > > > +		if (!dma_fence_is_signaled(&next->s_fence->finished))
> > > >   			schedule_delayed_work(&next->work_tdr, sched->timeout);
> > > >   	}
> > > > +	/* remove job from ring_mirror_list */
> > > > +	list_del(&s_job->node);
> > > >   	spin_unlock(&sched->job_list_lock);
> > +
> > > >   	dma_fence_put(&s_job->s_fence->finished);
> > > >   	sched->ops->free_job(s_job);
> >   }
> 
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions
       [not found]         ` <1533317473.20186.35.camel-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
@ 2018-08-06  8:12           ` Christian König
       [not found]             ` <a3d9837d-c44f-34cd-2e51-8ccd2ca98f54-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2018-08-06  8:12 UTC (permalink / raw)
  To: Lucas Stach, christian.koenig-5C7GfCeVMHo
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	patchwork-lst-bIcnvbaLZ9MEGnE8C9+IrQ, Eric Anholt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	kernel-bIcnvbaLZ9MEGnE8C9+IrQ, Nayan Deshmukh

Am 03.08.2018 um 19:31 schrieb Lucas Stach:
> Am Montag, den 06.08.2018, 14:57 +0200 schrieb Christian König:
>> Am 03.08.2018 um 16:29 schrieb Lucas Stach:
>>> drm_sched_job_finish() is a work item scheduled for each finished job on
>>> a unbound system workqueue. This means the workers can execute out of order
>>> with regard to the real hardware job completions.
>>>
>>> If this happens queueing a timeout worker for the first job on the ring
>>> mirror list is wrong, as this may be a job which has already finished
>>> executing. Fix this by reorganizing the code to always queue the worker
>>> for the next job on the list, if this job hasn't finished yet. This is
>>> robust against a potential reordering of the finish workers.
>>>
>>> Also move out the timeout worker cancelling, so that we don't need to
>>> take the job list lock twice. As a small optimization list_del is used
>>> to remove the job from the ring mirror list, as there is no need to
>>> reinit the list head in the job we are about to free.
>>>
>>>>> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
>>> ---
>>> v2: - properly handle last job in the ring
>>>       - check correct fence for compeletion
>>> ---
>>>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
>>>    1 file changed, 10 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> index 44d480768dfe..574875e2c206 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
>>>>>    						   finish_work);
>>>>>    	struct drm_gpu_scheduler *sched = s_job->sched;
>>>    
>>>>> -	/* remove job from ring_mirror_list */
>>>>> -	spin_lock(&sched->job_list_lock);
>>>>> -	list_del_init(&s_job->node);
>>>>> -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
>>>>> -		struct drm_sched_job *next;
>>> -
>>>>> -		spin_unlock(&sched->job_list_lock);
>>>>> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
>>>    		cancel_delayed_work_sync(&s_job->work_tdr);
>> That is unfortunately still racy here.
>>
>> Between canceling the job and removing it from the list someone could
>> actually start the time (in theory) :)
>>
>> Cancel it, remove it from the list and cancel it again.
> I don't see how. If we end up in this worker the finished fence of the
> job is already certainly signaled (as this is what triggers queueing of
> the worker). So even if some other worker manages to find this job as
> the next job in the list, the dma_fence_is_signaled check should
> prevent the timeout worker from getting scheduled again.

Well that makes sense, but is a bit hard to understand.

Anyway, please remove the extra "if" check. With that done the patch is 
Reviewed-by: Christian König <christian.koenig@amd.com>.

Thanks,
Christian.


>
>> BTW: You could completely drop the "if (sched->timeout !=
>> MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the
>> structure is initialized.
> Right.
>
> Regards,
> Lucas
>
>> Christian.
>>
>>>>> -		spin_lock(&sched->job_list_lock);
>>>    
>>>>> -		/* queue TDR for next job */
>>>>> -		next = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>> -						struct drm_sched_job, node);
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	/* queue TDR for next job */
>>>>> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>> +	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
>>>>> +		struct drm_sched_job *next = list_next_entry(s_job, node);
>>>    
>>>>> -		if (next)
>>>>> +		if (!dma_fence_is_signaled(&next->s_fence->finished))
>>>>>    			schedule_delayed_work(&next->work_tdr, sched->timeout);
>>>>>    	}
>>>>> +	/* remove job from ring_mirror_list */
>>>>> +	list_del(&s_job->node);
>>>>>    	spin_unlock(&sched->job_list_lock);
>>> +
>>>>>    	dma_fence_put(&s_job->s_fence->finished);
>>>>>    	sched->ops->free_job(s_job);
>>>    }
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions
       [not found]             ` <a3d9837d-c44f-34cd-2e51-8ccd2ca98f54-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-08-06 13:18               ` Lucas Stach
  0 siblings, 0 replies; 6+ messages in thread
From: Lucas Stach @ 2018-08-06 13:18 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo
  Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	patchwork-lst-bIcnvbaLZ9MEGnE8C9+IrQ, Eric Anholt,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	kernel-bIcnvbaLZ9MEGnE8C9+IrQ, Nayan Deshmukh

Am Montag, den 06.08.2018, 10:12 +0200 schrieb Christian König:
> Am 03.08.2018 um 19:31 schrieb Lucas Stach:
> > Am Montag, den 06.08.2018, 14:57 +0200 schrieb Christian König:
> > > Am 03.08.2018 um 16:29 schrieb Lucas Stach:
> > > > drm_sched_job_finish() is a work item scheduled for each finished job on
> > > > a unbound system workqueue. This means the workers can execute out of order
> > > > with regard to the real hardware job completions.
> > > > 
> > > > If this happens queueing a timeout worker for the first job on the ring
> > > > mirror list is wrong, as this may be a job which has already finished
> > > > executing. Fix this by reorganizing the code to always queue the worker
> > > > for the next job on the list, if this job hasn't finished yet. This is
> > > > robust against a potential reordering of the finish workers.
> > > > 
> > > > Also move out the timeout worker cancelling, so that we don't need to
> > > > take the job list lock twice. As a small optimization list_del is used
> > > > to remove the job from the ring mirror list, as there is no need to
> > > > reinit the list head in the job we are about to free.
> > > > 
> > > > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> > > > 
> > > > ---
> > > > v2: - properly handle last job in the ring
> > > >       - check correct fence for compeletion
> > > > ---
> > > >    drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
> > > >    1 file changed, 10 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > > > index 44d480768dfe..574875e2c206 100644
> > > > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > > > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > > > @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
> > > > > > > > > > > >    						   finish_work);
> > > > > >    	struct drm_gpu_scheduler *sched = s_job->sched;
> > > > 
> > > >    
> > > > > > > > > > > > -	/* remove job from ring_mirror_list */
> > > > > > > > > > > > -	spin_lock(&sched->job_list_lock);
> > > > > > > > > > > > -	list_del_init(&s_job->node);
> > > > > > > > > > > > -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
> > > > > > -		struct drm_sched_job *next;
> > > > 
> > > > -
> > > > > > > > > > > > -		spin_unlock(&sched->job_list_lock);
> > > > > > +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
> > > > 
> > > >    		cancel_delayed_work_sync(&s_job->work_tdr);
> > > 
> > > That is unfortunately still racy here.
> > > 
> > > Between canceling the job and removing it from the list someone could
> > > actually start the time (in theory) :)
> > > 
> > > Cancel it, remove it from the list and cancel it again.
> > 
> > I don't see how. If we end up in this worker the finished fence of the
> > job is already certainly signaled (as this is what triggers queueing of
> > the worker). So even if some other worker manages to find this job as
> > the next job in the list, the dma_fence_is_signaled check should
> > prevent the timeout worker from getting scheduled again.
> 
> Well that makes sense, but is a bit hard to understand.

I agree, all the possible parallelism and possible re-ordering makes
this seemingly simple part of the scheduler code a bit mind-breaking.
I've added a comment in v3 to capture the line of thought for future
reference.

Regards,
Lucas

> Anyway, please remove the extra "if" check. With that done the patch is 
> > Reviewed-by: Christian König <christian.koenig@amd.com>.
> 
> Thanks,
> Christian.
> 
> 
> > 
> > > BTW: You could completely drop the "if (sched->timeout !=
> > > MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the
> > > structure is initialized.
> > 
> > Right.
> > 
> > Regards,
> > Lucas
> > 
> > > Christian.
> > > 
> > > > > > -		spin_lock(&sched->job_list_lock);
> > > > 
> > > >    
> > > > > > > > > > > > -		/* queue TDR for next job */
> > > > > > > > > > > > -		next = list_first_entry_or_null(&sched->ring_mirror_list,
> > > > > > > > > > > > -						struct drm_sched_job, node);
> > > > > > > > > > > > +	spin_lock(&sched->job_list_lock);
> > > > > > > > > > > > +	/* queue TDR for next job */
> > > > > > > > > > > > +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> > > > > > > > > > > > +	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
> > > > > > +		struct drm_sched_job *next = list_next_entry(s_job, node);
> > > > 
> > > >    
> > > > > > > > > > > > -		if (next)
> > > > > > > > > > > > +		if (!dma_fence_is_signaled(&next->s_fence->finished))
> > > > > > > > > > > >    			schedule_delayed_work(&next->work_tdr, sched->timeout);
> > > > > > > > > > > >    	}
> > > > > > > > > > > > +	/* remove job from ring_mirror_list */
> > > > > > > > > > > > +	list_del(&s_job->node);
> > > > > >    	spin_unlock(&sched->job_list_lock);
> > > > 
> > > > +
> > > > > > > > > > > >    	dma_fence_put(&s_job->s_fence->finished);
> > > > > >    	sched->ops->free_job(s_job);
> > > > 
> > > >    }
> > 
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-08-06 13:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-03 14:29 [PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions Lucas Stach
2018-08-03 16:55 ` Christian König
     [not found] ` <20180803142947.30724-1-l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
2018-08-03 16:55   ` Christian König
     [not found]     ` <b9fb5404-81aa-16f1-ca1c-dfd064ac4d2a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-08-03 17:31       ` Lucas Stach
     [not found]         ` <1533317473.20186.35.camel-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
2018-08-06  8:12           ` Christian König
     [not found]             ` <a3d9837d-c44f-34cd-2e51-8ccd2ca98f54-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-08-06 13:18               ` Lucas Stach

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.