* [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-17 21:59 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-17 21:59 UTC (permalink / raw)
To: dri-devel, amd-gfx, ckoenig.leichtzumerken
Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.
Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.
v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 908b0b5..c6b7947 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
*/
void drm_sched_fini(struct drm_gpu_scheduler *sched)
{
+ int i;
+ struct drm_sched_entity *s_entity;
if (sched->thread)
kthread_stop(sched->thread);
+ /* Detach all sched_entites from this scheduler once it's stopped */
+ for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+ struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+ if (!rq)
+ continue;
+
+ /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */
+ spin_lock(&rq->lock);
+ while ((s_entity = list_first_entry_or_null(&rq->entities,
+ struct drm_sched_entity,
+ list))) {
+ spin_unlock(&rq->lock);
+
+ /* Prevent reinsertion and remove */
+ spin_lock(&s_entity->rq_lock);
+ s_entity->stopped = true;
+ drm_sched_rq_remove_entity(rq, s_entity);
+ spin_unlock(&s_entity->rq_lock);
+
+ spin_lock(&rq->lock);
+ }
+ spin_unlock(&rq->lock);
+
+ }
+
+ /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
+ wake_up_all(&sched->job_scheduled);
+
/* Confirm no work left behind accessing device structures */
cancel_delayed_work_sync(&sched->work_tdr);
--
2.7.4
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-17 21:59 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-17 21:59 UTC (permalink / raw)
To: dri-devel, amd-gfx, ckoenig.leichtzumerken; +Cc: Andrey Grodzovsky
Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.
Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.
v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 908b0b5..c6b7947 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
*/
void drm_sched_fini(struct drm_gpu_scheduler *sched)
{
+ int i;
+ struct drm_sched_entity *s_entity;
if (sched->thread)
kthread_stop(sched->thread);
+ /* Detach all sched_entites from this scheduler once it's stopped */
+ for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+ struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+ if (!rq)
+ continue;
+
+ /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */
+ spin_lock(&rq->lock);
+ while ((s_entity = list_first_entry_or_null(&rq->entities,
+ struct drm_sched_entity,
+ list))) {
+ spin_unlock(&rq->lock);
+
+ /* Prevent reinsertion and remove */
+ spin_lock(&s_entity->rq_lock);
+ s_entity->stopped = true;
+ drm_sched_rq_remove_entity(rq, s_entity);
+ spin_unlock(&s_entity->rq_lock);
+
+ spin_lock(&rq->lock);
+ }
+ spin_unlock(&rq->lock);
+
+ }
+
+ /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
+ wake_up_all(&sched->job_scheduled);
+
/* Confirm no work left behind accessing device structures */
cancel_delayed_work_sync(&sched->work_tdr);
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-17 21:59 ` Andrey Grodzovsky
@ 2021-02-18 8:07 ` Christian König
-1 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-18 8:07 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
> Problem: If scheduler is already stopped by the time sched_entity
> is released and entity's job_queue not empty I encountred
> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
> never becomes false.
>
> Fix: In drm_sched_fini detach all sched_entities from the
> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
> Also wakeup all those processes stuck in sched_entity flushing
> as the scheduler main thread which wakes them up is stopped by now.
>
> v2:
> Reverse order of drm_sched_rq_remove_entity and marking
> s_entity as stopped to prevent reinserion back to rq due
> to race.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 908b0b5..c6b7947 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
> */
> void drm_sched_fini(struct drm_gpu_scheduler *sched)
> {
> + int i;
> + struct drm_sched_entity *s_entity;
BTW: Please order that so that i is declared last.
> if (sched->thread)
> kthread_stop(sched->thread);
>
> + /* Detach all sched_entites from this scheduler once it's stopped */
> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> + struct drm_sched_rq *rq = &sched->sched_rq[i];
> +
> + if (!rq)
> + continue;
> +
> + /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */
> + spin_lock(&rq->lock);
> + while ((s_entity = list_first_entry_or_null(&rq->entities,
> + struct drm_sched_entity,
> + list))) {
> + spin_unlock(&rq->lock);
> +
> + /* Prevent reinsertion and remove */
> + spin_lock(&s_entity->rq_lock);
> + s_entity->stopped = true;
> + drm_sched_rq_remove_entity(rq, s_entity);
> + spin_unlock(&s_entity->rq_lock);
Well this spin_unlock/lock dance here doesn't look correct at all now.
Christian.
> +
> + spin_lock(&rq->lock);
> + }
> + spin_unlock(&rq->lock);
> +
> + }
> +
> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
> + wake_up_all(&sched->job_scheduled);
> +
> /* Confirm no work left behind accessing device structures */
> cancel_delayed_work_sync(&sched->work_tdr);
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-18 8:07 ` Christian König
0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-18 8:07 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
> Problem: If scheduler is already stopped by the time sched_entity
> is released and entity's job_queue not empty I encountred
> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
> never becomes false.
>
> Fix: In drm_sched_fini detach all sched_entities from the
> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
> Also wakeup all those processes stuck in sched_entity flushing
> as the scheduler main thread which wakes them up is stopped by now.
>
> v2:
> Reverse order of drm_sched_rq_remove_entity and marking
> s_entity as stopped to prevent reinserion back to rq due
> to race.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 908b0b5..c6b7947 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
> */
> void drm_sched_fini(struct drm_gpu_scheduler *sched)
> {
> + int i;
> + struct drm_sched_entity *s_entity;
BTW: Please order that so that i is declared last.
> if (sched->thread)
> kthread_stop(sched->thread);
>
> + /* Detach all sched_entites from this scheduler once it's stopped */
> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> + struct drm_sched_rq *rq = &sched->sched_rq[i];
> +
> + if (!rq)
> + continue;
> +
> + /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */
> + spin_lock(&rq->lock);
> + while ((s_entity = list_first_entry_or_null(&rq->entities,
> + struct drm_sched_entity,
> + list))) {
> + spin_unlock(&rq->lock);
> +
> + /* Prevent reinsertion and remove */
> + spin_lock(&s_entity->rq_lock);
> + s_entity->stopped = true;
> + drm_sched_rq_remove_entity(rq, s_entity);
> + spin_unlock(&s_entity->rq_lock);
Well this spin_unlock/lock dance here doesn't look correct at all now.
Christian.
> +
> + spin_lock(&rq->lock);
> + }
> + spin_unlock(&rq->lock);
> +
> + }
> +
> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
> + wake_up_all(&sched->job_scheduled);
> +
> /* Confirm no work left behind accessing device structures */
> cancel_delayed_work_sync(&sched->work_tdr);
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-18 8:07 ` Christian König
@ 2021-02-18 15:05 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-18 15:05 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2/18/21 3:07 AM, Christian König wrote:
>
>
> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>> Problem: If scheduler is already stopped by the time sched_entity
>> is released and entity's job_queue not empty I encountred
>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>> never becomes false.
>>
>> Fix: In drm_sched_fini detach all sched_entities from the
>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>> Also wakeup all those processes stuck in sched_entity flushing
>> as the scheduler main thread which wakes them up is stopped by now.
>>
>> v2:
>> Reverse order of drm_sched_rq_remove_entity and marking
>> s_entity as stopped to prevent reinserion back to rq due
>> to race.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>> 1 file changed, 31 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 908b0b5..c6b7947 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>> */
>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>> {
>> + int i;
>> + struct drm_sched_entity *s_entity;
>
> BTW: Please order that so that i is declared last.
>
>> if (sched->thread)
>> kthread_stop(sched->thread);
>> + /* Detach all sched_entites from this scheduler once it's stopped */
>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>> +
>> + if (!rq)
>> + continue;
>> +
>> + /* Loop this way because rq->lock is taken in
>> drm_sched_rq_remove_entity */
>> + spin_lock(&rq->lock);
>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>> + struct drm_sched_entity,
>> + list))) {
>> + spin_unlock(&rq->lock);
>> +
>> + /* Prevent reinsertion and remove */
>> + spin_lock(&s_entity->rq_lock);
>> + s_entity->stopped = true;
>> + drm_sched_rq_remove_entity(rq, s_entity);
>> + spin_unlock(&s_entity->rq_lock);
>
> Well this spin_unlock/lock dance here doesn't look correct at all now.
>
> Christian.
In what way ? It's in the same same order as in other call sites (see
drm_sched_entity_push_job and drm_sched_entity_flush).
If i just locked rq->lock and did list_for_each_entry_safe while manually
deleting entity->list instead of calling
drm_sched_rq_remove_entity this still would not be possible as the order of lock
acquisition between s_entity->rq_lock
and rq->lock would be reverse compared to the call sites mentioned above.
Andrey
>
>> +
>> + spin_lock(&rq->lock);
>> + }
>> + spin_unlock(&rq->lock);
>> +
>> + }
>> +
>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>> + wake_up_all(&sched->job_scheduled);
>> +
>> /* Confirm no work left behind accessing device structures */
>> cancel_delayed_work_sync(&sched->work_tdr);
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-18 15:05 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-18 15:05 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2/18/21 3:07 AM, Christian König wrote:
>
>
> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>> Problem: If scheduler is already stopped by the time sched_entity
>> is released and entity's job_queue not empty I encountred
>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>> never becomes false.
>>
>> Fix: In drm_sched_fini detach all sched_entities from the
>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>> Also wakeup all those processes stuck in sched_entity flushing
>> as the scheduler main thread which wakes them up is stopped by now.
>>
>> v2:
>> Reverse order of drm_sched_rq_remove_entity and marking
>> s_entity as stopped to prevent reinserion back to rq due
>> to race.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>> 1 file changed, 31 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 908b0b5..c6b7947 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>> */
>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>> {
>> + int i;
>> + struct drm_sched_entity *s_entity;
>
> BTW: Please order that so that i is declared last.
>
>> if (sched->thread)
>> kthread_stop(sched->thread);
>> + /* Detach all sched_entites from this scheduler once it's stopped */
>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>> +
>> + if (!rq)
>> + continue;
>> +
>> + /* Loop this way because rq->lock is taken in
>> drm_sched_rq_remove_entity */
>> + spin_lock(&rq->lock);
>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>> + struct drm_sched_entity,
>> + list))) {
>> + spin_unlock(&rq->lock);
>> +
>> + /* Prevent reinsertion and remove */
>> + spin_lock(&s_entity->rq_lock);
>> + s_entity->stopped = true;
>> + drm_sched_rq_remove_entity(rq, s_entity);
>> + spin_unlock(&s_entity->rq_lock);
>
> Well this spin_unlock/lock dance here doesn't look correct at all now.
>
> Christian.
In what way ? It's in the same same order as in other call sites (see
drm_sched_entity_push_job and drm_sched_entity_flush).
If i just locked rq->lock and did list_for_each_entry_safe while manually
deleting entity->list instead of calling
drm_sched_rq_remove_entity this still would not be possible as the order of lock
acquisition between s_entity->rq_lock
and rq->lock would be reverse compared to the call sites mentioned above.
Andrey
>
>> +
>> + spin_lock(&rq->lock);
>> + }
>> + spin_unlock(&rq->lock);
>> +
>> + }
>> +
>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>> + wake_up_all(&sched->job_scheduled);
>> +
>> /* Confirm no work left behind accessing device structures */
>> cancel_delayed_work_sync(&sched->work_tdr);
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-18 15:05 ` Andrey Grodzovsky
@ 2021-02-18 15:15 ` Christian König
-1 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-18 15:15 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>
> On 2/18/21 3:07 AM, Christian König wrote:
>>
>>
>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>> Problem: If scheduler is already stopped by the time sched_entity
>>> is released and entity's job_queue not empty I encountred
>>> a hang in drm_sched_entity_flush. This is because
>>> drm_sched_entity_is_idle
>>> never becomes false.
>>>
>>> Fix: In drm_sched_fini detach all sched_entities from the
>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>> Also wakeup all those processes stuck in sched_entity flushing
>>> as the scheduler main thread which wakes them up is stopped by now.
>>>
>>> v2:
>>> Reverse order of drm_sched_rq_remove_entity and marking
>>> s_entity as stopped to prevent reinserion back to rq due
>>> to race.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>> +++++++++++++++++++++++++++++++
>>> 1 file changed, 31 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 908b0b5..c6b7947 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>> */
>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>> {
>>> + int i;
>>> + struct drm_sched_entity *s_entity;
>>
>> BTW: Please order that so that i is declared last.
>>
>>> if (sched->thread)
>>> kthread_stop(sched->thread);
>>> + /* Detach all sched_entites from this scheduler once it's
>>> stopped */
>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>> +
>>> + if (!rq)
>>> + continue;
>>> +
>>> + /* Loop this way because rq->lock is taken in
>>> drm_sched_rq_remove_entity */
>>> + spin_lock(&rq->lock);
>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>> + struct drm_sched_entity,
>>> + list))) {
>>> + spin_unlock(&rq->lock);
>>> +
>>> + /* Prevent reinsertion and remove */
>>> + spin_lock(&s_entity->rq_lock);
>>> + s_entity->stopped = true;
>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>> + spin_unlock(&s_entity->rq_lock);
>>
>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>
>> Christian.
>
>
> In what way ? It's in the same same order as in other call sites (see
> drm_sched_entity_push_job and drm_sched_entity_flush).
> If i just locked rq->lock and did list_for_each_entry_safe while
> manually deleting entity->list instead of calling
> drm_sched_rq_remove_entity this still would not be possible as the
> order of lock acquisition between s_entity->rq_lock
> and rq->lock would be reverse compared to the call sites mentioned above.
Ah, now I understand. You need this because drm_sched_rq_remove_entity()
will grab the rq lock again!
Problem is now what prevents the entity from being destroyed while you
remove it?
Christian.
>
> Andrey
>
>
>
>>
>>> +
>>> + spin_lock(&rq->lock);
>>> + }
>>> + spin_unlock(&rq->lock);
>>> +
>>> + }
>>> +
>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>> scheduler */
>>> + wake_up_all(&sched->job_scheduled);
>>> +
>>> /* Confirm no work left behind accessing device structures */
>>> cancel_delayed_work_sync(&sched->work_tdr);
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-18 15:15 ` Christian König
0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-18 15:15 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>
> On 2/18/21 3:07 AM, Christian König wrote:
>>
>>
>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>> Problem: If scheduler is already stopped by the time sched_entity
>>> is released and entity's job_queue not empty I encountred
>>> a hang in drm_sched_entity_flush. This is because
>>> drm_sched_entity_is_idle
>>> never becomes false.
>>>
>>> Fix: In drm_sched_fini detach all sched_entities from the
>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>> Also wakeup all those processes stuck in sched_entity flushing
>>> as the scheduler main thread which wakes them up is stopped by now.
>>>
>>> v2:
>>> Reverse order of drm_sched_rq_remove_entity and marking
>>> s_entity as stopped to prevent reinserion back to rq due
>>> to race.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>> +++++++++++++++++++++++++++++++
>>> 1 file changed, 31 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 908b0b5..c6b7947 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>> */
>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>> {
>>> + int i;
>>> + struct drm_sched_entity *s_entity;
>>
>> BTW: Please order that so that i is declared last.
>>
>>> if (sched->thread)
>>> kthread_stop(sched->thread);
>>> + /* Detach all sched_entites from this scheduler once it's
>>> stopped */
>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>> +
>>> + if (!rq)
>>> + continue;
>>> +
>>> + /* Loop this way because rq->lock is taken in
>>> drm_sched_rq_remove_entity */
>>> + spin_lock(&rq->lock);
>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>> + struct drm_sched_entity,
>>> + list))) {
>>> + spin_unlock(&rq->lock);
>>> +
>>> + /* Prevent reinsertion and remove */
>>> + spin_lock(&s_entity->rq_lock);
>>> + s_entity->stopped = true;
>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>> + spin_unlock(&s_entity->rq_lock);
>>
>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>
>> Christian.
>
>
> In what way ? It's in the same same order as in other call sites (see
> drm_sched_entity_push_job and drm_sched_entity_flush).
> If i just locked rq->lock and did list_for_each_entry_safe while
> manually deleting entity->list instead of calling
> drm_sched_rq_remove_entity this still would not be possible as the
> order of lock acquisition between s_entity->rq_lock
> and rq->lock would be reverse compared to the call sites mentioned above.
Ah, now I understand. You need this because drm_sched_rq_remove_entity()
will grab the rq lock again!
Problem is now what prevents the entity from being destroyed while you
remove it?
Christian.
>
> Andrey
>
>
>
>>
>>> +
>>> + spin_lock(&rq->lock);
>>> + }
>>> + spin_unlock(&rq->lock);
>>> +
>>> + }
>>> +
>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>> scheduler */
>>> + wake_up_all(&sched->job_scheduled);
>>> +
>>> /* Confirm no work left behind accessing device structures */
>>> cancel_delayed_work_sync(&sched->work_tdr);
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-18 15:15 ` Christian König
@ 2021-02-18 16:41 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-18 16:41 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2/18/21 10:15 AM, Christian König wrote:
> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>
>> On 2/18/21 3:07 AM, Christian König wrote:
>>>
>>>
>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>> is released and entity's job_queue not empty I encountred
>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>> never becomes false.
>>>>
>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>
>>>> v2:
>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>> s_entity as stopped to prevent reinserion back to rq due
>>>> to race.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>>>> 1 file changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 908b0b5..c6b7947 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>> */
>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>> {
>>>> + int i;
>>>> + struct drm_sched_entity *s_entity;
>>>
>>> BTW: Please order that so that i is declared last.
>>>
>>>> if (sched->thread)
>>>> kthread_stop(sched->thread);
>>>> + /* Detach all sched_entites from this scheduler once it's stopped */
>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN;
>>>> i--) {
>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>> +
>>>> + if (!rq)
>>>> + continue;
>>>> +
>>>> + /* Loop this way because rq->lock is taken in
>>>> drm_sched_rq_remove_entity */
>>>> + spin_lock(&rq->lock);
>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>> + struct drm_sched_entity,
>>>> + list))) {
>>>> + spin_unlock(&rq->lock);
>>>> +
>>>> + /* Prevent reinsertion and remove */
>>>> + spin_lock(&s_entity->rq_lock);
>>>> + s_entity->stopped = true;
>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>> + spin_unlock(&s_entity->rq_lock);
>>>
>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>
>>> Christian.
>>
>>
>> In what way ? It's in the same same order as in other call sites (see
>> drm_sched_entity_push_job and drm_sched_entity_flush).
>> If i just locked rq->lock and did list_for_each_entry_safe while manually
>> deleting entity->list instead of calling
>> drm_sched_rq_remove_entity this still would not be possible as the order of
>> lock acquisition between s_entity->rq_lock
>> and rq->lock would be reverse compared to the call sites mentioned above.
>
> Ah, now I understand. You need this because drm_sched_rq_remove_entity() will
> grab the rq lock again!
>
> Problem is now what prevents the entity from being destroyed while you remove it?
>
> Christian.
Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity and
amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment amdgpu_ctx.refcount when
assigning sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before
removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due to
all the drm_sched_entity_select_rq
logic.
Another, kind of a band aid fix, would probably be just locking
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in amdgpu_ctx_mgr_fini
(which should be lock protected
anyway as I see from other idr usages in the code) ... This should prevent this
use after free.
Andrey
>
>>
>> Andrey
>>
>>
>>
>>>
>>>> +
>>>> + spin_lock(&rq->lock);
>>>> + }
>>>> + spin_unlock(&rq->lock);
>>>> +
>>>> + }
>>>> +
>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>>> + wake_up_all(&sched->job_scheduled);
>>>> +
>>>> /* Confirm no work left behind accessing device structures */
>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-18 16:41 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-18 16:41 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2/18/21 10:15 AM, Christian König wrote:
> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>
>> On 2/18/21 3:07 AM, Christian König wrote:
>>>
>>>
>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>> is released and entity's job_queue not empty I encountred
>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>> never becomes false.
>>>>
>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>
>>>> v2:
>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>> s_entity as stopped to prevent reinserion back to rq due
>>>> to race.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>>>> 1 file changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 908b0b5..c6b7947 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>> */
>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>> {
>>>> + int i;
>>>> + struct drm_sched_entity *s_entity;
>>>
>>> BTW: Please order that so that i is declared last.
>>>
>>>> if (sched->thread)
>>>> kthread_stop(sched->thread);
>>>> + /* Detach all sched_entites from this scheduler once it's stopped */
>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN;
>>>> i--) {
>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>> +
>>>> + if (!rq)
>>>> + continue;
>>>> +
>>>> + /* Loop this way because rq->lock is taken in
>>>> drm_sched_rq_remove_entity */
>>>> + spin_lock(&rq->lock);
>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>> + struct drm_sched_entity,
>>>> + list))) {
>>>> + spin_unlock(&rq->lock);
>>>> +
>>>> + /* Prevent reinsertion and remove */
>>>> + spin_lock(&s_entity->rq_lock);
>>>> + s_entity->stopped = true;
>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>> + spin_unlock(&s_entity->rq_lock);
>>>
>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>
>>> Christian.
>>
>>
>> In what way ? It's in the same same order as in other call sites (see
>> drm_sched_entity_push_job and drm_sched_entity_flush).
>> If i just locked rq->lock and did list_for_each_entry_safe while manually
>> deleting entity->list instead of calling
>> drm_sched_rq_remove_entity this still would not be possible as the order of
>> lock acquisition between s_entity->rq_lock
>> and rq->lock would be reverse compared to the call sites mentioned above.
>
> Ah, now I understand. You need this because drm_sched_rq_remove_entity() will
> grab the rq lock again!
>
> Problem is now what prevents the entity from being destroyed while you remove it?
>
> Christian.
Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity and
amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment amdgpu_ctx.refcount when
assigning sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before
removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due to
all the drm_sched_entity_select_rq
logic.
Another, kind of a band aid fix, would probably be just locking
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in amdgpu_ctx_mgr_fini
(which should be lock protected
anyway as I see from other idr usages in the code) ... This should prevent this
use after free.
Andrey
>
>>
>> Andrey
>>
>>
>>
>>>
>>>> +
>>>> + spin_lock(&rq->lock);
>>>> + }
>>>> + spin_unlock(&rq->lock);
>>>> +
>>>> + }
>>>> +
>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>>> + wake_up_all(&sched->job_scheduled);
>>>> +
>>>> /* Confirm no work left behind accessing device structures */
>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-18 16:41 ` Andrey Grodzovsky
@ 2021-02-19 19:17 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-19 19:17 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
Ping
Andrey
On 2/18/21 11:41 AM, Andrey Grodzovsky wrote:
>
> On 2/18/21 10:15 AM, Christian König wrote:
>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>
>>>>
>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>> is released and entity's job_queue not empty I encountred
>>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>>> never becomes false.
>>>>>
>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>
>>>>> v2:
>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>> to race.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>>>>> 1 file changed, 31 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 908b0b5..c6b7947 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>> */
>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>> {
>>>>> + int i;
>>>>> + struct drm_sched_entity *s_entity;
>>>>
>>>> BTW: Please order that so that i is declared last.
>>>>
>>>>> if (sched->thread)
>>>>> kthread_stop(sched->thread);
>>>>> + /* Detach all sched_entites from this scheduler once it's stopped */
>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN;
>>>>> i--) {
>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>> +
>>>>> + if (!rq)
>>>>> + continue;
>>>>> +
>>>>> + /* Loop this way because rq->lock is taken in
>>>>> drm_sched_rq_remove_entity */
>>>>> + spin_lock(&rq->lock);
>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>> + struct drm_sched_entity,
>>>>> + list))) {
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + /* Prevent reinsertion and remove */
>>>>> + spin_lock(&s_entity->rq_lock);
>>>>> + s_entity->stopped = true;
>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>
>>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>>
>>>> Christian.
>>>
>>>
>>> In what way ? It's in the same same order as in other call sites (see
>>> drm_sched_entity_push_job and drm_sched_entity_flush).
>>> If i just locked rq->lock and did list_for_each_entry_safe while manually
>>> deleting entity->list instead of calling
>>> drm_sched_rq_remove_entity this still would not be possible as the order of
>>> lock acquisition between s_entity->rq_lock
>>> and rq->lock would be reverse compared to the call sites mentioned above.
>>
>> Ah, now I understand. You need this because drm_sched_rq_remove_entity() will
>> grab the rq lock again!
>>
>> Problem is now what prevents the entity from being destroyed while you remove
>> it?
>>
>> Christian.
>
> Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity
> and amdgpu_ctx_entity is refcounted
> there is a problem here that we don't increment amdgpu_ctx.refcount when
> assigning sched_entity
> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before
> removing. We do it for
> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
> amdgpu_cs_parser_fini by
> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due to
> all the drm_sched_entity_select_rq
> logic.
>
> Another, kind of a band aid fix, would probably be just locking
> amdgpu_ctx_mgr.lock around drm_sched_fini
> when finalizing the fence driver and around idr iteration in
> amdgpu_ctx_mgr_fini (which should be lock protected
> anyway as I see from other idr usages in the code) ... This should prevent
> this use after free.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>
>>>>
>>>>> +
>>>>> + spin_lock(&rq->lock);
>>>>> + }
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + }
>>>>> +
>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>>>> + wake_up_all(&sched->job_scheduled);
>>>>> +
>>>>> /* Confirm no work left behind accessing device structures */
>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-19 19:17 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-19 19:17 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
Ping
Andrey
On 2/18/21 11:41 AM, Andrey Grodzovsky wrote:
>
> On 2/18/21 10:15 AM, Christian König wrote:
>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>
>>>>
>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>> is released and entity's job_queue not empty I encountred
>>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>>> never becomes false.
>>>>>
>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>
>>>>> v2:
>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>> to race.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>>>>> 1 file changed, 31 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 908b0b5..c6b7947 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>> */
>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>> {
>>>>> + int i;
>>>>> + struct drm_sched_entity *s_entity;
>>>>
>>>> BTW: Please order that so that i is declared last.
>>>>
>>>>> if (sched->thread)
>>>>> kthread_stop(sched->thread);
>>>>> + /* Detach all sched_entites from this scheduler once it's stopped */
>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN;
>>>>> i--) {
>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>> +
>>>>> + if (!rq)
>>>>> + continue;
>>>>> +
>>>>> + /* Loop this way because rq->lock is taken in
>>>>> drm_sched_rq_remove_entity */
>>>>> + spin_lock(&rq->lock);
>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>> + struct drm_sched_entity,
>>>>> + list))) {
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + /* Prevent reinsertion and remove */
>>>>> + spin_lock(&s_entity->rq_lock);
>>>>> + s_entity->stopped = true;
>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>
>>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>>
>>>> Christian.
>>>
>>>
>>> In what way ? It's in the same same order as in other call sites (see
>>> drm_sched_entity_push_job and drm_sched_entity_flush).
>>> If i just locked rq->lock and did list_for_each_entry_safe while manually
>>> deleting entity->list instead of calling
>>> drm_sched_rq_remove_entity this still would not be possible as the order of
>>> lock acquisition between s_entity->rq_lock
>>> and rq->lock would be reverse compared to the call sites mentioned above.
>>
>> Ah, now I understand. You need this because drm_sched_rq_remove_entity() will
>> grab the rq lock again!
>>
>> Problem is now what prevents the entity from being destroyed while you remove
>> it?
>>
>> Christian.
>
> Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity
> and amdgpu_ctx_entity is refcounted
> there is a problem here that we don't increment amdgpu_ctx.refcount when
> assigning sched_entity
> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before
> removing. We do it for
> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
> amdgpu_cs_parser_fini by
> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due to
> all the drm_sched_entity_select_rq
> logic.
>
> Another, kind of a band aid fix, would probably be just locking
> amdgpu_ctx_mgr.lock around drm_sched_fini
> when finalizing the fence driver and around idr iteration in
> amdgpu_ctx_mgr_fini (which should be lock protected
> anyway as I see from other idr usages in the code) ... This should prevent
> this use after free.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>
>>>>
>>>>> +
>>>>> + spin_lock(&rq->lock);
>>>>> + }
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + }
>>>>> +
>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>>>> + wake_up_all(&sched->job_scheduled);
>>>>> +
>>>>> /* Confirm no work left behind accessing device structures */
>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-18 16:41 ` Andrey Grodzovsky
@ 2021-02-20 8:38 ` Christian König
-1 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-20 8:38 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>
> On 2/18/21 10:15 AM, Christian König wrote:
>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>
>>>>
>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>> is released and entity's job_queue not empty I encountred
>>>>> a hang in drm_sched_entity_flush. This is because
>>>>> drm_sched_entity_is_idle
>>>>> never becomes false.
>>>>>
>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>
>>>>> v2:
>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>> to race.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>> +++++++++++++++++++++++++++++++
>>>>> 1 file changed, 31 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 908b0b5..c6b7947 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>> */
>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>> {
>>>>> + int i;
>>>>> + struct drm_sched_entity *s_entity;
>>>>
>>>> BTW: Please order that so that i is declared last.
>>>>
>>>>> if (sched->thread)
>>>>> kthread_stop(sched->thread);
>>>>> + /* Detach all sched_entites from this scheduler once it's
>>>>> stopped */
>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>> +
>>>>> + if (!rq)
>>>>> + continue;
>>>>> +
>>>>> + /* Loop this way because rq->lock is taken in
>>>>> drm_sched_rq_remove_entity */
>>>>> + spin_lock(&rq->lock);
>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>> + struct drm_sched_entity,
>>>>> + list))) {
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + /* Prevent reinsertion and remove */
>>>>> + spin_lock(&s_entity->rq_lock);
>>>>> + s_entity->stopped = true;
>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>
>>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>>
>>>> Christian.
>>>
>>>
>>> In what way ? It's in the same same order as in other call sites
>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>> manually deleting entity->list instead of calling
>>> drm_sched_rq_remove_entity this still would not be possible as the
>>> order of lock acquisition between s_entity->rq_lock
>>> and rq->lock would be reverse compared to the call sites mentioned
>>> above.
>>
>> Ah, now I understand. You need this because
>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>
>> Problem is now what prevents the entity from being destroyed while
>> you remove it?
>>
>> Christian.
>
> Right, well, since (unfortunately) sched_entity is part of
> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
> there is a problem here that we don't increment amdgpu_ctx.refcount
> when assigning sched_entity
> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
> before removing. We do it for
> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
> amdgpu_cs_parser_fini by
> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
> tricky due to all the drm_sched_entity_select_rq
> logic.
>
> Another, kind of a band aid fix, would probably be just locking
> amdgpu_ctx_mgr.lock around drm_sched_fini
> when finalizing the fence driver and around idr iteration in
> amdgpu_ctx_mgr_fini (which should be lock protected
> anyway as I see from other idr usages in the code) ... This should
> prevent this use after free.
Puh, that's rather complicated as well. Ok let's look at it from the
other side for a moment.
Why do we have to remove the entities from the rq in the first place?
Wouldn't it be sufficient to just set all of them to stopped?
Christian.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>
>>>>
>>>>> +
>>>>> + spin_lock(&rq->lock);
>>>>> + }
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + }
>>>>> +
>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>> scheduler */
>>>>> + wake_up_all(&sched->job_scheduled);
>>>>> +
>>>>> /* Confirm no work left behind accessing device structures */
>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-20 8:38 ` Christian König
0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-20 8:38 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>
> On 2/18/21 10:15 AM, Christian König wrote:
>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>
>>>>
>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>> is released and entity's job_queue not empty I encountred
>>>>> a hang in drm_sched_entity_flush. This is because
>>>>> drm_sched_entity_is_idle
>>>>> never becomes false.
>>>>>
>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>
>>>>> v2:
>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>> to race.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>> +++++++++++++++++++++++++++++++
>>>>> 1 file changed, 31 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 908b0b5..c6b7947 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>> */
>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>> {
>>>>> + int i;
>>>>> + struct drm_sched_entity *s_entity;
>>>>
>>>> BTW: Please order that so that i is declared last.
>>>>
>>>>> if (sched->thread)
>>>>> kthread_stop(sched->thread);
>>>>> + /* Detach all sched_entites from this scheduler once it's
>>>>> stopped */
>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>> +
>>>>> + if (!rq)
>>>>> + continue;
>>>>> +
>>>>> + /* Loop this way because rq->lock is taken in
>>>>> drm_sched_rq_remove_entity */
>>>>> + spin_lock(&rq->lock);
>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>> + struct drm_sched_entity,
>>>>> + list))) {
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + /* Prevent reinsertion and remove */
>>>>> + spin_lock(&s_entity->rq_lock);
>>>>> + s_entity->stopped = true;
>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>
>>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>>
>>>> Christian.
>>>
>>>
>>> In what way ? It's in the same same order as in other call sites
>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>> manually deleting entity->list instead of calling
>>> drm_sched_rq_remove_entity this still would not be possible as the
>>> order of lock acquisition between s_entity->rq_lock
>>> and rq->lock would be reverse compared to the call sites mentioned
>>> above.
>>
>> Ah, now I understand. You need this because
>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>
>> Problem is now what prevents the entity from being destroyed while
>> you remove it?
>>
>> Christian.
>
> Right, well, since (unfortunately) sched_entity is part of
> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
> there is a problem here that we don't increment amdgpu_ctx.refcount
> when assigning sched_entity
> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
> before removing. We do it for
> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
> amdgpu_cs_parser_fini by
> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
> tricky due to all the drm_sched_entity_select_rq
> logic.
>
> Another, kind of a band aid fix, would probably be just locking
> amdgpu_ctx_mgr.lock around drm_sched_fini
> when finalizing the fence driver and around idr iteration in
> amdgpu_ctx_mgr_fini (which should be lock protected
> anyway as I see from other idr usages in the code) ... This should
> prevent this use after free.
Puh, that's rather complicated as well. Ok let's look at it from the
other side for a moment.
Why do we have to remove the entities from the rq in the first place?
Wouldn't it be sufficient to just set all of them to stopped?
Christian.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>
>>>>
>>>>> +
>>>>> + spin_lock(&rq->lock);
>>>>> + }
>>>>> + spin_unlock(&rq->lock);
>>>>> +
>>>>> + }
>>>>> +
>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>> scheduler */
>>>>> + wake_up_all(&sched->job_scheduled);
>>>>> +
>>>>> /* Confirm no work left behind accessing device structures */
>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-20 8:38 ` Christian König
@ 2021-02-20 12:12 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-20 12:12 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2/20/21 3:38 AM, Christian König wrote:
>
>
> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>
>> On 2/18/21 10:15 AM, Christian König wrote:
>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>
>>>>>
>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>> is released and entity's job_queue not empty I encountred
>>>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>>>> never becomes false.
>>>>>>
>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>
>>>>>> v2:
>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>> to race.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>> +++++++++++++++++++++++++++++++
>>>>>> 1 file changed, 31 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 908b0b5..c6b7947 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>> */
>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>> {
>>>>>> + int i;
>>>>>> + struct drm_sched_entity *s_entity;
>>>>>
>>>>> BTW: Please order that so that i is declared last.
>>>>>
>>>>>> if (sched->thread)
>>>>>> kthread_stop(sched->thread);
>>>>>> + /* Detach all sched_entites from this scheduler once it's stopped */
>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN;
>>>>>> i--) {
>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>> +
>>>>>> + if (!rq)
>>>>>> + continue;
>>>>>> +
>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>> drm_sched_rq_remove_entity */
>>>>>> + spin_lock(&rq->lock);
>>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>>> + struct drm_sched_entity,
>>>>>> + list))) {
>>>>>> + spin_unlock(&rq->lock);
>>>>>> +
>>>>>> + /* Prevent reinsertion and remove */
>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>> + s_entity->stopped = true;
>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>
>>>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> In what way ? It's in the same same order as in other call sites (see
>>>> drm_sched_entity_push_job and drm_sched_entity_flush).
>>>> If i just locked rq->lock and did list_for_each_entry_safe while manually
>>>> deleting entity->list instead of calling
>>>> drm_sched_rq_remove_entity this still would not be possible as the order of
>>>> lock acquisition between s_entity->rq_lock
>>>> and rq->lock would be reverse compared to the call sites mentioned above.
>>>
>>> Ah, now I understand. You need this because drm_sched_rq_remove_entity()
>>> will grab the rq lock again!
>>>
>>> Problem is now what prevents the entity from being destroyed while you
>>> remove it?
>>>
>>> Christian.
>>
>> Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity
>> and amdgpu_ctx_entity is refcounted
>> there is a problem here that we don't increment amdgpu_ctx.refcount when
>> assigning sched_entity
>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before
>> removing. We do it for
>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>> amdgpu_cs_parser_fini by
>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due
>> to all the drm_sched_entity_select_rq
>> logic.
>>
>> Another, kind of a band aid fix, would probably be just locking
>> amdgpu_ctx_mgr.lock around drm_sched_fini
>> when finalizing the fence driver and around idr iteration in
>> amdgpu_ctx_mgr_fini (which should be lock protected
>> anyway as I see from other idr usages in the code) ... This should prevent
>> this use after free.
>
> Puh, that's rather complicated as well. Ok let's look at it from the other
> side for a moment.
>
> Why do we have to remove the entities from the rq in the first place?
>
> Wouldn't it be sufficient to just set all of them to stopped?
>
> Christian.
And adding it as another condition in drm_sched_entity_is_idle ?
Andrey
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>> + spin_lock(&rq->lock);
>>>>>> + }
>>>>>> + spin_unlock(&rq->lock);
>>>>>> +
>>>>>> + }
>>>>>> +
>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>>> scheduler */
>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>> +
>>>>>> /* Confirm no work left behind accessing device structures */
>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>
>>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-20 12:12 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-20 12:12 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2/20/21 3:38 AM, Christian König wrote:
>
>
> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>
>> On 2/18/21 10:15 AM, Christian König wrote:
>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>
>>>>>
>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>> is released and entity's job_queue not empty I encountred
>>>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>>>> never becomes false.
>>>>>>
>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>
>>>>>> v2:
>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>> to race.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>> +++++++++++++++++++++++++++++++
>>>>>> 1 file changed, 31 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 908b0b5..c6b7947 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>> */
>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>> {
>>>>>> + int i;
>>>>>> + struct drm_sched_entity *s_entity;
>>>>>
>>>>> BTW: Please order that so that i is declared last.
>>>>>
>>>>>> if (sched->thread)
>>>>>> kthread_stop(sched->thread);
>>>>>> + /* Detach all sched_entites from this scheduler once it's stopped */
>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN;
>>>>>> i--) {
>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>> +
>>>>>> + if (!rq)
>>>>>> + continue;
>>>>>> +
>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>> drm_sched_rq_remove_entity */
>>>>>> + spin_lock(&rq->lock);
>>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>>> + struct drm_sched_entity,
>>>>>> + list))) {
>>>>>> + spin_unlock(&rq->lock);
>>>>>> +
>>>>>> + /* Prevent reinsertion and remove */
>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>> + s_entity->stopped = true;
>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>
>>>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> In what way ? It's in the same same order as in other call sites (see
>>>> drm_sched_entity_push_job and drm_sched_entity_flush).
>>>> If i just locked rq->lock and did list_for_each_entry_safe while manually
>>>> deleting entity->list instead of calling
>>>> drm_sched_rq_remove_entity this still would not be possible as the order of
>>>> lock acquisition between s_entity->rq_lock
>>>> and rq->lock would be reverse compared to the call sites mentioned above.
>>>
>>> Ah, now I understand. You need this because drm_sched_rq_remove_entity()
>>> will grab the rq lock again!
>>>
>>> Problem is now what prevents the entity from being destroyed while you
>>> remove it?
>>>
>>> Christian.
>>
>> Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity
>> and amdgpu_ctx_entity is refcounted
>> there is a problem here that we don't increment amdgpu_ctx.refcount when
>> assigning sched_entity
>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before
>> removing. We do it for
>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>> amdgpu_cs_parser_fini by
>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due
>> to all the drm_sched_entity_select_rq
>> logic.
>>
>> Another, kind of a band aid fix, would probably be just locking
>> amdgpu_ctx_mgr.lock around drm_sched_fini
>> when finalizing the fence driver and around idr iteration in
>> amdgpu_ctx_mgr_fini (which should be lock protected
>> anyway as I see from other idr usages in the code) ... This should prevent
>> this use after free.
>
> Puh, that's rather complicated as well. Ok let's look at it from the other
> side for a moment.
>
> Why do we have to remove the entities from the rq in the first place?
>
> Wouldn't it be sufficient to just set all of them to stopped?
>
> Christian.
And adding it as another condition in drm_sched_entity_is_idle ?
Andrey
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>> + spin_lock(&rq->lock);
>>>>>> + }
>>>>>> + spin_unlock(&rq->lock);
>>>>>> +
>>>>>> + }
>>>>>> +
>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>>> scheduler */
>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>> +
>>>>>> /* Confirm no work left behind accessing device structures */
>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-20 12:12 ` Andrey Grodzovsky
@ 2021-02-22 13:35 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-22 13:35 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
Ping
Andrey
On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>
> On 2/20/21 3:38 AM, Christian König wrote:
>>
>>
>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>> drm_sched_entity_is_idle
>>>>>>> never becomes false.
>>>>>>>
>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>>
>>>>>>> v2:
>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>> to race.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>> */
>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>> {
>>>>>>> + int i;
>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>
>>>>>> BTW: Please order that so that i is declared last.
>>>>>>
>>>>>>> if (sched->thread)
>>>>>>> kthread_stop(sched->thread);
>>>>>>> + /* Detach all sched_entites from this scheduler once it's
>>>>>>> stopped */
>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>> +
>>>>>>> + if (!rq)
>>>>>>> + continue;
>>>>>>> +
>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>> drm_sched_rq_remove_entity */
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>>>> + struct drm_sched_entity,
>>>>>>> + list))) {
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>> + s_entity->stopped = true;
>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>
>>>>>> Well this spin_unlock/lock dance here doesn't look correct at all
>>>>>> now.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> In what way ? It's in the same same order as in other call sites
>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>> manually deleting entity->list instead of calling
>>>>> drm_sched_rq_remove_entity this still would not be possible as the
>>>>> order of lock acquisition between s_entity->rq_lock
>>>>> and rq->lock would be reverse compared to the call sites mentioned
>>>>> above.
>>>>
>>>> Ah, now I understand. You need this because
>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>
>>>> Problem is now what prevents the entity from being destroyed while
>>>> you remove it?
>>>>
>>>> Christian.
>>>
>>> Right, well, since (unfortunately) sched_entity is part of
>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>> there is a problem here that we don't increment amdgpu_ctx.refcount
>>> when assigning sched_entity
>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>> before removing. We do it for
>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>> amdgpu_cs_parser_fini by
>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>> tricky due to all the drm_sched_entity_select_rq
>>> logic.
>>>
>>> Another, kind of a band aid fix, would probably be just locking
>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>> when finalizing the fence driver and around idr iteration in
>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>> anyway as I see from other idr usages in the code) ... This should
>>> prevent this use after free.
>>
>> Puh, that's rather complicated as well. Ok let's look at it from the
>> other side for a moment.
>>
>> Why do we have to remove the entities from the rq in the first place?
>>
>> Wouldn't it be sufficient to just set all of them to stopped?
>>
>> Christian.
>
>
> And adding it as another condition in drm_sched_entity_is_idle ?
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + }
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>>>> scheduler */
>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>> +
>>>>>>> /* Confirm no work left behind accessing device structures */
>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>
>>>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cd948a126eeb84abf3e1308d8d598ca2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637494199501121764%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SAWjfsFZ%2BTi6jKVcyJtIC9qM8EgGthmt3MnSFlxbut8%3D&reserved=0
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-22 13:35 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-22 13:35 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
Ping
Andrey
On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>
> On 2/20/21 3:38 AM, Christian König wrote:
>>
>>
>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>> drm_sched_entity_is_idle
>>>>>>> never becomes false.
>>>>>>>
>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>>
>>>>>>> v2:
>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>> to race.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>> */
>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>> {
>>>>>>> + int i;
>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>
>>>>>> BTW: Please order that so that i is declared last.
>>>>>>
>>>>>>> if (sched->thread)
>>>>>>> kthread_stop(sched->thread);
>>>>>>> + /* Detach all sched_entites from this scheduler once it's
>>>>>>> stopped */
>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>> +
>>>>>>> + if (!rq)
>>>>>>> + continue;
>>>>>>> +
>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>> drm_sched_rq_remove_entity */
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>>>> + struct drm_sched_entity,
>>>>>>> + list))) {
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>> + s_entity->stopped = true;
>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>
>>>>>> Well this spin_unlock/lock dance here doesn't look correct at all
>>>>>> now.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> In what way ? It's in the same same order as in other call sites
>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>> manually deleting entity->list instead of calling
>>>>> drm_sched_rq_remove_entity this still would not be possible as the
>>>>> order of lock acquisition between s_entity->rq_lock
>>>>> and rq->lock would be reverse compared to the call sites mentioned
>>>>> above.
>>>>
>>>> Ah, now I understand. You need this because
>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>
>>>> Problem is now what prevents the entity from being destroyed while
>>>> you remove it?
>>>>
>>>> Christian.
>>>
>>> Right, well, since (unfortunately) sched_entity is part of
>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>> there is a problem here that we don't increment amdgpu_ctx.refcount
>>> when assigning sched_entity
>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>> before removing. We do it for
>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>> amdgpu_cs_parser_fini by
>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>> tricky due to all the drm_sched_entity_select_rq
>>> logic.
>>>
>>> Another, kind of a band aid fix, would probably be just locking
>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>> when finalizing the fence driver and around idr iteration in
>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>> anyway as I see from other idr usages in the code) ... This should
>>> prevent this use after free.
>>
>> Puh, that's rather complicated as well. Ok let's look at it from the
>> other side for a moment.
>>
>> Why do we have to remove the entities from the rq in the first place?
>>
>> Wouldn't it be sufficient to just set all of them to stopped?
>>
>> Christian.
>
>
> And adding it as another condition in drm_sched_entity_is_idle ?
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + }
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>>>> scheduler */
>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>> +
>>>>>>> /* Confirm no work left behind accessing device structures */
>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>
>>>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cd948a126eeb84abf3e1308d8d598ca2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637494199501121764%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SAWjfsFZ%2BTi6jKVcyJtIC9qM8EgGthmt3MnSFlxbut8%3D&reserved=0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-20 12:12 ` Andrey Grodzovsky
@ 2021-02-24 15:13 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-24 15:13 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
Ping
Andrey
On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>
> On 2/20/21 3:38 AM, Christian König wrote:
>>
>>
>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>> drm_sched_entity_is_idle
>>>>>>> never becomes false.
>>>>>>>
>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>>
>>>>>>> v2:
>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>> to race.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>> */
>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>> {
>>>>>>> + int i;
>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>
>>>>>> BTW: Please order that so that i is declared last.
>>>>>>
>>>>>>> if (sched->thread)
>>>>>>> kthread_stop(sched->thread);
>>>>>>> + /* Detach all sched_entites from this scheduler once it's
>>>>>>> stopped */
>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>> +
>>>>>>> + if (!rq)
>>>>>>> + continue;
>>>>>>> +
>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>> drm_sched_rq_remove_entity */
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>>>> + struct drm_sched_entity,
>>>>>>> + list))) {
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>> + s_entity->stopped = true;
>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>
>>>>>> Well this spin_unlock/lock dance here doesn't look correct at all
>>>>>> now.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> In what way ? It's in the same same order as in other call sites
>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>> manually deleting entity->list instead of calling
>>>>> drm_sched_rq_remove_entity this still would not be possible as the
>>>>> order of lock acquisition between s_entity->rq_lock
>>>>> and rq->lock would be reverse compared to the call sites mentioned
>>>>> above.
>>>>
>>>> Ah, now I understand. You need this because
>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>
>>>> Problem is now what prevents the entity from being destroyed while
>>>> you remove it?
>>>>
>>>> Christian.
>>>
>>> Right, well, since (unfortunately) sched_entity is part of
>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>> there is a problem here that we don't increment amdgpu_ctx.refcount
>>> when assigning sched_entity
>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>> before removing. We do it for
>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>> amdgpu_cs_parser_fini by
>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>> tricky due to all the drm_sched_entity_select_rq
>>> logic.
>>>
>>> Another, kind of a band aid fix, would probably be just locking
>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>> when finalizing the fence driver and around idr iteration in
>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>> anyway as I see from other idr usages in the code) ... This should
>>> prevent this use after free.
>>
>> Puh, that's rather complicated as well. Ok let's look at it from the
>> other side for a moment.
>>
>> Why do we have to remove the entities from the rq in the first place?
>>
>> Wouldn't it be sufficient to just set all of them to stopped?
>>
>> Christian.
>
>
> And adding it as another condition in drm_sched_entity_is_idle ?
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + }
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>>>> scheduler */
>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>> +
>>>>>>> /* Confirm no work left behind accessing device structures */
>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>
>>>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-24 15:13 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-24 15:13 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
Ping
Andrey
On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>
> On 2/20/21 3:38 AM, Christian König wrote:
>>
>>
>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>
>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>> drm_sched_entity_is_idle
>>>>>>> never becomes false.
>>>>>>>
>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>>
>>>>>>> v2:
>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>> to race.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>> */
>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>> {
>>>>>>> + int i;
>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>
>>>>>> BTW: Please order that so that i is declared last.
>>>>>>
>>>>>>> if (sched->thread)
>>>>>>> kthread_stop(sched->thread);
>>>>>>> + /* Detach all sched_entites from this scheduler once it's
>>>>>>> stopped */
>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>> +
>>>>>>> + if (!rq)
>>>>>>> + continue;
>>>>>>> +
>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>> drm_sched_rq_remove_entity */
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>>>>> + struct drm_sched_entity,
>>>>>>> + list))) {
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>> + s_entity->stopped = true;
>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>
>>>>>> Well this spin_unlock/lock dance here doesn't look correct at all
>>>>>> now.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> In what way ? It's in the same same order as in other call sites
>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>> manually deleting entity->list instead of calling
>>>>> drm_sched_rq_remove_entity this still would not be possible as the
>>>>> order of lock acquisition between s_entity->rq_lock
>>>>> and rq->lock would be reverse compared to the call sites mentioned
>>>>> above.
>>>>
>>>> Ah, now I understand. You need this because
>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>
>>>> Problem is now what prevents the entity from being destroyed while
>>>> you remove it?
>>>>
>>>> Christian.
>>>
>>> Right, well, since (unfortunately) sched_entity is part of
>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>> there is a problem here that we don't increment amdgpu_ctx.refcount
>>> when assigning sched_entity
>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>> before removing. We do it for
>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>> amdgpu_cs_parser_fini by
>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>> tricky due to all the drm_sched_entity_select_rq
>>> logic.
>>>
>>> Another, kind of a band aid fix, would probably be just locking
>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>> when finalizing the fence driver and around idr iteration in
>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>> anyway as I see from other idr usages in the code) ... This should
>>> prevent this use after free.
>>
>> Puh, that's rather complicated as well. Ok let's look at it from the
>> other side for a moment.
>>
>> Why do we have to remove the entities from the rq in the first place?
>>
>> Wouldn't it be sufficient to just set all of them to stopped?
>>
>> Christian.
>
>
> And adding it as another condition in drm_sched_entity_is_idle ?
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> + spin_lock(&rq->lock);
>>>>>>> + }
>>>>>>> + spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this
>>>>>>> scheduler */
>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>> +
>>>>>>> /* Confirm no work left behind accessing device structures */
>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>
>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-24 15:13 ` Andrey Grodzovsky
@ 2021-02-25 7:53 ` Christian König
-1 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-25 7:53 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
> Ping
Sorry, I've been on vacation this week.
>
> Andrey
>
> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>
>> On 2/20/21 3:38 AM, Christian König wrote:
>>>
>>>
>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>
>>>>>>>
>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>> drm_sched_entity_is_idle
>>>>>>>> never becomes false.
>>>>>>>>
>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>> as the scheduler main thread which wakes them up is stopped by
>>>>>>>> now.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>> to race.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>> */
>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>> {
>>>>>>>> + int i;
>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>
>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>
>>>>>>>> if (sched->thread)
>>>>>>>> kthread_stop(sched->thread);
>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>> it's stopped */
>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>> +
>>>>>>>> + if (!rq)
>>>>>>>> + continue;
>>>>>>>> +
>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>> + while ((s_entity =
>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>> + struct drm_sched_entity,
>>>>>>>> + list))) {
>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>> +
>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>> + s_entity->stopped = true;
>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>
>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>> all now.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>> In what way ? It's in the same same order as in other call sites
>>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>>> manually deleting entity->list instead of calling
>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>> mentioned above.
>>>>>
>>>>> Ah, now I understand. You need this because
>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>
>>>>> Problem is now what prevents the entity from being destroyed while
>>>>> you remove it?
>>>>>
>>>>> Christian.
>>>>
>>>> Right, well, since (unfortunately) sched_entity is part of
>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>> there is a problem here that we don't increment amdgpu_ctx.refcount
>>>> when assigning sched_entity
>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>>> before removing. We do it for
>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>>> amdgpu_cs_parser_fini by
>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>> tricky due to all the drm_sched_entity_select_rq
>>>> logic.
>>>>
>>>> Another, kind of a band aid fix, would probably be just locking
>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>> when finalizing the fence driver and around idr iteration in
>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>> anyway as I see from other idr usages in the code) ... This should
>>>> prevent this use after free.
>>>
>>> Puh, that's rather complicated as well. Ok let's look at it from the
>>> other side for a moment.
>>>
>>> Why do we have to remove the entities from the rq in the first place?
>>>
>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>
>>> Christian.
>>
>>
>> And adding it as another condition in drm_sched_entity_is_idle ?
Yes, I think that should work.
Christian.
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>> + }
>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>> +
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>> this scheduler */
>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>> +
>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>> structures */
>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>
>>>>>
>>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-25 7:53 ` Christian König
0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-25 7:53 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
> Ping
Sorry, I've been on vacation this week.
>
> Andrey
>
> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>
>> On 2/20/21 3:38 AM, Christian König wrote:
>>>
>>>
>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>
>>>>>>>
>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>> drm_sched_entity_is_idle
>>>>>>>> never becomes false.
>>>>>>>>
>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>> as the scheduler main thread which wakes them up is stopped by
>>>>>>>> now.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>> to race.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>> */
>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>> {
>>>>>>>> + int i;
>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>
>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>
>>>>>>>> if (sched->thread)
>>>>>>>> kthread_stop(sched->thread);
>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>> it's stopped */
>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>> +
>>>>>>>> + if (!rq)
>>>>>>>> + continue;
>>>>>>>> +
>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>> + while ((s_entity =
>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>> + struct drm_sched_entity,
>>>>>>>> + list))) {
>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>> +
>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>> + s_entity->stopped = true;
>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>
>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>> all now.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>> In what way ? It's in the same same order as in other call sites
>>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>>> manually deleting entity->list instead of calling
>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>> mentioned above.
>>>>>
>>>>> Ah, now I understand. You need this because
>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>
>>>>> Problem is now what prevents the entity from being destroyed while
>>>>> you remove it?
>>>>>
>>>>> Christian.
>>>>
>>>> Right, well, since (unfortunately) sched_entity is part of
>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>> there is a problem here that we don't increment amdgpu_ctx.refcount
>>>> when assigning sched_entity
>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>>> before removing. We do it for
>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>>> amdgpu_cs_parser_fini by
>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>> tricky due to all the drm_sched_entity_select_rq
>>>> logic.
>>>>
>>>> Another, kind of a band aid fix, would probably be just locking
>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>> when finalizing the fence driver and around idr iteration in
>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>> anyway as I see from other idr usages in the code) ... This should
>>>> prevent this use after free.
>>>
>>> Puh, that's rather complicated as well. Ok let's look at it from the
>>> other side for a moment.
>>>
>>> Why do we have to remove the entities from the rq in the first place?
>>>
>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>
>>> Christian.
>>
>>
>> And adding it as another condition in drm_sched_entity_is_idle ?
Yes, I think that should work.
Christian.
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>> + }
>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>> +
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>> this scheduler */
>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>> +
>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>> structures */
>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>
>>>>>
>>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-25 7:53 ` Christian König
@ 2021-02-25 16:03 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-25 16:03 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2021-02-25 2:53 a.m., Christian König wrote:
> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>> Ping
>
> Sorry, I've been on vacation this week.
>
>>
>> Andrey
>>
>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>
>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>
>>>>
>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>> never becomes false.
>>>>>>>>>
>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>> as the scheduler main thread which wakes them up is stopped by
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>> to race.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>> */
>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>> {
>>>>>>>>> + int i;
>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>
>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>
>>>>>>>>> if (sched->thread)
>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>> it's stopped */
>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>> +
>>>>>>>>> + if (!rq)
>>>>>>>>> + continue;
>>>>>>>>> +
>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>> + while ((s_entity =
>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>> + list))) {
>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>> +
>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>
>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>>> all now.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>
>>>>>>>
>>>>>>> In what way ? It's in the same same order as in other call sites
>>>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>>>> manually deleting entity->list instead of calling
>>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>> mentioned above.
>>>>>>
>>>>>> Ah, now I understand. You need this because
>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>
>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>> while you remove it?
>>>>>>
>>>>>> Christian.
>>>>>
>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>> there is a problem here that we don't increment
>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>>>> before removing. We do it for
>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>>>> amdgpu_cs_parser_fini by
>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>>> tricky due to all the drm_sched_entity_select_rq
>>>>> logic.
>>>>>
>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>> when finalizing the fence driver and around idr iteration in
>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>> anyway as I see from other idr usages in the code) ... This should
>>>>> prevent this use after free.
>>>>
>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>> the other side for a moment.
>>>>
>>>> Why do we have to remove the entities from the rq in the first place?
>>>>
>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>
>>>> Christian.
>>>
>>>
>>> And adding it as another condition in drm_sched_entity_is_idle ?
>
> Yes, I think that should work.
In this case reverse locking order is created, In
drm_sched_entity_push_job and drm_sched_entity_flush lock
entity->rq_lock locked first and rq->lock locked second. In
drm_sched_fini on the other hand, I need to lock rq->lock first to
iterate rq->entities and then lock s_entity->rq_lock for each entity to
modify s_entity->stopped. I guess we could change s_entity->stopped to
atomic ?
Andrey
>
> Christian.
>
>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>> + }
>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>> +
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>> this scheduler */
>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>> +
>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>> structures */
>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>
>>>>>>
>>>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-25 16:03 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-25 16:03 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2021-02-25 2:53 a.m., Christian König wrote:
> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>> Ping
>
> Sorry, I've been on vacation this week.
>
>>
>> Andrey
>>
>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>
>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>
>>>>
>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>> never becomes false.
>>>>>>>>>
>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>> as the scheduler main thread which wakes them up is stopped by
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>> to race.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>> */
>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>> {
>>>>>>>>> + int i;
>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>
>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>
>>>>>>>>> if (sched->thread)
>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>> it's stopped */
>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>> +
>>>>>>>>> + if (!rq)
>>>>>>>>> + continue;
>>>>>>>>> +
>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>> + while ((s_entity =
>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>> + list))) {
>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>> +
>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>
>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>>> all now.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>
>>>>>>>
>>>>>>> In what way ? It's in the same same order as in other call sites
>>>>>>> (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe while
>>>>>>> manually deleting entity->list instead of calling
>>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>> mentioned above.
>>>>>>
>>>>>> Ah, now I understand. You need this because
>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>
>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>> while you remove it?
>>>>>>
>>>>>> Christian.
>>>>>
>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>> there is a problem here that we don't increment
>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>>>> before removing. We do it for
>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>>>> amdgpu_cs_parser_fini by
>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>>> tricky due to all the drm_sched_entity_select_rq
>>>>> logic.
>>>>>
>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>> when finalizing the fence driver and around idr iteration in
>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>> anyway as I see from other idr usages in the code) ... This should
>>>>> prevent this use after free.
>>>>
>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>> the other side for a moment.
>>>>
>>>> Why do we have to remove the entities from the rq in the first place?
>>>>
>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>
>>>> Christian.
>>>
>>>
>>> And adding it as another condition in drm_sched_entity_is_idle ?
>
> Yes, I think that should work.
In this case reverse locking order is created, In
drm_sched_entity_push_job and drm_sched_entity_flush lock
entity->rq_lock locked first and rq->lock locked second. In
drm_sched_fini on the other hand, I need to lock rq->lock first to
iterate rq->entities and then lock s_entity->rq_lock for each entity to
modify s_entity->stopped. I guess we could change s_entity->stopped to
atomic ?
Andrey
>
> Christian.
>
>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>> + }
>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>> +
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>> this scheduler */
>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>> +
>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>> structures */
>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>
>>>>>>
>>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-25 16:03 ` Andrey Grodzovsky
@ 2021-02-25 18:42 ` Christian König
-1 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-25 18:42 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:
>
> On 2021-02-25 2:53 a.m., Christian König wrote:
>> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>>> Ping
>>
>> Sorry, I've been on vacation this week.
>>
>>>
>>> Andrey
>>>
>>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>>
>>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>>
>>>>>
>>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>>> Problem: If scheduler is already stopped by the time
>>>>>>>>>> sched_entity
>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>> never becomes false.
>>>>>>>>>>
>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>> as the scheduler main thread which wakes them up is stopped
>>>>>>>>>> by now.
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>> to race.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>> */
>>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>> {
>>>>>>>>>> + int i;
>>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>>
>>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>>
>>>>>>>>>> if (sched->thread)
>>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>>> it's stopped */
>>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>> +
>>>>>>>>>> + if (!rq)
>>>>>>>>>> + continue;
>>>>>>>>>> +
>>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>> + while ((s_entity =
>>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>>> + list))) {
>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>> +
>>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>>
>>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>>>> all now.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>
>>>>>>>> In what way ? It's in the same same order as in other call
>>>>>>>> sites (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe
>>>>>>>> while manually deleting entity->list instead of calling
>>>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>>> mentioned above.
>>>>>>>
>>>>>>> Ah, now I understand. You need this because
>>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>>
>>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>>> while you remove it?
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>>> there is a problem here that we don't increment
>>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>>>>> before removing. We do it for
>>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>>>>> amdgpu_cs_parser_fini by
>>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>>>> tricky due to all the drm_sched_entity_select_rq
>>>>>> logic.
>>>>>>
>>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>>> when finalizing the fence driver and around idr iteration in
>>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>>> anyway as I see from other idr usages in the code) ... This
>>>>>> should prevent this use after free.
>>>>>
>>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>>> the other side for a moment.
>>>>>
>>>>> Why do we have to remove the entities from the rq in the first place?
>>>>>
>>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> And adding it as another condition in drm_sched_entity_is_idle ?
>>
>> Yes, I think that should work.
>
>
> In this case reverse locking order is created, In
> drm_sched_entity_push_job and drm_sched_entity_flush lock
> entity->rq_lock locked first and rq->lock locked second. In
> drm_sched_fini on the other hand, I need to lock rq->lock first to
> iterate rq->entities and then lock s_entity->rq_lock for each entity
> to modify s_entity->stopped. I guess we could change s_entity->stopped
> to atomic ?
Good point. But I think the memory barrier inserted by
wake_up_all(&sched->job_scheduled); should be sufficient.
If I see this correctly we have the entity->rq_lock mainly to protect
concurrent changes of entity->rq.
But when two CPUs both set entity->stopped to true at the same time we
don't really care about it as long drm_sched_entity_is_idle() sees it.
Regards,
Christian.
>
> Andrey
>
>
>>
>> Christian.
>>
>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>> + }
>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>> +
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>>> this scheduler */
>>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>>> +
>>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>>> structures */
>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>
>>>>>>>
>>>>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-25 18:42 ` Christian König
0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-25 18:42 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:
>
> On 2021-02-25 2:53 a.m., Christian König wrote:
>> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>>> Ping
>>
>> Sorry, I've been on vacation this week.
>>
>>>
>>> Andrey
>>>
>>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>>
>>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>>
>>>>>
>>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>>> Problem: If scheduler is already stopped by the time
>>>>>>>>>> sched_entity
>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>> never becomes false.
>>>>>>>>>>
>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>> as the scheduler main thread which wakes them up is stopped
>>>>>>>>>> by now.
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>> to race.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>> */
>>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>> {
>>>>>>>>>> + int i;
>>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>>
>>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>>
>>>>>>>>>> if (sched->thread)
>>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>>> it's stopped */
>>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>> +
>>>>>>>>>> + if (!rq)
>>>>>>>>>> + continue;
>>>>>>>>>> +
>>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>> + while ((s_entity =
>>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>>> + list))) {
>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>> +
>>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>>
>>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>>>> all now.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>
>>>>>>>> In what way ? It's in the same same order as in other call
>>>>>>>> sites (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe
>>>>>>>> while manually deleting entity->list instead of calling
>>>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>>> mentioned above.
>>>>>>>
>>>>>>> Ah, now I understand. You need this because
>>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>>
>>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>>> while you remove it?
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>>> there is a problem here that we don't increment
>>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not decrement
>>>>>> before removing. We do it for
>>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and
>>>>>> amdgpu_cs_parser_fini by
>>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>>>> tricky due to all the drm_sched_entity_select_rq
>>>>>> logic.
>>>>>>
>>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>>> when finalizing the fence driver and around idr iteration in
>>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>>> anyway as I see from other idr usages in the code) ... This
>>>>>> should prevent this use after free.
>>>>>
>>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>>> the other side for a moment.
>>>>>
>>>>> Why do we have to remove the entities from the rq in the first place?
>>>>>
>>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> And adding it as another condition in drm_sched_entity_is_idle ?
>>
>> Yes, I think that should work.
>
>
> In this case reverse locking order is created, In
> drm_sched_entity_push_job and drm_sched_entity_flush lock
> entity->rq_lock locked first and rq->lock locked second. In
> drm_sched_fini on the other hand, I need to lock rq->lock first to
> iterate rq->entities and then lock s_entity->rq_lock for each entity
> to modify s_entity->stopped. I guess we could change s_entity->stopped
> to atomic ?
Good point. But I think the memory barrier inserted by
wake_up_all(&sched->job_scheduled); should be sufficient.
If I see this correctly we have the entity->rq_lock mainly to protect
concurrent changes of entity->rq.
But when two CPUs both set entity->stopped to true at the same time we
don't really care about it as long drm_sched_entity_is_idle() sees it.
Regards,
Christian.
>
> Andrey
>
>
>>
>> Christian.
>>
>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>> + }
>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>> +
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>>> this scheduler */
>>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>>> +
>>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>>> structures */
>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>
>>>>>>>
>>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-25 18:42 ` Christian König
@ 2021-02-25 21:27 ` Andrey Grodzovsky
-1 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-25 21:27 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2021-02-25 1:42 p.m., Christian König wrote:
>
>
> Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:
>>
>> On 2021-02-25 2:53 a.m., Christian König wrote:
>>> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>>>> Ping
>>>
>>> Sorry, I've been on vacation this week.
>>>
>>>>
>>>> Andrey
>>>>
>>>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>>>
>>>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>>>
>>>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Problem: If scheduler is already stopped by the time
>>>>>>>>>>> sched_entity
>>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>>> never becomes false.
>>>>>>>>>>>
>>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>>> as the scheduler main thread which wakes them up is stopped
>>>>>>>>>>> by now.
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>>> to race.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>>> */
>>>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>>> {
>>>>>>>>>>> + int i;
>>>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>>>
>>>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>>>
>>>>>>>>>>> if (sched->thread)
>>>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>>>> it's stopped */
>>>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>>> +
>>>>>>>>>>> + if (!rq)
>>>>>>>>>>> + continue;
>>>>>>>>>>> +
>>>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>> + while ((s_entity =
>>>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>>>> + list))) {
>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>> +
>>>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>>>
>>>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>>>>> all now.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In what way ? It's in the same same order as in other call
>>>>>>>>> sites (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe
>>>>>>>>> while manually deleting entity->list instead of calling
>>>>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>>>> mentioned above.
>>>>>>>>
>>>>>>>> Ah, now I understand. You need this because
>>>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>>>
>>>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>>>> while you remove it?
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>
>>>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>>>> there is a problem here that we don't increment
>>>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not
>>>>>>> decrement before removing. We do it for
>>>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init
>>>>>>> and amdgpu_cs_parser_fini by
>>>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>>>>> tricky due to all the drm_sched_entity_select_rq
>>>>>>> logic.
>>>>>>>
>>>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>>>> when finalizing the fence driver and around idr iteration in
>>>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>>>> anyway as I see from other idr usages in the code) ... This
>>>>>>> should prevent this use after free.
>>>>>>
>>>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>>>> the other side for a moment.
>>>>>>
>>>>>> Why do we have to remove the entities from the rq in the first
>>>>>> place?
>>>>>>
>>>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> And adding it as another condition in drm_sched_entity_is_idle ?
>>>
>>> Yes, I think that should work.
>>
>>
>> In this case reverse locking order is created, In
>> drm_sched_entity_push_job and drm_sched_entity_flush lock
>> entity->rq_lock locked first and rq->lock locked second. In
>> drm_sched_fini on the other hand, I need to lock rq->lock first to
>> iterate rq->entities and then lock s_entity->rq_lock for each entity
>> to modify s_entity->stopped. I guess we could change
>> s_entity->stopped to atomic ?
>
> Good point. But I think the memory barrier inserted by
> wake_up_all(&sched->job_scheduled); should be sufficient.
>
> If I see this correctly we have the entity->rq_lock mainly to protect
> concurrent changes of entity->rq.
>
> But when two CPUs both set entity->stopped to true at the same time we
> don't really care about it as long drm_sched_entity_is_idle() sees it.
>
> Regards,
> Christian.
I was more thinking about integrity of reading/writing entity->stopped
from different threads. I guess since it's bool it's guaranteed to be
atomic from HW perspective anyway ?
Will send V3 soon.
Andrey
>
>>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>> + }
>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>> +
>>>>>>>>>>> + }
>>>>>>>>>>> +
>>>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>>>> this scheduler */
>>>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>>>> +
>>>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>>>> structures */
>>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-25 21:27 ` Andrey Grodzovsky
0 siblings, 0 replies; 30+ messages in thread
From: Andrey Grodzovsky @ 2021-02-25 21:27 UTC (permalink / raw)
To: Christian König, dri-devel, amd-gfx
On 2021-02-25 1:42 p.m., Christian König wrote:
>
>
> Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:
>>
>> On 2021-02-25 2:53 a.m., Christian König wrote:
>>> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>>>> Ping
>>>
>>> Sorry, I've been on vacation this week.
>>>
>>>>
>>>> Andrey
>>>>
>>>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>>>
>>>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>>>
>>>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Problem: If scheduler is already stopped by the time
>>>>>>>>>>> sched_entity
>>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>>> never becomes false.
>>>>>>>>>>>
>>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>>> as the scheduler main thread which wakes them up is stopped
>>>>>>>>>>> by now.
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>>> to race.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>>> */
>>>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>>> {
>>>>>>>>>>> + int i;
>>>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>>>
>>>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>>>
>>>>>>>>>>> if (sched->thread)
>>>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>>>> it's stopped */
>>>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>>> +
>>>>>>>>>>> + if (!rq)
>>>>>>>>>>> + continue;
>>>>>>>>>>> +
>>>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>> + while ((s_entity =
>>>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>>>> + list))) {
>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>> +
>>>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>>>
>>>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct at
>>>>>>>>>> all now.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In what way ? It's in the same same order as in other call
>>>>>>>>> sites (see drm_sched_entity_push_job and drm_sched_entity_flush).
>>>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe
>>>>>>>>> while manually deleting entity->list instead of calling
>>>>>>>>> drm_sched_rq_remove_entity this still would not be possible as
>>>>>>>>> the order of lock acquisition between s_entity->rq_lock
>>>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>>>> mentioned above.
>>>>>>>>
>>>>>>>> Ah, now I understand. You need this because
>>>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>>>
>>>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>>>> while you remove it?
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>
>>>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>>>> there is a problem here that we don't increment
>>>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not
>>>>>>> decrement before removing. We do it for
>>>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init
>>>>>>> and amdgpu_cs_parser_fini by
>>>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit
>>>>>>> tricky due to all the drm_sched_entity_select_rq
>>>>>>> logic.
>>>>>>>
>>>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>>>> when finalizing the fence driver and around idr iteration in
>>>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>>>> anyway as I see from other idr usages in the code) ... This
>>>>>>> should prevent this use after free.
>>>>>>
>>>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>>>> the other side for a moment.
>>>>>>
>>>>>> Why do we have to remove the entities from the rq in the first
>>>>>> place?
>>>>>>
>>>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> And adding it as another condition in drm_sched_entity_is_idle ?
>>>
>>> Yes, I think that should work.
>>
>>
>> In this case reverse locking order is created, In
>> drm_sched_entity_push_job and drm_sched_entity_flush lock
>> entity->rq_lock locked first and rq->lock locked second. In
>> drm_sched_fini on the other hand, I need to lock rq->lock first to
>> iterate rq->entities and then lock s_entity->rq_lock for each entity
>> to modify s_entity->stopped. I guess we could change
>> s_entity->stopped to atomic ?
>
> Good point. But I think the memory barrier inserted by
> wake_up_all(&sched->job_scheduled); should be sufficient.
>
> If I see this correctly we have the entity->rq_lock mainly to protect
> concurrent changes of entity->rq.
>
> But when two CPUs both set entity->stopped to true at the same time we
> don't really care about it as long drm_sched_entity_is_idle() sees it.
>
> Regards,
> Christian.
I was more thinking about integrity of reading/writing entity->stopped
from different threads. I guess since it's bool it's guaranteed to be
atomic from HW perspective anyway ?
Will send V3 soon.
Andrey
>
>>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>> + }
>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>> +
>>>>>>>>>>> + }
>>>>>>>>>>> +
>>>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>>>> this scheduler */
>>>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>>>> +
>>>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>>>> structures */
>>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
2021-02-25 21:27 ` Andrey Grodzovsky
@ 2021-02-26 8:01 ` Christian König
-1 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-26 8:01 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 25.02.21 um 22:27 schrieb Andrey Grodzovsky:
>
> On 2021-02-25 1:42 p.m., Christian König wrote:
>>
>>
>> Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-02-25 2:53 a.m., Christian König wrote:
>>>> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>>>>> Ping
>>>>
>>>> Sorry, I've been on vacation this week.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>>>>
>>>>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>>>>
>>>>>>>
>>>>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>>>>
>>>>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Problem: If scheduler is already stopped by the time
>>>>>>>>>>>> sched_entity
>>>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>>>> never becomes false.
>>>>>>>>>>>>
>>>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>>>> as the scheduler main thread which wakes them up is stopped
>>>>>>>>>>>> by now.
>>>>>>>>>>>>
>>>>>>>>>>>> v2:
>>>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>>>> to race.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>>>> */
>>>>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> {
>>>>>>>>>>>> + int i;
>>>>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>>>>
>>>>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>>>>
>>>>>>>>>>>> if (sched->thread)
>>>>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>>>>> it's stopped */
>>>>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (!rq)
>>>>>>>>>>>> + continue;
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>>> + while ((s_entity =
>>>>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>>>>> + list))) {
>>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>>>>
>>>>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct
>>>>>>>>>>> at all now.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In what way ? It's in the same same order as in other call
>>>>>>>>>> sites (see drm_sched_entity_push_job and
>>>>>>>>>> drm_sched_entity_flush).
>>>>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe
>>>>>>>>>> while manually deleting entity->list instead of calling
>>>>>>>>>> drm_sched_rq_remove_entity this still would not be possible
>>>>>>>>>> as the order of lock acquisition between s_entity->rq_lock
>>>>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>>>>> mentioned above.
>>>>>>>>>
>>>>>>>>> Ah, now I understand. You need this because
>>>>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>>>>
>>>>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>>>>> while you remove it?
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>>>>> there is a problem here that we don't increment
>>>>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not
>>>>>>>> decrement before removing. We do it for
>>>>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init
>>>>>>>> and amdgpu_cs_parser_fini by
>>>>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a
>>>>>>>> bit tricky due to all the drm_sched_entity_select_rq
>>>>>>>> logic.
>>>>>>>>
>>>>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>>>>> when finalizing the fence driver and around idr iteration in
>>>>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>>>>> anyway as I see from other idr usages in the code) ... This
>>>>>>>> should prevent this use after free.
>>>>>>>
>>>>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>>>>> the other side for a moment.
>>>>>>>
>>>>>>> Why do we have to remove the entities from the rq in the first
>>>>>>> place?
>>>>>>>
>>>>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>> And adding it as another condition in drm_sched_entity_is_idle ?
>>>>
>>>> Yes, I think that should work.
>>>
>>>
>>> In this case reverse locking order is created, In
>>> drm_sched_entity_push_job and drm_sched_entity_flush lock
>>> entity->rq_lock locked first and rq->lock locked second. In
>>> drm_sched_fini on the other hand, I need to lock rq->lock first to
>>> iterate rq->entities and then lock s_entity->rq_lock for each entity
>>> to modify s_entity->stopped. I guess we could change
>>> s_entity->stopped to atomic ?
>>
>> Good point. But I think the memory barrier inserted by
>> wake_up_all(&sched->job_scheduled); should be sufficient.
>>
>> If I see this correctly we have the entity->rq_lock mainly to protect
>> concurrent changes of entity->rq.
>>
>> But when two CPUs both set entity->stopped to true at the same time
>> we don't really care about it as long drm_sched_entity_is_idle() sees
>> it.
>>
>> Regards,
>> Christian.
>
>
> I was more thinking about integrity of reading/writing entity->stopped
> from different threads. I guess since it's bool it's guaranteed to be
> atomic from HW perspective anyway ?
More or less yes. The key point is that we only change it from false ->
true and never the other way around. All that you need then for other
CPUs to see the value is a barrier.
Christian.
> Will send V3 soon.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>>> + }
>>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>>>>> this scheduler */
>>>>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>>>>> +
>>>>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>>>>> structures */
>>>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
@ 2021-02-26 8:01 ` Christian König
0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2021-02-26 8:01 UTC (permalink / raw)
To: Andrey Grodzovsky, dri-devel, amd-gfx
Am 25.02.21 um 22:27 schrieb Andrey Grodzovsky:
>
> On 2021-02-25 1:42 p.m., Christian König wrote:
>>
>>
>> Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-02-25 2:53 a.m., Christian König wrote:
>>>> Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:
>>>>> Ping
>>>>
>>>> Sorry, I've been on vacation this week.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:
>>>>>>
>>>>>> On 2/20/21 3:38 AM, Christian König wrote:
>>>>>>>
>>>>>>>
>>>>>>> Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 2/18/21 10:15 AM, Christian König wrote:
>>>>>>>>> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>>>>>>>>>
>>>>>>>>>> On 2/18/21 3:07 AM, Christian König wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Problem: If scheduler is already stopped by the time
>>>>>>>>>>>> sched_entity
>>>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>>>> a hang in drm_sched_entity_flush. This is because
>>>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>>>> never becomes false.
>>>>>>>>>>>>
>>>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>>>> scheduler's run queues. This will satisfy
>>>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>>>> as the scheduler main thread which wakes them up is stopped
>>>>>>>>>>>> by now.
>>>>>>>>>>>>
>>>>>>>>>>>> v2:
>>>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>>>> to race.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 31
>>>>>>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>>>>>>> 1 file changed, 31 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> index 908b0b5..c6b7947 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>>>> */
>>>>>>>>>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> {
>>>>>>>>>>>> + int i;
>>>>>>>>>>>> + struct drm_sched_entity *s_entity;
>>>>>>>>>>>
>>>>>>>>>>> BTW: Please order that so that i is declared last.
>>>>>>>>>>>
>>>>>>>>>>>> if (sched->thread)
>>>>>>>>>>>> kthread_stop(sched->thread);
>>>>>>>>>>>> + /* Detach all sched_entites from this scheduler once
>>>>>>>>>>>> it's stopped */
>>>>>>>>>>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>>>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>>>> + struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (!rq)
>>>>>>>>>>>> + continue;
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* Loop this way because rq->lock is taken in
>>>>>>>>>>>> drm_sched_rq_remove_entity */
>>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>>> + while ((s_entity =
>>>>>>>>>>>> list_first_entry_or_null(&rq->entities,
>>>>>>>>>>>> + struct drm_sched_entity,
>>>>>>>>>>>> + list))) {
>>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* Prevent reinsertion and remove */
>>>>>>>>>>>> + spin_lock(&s_entity->rq_lock);
>>>>>>>>>>>> + s_entity->stopped = true;
>>>>>>>>>>>> + drm_sched_rq_remove_entity(rq, s_entity);
>>>>>>>>>>>> + spin_unlock(&s_entity->rq_lock);
>>>>>>>>>>>
>>>>>>>>>>> Well this spin_unlock/lock dance here doesn't look correct
>>>>>>>>>>> at all now.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In what way ? It's in the same same order as in other call
>>>>>>>>>> sites (see drm_sched_entity_push_job and
>>>>>>>>>> drm_sched_entity_flush).
>>>>>>>>>> If i just locked rq->lock and did list_for_each_entry_safe
>>>>>>>>>> while manually deleting entity->list instead of calling
>>>>>>>>>> drm_sched_rq_remove_entity this still would not be possible
>>>>>>>>>> as the order of lock acquisition between s_entity->rq_lock
>>>>>>>>>> and rq->lock would be reverse compared to the call sites
>>>>>>>>>> mentioned above.
>>>>>>>>>
>>>>>>>>> Ah, now I understand. You need this because
>>>>>>>>> drm_sched_rq_remove_entity() will grab the rq lock again!
>>>>>>>>>
>>>>>>>>> Problem is now what prevents the entity from being destroyed
>>>>>>>>> while you remove it?
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Right, well, since (unfortunately) sched_entity is part of
>>>>>>>> amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
>>>>>>>> there is a problem here that we don't increment
>>>>>>>> amdgpu_ctx.refcount when assigning sched_entity
>>>>>>>> to new rq (e.g. before drm_sched_rq_add_entity) and not
>>>>>>>> decrement before removing. We do it for
>>>>>>>> amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init
>>>>>>>> and amdgpu_cs_parser_fini by
>>>>>>>> calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a
>>>>>>>> bit tricky due to all the drm_sched_entity_select_rq
>>>>>>>> logic.
>>>>>>>>
>>>>>>>> Another, kind of a band aid fix, would probably be just locking
>>>>>>>> amdgpu_ctx_mgr.lock around drm_sched_fini
>>>>>>>> when finalizing the fence driver and around idr iteration in
>>>>>>>> amdgpu_ctx_mgr_fini (which should be lock protected
>>>>>>>> anyway as I see from other idr usages in the code) ... This
>>>>>>>> should prevent this use after free.
>>>>>>>
>>>>>>> Puh, that's rather complicated as well. Ok let's look at it from
>>>>>>> the other side for a moment.
>>>>>>>
>>>>>>> Why do we have to remove the entities from the rq in the first
>>>>>>> place?
>>>>>>>
>>>>>>> Wouldn't it be sufficient to just set all of them to stopped?
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>> And adding it as another condition in drm_sched_entity_is_idle ?
>>>>
>>>> Yes, I think that should work.
>>>
>>>
>>> In this case reverse locking order is created, In
>>> drm_sched_entity_push_job and drm_sched_entity_flush lock
>>> entity->rq_lock locked first and rq->lock locked second. In
>>> drm_sched_fini on the other hand, I need to lock rq->lock first to
>>> iterate rq->entities and then lock s_entity->rq_lock for each entity
>>> to modify s_entity->stopped. I guess we could change
>>> s_entity->stopped to atomic ?
>>
>> Good point. But I think the memory barrier inserted by
>> wake_up_all(&sched->job_scheduled); should be sufficient.
>>
>> If I see this correctly we have the entity->rq_lock mainly to protect
>> concurrent changes of entity->rq.
>>
>> But when two CPUs both set entity->stopped to true at the same time
>> we don't really care about it as long drm_sched_entity_is_idle() sees
>> it.
>>
>> Regards,
>> Christian.
>
>
> I was more thinking about integrity of reading/writing entity->stopped
> from different threads. I guess since it's bool it's guaranteed to be
> atomic from HW perspective anyway ?
More or less yes. The key point is that we only change it from false ->
true and never the other way around. All that you need then for other
CPUs to see the value is a barrier.
Christian.
> Will send V3 soon.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> + spin_lock(&rq->lock);
>>>>>>>>>>>> + }
>>>>>>>>>>>> + spin_unlock(&rq->lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for
>>>>>>>>>>>> this scheduler */
>>>>>>>>>>>> + wake_up_all(&sched->job_scheduled);
>>>>>>>>>>>> +
>>>>>>>>>>>> /* Confirm no work left behind accessing device
>>>>>>>>>>>> structures */
>>>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2021-02-26 8:02 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-17 21:59 [PATCH v2] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
2021-02-17 21:59 ` Andrey Grodzovsky
2021-02-18 8:07 ` Christian König
2021-02-18 8:07 ` Christian König
2021-02-18 15:05 ` Andrey Grodzovsky
2021-02-18 15:05 ` Andrey Grodzovsky
2021-02-18 15:15 ` Christian König
2021-02-18 15:15 ` Christian König
2021-02-18 16:41 ` Andrey Grodzovsky
2021-02-18 16:41 ` Andrey Grodzovsky
2021-02-19 19:17 ` Andrey Grodzovsky
2021-02-19 19:17 ` Andrey Grodzovsky
2021-02-20 8:38 ` Christian König
2021-02-20 8:38 ` Christian König
2021-02-20 12:12 ` Andrey Grodzovsky
2021-02-20 12:12 ` Andrey Grodzovsky
2021-02-22 13:35 ` Andrey Grodzovsky
2021-02-22 13:35 ` Andrey Grodzovsky
2021-02-24 15:13 ` Andrey Grodzovsky
2021-02-24 15:13 ` Andrey Grodzovsky
2021-02-25 7:53 ` Christian König
2021-02-25 7:53 ` Christian König
2021-02-25 16:03 ` Andrey Grodzovsky
2021-02-25 16:03 ` Andrey Grodzovsky
2021-02-25 18:42 ` Christian König
2021-02-25 18:42 ` Christian König
2021-02-25 21:27 ` Andrey Grodzovsky
2021-02-25 21:27 ` Andrey Grodzovsky
2021-02-26 8:01 ` Christian König
2021-02-26 8:01 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.