All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
Date: Thu, 18 Feb 2021 11:41:59 -0500	[thread overview]
Message-ID: <74c4a9e1-f1e0-03e5-3c99-755f3cf1b60f@amd.com> (raw)
In-Reply-To: <abcb8ea9-fcb4-a781-bf87-d12f3910e484@gmail.com>


On 2/18/21 10:15 AM, Christian König wrote:
> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>
>> On 2/18/21 3:07 AM, Christian König wrote:
>>>
>>>
>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>> is released and entity's job_queue not empty I encountred
>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>> never becomes false.
>>>>
>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>
>>>> v2:
>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>> s_entity as stopped to prevent reinserion back to rq due
>>>> to race.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>>>>   1 file changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 908b0b5..c6b7947 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>    */
>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>   {
>>>> +    int i;
>>>> +    struct drm_sched_entity *s_entity;
>>>
>>> BTW: Please order that so that i is declared last.
>>>
>>>>       if (sched->thread)
>>>>           kthread_stop(sched->thread);
>>>>   +    /* Detach all sched_entites from this scheduler once it's stopped */
>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
>>>> i--) {
>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>> +
>>>> +        if (!rq)
>>>> +            continue;
>>>> +
>>>> +        /* Loop this way because rq->lock is taken in 
>>>> drm_sched_rq_remove_entity */
>>>> +        spin_lock(&rq->lock);
>>>> +        while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>> +                                struct drm_sched_entity,
>>>> +                                list))) {
>>>> +            spin_unlock(&rq->lock);
>>>> +
>>>> +            /* Prevent reinsertion and remove */
>>>> +            spin_lock(&s_entity->rq_lock);
>>>> +            s_entity->stopped = true;
>>>> +            drm_sched_rq_remove_entity(rq, s_entity);
>>>> +            spin_unlock(&s_entity->rq_lock);
>>>
>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>
>>> Christian.
>>
>>
>> In what way ? It's in the same same order as in other call sites (see 
>> drm_sched_entity_push_job and drm_sched_entity_flush).
>> If i just locked rq->lock and did list_for_each_entry_safe while manually 
>> deleting entity->list instead of calling
>> drm_sched_rq_remove_entity this still would not be possible as the order of 
>> lock acquisition between  s_entity->rq_lock
>> and rq->lock would be reverse compared to the call sites mentioned above.
>
> Ah, now I understand. You need this because drm_sched_rq_remove_entity() will 
> grab the rq lock again!
>
> Problem is now what prevents the entity from being destroyed while you remove it?
>
> Christian.

Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity and 
amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment amdgpu_ctx.refcount when 
assigning  sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before 
removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and 
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due to 
all the drm_sched_entity_select_rq
logic.

Another, kind of a band aid fix, would probably be just locking 
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in amdgpu_ctx_mgr_fini 
(which should be lock protected
anyway as I see from other idr usages in the code) ... This should prevent this 
use after free.

Andrey


>
>>
>> Andrey
>>
>>
>>
>>>
>>>> +
>>>> +            spin_lock(&rq->lock);
>>>> +        }
>>>> +        spin_unlock(&rq->lock);
>>>> +
>>>> +    }
>>>> +
>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>>> +    wake_up_all(&sched->job_scheduled);
>>>> +
>>>>       /* Confirm no work left behind accessing device structures */
>>>>       cancel_delayed_work_sync(&sched->work_tdr);
>>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released
Date: Thu, 18 Feb 2021 11:41:59 -0500	[thread overview]
Message-ID: <74c4a9e1-f1e0-03e5-3c99-755f3cf1b60f@amd.com> (raw)
In-Reply-To: <abcb8ea9-fcb4-a781-bf87-d12f3910e484@gmail.com>


On 2/18/21 10:15 AM, Christian König wrote:
> Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:
>>
>> On 2/18/21 3:07 AM, Christian König wrote:
>>>
>>>
>>> Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>> is released and entity's job_queue not empty I encountred
>>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
>>>> never becomes false.
>>>>
>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>
>>>> v2:
>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>> s_entity as stopped to prevent reinserion back to rq due
>>>> to race.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++
>>>>   1 file changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 908b0b5..c6b7947 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>    */
>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>   {
>>>> +    int i;
>>>> +    struct drm_sched_entity *s_entity;
>>>
>>> BTW: Please order that so that i is declared last.
>>>
>>>>       if (sched->thread)
>>>>           kthread_stop(sched->thread);
>>>>   +    /* Detach all sched_entites from this scheduler once it's stopped */
>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
>>>> i--) {
>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>> +
>>>> +        if (!rq)
>>>> +            continue;
>>>> +
>>>> +        /* Loop this way because rq->lock is taken in 
>>>> drm_sched_rq_remove_entity */
>>>> +        spin_lock(&rq->lock);
>>>> +        while ((s_entity = list_first_entry_or_null(&rq->entities,
>>>> +                                struct drm_sched_entity,
>>>> +                                list))) {
>>>> +            spin_unlock(&rq->lock);
>>>> +
>>>> +            /* Prevent reinsertion and remove */
>>>> +            spin_lock(&s_entity->rq_lock);
>>>> +            s_entity->stopped = true;
>>>> +            drm_sched_rq_remove_entity(rq, s_entity);
>>>> +            spin_unlock(&s_entity->rq_lock);
>>>
>>> Well this spin_unlock/lock dance here doesn't look correct at all now.
>>>
>>> Christian.
>>
>>
>> In what way ? It's in the same same order as in other call sites (see 
>> drm_sched_entity_push_job and drm_sched_entity_flush).
>> If i just locked rq->lock and did list_for_each_entry_safe while manually 
>> deleting entity->list instead of calling
>> drm_sched_rq_remove_entity this still would not be possible as the order of 
>> lock acquisition between  s_entity->rq_lock
>> and rq->lock would be reverse compared to the call sites mentioned above.
>
> Ah, now I understand. You need this because drm_sched_rq_remove_entity() will 
> grab the rq lock again!
>
> Problem is now what prevents the entity from being destroyed while you remove it?
>
> Christian.

Right, well, since (unfortunately) sched_entity is part of amdgpu_ctx_entity and 
amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment amdgpu_ctx.refcount when 
assigning  sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement before 
removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and 
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit tricky due to 
all the drm_sched_entity_select_rq
logic.

Another, kind of a band aid fix, would probably be just locking 
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in amdgpu_ctx_mgr_fini 
(which should be lock protected
anyway as I see from other idr usages in the code) ... This should prevent this 
use after free.

Andrey


>
>>
>> Andrey
>>
>>
>>
>>>
>>>> +
>>>> +            spin_lock(&rq->lock);
>>>> +        }
>>>> +        spin_unlock(&rq->lock);
>>>> +
>>>> +    }
>>>> +
>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>>> +    wake_up_all(&sched->job_scheduled);
>>>> +
>>>>       /* Confirm no work left behind accessing device structures */
>>>>       cancel_delayed_work_sync(&sched->work_tdr);
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-02-18 16:42 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-17 21:59 [PATCH v2] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
2021-02-17 21:59 ` Andrey Grodzovsky
2021-02-18  8:07 ` Christian König
2021-02-18  8:07   ` Christian König
2021-02-18 15:05   ` Andrey Grodzovsky
2021-02-18 15:05     ` Andrey Grodzovsky
2021-02-18 15:15     ` Christian König
2021-02-18 15:15       ` Christian König
2021-02-18 16:41       ` Andrey Grodzovsky [this message]
2021-02-18 16:41         ` Andrey Grodzovsky
2021-02-19 19:17         ` Andrey Grodzovsky
2021-02-19 19:17           ` Andrey Grodzovsky
2021-02-20  8:38         ` Christian König
2021-02-20  8:38           ` Christian König
2021-02-20 12:12           ` Andrey Grodzovsky
2021-02-20 12:12             ` Andrey Grodzovsky
2021-02-22 13:35             ` Andrey Grodzovsky
2021-02-22 13:35               ` Andrey Grodzovsky
2021-02-24 15:13             ` Andrey Grodzovsky
2021-02-24 15:13               ` Andrey Grodzovsky
2021-02-25  7:53               ` Christian König
2021-02-25  7:53                 ` Christian König
2021-02-25 16:03                 ` Andrey Grodzovsky
2021-02-25 16:03                   ` Andrey Grodzovsky
2021-02-25 18:42                   ` Christian König
2021-02-25 18:42                     ` Christian König
2021-02-25 21:27                     ` Andrey Grodzovsky
2021-02-25 21:27                       ` Andrey Grodzovsky
2021-02-26  8:01                       ` Christian König
2021-02-26  8:01                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74c4a9e1-f1e0-03e5-3c99-755f3cf1b60f@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.