All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	Jingwen Chen <Jingwen.Chen2@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: monk.liu@amd.com
Subject: Re: [PATCH] Revert "drm/scheduler: Avoid accessing freed bad job."
Date: Tue, 17 Aug 2021 15:43:58 +0200	[thread overview]
Message-ID: <2126dab8-3484-7fd6-b99e-b94e830fd50b@amd.com> (raw)
In-Reply-To: <e5db89e8-22b9-2539-7a53-4a10b751ed88@amd.com>



Am 17.08.21 um 15:37 schrieb Andrey Grodzovsky:
> On 2021-08-17 12:28 a.m., Jingwen Chen wrote:
>> [Why]
>> for bailing job, this commit will delete it from pending list thus the
>> bailing job will never have a chance to be resubmitted even in advance
>> tdr mode.
>>
>> [How]
>> after embeded hw_fence into amdgpu_job is done, the race condition that
>> this commit tries to work around is completely solved.So revert this
>> commit.
>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>
>
> Can you elaborate please how this solves the race ?
> As far as I see and  with this patch reverted, in drm_sched_job_timedout
> you get a pointer to next job to process in timed out handler, 
> immediately
> next this job is actually finished and it's fence signaled, this in 
> turn triggers
> drm_sched_get_cleanup_job which fetches this job and returns to 
> drm_sched_main
> which in turn call free_job callabck->...->amdgpu_fence_free which 
> frees the job
> from the HW dma_fence release callback. After that you proceed with a 
> freed job
> in timed out handler.
>
> If you could take the HW fence reference from drm_sched_job_timedout 
> before
> starting processing then yes, I think it would work.

Yes, precisely that's what I had in mind as well and seems to be missing 
from this patch.

Regards,
Christian.

>
> Andrey
>
>
>>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 27 --------------------------
>>   1 file changed, 27 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a953693b45..31d1176d939f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -317,21 +317,10 @@ static void drm_sched_job_timedout(struct 
>> work_struct *work)
>>       enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>         sched = container_of(work, struct drm_gpu_scheduler, 
>> work_tdr.work);
>> -
>> -    /* Protects against concurrent deletion in 
>> drm_sched_get_cleanup_job */
>> -    spin_lock(&sched->job_list_lock);
>>       job = list_first_entry_or_null(&sched->pending_list,
>>                          struct drm_sched_job, list);
>>         if (job) {
>> -        /*
>> -         * Remove the bad job so it cannot be freed by concurrent
>> -         * drm_sched_cleanup_jobs. It will be reinserted back after 
>> sched->thread
>> -         * is parked at which point it's safe.
>> -         */
>> -        list_del_init(&job->list);
>> -        spin_unlock(&sched->job_list_lock);
>> -
>>           status = job->sched->ops->timedout_job(job);
>>             /*
>> @@ -342,8 +331,6 @@ static void drm_sched_job_timedout(struct 
>> work_struct *work)
>>               job->sched->ops->free_job(job);
>>               sched->free_guilty = false;
>>           }
>> -    } else {
>> -        spin_unlock(&sched->job_list_lock);
>>       }
>>         if (status != DRM_GPU_SCHED_STAT_ENODEV) {
>> @@ -392,20 +379,6 @@ void drm_sched_stop(struct drm_gpu_scheduler 
>> *sched, struct drm_sched_job *bad)
>>         kthread_park(sched->thread);
>>   -    /*
>> -     * Reinsert back the bad job here - now it's safe as
>> -     * drm_sched_get_cleanup_job cannot race against us and release the
>> -     * bad job at this point - we parked (waited for) any in progress
>> -     * (earlier) cleanups and drm_sched_get_cleanup_job will not be 
>> called
>> -     * now until the scheduler thread is unparked.
>> -     */
>> -    if (bad && bad->sched == sched)
>> -        /*
>> -         * Add at the head of the queue to reflect it was the earliest
>> -         * job extracted.
>> -         */
>> -        list_add(&bad->list, &sched->pending_list);
>> -
>>       /*
>>        * Iterate the job list from later to  earlier one and either 
>> deactive
>>        * their HW callbacks or remove them from pending list if they 
>> already


  reply	other threads:[~2021-08-17 13:44 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-17  4:28 [PATCH] Revert "drm/scheduler: Avoid accessing freed bad job." Jingwen Chen
2021-08-17  6:35 ` Liu, Monk
2021-08-17 13:37 ` Andrey Grodzovsky
2021-08-17 13:43   ` Christian König [this message]
2021-08-18  8:08     ` Jingwen Chen
2021-08-18 10:39       ` Jingwen Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2126dab8-3484-7fd6-b99e-b94e830fd50b@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Jingwen.Chen2@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=monk.liu@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.