All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Cc: jingwen.chen2@amd.com, monk.liu@amd.com, yiqing.yao@amd.com
Subject: Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
Date: Wed, 22 Jun 2022 13:19:17 -0400	[thread overview]
Message-ID: <76e95c1d-83a0-2cd4-518b-7b769e963d13@amd.com> (raw)
In-Reply-To: <b011246b-fe55-0385-a64b-585c0baa2863@amd.com>


On 2022-06-22 03:17, Christian König wrote:
> Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
>>
>> On 2022-06-21 03:28, Christian König wrote:
>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>> Align refcount behaviour for amdgpu_job embedded HW fence with
>>>> classic pointer style HW fences by increasing refcount each
>>>> time emit is called so amdgpu code doesn't need to make workarounds
>>>> using amdgpu_job.job_run_counter to keep the HW fence refcount 
>>>> balanced.
>>>
>>> Could we now also remove job_run_counter?
>>>
>>> Christian.
>>
>>
>> I am afraid not, job counter is needed since at all times the 
>> refcount on the
>> embedded fence cannot drop to zero because this will free the job 
>> itself before
>> the end of it's life cycle. We have to be able to differentiate in 
>> amdgpu_fence_emit
>> between first ever call where we init the embedded fence's refcount 
>> from scratch using kref_init
>> to repeating calls when refcount already > 0 and we just fake 
>> increase the refcount to align
>> behavior with pointer style fences in other drivers.
>
> Well what we should probably rather do is move the init out of emit 
> instead.
>
> The only down side I can see is that the sequence number isn't know on 
> initial init and so needs to be zero or something like that.
>
> Regards,
> Christian.


Not sure how this help, the problem is not there but in amdgpu_job_run, 
for embedded fence and resubmit job in pending list amdgpu_job_run
will be called twice or even 3 times with recheck guilty job sequence. I 
am supposed to do dma_fence_init to embeded HW fence only on first call 
while on second and
third only update sequence_num and increase refcount. How can i 
differentiate between first and non first calls without job_run_counter ?

Andrey


>
>>
>> I guess we could assume that embedded fence is all zeroes before 
>> first dma_fence_init  if assuming the job itself
>> was allocated using kzalloc and so u can look at dma_fence_ops == 
>> NULL or maybe seqno == 0
>> as a hint if that the fist call or not but it's a risky assumption in 
>> my opinion.
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Also since in the previous patch we resumed setting s_fence->parent 
>>>> to NULL
>>>> in drm_sched_stop switch to directly checking if job->hw_fence is
>>>> signaled to short circuit reset if already signed.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 
>>>> ++++++++++++++++------
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> index 513c57f839d8..447bd92c4856 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct 
>>>> amdgpu_device *adev,
>>>>           goto err_ib_sched;
>>>>       }
>>>>   +    /* Drop the initial kref_init count (see drm_sched_main as 
>>>> example) */
>>>> +    dma_fence_put(f);
>>>>       ret = dma_fence_wait(f, false);
>>>>     err_ib_sched:
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index c99541685804..f9718119834f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>             /* clear job's guilty and depend the folowing step to 
>>>> decide the real one */
>>>>           drm_sched_reset_karma(s_job);
>>>> -        /* for the real bad job, it will be resubmitted twice, 
>>>> adding a dma_fence_get
>>>> -         * to make sure fence is balanced */
>>>> -        dma_fence_get(s_job->s_fence->parent);
>>>>           drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>   +        if (!s_job->s_fence->parent) {
>>>> +            DRM_WARN("Failed to get a HW fence for job!");
>>>> +            continue;
>>>> +        }
>>>> +
>>>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, 
>>>> false, ring->sched.timeout);
>>>>           if (ret == 0) { /* timeout */
>>>>               DRM_ERROR("Found the real bad job! ring:%s, 
>>>> job_id:%llx\n",
>>>>                           ring->sched.name, s_job->id);
>>>>   +
>>>> +            /* Clear this failed job from fence array */
>>>> +            amdgpu_fence_driver_clear_job_fences(ring);
>>>> +
>>>> +            /* Since the job won't signal and we go for
>>>> +             * another resubmit drop this parent pointer
>>>> +             */
>>>> +            dma_fence_put(s_job->s_fence->parent);
>>>> +            s_job->s_fence->parent = NULL;
>>>> +
>>>>               /* set guilty */
>>>>               drm_sched_increase_karma(s_job);
>>>>   retry:
>>>> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>             /* got the hw fence, signal finished fence */
>>>>           atomic_dec(ring->sched.score);
>>>> -        dma_fence_put(s_job->s_fence->parent);
>>>> dma_fence_get(&s_job->s_fence->finished);
>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>> dma_fence_put(&s_job->s_fence->finished);
>>>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>        *
>>>>        * job->base holds a reference to parent fence
>>>>        */
>>>> -    if (job && job->base.s_fence->parent &&
>>>> - dma_fence_is_signaled(job->base.s_fence->parent)) {
>>>> +    if (job && (job->hw_fence.ops != NULL) &&
>>>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>>>           job_signaled = true;
>>>>           dev_info(adev->dev, "Guilty job already signaled, 
>>>> skipping HW reset");
>>>>           goto skip_hw_reset;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index d6d54ba4c185..9bd4e18212fc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>> *ring, struct dma_fence **f, struct amd
>>>>       if (job && job->job_run_counter) {
>>>>           /* reinit seq for resubmitted jobs */
>>>>           fence->seqno = seq;
>>>> +        /* TO be inline with external fence creation and other 
>>>> drivers */
>>>> +        dma_fence_get(fence);
>>>>       } else {
>>>> -        if (job)
>>>> +        if (job) {
>>>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>>>                          &ring->fence_drv.lock,
>>>>                          adev->fence_context + ring->idx, seq);
>>>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>>>> +            dma_fence_get(fence);
>>>> +        }
>>>>           else
>>>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>>>                          &ring->fence_drv.lock,
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 58568fdde2d0..638e1d600258 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct 
>>>> drm_sched_job *sched_job)
>>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>>       }
>>>>   -    if (!job->job_run_counter)
>>>> -        dma_fence_get(fence);
>>>> -    else if (finished->error < 0)
>>>> -        dma_fence_put(&job->hw_fence);
>>>>       job->job_run_counter++;
>>>>       amdgpu_job_free_resources(job);
>>>
>

  reply	other threads:[~2022-06-22 17:19 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-20 22:02 [PATCH 0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount Andrey Grodzovsky
2022-06-20 22:02 ` Andrey Grodzovsky
2022-06-20 22:02 ` [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence Andrey Grodzovsky
2022-06-20 22:02   ` Andrey Grodzovsky
2022-06-21  7:19   ` Christian König
2022-06-21 19:34     ` Andrey Grodzovsky
2022-06-22  9:00       ` Christian König
2022-06-22 15:01         ` Andrey Grodzovsky
2022-06-22 15:04           ` Christian König
2022-06-23 21:18             ` Andrey Grodzovsky
2022-06-24  6:00               ` Christian König
2022-06-20 22:02 ` [PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences Andrey Grodzovsky
2022-06-20 22:02   ` Andrey Grodzovsky
2022-06-21  7:21   ` Christian König
2022-06-20 22:03 ` [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset Andrey Grodzovsky
2022-06-20 22:03   ` Andrey Grodzovsky
2022-06-21  7:25   ` Christian König
2022-06-21 19:45     ` Andrey Grodzovsky
2022-06-22  1:47       ` VURDIGERENATARAJ, CHANDAN
2022-06-22  2:41         ` Andrey Grodzovsky
2022-06-22 17:31       ` Andrey Grodzovsky
2022-06-23  5:57         ` Christian König
2022-06-20 22:03 ` [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' Andrey Grodzovsky
2022-06-20 22:03   ` Andrey Grodzovsky
2022-06-21  7:26   ` Christian König
2022-06-20 22:03 ` [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change Andrey Grodzovsky
2022-06-20 22:03   ` Andrey Grodzovsky
2022-06-21  7:28   ` Christian König
2022-06-21 20:00     ` Andrey Grodzovsky
2022-06-22  7:17       ` Christian König
2022-06-22 17:19         ` Andrey Grodzovsky [this message]
2022-06-23  5:52           ` Christian König
2022-06-23 14:51             ` Andrey Grodzovsky
2022-06-23 14:54               ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=76e95c1d-83a0-2cd4-518b-7b769e963d13@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jingwen.chen2@amd.com \
    --cc=monk.liu@amd.com \
    --cc=yiqing.yao@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.