All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Chen, Guchun" <Guchun.Chen@amd.com>
To: "Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Alex Deucher" <alexdeucher@gmail.com>,
	"Mike Lothian" <mike@fireburn.co.uk>,
	"Koenig,  Christian" <Christian.Koenig@amd.com>
Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>,
	"Gao, Likun" <Likun.Gao@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>
Subject: RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)
Date: Fri, 27 Aug 2021 10:42:40 +0000	[thread overview]
Message-ID: <DM5PR12MB2469B54B364C8F84681623CFF1C89@DM5PR12MB2469.namprd12.prod.outlook.com> (raw)
In-Reply-To: <2e3f376e-b88b-867e-2dec-06bbe0029d7b@amd.com>

[Public]

Hi Andrey and Christian,

I just send out a new patch to address this, I am not sure if I understand your point correctly. Please review.

The patch is to stop scheduler in fence_hw_fini and start the scheduler in fence_hw_init.

Regards,
Guchun

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Monday, August 23, 2021 10:42 PM
To: Christian König <ckoenig.leichtzumerken@gmail.com>; Chen, Guchun <Guchun.Chen@amd.com>; Alex Deucher <alexdeucher@gmail.com>; Mike Lothian <mike@fireburn.co.uk>; Koenig, Christian <Christian.Koenig@amd.com>
Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>; Gao, Likun <Likun.Gao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)


On 2021-08-23 2:50 a.m., Christian König wrote:
> Good mornings guys,
>
> Andrey has a rather valid concern here, but I think we need to 
> approach this from a more high level view.
>
> When hw_fini is called we should make sure that the scheduler can't 
> submit any more work to the hardware, because the hw is finalized and 
> not expected to response any more.
>
> As far as I can see the cleanest approach would be to stop the 
> scheduler in hw_fini and fully clean it up in sw_fini. That would also 
> fit quite nicely with how GPU reset is supposed to work I think.
>
> Problem is that this is currently done outside of the fence code for 
> the at least the reset case, so before we restructure that we need to 
> stick with what we have.
>
> Andrey do you think it would be any problem if we stop the scheduler 
> manually in the hot plug case as well?


As long as it's 'parked' inside HW fini - meaning the thread submitting to HW is done I think it should cover hot unplug as well.

Andrey


>
> Thanks,
> Christian.
>
> Am 23.08.21 um 08:36 schrieb Chen, Guchun:
>> [Public]
>>
>> Hi Andrey,
>>
>> Thanks for your notice. The cause why moving drm_sched_fini to 
>> sw_fini is it's a SW behavior and part of SW shutdown, so hw_fini 
>> should not touch it. But if the race, that scheduler on the ring 
>> possibly keeps submitting jobs which causes un-empty ring is there, 
>> possibly we still need to call drm_sched_fini first in hw_fini to 
>> stop job submission first.
>>
>> @Koenig, Christian what's your opinion?
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: Alex Deucher <alexdeucher@gmail.com>
>> Sent: Friday, August 20, 2021 2:13 AM
>> To: Mike Lothian <mike@fireburn.co.uk>
>> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun 
>> <Guchun.Chen@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>; 
>> Gao, Likun <Likun.Gao@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; 
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver 
>> fini in s3 test (v2)
>>
>> Please go ahead.  Thanks!
>>
>> Alex
>>
>> On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian <mike@fireburn.co.uk>
>> wrote:
>>> Hi
>>>
>>> Do I need to open a new bug report for this?
>>>
>>> Cheers
>>>
>>> Mike
>>>
>>> On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky 
>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> On 2021-08-02 1:16 a.m., Guchun Chen wrote:
>>>>> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to 
>>>>> stop scheduler in s3 test, otherwise, fence related failure will 
>>>>> arrive after resume. To fix this and for a better clean up, move 
>>>>> drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part 
>>>>> of driver shutdown, and should never be called in hw_fini.
>>>>>
>>>>> v2: rename amdgpu_fence_driver_init to 
>>>>> amdgpu_fence_driver_sw_init, to keep sw_init and sw_fini paired.
>>>>>
>>>>> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable 
>>>>> sequence
>>>>> Suggested-by: Christian König <christian.koenig@amd.com>
>>>>> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++++++-----
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
>>>>>    3 files changed, 11 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index b1d2dc39e8be..9e53ff851496 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device 
>>>>> *adev,
>>>>>
>>>>>    fence_driver_init:
>>>>>        /* Fence driver */
>>>>> -     r = amdgpu_fence_driver_init(adev);
>>>>> +     r = amdgpu_fence_driver_sw_init(adev);
>>>>>        if (r) {
>>>>> -             dev_err(adev->dev, "amdgpu_fence_driver_init 
>>>>> failed\n");
>>>>> +             dev_err(adev->dev, "amdgpu_fence_driver_sw_init  
>>>>> +failed\n");
>>>>>                amdgpu_vf_error_put(adev, 
>>>>> AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);
>>>>>                goto failed;
>>>>>        }
>>>>> @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device 
>>>>> *dev, bool fbcon)
>>>>>        }
>>>>>        amdgpu_fence_driver_hw_init(adev);
>>>>>
>>>>> -
>>>>>        r = amdgpu_device_ip_late_init(adev);
>>>>>        if (r)
>>>>>                return r;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 49c5c7331c53..7495911516c2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct
>>>>> amdgpu_ring *ring,
>>>>>    }
>>>>>
>>>>>    /**
>>>>> - * amdgpu_fence_driver_init - init the fence driver
>>>>> + * amdgpu_fence_driver_sw_init - init the fence driver
>>>>>     * for all possible rings.
>>>>>     *
>>>>>     * @adev: amdgpu device pointer @@ -509,13 +509,13 @@ int 
>>>>> amdgpu_fence_driver_init_ring(struct
>>>>> amdgpu_ring *ring,
>>>>>     * amdgpu_fence_driver_start_ring().
>>>>>     * Returns 0 for success.
>>>>>     */
>>>>> -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
>>>>> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
>>>>>    {
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>>    /**
>>>>> - * amdgpu_fence_driver_fini - tear down the fence driver
>>>>> + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>>>>>     * for all possible rings.
>>>>>     *
>>>>>     * @adev: amdgpu device pointer @@ -531,8 +531,7 @@ void 
>>>>> amdgpu_fence_driver_hw_fini(struct
>>>>> amdgpu_device *adev)
>>>>>
>>>>>                if (!ring || !ring->fence_drv.initialized)
>>>>>                        continue;
>>>>> -             if (!ring->no_scheduler)
>>>>> -                     drm_sched_fini(&ring->sched);
>>>>> +
>>>>>                /* You can't wait for HW to signal if it's gone */
>>>>>                if (!drm_dev_is_unplugged(&adev->ddev))
>>>>>                        r = amdgpu_fence_wait_empty(ring);
>>>>
>>>> Sorry for late notice, missed this patch. By moving drm_sched_fini 
>>>> past amdgpu_fence_wait_empty a race is created as even after you 
>>>> waited for all fences on the ring to signal the sw scheduler will 
>>>> keep submitting new jobs on the ring and so the ring won't stay empty.
>>>>
>>>> For hot device removal also we want to prevent any access to HW 
>>>> past PCI removal in order to not do any MMIO accesses inside the 
>>>> physical MMIO range that no longer belongs to this device after 
>>>> it's removal by the PCI core. Stopping all the schedulers prevents 
>>>> any MMIO accesses done during job submissions and that why 
>>>> drm_sched_fini was done as part of amdgpu_fence_driver_hw_fini and 
>>>> not amdgpu_fence_driver_sw_fini
>>>>
>>>> Andrey
>>>>
>>>>> @@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct
>>>>> amdgpu_device *adev)
>>>>>                if (!ring || !ring->fence_drv.initialized)
>>>>>                        continue;
>>>>>
>>>>> +             if (!ring->no_scheduler)
>>>>> +                     drm_sched_fini(&ring->sched);
>>>>> +
>>>>>                for (j = 0; j <= ring->fence_drv.num_fences_mask; 
>>>>> ++j) dma_fence_put(ring->fence_drv.fences[j]);
>>>>>                kfree(ring->fence_drv.fences); diff --git 
>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index 27adffa7658d..9c11ced4312c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -106,7 +106,6 @@ struct amdgpu_fence_driver {
>>>>>        struct dma_fence                **fences;
>>>>>    };
>>>>>
>>>>> -int amdgpu_fence_driver_init(struct amdgpu_device *adev);
>>>>>    void amdgpu_fence_driver_force_completion(struct amdgpu_ring 
>>>>> *ring);
>>>>>
>>>>>    int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, @@
>>>>> -115,9 +114,10 @@ int amdgpu_fence_driver_init_ring(struct
>>>>> amdgpu_ring *ring,
>>>>>    int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,
>>>>>                                   struct amdgpu_irq_src *irq_src,
>>>>>                                   unsigned irq_type);
>>>>> +void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
>>>>>    void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev);
>>>>> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev);
>>>>>    void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev); 
>>>>> -void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
>>>>>    int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
>>>>> dma_fence **fence,
>>>>>                      unsigned flags);
>>>>>    int amdgpu_fence_emit_polling(struct amdgpu_ring *ring, 
>>>>> uint32_t *s,
>

      reply	other threads:[~2021-08-27 10:42 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-02  5:16 [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2) Guchun Chen
2021-08-02  6:56 ` Christian König
2021-08-02  8:23   ` Chen, Guchun
2021-08-02 13:35     ` Alex Deucher
2021-08-02 16:19       ` Mike Lothian
2021-08-03  1:56       ` Chen, Guchun
2021-08-18  2:08         ` Mike Lothian
2021-08-18  2:12           ` Mike Lothian
2021-08-18  2:23             ` Chen, Guchun
2021-08-18  8:13               ` Mike Lothian
2021-08-18  5:26 ` Andrey Grodzovsky
2021-08-19 12:04   ` Mike Lothian
2021-08-19 18:13     ` Alex Deucher
2021-08-23  6:36       ` Chen, Guchun
2021-08-23  6:50         ` Christian König
2021-08-23 14:41           ` Andrey Grodzovsky
2021-08-27 10:42             ` Chen, Guchun [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM5PR12MB2469B54B364C8F84681623CFF1C89@DM5PR12MB2469.namprd12.prod.outlook.com \
    --to=guchun.chen@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Likun.Gao@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=mike@fireburn.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.