dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Liu, Shaoyun" <Shaoyun.Liu@amd.com>
To: "Koenig, Christian" <Christian.Koenig@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>,
	"Chen, JingWen" <JingWen.Chen2@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"Deng, Emily" <Emily.Deng@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Chen, Horace" <Horace.Chen@amd.com>
Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
Date: Tue, 4 Jan 2022 17:13:05 +0000	[thread overview]
Message-ID: <CH0PR12MB5372A0240BEAFB02406899E9F44A9@CH0PR12MB5372.namprd12.prod.outlook.com> (raw)
In-Reply-To: <23bebf13-c622-7c61-af88-0e0970b90389@amd.com>

[AMD Official Use Only]

I mostly agree with the sequences Christian  described .  Just one  thing might need to  discuss here.  For FLR notified from host,  in new sequenceas described  , driver need to reply the  READY_TO_RESET in the  workitem  from a reset  work queue which means inside flr_work, driver can not directly reply to host but need to queue another workqueue . For current  code ,  the flr_work for sriov itself is a work queue queued from ISR .  I think we should try to response to the host driver as soon as possible.  Queue another workqueue  inside  the workqueue  doesn't sounds efficient to me.  
Anyway, what we need is a working  solution for our project.  So if we need to change the sequence, we  need to make sure it's been tested first and won't break the functionality before the code is landed in the branch . 

Regards
Shaoyun.liu


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Tuesday, January 4, 2022 6:36 AM
To: Liu, Monk <Monk.Liu@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>
Cc: daniel@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>

Then we have a major design issue in the SRIOV protocol and really need to question this.

How do you want to prevent a race between the hypervisor resetting the hardware and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a timeout or signal from hypervisor.
2. For each of those potential reset sources a work item is send to the single workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we are ready for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset sources.

It does make sense that the hypervisor resets the hardware without waiting for the clients for too long, but if we don't follow this general steps we will always have a race between the different components.

Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
> [AMD Official Use Only]
>
>>> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
> is about to start or was already executed, but host will do FLR anyway 
> without waiting for guest too long
>
>>> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
> It makes the code to crash ... how could it be a fix ?
>
> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>
> Thanks
> -------------------------------------------------------------------
> Monk Liu | Cloud GPU & Virtualization Solution | AMD
> -------------------------------------------------------------------
> we are hiring software manager for CVS core team
> -------------------------------------------------------------------
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Tuesday, January 4, 2022 6:19 PM
> To: Chen, JingWen <JingWen.Chen2@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Liu, 
> Monk <Monk.Liu@amd.com>; dri-devel@lists.freedesktop.org; 
> amd-gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; 
> Chen, JingWen <JingWen.Chen2@amd.com>
> Cc: daniel@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Hi Jingwen,
>
> well what I mean is that we need to adjust the implementation in amdgpu to actually match the requirements.
>
> Could be that the reset sequence is questionable in general, but I doubt so at least for now.
>
> See the FLR request from the hypervisor is just another source of signaling the need for a reset, similar to each job timeout on each queue. Otherwise you have a race condition between the hypervisor and the scheduler.
>
> Properly setting in_gpu_reset is indeed mandatory, but should happen at a central place and not in the SRIOV specific code.
>
> In other words I strongly think that the current SRIOV reset implementation is severely broken and what Andrey is doing is actually fixing it.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>> Hi Christian,
>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>
>> Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution.
>>
>>   From my point of view, we can directly use amdgpu_device_lock_adev 
>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced.
>> But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space.
>>
>> Best Regards,
>> Jingwen Chen
>>
>> On 2022/1/3 下午6:17, Christian König wrote:
>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>
>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>
>>> Christian.
>>>
>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>> Sure, I guess i can drop this patch then.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.
>>>>>
>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>>>>>>
>>>>>> Best wishes
>>>>>> Emily Deng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Monk <Monk.Liu@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, 
>>>>>>> Andrey <Andrey.Grodzovsky@amd.com>; 
>>>>>>> dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org; 
>>>>>>> Chen, Horace <Horace.Chen@amd.com>; Chen, JingWen 
>>>>>>> <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>>>>>>> Cc: daniel@ffwll.ch
>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>> reset protection for SRIOV
>>>>>>>
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>
>>>>>>> Please take a review on Andrey's patch
>>>>>>>
>>>>>>> Thanks
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>> ----------------------------------------------------------------
>>>>>>> -
>>>>>>> --
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri- 
>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace 
>>>>>>> <Horace.Chen@amd.com>
>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>> reset protection for SRIOV
>>>>>>>
>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>> no need for this.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Acked-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>      2 files changed, 22 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>> received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR 
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          struct amdgpu_device *adev = container_of(virt, struct
>>>>>>> amdgpu_device, virt);
>>>>>>>>          int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>
>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>> received,
>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>> -     * the VF FLR.
>>>>>>>> -     */
>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>> -        return;
>>>>>>>> -
>>>>>>>>          amdgpu_virt_fini_data_exchange(adev);
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>
>>>>>>>>          xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 
>>>>>>>> 0, 0);
>>>>>>>>
>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>> work_struct *work)
>>>>>>>>          } while (timeout > 1);
>>>>>>>>
>>>>>>>>      flr_done:
>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>> -
>>>>>>>>          /* Trigger recovery for world switch failure if no TDR 
>>>>>>>> */
>>>>>>>>          if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>              && (!amdgpu_device_has_job_running(adev) ||

  parent reply	other threads:[~2022-01-04 17:13 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-22 22:04 [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky
2021-12-22 22:04 ` [RFC v2 1/8] drm/amdgpu: Introduce reset domain Andrey Grodzovsky
2021-12-22 22:05 ` [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-23  8:39   ` Christian König
2021-12-22 22:05 ` [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-23  8:40   ` Christian König
2021-12-22 22:05 ` [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-23  8:41   ` Christian König
2022-01-05  9:54   ` Lazar, Lijo
2022-01-05 12:31     ` Christian König
2022-01-05 13:11       ` Lazar, Lijo
2022-01-05 13:15         ` Christian König
2022-01-05 13:26           ` Lazar, Lijo
2022-01-05 13:41             ` Christian König
2022-01-05 18:11       ` Andrey Grodzovsky
2022-01-17 19:14         ` Andrey Grodzovsky
2022-01-17 19:17           ` Christian König
2022-01-17 19:21             ` Andrey Grodzovsky
2022-01-26 15:52               ` Andrey Grodzovsky
2022-01-28 16:57                 ` Grodzovsky, Andrey
2022-02-07  2:41                   ` JingWen Chen
2022-02-07  3:08                     ` Grodzovsky, Andrey
2021-12-22 22:13 ` [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-22 22:14   ` [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Andrey Grodzovsky
2021-12-23  8:42     ` Christian König
2021-12-23 10:14       ` Liu, Monk
2021-12-24  8:58         ` Deng, Emily
2021-12-24  9:57           ` JingWen Chen
2021-12-30 18:45             ` Andrey Grodzovsky
2022-01-03 10:17               ` Christian König
2022-01-04  9:07                 ` JingWen Chen
2022-01-04 10:18                   ` Christian König
2022-01-04 10:49                     ` Liu, Monk
2022-01-04 11:36                       ` Christian König
2022-01-04 16:56                         ` Andrey Grodzovsky
2022-01-05  7:34                           ` JingWen Chen
2022-01-05  7:59                             ` Christian König
2022-01-05 18:24                               ` Andrey Grodzovsky
2022-01-06  4:59                                 ` JingWen Chen
2022-01-06  5:18                                   ` JingWen Chen
2022-01-06  9:13                                     ` Christian König
2022-01-06 19:13                                     ` Andrey Grodzovsky
2022-01-07  3:57                                       ` JingWen Chen
2022-01-07  5:46                                         ` JingWen Chen
2022-01-07 16:02                                           ` Andrey Grodzovsky
2022-01-12  6:28                                             ` JingWen Chen
2022-01-04 17:13                         ` Liu, Shaoyun [this message]
2022-01-04 20:54                           ` Andrey Grodzovsky
2022-01-05  0:01                             ` Liu, Shaoyun
2022-01-05  7:25                         ` JingWen Chen
2021-12-30 18:39           ` Andrey Grodzovsky
2021-12-23 18:07     ` Liu, Shaoyun
2021-12-23 18:29   ` [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CH0PR12MB5372A0240BEAFB02406899E9F44A9@CH0PR12MB5372.namprd12.prod.outlook.com \
    --to=shaoyun.liu@amd.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=Horace.Chen@amd.com \
    --cc=JingWen.Chen2@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).