All of lore.kernel.org
 help / color / mirror / Atom feed
From: JingWen Chen <jingwech@amd.com>
To: "Deng, Emily" <Emily.Deng@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Chen, Horace" <Horace.Chen@amd.com>,
	"Chen,  JingWen" <JingWen.Chen2@amd.com>
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
Date: Fri, 24 Dec 2021 17:57:06 +0800	[thread overview]
Message-ID: <cc8e296d-360d-9a2f-85cd-f47d55581e99@amd.com> (raw)
In-Reply-To: <PH0PR12MB5417F12B403B8181D5CD03988F7F9@PH0PR12MB5417.namprd12.prod.outlook.com>

I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Liu, Monk <Monk.Liu@amd.com>
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>> <Horace.Chen@amd.com>
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>   2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) ||

WARNING: multiple messages have this Message-ID (diff)
From: JingWen Chen <jingwech@amd.com>
To: "Deng, Emily" <Emily.Deng@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Chen, Horace" <Horace.Chen@amd.com>,
	"Chen,  JingWen" <JingWen.Chen2@amd.com>
Cc: "daniel@ffwll.ch" <daniel@ffwll.ch>
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
Date: Fri, 24 Dec 2021 17:57:06 +0800	[thread overview]
Message-ID: <cc8e296d-360d-9a2f-85cd-f47d55581e99@amd.com> (raw)
In-Reply-To: <PH0PR12MB5417F12B403B8181D5CD03988F7F9@PH0PR12MB5417.namprd12.prod.outlook.com>

I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: Liu, Monk <Monk.Liu@amd.com>
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; dri-devel@lists.freedesktop.org; amd-
>> gfx@lists.freedesktop.org; Chen, Horace <Horace.Chen@amd.com>; Chen,
>> JingWen <JingWen.Chen2@amd.com>; Deng, Emily <Emily.Deng@amd.com>
>> Cc: daniel@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; dri-
>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: daniel@ffwll.ch; Liu, Monk <Monk.Liu@amd.com>; Chen, Horace
>> <Horace.Chen@amd.com>
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>   2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -	 * otherwise the mailbox msg will be ruined/reseted by
>>> -	 * the VF FLR.
>>> -	 */
>>> -	if (!down_write_trylock(&adev->reset_sem))
>>> -		return;
>>> -
>>>   	amdgpu_virt_fini_data_exchange(adev);
>>> -	atomic_set(&adev->in_gpu_reset, 1);
>>>
>>>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>>   	} while (timeout > 1);
>>>
>>>   flr_done:
>>> -	atomic_set(&adev->in_gpu_reset, 0);
>>> -	up_write(&adev->reset_sem);
>>> -
>>>   	/* Trigger recovery for world switch failure if no TDR */
>>>   	if (amdgpu_device_should_recover_gpu(adev)
>>>   		&& (!amdgpu_device_has_job_running(adev) ||

  reply	other threads:[~2021-12-24  9:57 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-22 22:04 [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky
2021-12-22 22:04 ` Andrey Grodzovsky
2021-12-22 22:04 ` [RFC v2 1/8] drm/amdgpu: Introduce reset domain Andrey Grodzovsky
2021-12-22 22:04   ` Andrey Grodzovsky
2021-12-22 22:05 ` [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-22 22:05   ` Andrey Grodzovsky
2021-12-23  8:39   ` Christian König
2021-12-23  8:39     ` Christian König
2021-12-22 22:05 ` [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-22 22:05   ` Andrey Grodzovsky
2021-12-23  8:40   ` Christian König
2021-12-23  8:40     ` Christian König
2021-12-22 22:05 ` [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-22 22:05   ` Andrey Grodzovsky
2021-12-23  8:41   ` Christian König
2021-12-23  8:41     ` Christian König
2022-01-05  9:54   ` Lazar, Lijo
2022-01-05  9:54     ` Lazar, Lijo
2022-01-05 12:31     ` Christian König
2022-01-05 12:31       ` Christian König
2022-01-05 13:11       ` Lazar, Lijo
2022-01-05 13:11         ` Lazar, Lijo
2022-01-05 13:15         ` Christian König
2022-01-05 13:15           ` Christian König
2022-01-05 13:26           ` Lazar, Lijo
2022-01-05 13:26             ` Lazar, Lijo
2022-01-05 13:41             ` Christian König
2022-01-05 13:41               ` Christian König
2022-01-05 18:11       ` Andrey Grodzovsky
2022-01-05 18:11         ` Andrey Grodzovsky
2022-01-17 19:14         ` Andrey Grodzovsky
2022-01-17 19:17           ` Christian König
2022-01-17 19:21             ` Andrey Grodzovsky
2022-01-26 15:52               ` Andrey Grodzovsky
2022-01-28 16:57                 ` Grodzovsky, Andrey
2022-02-07  2:41                   ` JingWen Chen
2022-02-07  3:08                     ` Grodzovsky, Andrey
2021-12-22 22:13 ` [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-22 22:13   ` Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-22 22:13     ` Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-22 22:13     ` Andrey Grodzovsky
2021-12-22 22:14   ` [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Andrey Grodzovsky
2021-12-22 22:14     ` Andrey Grodzovsky
2021-12-23  8:42     ` Christian König
2021-12-23  8:42       ` Christian König
2021-12-23 10:14       ` Liu, Monk
2021-12-23 10:14         ` Liu, Monk
2021-12-24  8:58         ` Deng, Emily
2021-12-24  8:58           ` Deng, Emily
2021-12-24  9:57           ` JingWen Chen [this message]
2021-12-24  9:57             ` JingWen Chen
2021-12-30 18:45             ` Andrey Grodzovsky
2021-12-30 18:45               ` Andrey Grodzovsky
2022-01-03 10:17               ` Christian König
2022-01-03 10:17                 ` Christian König
2022-01-04  9:07                 ` JingWen Chen
2022-01-04  9:07                   ` JingWen Chen
2022-01-04 10:18                   ` Christian König
2022-01-04 10:18                     ` Christian König
2022-01-04 10:49                     ` Liu, Monk
2022-01-04 10:49                       ` Liu, Monk
2022-01-04 11:36                       ` Christian König
2022-01-04 11:36                         ` Christian König
2022-01-04 16:56                         ` Andrey Grodzovsky
2022-01-04 16:56                           ` Andrey Grodzovsky
2022-01-05  7:34                           ` JingWen Chen
2022-01-05  7:34                             ` JingWen Chen
2022-01-05  7:59                             ` Christian König
2022-01-05  7:59                               ` Christian König
2022-01-05 18:24                               ` Andrey Grodzovsky
2022-01-05 18:24                                 ` Andrey Grodzovsky
2022-01-06  4:59                                 ` JingWen Chen
2022-01-06  4:59                                   ` JingWen Chen
2022-01-06  5:18                                   ` JingWen Chen
2022-01-06  5:18                                     ` JingWen Chen
2022-01-06  9:13                                     ` Christian König
2022-01-06  9:13                                       ` Christian König
2022-01-06 19:13                                     ` Andrey Grodzovsky
2022-01-06 19:13                                       ` Andrey Grodzovsky
2022-01-07  3:57                                       ` JingWen Chen
2022-01-07  3:57                                         ` JingWen Chen
2022-01-07  5:46                                         ` JingWen Chen
2022-01-07  5:46                                           ` JingWen Chen
2022-01-07 16:02                                           ` Andrey Grodzovsky
2022-01-07 16:02                                             ` Andrey Grodzovsky
2022-01-12  6:28                                             ` JingWen Chen
2022-01-12  6:28                                               ` JingWen Chen
2022-01-04 17:13                         ` Liu, Shaoyun
2022-01-04 17:13                           ` Liu, Shaoyun
2022-01-04 20:54                           ` Andrey Grodzovsky
2022-01-04 20:54                             ` Andrey Grodzovsky
2022-01-05  0:01                             ` Liu, Shaoyun
2022-01-05  0:01                               ` Liu, Shaoyun
2022-01-05  7:25                         ` JingWen Chen
2022-01-05  7:25                           ` JingWen Chen
2021-12-30 18:39           ` Andrey Grodzovsky
2021-12-30 18:39             ` Andrey Grodzovsky
2021-12-23 18:07     ` Liu, Shaoyun
2021-12-23 18:07       ` Liu, Shaoyun
2021-12-23 18:29   ` [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-23 18:29     ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc8e296d-360d-9a2f-85cd-f47d55581e99@amd.com \
    --to=jingwech@amd.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=Horace.Chen@amd.com \
    --cc=JingWen.Chen2@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.