All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV of SRIOV
@ 2019-12-17 10:19 Monk Liu
  2019-12-17 10:19 ` [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR " Monk Liu
  2019-12-17 10:31 ` [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV " Deng, Emily
  0 siblings, 2 replies; 16+ messages in thread
From: Monk Liu @ 2019-12-17 10:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: Monk Liu

issues:
gpu_recover() is re-entered by the mailbox interrupt
handler mxgpu_nv.c

fix:
we need to bypass the gpu_recover() invoke in mailbox
interrupt as long as the timeout is not infinite (thus the TDR
will be triggered automatically after time out, no need to invoke
gpu_recover() through mailbox interrupt.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 0d8767e..1c3a7d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -269,7 +269,11 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery for world switch failure if no TDR */
-	if (amdgpu_device_should_recover_gpu(adev))
+	if (amdgpu_device_should_recover_gpu(adev)
+		&& (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
+		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
+		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
 		amdgpu_device_gpu_recover(adev, NULL);
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-17 10:19 [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV of SRIOV Monk Liu
@ 2019-12-17 10:19 ` Monk Liu
  2019-12-17 10:38   ` Deng, Emily
  2019-12-17 15:37   ` shaoyunl
  2019-12-17 10:31 ` [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV " Deng, Emily
  1 sibling, 2 replies; 16+ messages in thread
From: Monk Liu @ 2019-12-17 10:19 UTC (permalink / raw)
  To: amd-gfx; +Cc: Monk Liu

issues:
MEC is ruined by the amdkfd_pre_reset after VF FLR done

fix:
amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
the correct sequence is do amdkfd_pre_reset before VF FLR but there is
a limitation to block this sequence:
if we do pre_reset() before VF FLR, it would go KIQ way to do register
access and stuck there, because KIQ probably won't work by that time
(e.g. you already made GFX hang)

so the best way right now is to simply remove it.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 605cef6..ae962b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	if (r)
 		return r;
 
-	amdgpu_amdkfd_pre_reset(adev);
-
 	/* Resume IP prior to SMC */
 	r = amdgpu_device_ip_reinit_early_sriov(adev);
 	if (r)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV of SRIOV
  2019-12-17 10:19 [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV of SRIOV Monk Liu
  2019-12-17 10:19 ` [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR " Monk Liu
@ 2019-12-17 10:31 ` Deng, Emily
  1 sibling, 0 replies; 16+ messages in thread
From: Deng, Emily @ 2019-12-17 10:31 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx; +Cc: Liu, Monk

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Emily Deng <Emily.Deng@amd.com>

>-----Original Message-----
>From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Monk Liu
>Sent: Tuesday, December 17, 2019 6:20 PM
>To: amd-gfx@lists.freedesktop.org
>Cc: Liu, Monk <Monk.Liu@amd.com>
>Subject: [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV of SRIOV
>
>issues:
>gpu_recover() is re-entered by the mailbox interrupt handler mxgpu_nv.c
>
>fix:
>we need to bypass the gpu_recover() invoke in mailbox interrupt as long as the
>timeout is not infinite (thus the TDR will be triggered automatically after time
>out, no need to invoke
>gpu_recover() through mailbox interrupt.
>
>Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>index 0d8767e..1c3a7d4 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>@@ -269,7 +269,11 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
> 	}
>
> 	/* Trigger recovery for world switch failure if no TDR */
>-	if (amdgpu_device_should_recover_gpu(adev))
>+	if (amdgpu_device_should_recover_gpu(adev)
>+		&& (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
>+		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
>+		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
>+		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
> 		amdgpu_device_gpu_recover(adev, NULL);  }
>
>--
>2.7.4
>
>_______________________________________________
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
>edesktop.org%2Fmailman%2Flistinfo%2Famd-
>gfx&amp;data=02%7C01%7CEmily.Deng%40amd.com%7C029ef88677e744f2ad
>8f08d782dab68c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63
>7121748276776005&amp;sdata=IiRwMTw6DQW8sh8Y7SkZ2PehohwnH6gSqkt
>t64a73UU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-17 10:19 ` [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR " Monk Liu
@ 2019-12-17 10:38   ` Deng, Emily
  2019-12-17 15:37   ` shaoyunl
  1 sibling, 0 replies; 16+ messages in thread
From: Deng, Emily @ 2019-12-17 10:38 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx; +Cc: Liu, Monk

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Emily Deng <Emily.Deng@amd.com>

>-----Original Message-----
>From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Monk Liu
>Sent: Tuesday, December 17, 2019 6:20 PM
>To: amd-gfx@lists.freedesktop.org
>Cc: Liu, Monk <Monk.Liu@amd.com>
>Subject: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
>
>issues:
>MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
>fix:
>amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR, the
>correct sequence is do amdkfd_pre_reset before VF FLR but there is a limitation
>to block this sequence:
>if we do pre_reset() before VF FLR, it would go KIQ way to do register access and
>stuck there, because KIQ probably won't work by that time (e.g. you already
>made GFX hang)
>
>so the best way right now is to simply remove it.
>
>Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
> 1 file changed, 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 605cef6..ae962b9 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> 	if (r)
> 		return r;
>
>-	amdgpu_amdkfd_pre_reset(adev);
>-
> 	/* Resume IP prior to SMC */
> 	r = amdgpu_device_ip_reinit_early_sriov(adev);
> 	if (r)
>--
>2.7.4
>
>_______________________________________________
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
>edesktop.org%2Fmailman%2Flistinfo%2Famd-
>gfx&amp;data=02%7C01%7CEmily.Deng%40amd.com%7C74408803b49e4f328
>f7708d782daba6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6
>37121748318124859&amp;sdata=4YbyHwEEGxVLEhuOg%2Frc%2FxdhFRwrdm
>FuZ4vpHx%2FApAE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-17 10:19 ` [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR " Monk Liu
  2019-12-17 10:38   ` Deng, Emily
@ 2019-12-17 15:37   ` shaoyunl
  2019-12-17 21:14     ` Felix Kuehling
  2019-12-19  3:49     ` Liu, Monk
  1 sibling, 2 replies; 16+ messages in thread
From: shaoyunl @ 2019-12-17 15:37 UTC (permalink / raw)
  To: amd-gfx

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
> the correct sequence is do amdkfd_pre_reset before VF FLR but there is
> a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> -	amdgpu_amdkfd_pre_reset(adev);
> -
>   	/* Resume IP prior to SMC */
>   	r = amdgpu_device_ip_reinit_early_sriov(adev);
>   	if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-17 15:37   ` shaoyunl
@ 2019-12-17 21:14     ` Felix Kuehling
  2019-12-19  3:49     ` Liu, Monk
  1 sibling, 0 replies; 16+ messages in thread
From: Felix Kuehling @ 2019-12-17 21:14 UTC (permalink / raw)
  To: amd-gfx

I agree. Removing the call to pre-reset probably breaks GPU reset for KFD.

We call the KFD suspend function in pre-reset, which uses the HIQ to 
stop any user mode queues still running. If that is not possible because 
the HIQ is hanging, it should fail with a timeout. There may be 
something we can do if we know that the HIQ is hanging, so we only 
update the KFD-internal queue state without actually sending anything to 
the HIQ.

Regards,
   Felix

On 2019-12-17 10:37, shaoyunl wrote:
> I think amdkfd side depends on this call to stop the user queue, 
> without this call, the user queue can submit to HW during the reset 
> which could cause hang again ...
> Do we know the root cause why this function would ruin MEC ? From the 
> logic, I think this function should be called before FLR since we need 
> to disable the user queue submission first.
> I remembered the function should use hiq to communicate with HW , 
> shouldn't use kiq to access HW registerm,  has this been changed ?
>
>
> Regards
> shaoyun.liu
>
>
> On 2019-12-17 5:19 a.m., Monk Liu wrote:
>> issues:
>> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>>
>> fix:
>> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
>> the correct sequence is do amdkfd_pre_reset before VF FLR but there is
>> a limitation to block this sequence:
>> if we do pre_reset() before VF FLR, it would go KIQ way to do register
>> access and stuck there, because KIQ probably won't work by that time
>> (e.g. you already made GFX hang)
>>
>> so the best way right now is to simply remove it.
>>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 605cef6..ae962b9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
>> amdgpu_device *adev,
>>       if (r)
>>           return r;
>>   -    amdgpu_amdkfd_pre_reset(adev);
>> -
>>       /* Resume IP prior to SMC */
>>       r = amdgpu_device_ip_reinit_early_sriov(adev);
>>       if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cbd097404ba8b4e7f9d9308d7830717fe%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938908876710&amp;sdata=bNGTZtFLiQ46UwjCa5u8hXG1KUtK%2Fs98g7rBmBtTaPs%3D&amp;reserved=0 
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-17 15:37   ` shaoyunl
  2019-12-17 21:14     ` Felix Kuehling
@ 2019-12-19  3:49     ` Liu, Monk
  2019-12-19  3:52       ` Liu, Monk
  1 sibling, 1 reply; 16+ messages in thread
From: Liu, Monk @ 2019-12-19  3:49 UTC (permalink / raw)
  To: Liu, Shaoyun, amd-gfx

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in 
The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access
If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option 
Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF 
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but 
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> -	amdgpu_amdkfd_pre_reset(adev);
> -
>   	/* Resume IP prior to SMC */
>   	r = amdgpu_device_ip_reinit_early_sriov(adev);
>   	if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19  3:49     ` Liu, Monk
@ 2019-12-19  3:52       ` Liu, Monk
  2019-12-19  4:29         ` Liu, Shaoyun
  0 siblings, 1 reply; 16+ messages in thread
From: Liu, Monk @ 2019-12-19  3:52 UTC (permalink / raw)
  To: Liu, Shaoyun, amd-gfx

Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to following KIQ ring test fail , and on bare-metal it is called before gpu rest , so that's why on bare-metal we don't have this issue 

But the reason we cannot call it before VF FLR on SRIOV case was already stated in this thread 

Thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Liu, Monk 
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl <shaoyun.liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF 
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but 
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> -	amdgpu_amdkfd_pre_reset(adev);
> -
>   	/* Resume IP prior to SMC */
>   	r = amdgpu_device_ip_reinit_early_sriov(adev);
>   	if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19  3:52       ` Liu, Monk
@ 2019-12-19  4:29         ` Liu, Shaoyun
  2019-12-19  6:13           ` Liu, Monk
  0 siblings, 1 reply; 16+ messages in thread
From: Liu, Shaoyun @ 2019-12-19  4:29 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 5343 bytes --]

I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .
This function will use hiq to access CP, in case CP already hang, we might not able to get the response from hw and will got a timeout. I think kfd internal should handle this. Felix already have some comments on that.
I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.

Regards
Shaoyun.liu
________________________________
From: Liu, Monk <Monk.Liu@amd.com>
Sent: December 18, 2019 10:52:47 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to following KIQ ring test fail , and on bare-metal it is called before gpu rest , so that's why on bare-metal we don't have this issue

But the reason we cannot call it before VF FLR on SRIOV case was already stated in this thread

Thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl <shaoyun.liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>        if (r)
>                return r;
>
> -     amdgpu_amdkfd_pre_reset(adev);
> -
>        /* Resume IP prior to SMC */
>        r = amdgpu_device_ip_reinit_early_sriov(adev);
>        if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 7060 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19  4:29         ` Liu, Shaoyun
@ 2019-12-19  6:13           ` Liu, Monk
  2019-12-19 14:51             ` [PATCwH " Liu, Shaoyun
  0 siblings, 1 reply; 16+ messages in thread
From: Liu, Monk @ 2019-12-19  6:13 UTC (permalink / raw)
  To: Liu, Shaoyun, amd-gfx


[-- Attachment #1.1.1: Type: text/plain, Size: 8531 bytes --]

>>> I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.

Because before VF FLR calling function would lead to register access through KIQ,  which will not complete because KIQ/GFX already hang by that time

>>> I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .

Please check "void pm_uninit(struct packet_manager *pm)" which is invoked inside of amdkfd_pre_reset() :

It will call uninitialized() in kfd_kernel_queue.c file

And then go to the path of "kq->mqd_mgr->destroy_mqd(...)"

And finally it calls "static int kgd_hqd_destroy(...)" in amdgpu_amdkfd_gfx_v10.c


539 {
540     struct amdgpu_device *adev = get_amdgpu_device(kgd);
541     enum hqd_dequeue_request_type type;
542     unsigned long end_jiffies;
543     uint32_t temp;
544     struct v10_compute_mqd *m = get_mqd(mqd);
545
546 #if 0
547     unsigned long flags;
548     int retry;
549 #endif
550
551     acquire_queue(kgd, pipe_id, queue_id); //this introduce register access via KIQ
552
553     if (m->cp_hqd_vmid == 0)
554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); //this introduce register access via KIQ
555
556     switch (reset_type) {
557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
558         type = DRAIN_PIPE;
559         break;
560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
561         type = RESET_WAVES;
562         break;
563     default:
564         type = DRAIN_PIPE;
565         break;
566     }
624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), type); //this introduce register access via KIQ
625
626     end_jiffies = (utimeout * HZ / 1000) + jiffies;
627     while (true) {
628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); //this introduce register access via KIQ
629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
630             break;
631         if (time_after(jiffies, end_jiffies)) {
632             pr_err("cp queue preemption time out.\n");
633             release_queue(kgd);
634             return -ETIME;
635         }
636         usleep_range(500, 1000);
637     }
638
639     release_queue(kgd);
640     return 0;

If we use the sequence from bare-metal, all above highlighted register access will not work because KIQ/GFX already died by that time which means the amdkfd_pre_reset() is actually  not working as expected.

_____________________________________
Monk Liu|GPU Virtualization Team |AMD
[sig-cloud-gpu]

From: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Sent: Thursday, December 19, 2019 12:30 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .
This function will use hiq to access CP, in case CP already hang, we might not able to get the response from hw and will got a timeout. I think kfd internal should handle this. Felix already have some comments on that.
I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.

Regards
Shaoyun.liu
________________________________
From: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>
Sent: December 18, 2019 10:52:47 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com<mailto:Shaoyun.Liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to following KIQ ring test fail , and on bare-metal it is called before gpu rest , so that's why on bare-metal we don't have this issue

But the reason we cannot call it before VF FLR on SRIOV case was already stated in this thread

Thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl <shaoyun.liu@amd.com<mailto:shaoyun.liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>        if (r)
>                return r;
>
> -     amdgpu_amdkfd_pre_reset(adev);
> -
>        /* Resume IP prior to SMC */
>        r = amdgpu_device_ip_reinit_early_sriov(adev);
>        if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0

[-- Attachment #1.1.2: Type: text/html, Size: 17124 bytes --]

[-- Attachment #1.2: image001.png --]
[-- Type: image/png, Size: 12243 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19  6:13           ` Liu, Monk
@ 2019-12-19 14:51             ` Liu, Shaoyun
  2019-12-19 16:59               ` shaoyunl
  0 siblings, 1 reply; 16+ messages in thread
From: Liu, Shaoyun @ 2019-12-19 14:51 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx


[-- Attachment #1.1.1: Type: text/plain, Size: 9535 bytes --]

I see, thanks for the detail information.
Normally when CP is hang, the hiq access to unmap the queue will failed before driver call to the hqd_destroy. I think driver should add the code to check the return value  and directly finish the pre_reset in this case . If the hiq does not hang but kiq hang. We can use the same logic in hqd_destroy function,  return in first access failure instead go further.  With this change we probably can move the pre_reset function back to normal .
Felix, do you have any concerns or comments for the change.

Regards
Shaoyun.liu
________________________________
From: Liu, Monk <Monk.Liu@amd.com>
Sent: December 19, 2019 1:13:24 AM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV


>>> I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.



Because before VF FLR calling function would lead to register access through KIQ,  which will not complete because KIQ/GFX already hang by that time



>>> I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .



Please check “void pm_uninit(struct packet_manager *pm)” which is invoked inside of amdkfd_pre_reset() :



It will call uninitialized() in kfd_kernel_queue.c file



And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”



And finally it calls “static int kgd_hqd_destroy(…)” in amdgpu_amdkfd_gfx_v10.c





539 {

540     struct amdgpu_device *adev = get_amdgpu_device(kgd);

541     enum hqd_dequeue_request_type type;

542     unsigned long end_jiffies;

543     uint32_t temp;

544     struct v10_compute_mqd *m = get_mqd(mqd);

545

546 #if 0

547     unsigned long flags;

548     int retry;

549 #endif

550

551     acquire_queue(kgd, pipe_id, queue_id); //this introduce register access via KIQ

552

553     if (m->cp_hqd_vmid == 0)

554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); //this introduce register access via KIQ

555

556     switch (reset_type) {

557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:

558         type = DRAIN_PIPE;

559         break;

560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:

561         type = RESET_WAVES;

562         break;

563     default:

564         type = DRAIN_PIPE;

565         break;

566     }

624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), type); //this introduce register access via KIQ

625

626     end_jiffies = (utimeout * HZ / 1000) + jiffies;

627     while (true) {

628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); //this introduce register access via KIQ

629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))

630             break;

631         if (time_after(jiffies, end_jiffies)) {

632             pr_err("cp queue preemption time out.\n");

633             release_queue(kgd);

634             return -ETIME;

635         }

636         usleep_range(500, 1000);

637     }

638

639     release_queue(kgd);

640     return 0;



If we use the sequence from bare-metal, all above highlighted register access will not work because KIQ/GFX already died by that time which means the amdkfd_pre_reset() is actually  not working as expected.



_____________________________________

Monk Liu|GPU Virtualization Team |AMD

[sig-cloud-gpu]



From: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Sent: Thursday, December 19, 2019 12:30 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV



I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .
This function will use hiq to access CP, in case CP already hang, we might not able to get the response from hw and will got a timeout. I think kfd internal should handle this. Felix already have some comments on that.
I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.

Regards
Shaoyun.liu

________________________________

From: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>
Sent: December 18, 2019 10:52:47 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com<mailto:Shaoyun.Liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV



Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to following KIQ ring test fail , and on bare-metal it is called before gpu rest , so that's why on bare-metal we don't have this issue

But the reason we cannot call it before VF FLR on SRIOV case was already stated in this thread

Thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl <shaoyun.liu@amd.com<mailto:shaoyun.liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>        if (r)
>                return r;
>
> -     amdgpu_amdkfd_pre_reset(adev);
> -
>        /* Resume IP prior to SMC */
>        r = amdgpu_device_ip_reinit_early_sriov(adev);
>        if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0

[-- Attachment #1.1.2: Type: text/html, Size: 16369 bytes --]

[-- Attachment #1.2: image001.png --]
[-- Type: image/png, Size: 12243 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19 14:51             ` [PATCwH " Liu, Shaoyun
@ 2019-12-19 16:59               ` shaoyunl
  2019-12-19 22:44                 ` Felix Kuehling
  2019-12-20  2:46                 ` Liu, Monk
  0 siblings, 2 replies; 16+ messages in thread
From: shaoyunl @ 2019-12-19 16:59 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 11157 bytes --]

After check the code , in KFD side , should be simple just add the check 
in stop_cpsch code . For kiq, there is no return for WREG32 , so no easy 
way to check the return value . Maybe we can add kiq_status in struct 
amdgpu_kiq  to indicate the kiq is hang or not ,  in hdq_destroy 
function check this  kiq_status after acquire_queue , finish the destroy 
function is kiq is hang for SRIOV only .

Any comments ?


shaoyun.liu


On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:

> I see, thanks for the detail information.
> Normally when CP is hang, the hiq access to unmap the queue will 
> failed before driver call to the hqd_destroy. I think driver should 
> add the code to check the return value  and directly finish the 
> pre_reset in this case . If the hiq does not hang but kiq hang. We can 
> use the same logic in hqd_destroy function,  return in first access 
> failure instead go further.  With this change we probably can move the 
> pre_reset function back to normal .
> Felix, do you have any concerns or comments for the change.
>
> Regards
> Shaoyun.liu
> ------------------------------------------------------------------------
> *From:* Liu, Monk <Monk.Liu@amd.com>
> *Sent:* December 19, 2019 1:13:24 AM
> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com>; 
> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
> of SRIOV
>
> >>> I would like to check why we need a special sequences for sriov on 
> this pre_reset. If possible, make it the same as bare metal mode 
> sounds better solution.
>
> Because before VF FLR calling function would lead to register access 
> through KIQ,  which will not complete because KIQ/GFX already hang by 
> that time
>
> >>> I don't remember any register access by amdkfd_pre_reset call,   
> please let me know if this assumption is wrong .
>
> Please check “void pm_uninit(struct packet_manager *pm)” which is 
> invoked inside of amdkfd_pre_reset() :
>
> It will call uninitialized() in kfd_kernel_queue.c file
>
> And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”
>
> And finally it calls “static int kgd_hqd_destroy(…)” in 
> amdgpu_amdkfd_gfx_v10.c
>
> 539 {
>
> 540     struct amdgpu_device *adev = get_amdgpu_device(kgd);
>
> 541     enum hqd_dequeue_request_type type;
>
> 542     unsigned long end_jiffies;
>
> 543     uint32_t temp;
>
> 544     struct v10_compute_mqd *m = get_mqd(mqd);
>
> 545
>
> 546 #if 0
>
> 547     unsigned long flags;
>
> 548     int retry;
>
> 549 #endif
>
> 550
>
> 551     acquire_queue(kgd, pipe_id, queue_id); //this introduce 
> register access via KIQ
>
> 552
>
> 553     if (m->cp_hqd_vmid == 0)
>
> 554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); 
> //this introduce register access via KIQ
>
> 555
>
> 556     switch (reset_type) {
>
> 557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
>
> 558         type = DRAIN_PIPE;
>
> 559         break;
>
> 560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
>
> 561         type = RESET_WAVES;
>
> 562         break;
>
> 563     default:
>
> 564         type = DRAIN_PIPE;
>
> 565         break;
>
> 566     }
>
> 624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), 
> type); //this introduce register access via KIQ
>
> 625
>
> 626     end_jiffies = (utimeout * HZ / 1000) + jiffies;
>
> 627     while (true) {
>
> 628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); 
> //this introduce register access via KIQ
>
> 629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
>
> 630             break;
>
> 631         if (time_after(jiffies, end_jiffies)) {
>
> 632             pr_err("cp queue preemption time out.\n");
>
> 633             release_queue(kgd);
>
> 634             return -ETIME;
>
> 635         }
>
> 636         usleep_range(500, 1000);
>
> 637     }
>
> 638
>
> 639     release_queue(kgd);
>
> 640     return 0;
>
> If we use the sequence from bare-metal, all above highlighted register 
> access will not work because KIQ/GFX already died by that time which 
> means the amdkfd_pre_reset() is actually  not working as expected.
>
> _____________________________________
>
> Monk Liu|GPU Virtualization Team |AMD
>
> sig-cloud-gpu
>
> *From:* Liu, Shaoyun <Shaoyun.Liu@amd.com>
> *Sent:* Thursday, December 19, 2019 12:30 PM
> *To:* Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
> of SRIOV
>
> I don't remember any register access by amdkfd_pre_reset call,   
> please let me know if this assumption is wrong .
> This function will use hiq to access CP, in case CP already hang, we 
> might not able to get the response from hw and will got a timeout. I 
> think kfd internal should handle this. Felix already have some 
> comments on that.
> I would like to check why we need a special sequences for sriov on 
> this pre_reset. If possible, make it the same as bare metal mode 
> sounds better solution.
>
> Regards
> Shaoyun.liu
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
> *Sent:* December 18, 2019 10:52:47 PM
> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com <mailto:Shaoyun.Liu@amd.com>>; 
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> 
> <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
> of SRIOV
>
> Oh, by the way
>
> >>> Do we know the root cause why this function would ruin MEC ?
>
> Only we call this function right after VF FLR will ruin MEC and lead 
> to following KIQ ring test fail , and on bare-metal it is called 
> before gpu rest , so that's why on bare-metal we don't have this issue
>
> But the reason we cannot call it before VF FLR on SRIOV case was 
> already stated in this thread
>
> Thanks
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Thursday, December 19, 2019 11:49 AM
> To: shaoyunl <shaoyun.liu@amd.com <mailto:shaoyun.liu@amd.com>>; 
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of 
> SRIOV
>
> Hi Shaoyun
>
> >>> Do we know the root cause why this function would ruin MEC ? From 
> the logic, I think this function should be called before FLR since we 
> need to disable the user queue submission first.
> Right now I don't know which detail step lead to KIQ ring test fail, I 
> totally agree with you that this func should be called before VF FLR, 
> but we cannot do it and the reason is described in The comment:
>
> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
> > access and stuck there, because KIQ probably won't work by that time
> > (e.g. you already made GFX hang)
>
>
> >>> I remembered the function should use hiq to communicate with HW , 
> shouldn't use kiq to access HW registerm,  has this been changed ?
> Tis function use WREG32/RREG32 to do register access, like all other 
> functions in KMD,  and WREG32/RREG32 will let KIQ to do the register 
> access If we are under dynamic SRIOV  mode (means we are SRIOV VF and 
> isn't under full exclusive mode)
>
> You see that if you call this func before EVENT_5 (event 5 triggers VF 
> FLR) then it will run under dynamic mode and KIQ will handle the 
> register access, which is not an option Since ME/MEC probably already 
> hang ( if we are testing quark on gfx/compute rings)
>
> Do you have a good suggestion ?
>
> thanks
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org 
> <mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
> Sent: Tuesday, December 17, 2019 11:38 PM
> To: amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of 
> SRIOV
>
> I think amdkfd side depends on this call to stop the user queue, 
> without this call, the user queue can submit to HW during the reset 
> which could cause hang again ...
> Do we know the root cause why this function would ruin MEC ? From the 
> logic, I think this function should be called before FLR since we need 
> to disable the user queue submission first.
> I remembered the function should use hiq to communicate with HW , 
> shouldn't use kiq to access HW registerm,  has this been changed ?
>
>
> Regards
> shaoyun.liu
>
>
> On 2019-12-17 5:19 a.m., Monk Liu wrote:
> > issues:
> > MEC is ruined by the amdkfd_pre_reset after VF FLR done
> >
> > fix:
> > amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> > FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> > there is a limitation to block this sequence:
> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
> > access and stuck there, because KIQ probably won't work by that time
> > (e.g. you already made GFX hang)
> >
> > so the best way right now is to simply remove it.
> >
> > Signed-off-by: Monk Liu <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
> >   1 file changed, 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 605cef6..ae962b9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
> >        if (r)
> >                return r;
> >
> > -     amdgpu_amdkfd_pre_reset(adev);
> > -
> >        /* Resume IP prior to SMC */
> >        r = amdgpu_device_ip_reinit_early_sriov(adev);
> >        if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639048992279&sdata=38z3sISWX26bZPplKeHvD0xIPCRbPAW%2BgKv2cXqetXc%3D&reserved=0>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639049012267&amp;sdata=se3rrEVIDZa677riVu5MAf95y%2BxndiDw5BULScsxFBc%3D&amp;reserved=0

[-- Attachment #1.2.1: Type: text/html, Size: 23232 bytes --]

[-- Attachment #1.2.2: image001.png --]
[-- Type: image/png, Size: 12243 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19 16:59               ` shaoyunl
@ 2019-12-19 22:44                 ` Felix Kuehling
  2019-12-19 23:25                   ` shaoyunl
  2019-12-20  2:46                 ` Liu, Monk
  1 sibling, 1 reply; 16+ messages in thread
From: Felix Kuehling @ 2019-12-19 22:44 UTC (permalink / raw)
  To: shaoyunl, Liu, Monk, amd-gfx

I'm thinking, if we know we're preparing for a GPU reset, maybe we 
shouldn't even try to suspend processes and stop the HIQ. 
kfd_suspend_all_processes, stop_cpsch and other functions up that call 
chain up to kgd2kfd_suspend could have a parameter (bool pre_reset) that 
would update the driver state but not touch the hardware. That avoids 
unnecessary timeouts on things that aren't expected to complete anyway.

Regards,
   Felix


On 2019-12-19 11:59 a.m., shaoyunl wrote:
>
> After check the code , in KFD side , should be simple just add the 
> check in stop_cpsch code . For kiq, there is no return for WREG32 , so 
> no easy way to check the return value . Maybe we can add kiq_status in 
> struct amdgpu_kiq  to indicate the kiq is hang or not ,  in 
> hdq_destroy function check this  kiq_status after acquire_queue , 
> finish the destroy function is kiq is hang for SRIOV only .
>
> Any comments ?
>
>
> shaoyun.liu
>
>
> On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:
>
>> I see, thanks for the detail information.
>> Normally when CP is hang, the hiq access to unmap the queue will 
>> failed before driver call to the hqd_destroy. I think driver should 
>> add the code to check the return value  and directly finish the 
>> pre_reset in this case . If the hiq does not hang but kiq hang. We 
>> can use the same logic in hqd_destroy function, return in first 
>> access failure instead go further.  With this change we probably can 
>> move the pre_reset function back to normal .
>> Felix, do you have any concerns or comments for the change.
>>
>> Regards
>> Shaoyun.liu
>> ------------------------------------------------------------------------
>> *From:* Liu, Monk <Monk.Liu@amd.com>
>> *Sent:* December 19, 2019 1:13:24 AM
>> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com>; 
>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>> of SRIOV
>>
>> >>> I would like to check why we need a special sequences for sriov 
>> on this pre_reset. If possible, make it the same as bare metal mode 
>> sounds better solution.
>>
>> Because before VF FLR calling function would lead to register access 
>> through KIQ,  which will not complete because KIQ/GFX already hang by 
>> that time
>>
>> >>> I don't remember any register access by amdkfd_pre_reset call,   
>> please let me know if this assumption is wrong .
>>
>> Please check “void pm_uninit(struct packet_manager *pm)” which is 
>> invoked inside of amdkfd_pre_reset() :
>>
>> It will call uninitialized() in kfd_kernel_queue.c file
>>
>> And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”
>>
>> And finally it calls “static int kgd_hqd_destroy(…)” in 
>> amdgpu_amdkfd_gfx_v10.c
>>
>> 539 {
>>
>> 540     struct amdgpu_device *adev = get_amdgpu_device(kgd);
>>
>> 541     enum hqd_dequeue_request_type type;
>>
>> 542     unsigned long end_jiffies;
>>
>> 543     uint32_t temp;
>>
>> 544     struct v10_compute_mqd *m = get_mqd(mqd);
>>
>> 545
>>
>> 546 #if 0
>>
>> 547     unsigned long flags;
>>
>> 548     int retry;
>>
>> 549 #endif
>>
>> 550
>>
>> 551     acquire_queue(kgd, pipe_id, queue_id); //this introduce 
>> register access via KIQ
>>
>> 552
>>
>> 553     if (m->cp_hqd_vmid == 0)
>>
>> 554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); 
>> //this introduce register access via KIQ
>>
>> 555
>>
>> 556     switch (reset_type) {
>>
>> 557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
>>
>> 558         type = DRAIN_PIPE;
>>
>> 559         break;
>>
>> 560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
>>
>> 561         type = RESET_WAVES;
>>
>> 562         break;
>>
>> 563     default:
>>
>> 564         type = DRAIN_PIPE;
>>
>> 565         break;
>>
>> 566     }
>>
>> 624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), 
>> type); //this introduce register access via KIQ
>>
>> 625
>>
>> 626     end_jiffies = (utimeout * HZ / 1000) + jiffies;
>>
>> 627     while (true) {
>>
>> 628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); 
>> //this introduce register access via KIQ
>>
>> 629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
>>
>> 630             break;
>>
>> 631         if (time_after(jiffies, end_jiffies)) {
>>
>> 632             pr_err("cp queue preemption time out.\n");
>>
>> 633             release_queue(kgd);
>>
>> 634             return -ETIME;
>>
>> 635         }
>>
>> 636         usleep_range(500, 1000);
>>
>> 637     }
>>
>> 638
>>
>> 639     release_queue(kgd);
>>
>> 640     return 0;
>>
>> If we use the sequence from bare-metal, all above highlighted 
>> register access will not work because KIQ/GFX already died by that 
>> time which means the amdkfd_pre_reset() is actually  not working as 
>> expected.
>>
>> _____________________________________
>>
>> Monk Liu|GPU Virtualization Team |AMD
>>
>> sig-cloud-gpu
>>
>> *From:* Liu, Shaoyun <Shaoyun.Liu@amd.com>
>> *Sent:* Thursday, December 19, 2019 12:30 PM
>> *To:* Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> *Subject:* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>> of SRIOV
>>
>> I don't remember any register access by amdkfd_pre_reset call,   
>> please let me know if this assumption is wrong .
>> This function will use hiq to access CP, in case CP already hang, we 
>> might not able to get the response from hw and will got a timeout. I 
>> think kfd internal should handle this. Felix already have some 
>> comments on that.
>> I would like to check why we need a special sequences for sriov on 
>> this pre_reset. If possible, make it the same as bare metal mode 
>> sounds better solution.
>>
>> Regards
>> Shaoyun.liu
>>
>> ------------------------------------------------------------------------
>>
>> *From:*Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
>> *Sent:* December 18, 2019 10:52:47 PM
>> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com 
>> <mailto:Shaoyun.Liu@amd.com>>; amd-gfx@lists.freedesktop.org 
>> <mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org 
>> <mailto:amd-gfx@lists.freedesktop.org>>
>> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>> of SRIOV
>>
>> Oh, by the way
>>
>> >>> Do we know the root cause why this function would ruin MEC ?
>>
>> Only we call this function right after VF FLR will ruin MEC and lead 
>> to following KIQ ring test fail , and on bare-metal it is called 
>> before gpu rest , so that's why on bare-metal we don't have this issue
>>
>> But the reason we cannot call it before VF FLR on SRIOV case was 
>> already stated in this thread
>>
>> Thanks
>> _____________________________________
>> Monk Liu|GPU Virtualization Team |AMD
>>
>>
>> -----Original Message-----
>> From: Liu, Monk
>> Sent: Thursday, December 19, 2019 11:49 AM
>> To: shaoyunl <shaoyun.liu@amd.com <mailto:shaoyun.liu@amd.com>>; 
>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>> Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of 
>> SRIOV
>>
>> Hi Shaoyun
>>
>> >>> Do we know the root cause why this function would ruin MEC ? From 
>> the logic, I think this function should be called before FLR since we 
>> need to disable the user queue submission first.
>> Right now I don't know which detail step lead to KIQ ring test fail, 
>> I totally agree with you that this func should be called before VF 
>> FLR, but we cannot do it and the reason is described in The comment:
>>
>> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
>> > access and stuck there, because KIQ probably won't work by that time
>> > (e.g. you already made GFX hang)
>>
>>
>> >>> I remembered the function should use hiq to communicate with HW , 
>> shouldn't use kiq to access HW registerm,  has this been changed ?
>> Tis function use WREG32/RREG32 to do register access, like all other 
>> functions in KMD,  and WREG32/RREG32 will let KIQ to do the register 
>> access If we are under dynamic SRIOV  mode (means we are SRIOV VF and 
>> isn't under full exclusive mode)
>>
>> You see that if you call this func before EVENT_5 (event 5 triggers 
>> VF FLR) then it will run under dynamic mode and KIQ will handle the 
>> register access, which is not an option Since ME/MEC probably already 
>> hang ( if we are testing quark on gfx/compute rings)
>>
>> Do you have a good suggestion ?
>>
>> thanks
>> _____________________________________
>> Monk Liu|GPU Virtualization Team |AMD
>>
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org 
>> <mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
>> Sent: Tuesday, December 17, 2019 11:38 PM
>> To: amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of 
>> SRIOV
>>
>> I think amdkfd side depends on this call to stop the user queue, 
>> without this call, the user queue can submit to HW during the reset 
>> which could cause hang again ...
>> Do we know the root cause why this function would ruin MEC ? From the 
>> logic, I think this function should be called before FLR since we 
>> need to disable the user queue submission first.
>> I remembered the function should use hiq to communicate with HW , 
>> shouldn't use kiq to access HW registerm,  has this been changed ?
>>
>>
>> Regards
>> shaoyun.liu
>>
>>
>> On 2019-12-17 5:19 a.m., Monk Liu wrote:
>> > issues:
>> > MEC is ruined by the amdkfd_pre_reset after VF FLR done
>> >
>> > fix:
>> > amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
>> > FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
>> > there is a limitation to block this sequence:
>> > if we do pre_reset() before VF FLR, it would go KIQ way to do register
>> > access and stuck there, because KIQ probably won't work by that time
>> > (e.g. you already made GFX hang)
>> >
>> > so the best way right now is to simply remove it.
>> >
>> > Signed-off-by: Monk Liu <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
>> > ---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>> >   1 file changed, 2 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > index 605cef6..ae962b9 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
>> amdgpu_device *adev,
>> >        if (r)
>> >                return r;
>> >
>> > -     amdgpu_amdkfd_pre_reset(adev);
>> > -
>> >        /* Resume IP prior to SMC */
>> >        r = amdgpu_device_ip_reinit_early_sriov(adev);
>> >        if (r)
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19 22:44                 ` Felix Kuehling
@ 2019-12-19 23:25                   ` shaoyunl
  2019-12-20  0:27                     ` Felix Kuehling
  0 siblings, 1 reply; 16+ messages in thread
From: shaoyunl @ 2019-12-19 23:25 UTC (permalink / raw)
  To: Felix Kuehling, Liu, Monk, amd-gfx

How we prevent the  user queue from submitting on the  following FLR  if 
we didn't unmap the  user queues . It's possible that CP still not hang 
when other part HW get hang and  need a reset .

Om, but probably it's ok since after FLR , all the hqd will be reset to 
unmapped by default by HW  and existing user queue need to be re-created 
anyway ...

I think i'm ok with your proposal . Can KFD team prepare the change ?


Regards

shaoyun.liu



On 2019-12-19 5:44 p.m., Felix Kuehling wrote:
> I'm thinking, if we know we're preparing for a GPU reset, maybe we 
> shouldn't even try to suspend processes and stop the HIQ. 
> kfd_suspend_all_processes, stop_cpsch and other functions up that call 
> chain up to kgd2kfd_suspend could have a parameter (bool pre_reset) 
> that would update the driver state but not touch the hardware. That 
> avoids unnecessary timeouts on things that aren't expected to complete 
> anyway.
>
> Regards,
>   Felix
>
>
> On 2019-12-19 11:59 a.m., shaoyunl wrote:
>>
>> After check the code , in KFD side , should be simple just add the 
>> check in stop_cpsch code . For kiq, there is no return for WREG32 , 
>> so no easy way to check the return value . Maybe we can add 
>> kiq_status in struct amdgpu_kiq  to indicate the kiq is hang or not 
>> ,  in hdq_destroy function check this  kiq_status after acquire_queue 
>> , finish the destroy function is kiq is hang for SRIOV only .
>>
>> Any comments ?
>>
>>
>> shaoyun.liu
>>
>>
>> On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:
>>
>>> I see, thanks for the detail information.
>>> Normally when CP is hang, the hiq access to unmap the queue will 
>>> failed before driver call to the hqd_destroy. I think driver should 
>>> add the code to check the return value  and directly finish the 
>>> pre_reset in this case . If the hiq does not hang but kiq hang. We 
>>> can use the same logic in hqd_destroy function, return in first 
>>> access failure instead go further.  With this change we probably can 
>>> move the pre_reset function back to normal .
>>> Felix, do you have any concerns or comments for the change.
>>>
>>> Regards
>>> Shaoyun.liu
>>> ------------------------------------------------------------------------ 
>>>
>>> *From:* Liu, Monk <Monk.Liu@amd.com>
>>> *Sent:* December 19, 2019 1:13:24 AM
>>> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com>; 
>>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>>> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>> of SRIOV
>>>
>>> >>> I would like to check why we need a special sequences for sriov 
>>> on this pre_reset. If possible, make it the same as bare metal mode 
>>> sounds better solution.
>>>
>>> Because before VF FLR calling function would lead to register access 
>>> through KIQ,  which will not complete because KIQ/GFX already hang 
>>> by that time
>>>
>>> >>> I don't remember any register access by amdkfd_pre_reset call,   
>>> please let me know if this assumption is wrong .
>>>
>>> Please check “void pm_uninit(struct packet_manager *pm)” which is 
>>> invoked inside of amdkfd_pre_reset() :
>>>
>>> It will call uninitialized() in kfd_kernel_queue.c file
>>>
>>> And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”
>>>
>>> And finally it calls “static int kgd_hqd_destroy(…)” in 
>>> amdgpu_amdkfd_gfx_v10.c
>>>
>>> 539 {
>>>
>>> 540     struct amdgpu_device *adev = get_amdgpu_device(kgd);
>>>
>>> 541     enum hqd_dequeue_request_type type;
>>>
>>> 542     unsigned long end_jiffies;
>>>
>>> 543     uint32_t temp;
>>>
>>> 544     struct v10_compute_mqd *m = get_mqd(mqd);
>>>
>>> 545
>>>
>>> 546 #if 0
>>>
>>> 547     unsigned long flags;
>>>
>>> 548     int retry;
>>>
>>> 549 #endif
>>>
>>> 550
>>>
>>> 551     acquire_queue(kgd, pipe_id, queue_id); //this introduce 
>>> register access via KIQ
>>>
>>> 552
>>>
>>> 553     if (m->cp_hqd_vmid == 0)
>>>
>>> 554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); 
>>> //this introduce register access via KIQ
>>>
>>> 555
>>>
>>> 556     switch (reset_type) {
>>>
>>> 557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
>>>
>>> 558         type = DRAIN_PIPE;
>>>
>>> 559         break;
>>>
>>> 560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
>>>
>>> 561         type = RESET_WAVES;
>>>
>>> 562         break;
>>>
>>> 563     default:
>>>
>>> 564         type = DRAIN_PIPE;
>>>
>>> 565         break;
>>>
>>> 566     }
>>>
>>> 624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), 
>>> type); //this introduce register access via KIQ
>>>
>>> 625
>>>
>>> 626     end_jiffies = (utimeout * HZ / 1000) + jiffies;
>>>
>>> 627     while (true) {
>>>
>>> 628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); 
>>> //this introduce register access via KIQ
>>>
>>> 629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
>>>
>>> 630             break;
>>>
>>> 631         if (time_after(jiffies, end_jiffies)) {
>>>
>>> 632             pr_err("cp queue preemption time out.\n");
>>>
>>> 633             release_queue(kgd);
>>>
>>> 634             return -ETIME;
>>>
>>> 635         }
>>>
>>> 636         usleep_range(500, 1000);
>>>
>>> 637     }
>>>
>>> 638
>>>
>>> 639     release_queue(kgd);
>>>
>>> 640     return 0;
>>>
>>> If we use the sequence from bare-metal, all above highlighted 
>>> register access will not work because KIQ/GFX already died by that 
>>> time which means the amdkfd_pre_reset() is actually  not working as 
>>> expected.
>>>
>>> _____________________________________
>>>
>>> Monk Liu|GPU Virtualization Team |AMD
>>>
>>> sig-cloud-gpu
>>>
>>> *From:* Liu, Shaoyun <Shaoyun.Liu@amd.com>
>>> *Sent:* Thursday, December 19, 2019 12:30 PM
>>> *To:* Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> *Subject:* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>> of SRIOV
>>>
>>> I don't remember any register access by amdkfd_pre_reset call,   
>>> please let me know if this assumption is wrong .
>>> This function will use hiq to access CP, in case CP already hang, we 
>>> might not able to get the response from hw and will got a timeout. I 
>>> think kfd internal should handle this. Felix already have some 
>>> comments on that.
>>> I would like to check why we need a special sequences for sriov on 
>>> this pre_reset. If possible, make it the same as bare metal mode 
>>> sounds better solution.
>>>
>>> Regards
>>> Shaoyun.liu
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> *From:*Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
>>> *Sent:* December 18, 2019 10:52:47 PM
>>> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com 
>>> <mailto:Shaoyun.Liu@amd.com>>; amd-gfx@lists.freedesktop.org 
>>> <mailto:amd-gfx@lists.freedesktop.org> 
>>> <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>>> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>> of SRIOV
>>>
>>> Oh, by the way
>>>
>>> >>> Do we know the root cause why this function would ruin MEC ?
>>>
>>> Only we call this function right after VF FLR will ruin MEC and lead 
>>> to following KIQ ring test fail , and on bare-metal it is called 
>>> before gpu rest , so that's why on bare-metal we don't have this issue
>>>
>>> But the reason we cannot call it before VF FLR on SRIOV case was 
>>> already stated in this thread
>>>
>>> Thanks
>>> _____________________________________
>>> Monk Liu|GPU Virtualization Team |AMD
>>>
>>>
>>> -----Original Message-----
>>> From: Liu, Monk
>>> Sent: Thursday, December 19, 2019 11:49 AM
>>> To: shaoyunl <shaoyun.liu@amd.com <mailto:shaoyun.liu@amd.com>>; 
>>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>>> Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>> of SRIOV
>>>
>>> Hi Shaoyun
>>>
>>> >>> Do we know the root cause why this function would ruin MEC ? 
>>> From the logic, I think this function should be called before FLR 
>>> since we need to disable the user queue submission first.
>>> Right now I don't know which detail step lead to KIQ ring test fail, 
>>> I totally agree with you that this func should be called before VF 
>>> FLR, but we cannot do it and the reason is described in The comment:
>>>
>>> > if we do pre_reset() before VF FLR, it would go KIQ way to do 
>>> register
>>> > access and stuck there, because KIQ probably won't work by that time
>>> > (e.g. you already made GFX hang)
>>>
>>>
>>> >>> I remembered the function should use hiq to communicate with HW 
>>> , shouldn't use kiq to access HW registerm,  has this been changed ?
>>> Tis function use WREG32/RREG32 to do register access, like all other 
>>> functions in KMD,  and WREG32/RREG32 will let KIQ to do the register 
>>> access If we are under dynamic SRIOV  mode (means we are SRIOV VF 
>>> and isn't under full exclusive mode)
>>>
>>> You see that if you call this func before EVENT_5 (event 5 triggers 
>>> VF FLR) then it will run under dynamic mode and KIQ will handle the 
>>> register access, which is not an option Since ME/MEC probably 
>>> already hang ( if we are testing quark on gfx/compute rings)
>>>
>>> Do you have a good suggestion ?
>>>
>>> thanks
>>> _____________________________________
>>> Monk Liu|GPU Virtualization Team |AMD
>>>
>>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org 
>>> <mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
>>> Sent: Tuesday, December 17, 2019 11:38 PM
>>> To: amd-gfx@lists.freedesktop.org 
>>> <mailto:amd-gfx@lists.freedesktop.org>
>>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>> of SRIOV
>>>
>>> I think amdkfd side depends on this call to stop the user queue, 
>>> without this call, the user queue can submit to HW during the reset 
>>> which could cause hang again ...
>>> Do we know the root cause why this function would ruin MEC ? From 
>>> the logic, I think this function should be called before FLR since 
>>> we need to disable the user queue submission first.
>>> I remembered the function should use hiq to communicate with HW , 
>>> shouldn't use kiq to access HW registerm,  has this been changed ?
>>>
>>>
>>> Regards
>>> shaoyun.liu
>>>
>>>
>>> On 2019-12-17 5:19 a.m., Monk Liu wrote:
>>> > issues:
>>> > MEC is ruined by the amdkfd_pre_reset after VF FLR done
>>> >
>>> > fix:
>>> > amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
>>> > FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
>>> > there is a limitation to block this sequence:
>>> > if we do pre_reset() before VF FLR, it would go KIQ way to do 
>>> register
>>> > access and stuck there, because KIQ probably won't work by that time
>>> > (e.g. you already made GFX hang)
>>> >
>>> > so the best way right now is to simply remove it.
>>> >
>>> > Signed-off-by: Monk Liu <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
>>> > ---
>>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>>> >   1 file changed, 2 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > index 605cef6..ae962b9 100644
>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
>>> amdgpu_device *adev,
>>> >        if (r)
>>> >                return r;
>>> >
>>> > -     amdgpu_amdkfd_pre_reset(adev);
>>> > -
>>> >        /* Resume IP prior to SMC */
>>> >        r = amdgpu_device_ip_reinit_early_sriov(adev);
>>> >        if (r)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19 23:25                   ` shaoyunl
@ 2019-12-20  0:27                     ` Felix Kuehling
  0 siblings, 0 replies; 16+ messages in thread
From: Felix Kuehling @ 2019-12-20  0:27 UTC (permalink / raw)
  To: shaoyunl, Liu, Monk, amd-gfx

I can prepare a patch, but I can't test it thoroughly. It would need to 
be tested on bare-metal and SRIOV.

Regards,
   Felix

On 2019-12-19 18:25, shaoyunl wrote:
> How we prevent the  user queue from submitting on the  following FLR  
> if we didn't unmap the  user queues . It's possible that CP still not 
> hang when other part HW get hang and  need a reset .
>
> Om, but probably it's ok since after FLR , all the hqd will be reset 
> to unmapped by default by HW  and existing user queue need to be 
> re-created anyway ...
>
> I think i'm ok with your proposal . Can KFD team prepare the change ?
>
>
> Regards
>
> shaoyun.liu
>
>
>
> On 2019-12-19 5:44 p.m., Felix Kuehling wrote:
>> I'm thinking, if we know we're preparing for a GPU reset, maybe we 
>> shouldn't even try to suspend processes and stop the HIQ. 
>> kfd_suspend_all_processes, stop_cpsch and other functions up that 
>> call chain up to kgd2kfd_suspend could have a parameter (bool 
>> pre_reset) that would update the driver state but not touch the 
>> hardware. That avoids unnecessary timeouts on things that aren't 
>> expected to complete anyway.
>>
>> Regards,
>>   Felix
>>
>>
>> On 2019-12-19 11:59 a.m., shaoyunl wrote:
>>>
>>> After check the code , in KFD side , should be simple just add the 
>>> check in stop_cpsch code . For kiq, there is no return for WREG32 , 
>>> so no easy way to check the return value . Maybe we can add 
>>> kiq_status in struct amdgpu_kiq  to indicate the kiq is hang or not 
>>> ,  in hdq_destroy function check this kiq_status after acquire_queue 
>>> , finish the destroy function is kiq is hang for SRIOV only .
>>>
>>> Any comments ?
>>>
>>>
>>> shaoyun.liu
>>>
>>>
>>> On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:
>>>
>>>> I see, thanks for the detail information.
>>>> Normally when CP is hang, the hiq access to unmap the queue will 
>>>> failed before driver call to the hqd_destroy. I think driver should 
>>>> add the code to check the return value  and directly finish the 
>>>> pre_reset in this case . If the hiq does not hang but kiq hang. We 
>>>> can use the same logic in hqd_destroy function, return in first 
>>>> access failure instead go further.  With this change we probably 
>>>> can move the pre_reset function back to normal .
>>>> Felix, do you have any concerns or comments for the change.
>>>>
>>>> Regards
>>>> Shaoyun.liu
>>>> ------------------------------------------------------------------------ 
>>>>
>>>> *From:* Liu, Monk <Monk.Liu@amd.com>
>>>> *Sent:* December 19, 2019 1:13:24 AM
>>>> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com>; 
>>>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>>>> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in 
>>>> TDR of SRIOV
>>>>
>>>> >>> I would like to check why we need a special sequences for sriov 
>>>> on this pre_reset. If possible, make it the same as bare metal mode 
>>>> sounds better solution.
>>>>
>>>> Because before VF FLR calling function would lead to register 
>>>> access through KIQ,  which will not complete because KIQ/GFX 
>>>> already hang by that time
>>>>
>>>> >>> I don't remember any register access by amdkfd_pre_reset 
>>>> call,   please let me know if this assumption is wrong .
>>>>
>>>> Please check “void pm_uninit(struct packet_manager *pm)” which is 
>>>> invoked inside of amdkfd_pre_reset() :
>>>>
>>>> It will call uninitialized() in kfd_kernel_queue.c file
>>>>
>>>> And then go to the path of “kq->mqd_mgr->destroy_mqd(…)”
>>>>
>>>> And finally it calls “static int kgd_hqd_destroy(…)” in 
>>>> amdgpu_amdkfd_gfx_v10.c
>>>>
>>>> 539 {
>>>>
>>>> 540     struct amdgpu_device *adev = get_amdgpu_device(kgd);
>>>>
>>>> 541     enum hqd_dequeue_request_type type;
>>>>
>>>> 542     unsigned long end_jiffies;
>>>>
>>>> 543     uint32_t temp;
>>>>
>>>> 544     struct v10_compute_mqd *m = get_mqd(mqd);
>>>>
>>>> 545
>>>>
>>>> 546 #if 0
>>>>
>>>> 547     unsigned long flags;
>>>>
>>>> 548     int retry;
>>>>
>>>> 549 #endif
>>>>
>>>> 550
>>>>
>>>> 551     acquire_queue(kgd, pipe_id, queue_id); //this introduce 
>>>> register access via KIQ
>>>>
>>>> 552
>>>>
>>>> 553     if (m->cp_hqd_vmid == 0)
>>>>
>>>> 554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 
>>>> 0); //this introduce register access via KIQ
>>>>
>>>> 555
>>>>
>>>> 556     switch (reset_type) {
>>>>
>>>> 557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
>>>>
>>>> 558         type = DRAIN_PIPE;
>>>>
>>>> 559         break;
>>>>
>>>> 560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
>>>>
>>>> 561         type = RESET_WAVES;
>>>>
>>>> 562         break;
>>>>
>>>> 563     default:
>>>>
>>>> 564         type = DRAIN_PIPE;
>>>>
>>>> 565         break;
>>>>
>>>> 566     }
>>>>
>>>> 624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), 
>>>> type); //this introduce register access via KIQ
>>>>
>>>> 625
>>>>
>>>> 626     end_jiffies = (utimeout * HZ / 1000) + jiffies;
>>>>
>>>> 627     while (true) {
>>>>
>>>> 628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, 
>>>> mmCP_HQD_ACTIVE)); //this introduce register access via KIQ
>>>>
>>>> 629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
>>>>
>>>> 630             break;
>>>>
>>>> 631         if (time_after(jiffies, end_jiffies)) {
>>>>
>>>> 632             pr_err("cp queue preemption time out.\n");
>>>>
>>>> 633             release_queue(kgd);
>>>>
>>>> 634             return -ETIME;
>>>>
>>>> 635         }
>>>>
>>>> 636         usleep_range(500, 1000);
>>>>
>>>> 637     }
>>>>
>>>> 638
>>>>
>>>> 639     release_queue(kgd);
>>>>
>>>> 640     return 0;
>>>>
>>>> If we use the sequence from bare-metal, all above highlighted 
>>>> register access will not work because KIQ/GFX already died by that 
>>>> time which means the amdkfd_pre_reset() is actually  not working as 
>>>> expected.
>>>>
>>>> _____________________________________
>>>>
>>>> Monk Liu|GPU Virtualization Team |AMD
>>>>
>>>> sig-cloud-gpu
>>>>
>>>> *From:* Liu, Shaoyun <Shaoyun.Liu@amd.com>
>>>> *Sent:* Thursday, December 19, 2019 12:30 PM
>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> *Subject:* Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in 
>>>> TDR of SRIOV
>>>>
>>>> I don't remember any register access by amdkfd_pre_reset call,   
>>>> please let me know if this assumption is wrong .
>>>> This function will use hiq to access CP, in case CP already hang, 
>>>> we might not able to get the response from hw and will got a 
>>>> timeout. I think kfd internal should handle this. Felix already 
>>>> have some comments on that.
>>>> I would like to check why we need a special sequences for sriov on 
>>>> this pre_reset. If possible, make it the same as bare metal mode 
>>>> sounds better solution.
>>>>
>>>> Regards
>>>> Shaoyun.liu
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>>
>>>> *From:*Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
>>>> *Sent:* December 18, 2019 10:52:47 PM
>>>> *To:* Liu, Shaoyun <Shaoyun.Liu@amd.com 
>>>> <mailto:Shaoyun.Liu@amd.com>>; amd-gfx@lists.freedesktop.org 
>>>> <mailto:amd-gfx@lists.freedesktop.org> 
>>>> <amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>>
>>>> *Subject:* RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in 
>>>> TDR of SRIOV
>>>>
>>>> Oh, by the way
>>>>
>>>> >>> Do we know the root cause why this function would ruin MEC ?
>>>>
>>>> Only we call this function right after VF FLR will ruin MEC and 
>>>> lead to following KIQ ring test fail , and on bare-metal it is 
>>>> called before gpu rest , so that's why on bare-metal we don't have 
>>>> this issue
>>>>
>>>> But the reason we cannot call it before VF FLR on SRIOV case was 
>>>> already stated in this thread
>>>>
>>>> Thanks
>>>> _____________________________________
>>>> Monk Liu|GPU Virtualization Team |AMD
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Liu, Monk
>>>> Sent: Thursday, December 19, 2019 11:49 AM
>>>> To: shaoyunl <shaoyun.liu@amd.com <mailto:shaoyun.liu@amd.com>>; 
>>>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>>>> Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>>> of SRIOV
>>>>
>>>> Hi Shaoyun
>>>>
>>>> >>> Do we know the root cause why this function would ruin MEC ? 
>>>> From the logic, I think this function should be called before FLR 
>>>> since we need to disable the user queue submission first.
>>>> Right now I don't know which detail step lead to KIQ ring test 
>>>> fail, I totally agree with you that this func should be called 
>>>> before VF FLR, but we cannot do it and the reason is described in 
>>>> The comment:
>>>>
>>>> > if we do pre_reset() before VF FLR, it would go KIQ way to do 
>>>> register
>>>> > access and stuck there, because KIQ probably won't work by that time
>>>> > (e.g. you already made GFX hang)
>>>>
>>>>
>>>> >>> I remembered the function should use hiq to communicate with HW 
>>>> , shouldn't use kiq to access HW registerm,  has this been changed ?
>>>> Tis function use WREG32/RREG32 to do register access, like all 
>>>> other functions in KMD,  and WREG32/RREG32 will let KIQ to do the 
>>>> register access If we are under dynamic SRIOV mode (means we are 
>>>> SRIOV VF and isn't under full exclusive mode)
>>>>
>>>> You see that if you call this func before EVENT_5 (event 5 triggers 
>>>> VF FLR) then it will run under dynamic mode and KIQ will handle the 
>>>> register access, which is not an option Since ME/MEC probably 
>>>> already hang ( if we are testing quark on gfx/compute rings)
>>>>
>>>> Do you have a good suggestion ?
>>>>
>>>> thanks
>>>> _____________________________________
>>>> Monk Liu|GPU Virtualization Team |AMD
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org 
>>>> <mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
>>>> Sent: Tuesday, December 17, 2019 11:38 PM
>>>> To: amd-gfx@lists.freedesktop.org 
>>>> <mailto:amd-gfx@lists.freedesktop.org>
>>>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR 
>>>> of SRIOV
>>>>
>>>> I think amdkfd side depends on this call to stop the user queue, 
>>>> without this call, the user queue can submit to HW during the reset 
>>>> which could cause hang again ...
>>>> Do we know the root cause why this function would ruin MEC ? From 
>>>> the logic, I think this function should be called before FLR since 
>>>> we need to disable the user queue submission first.
>>>> I remembered the function should use hiq to communicate with HW , 
>>>> shouldn't use kiq to access HW registerm,  has this been changed ?
>>>>
>>>>
>>>> Regards
>>>> shaoyun.liu
>>>>
>>>>
>>>> On 2019-12-17 5:19 a.m., Monk Liu wrote:
>>>> > issues:
>>>> > MEC is ruined by the amdkfd_pre_reset after VF FLR done
>>>> >
>>>> > fix:
>>>> > amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
>>>> > FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
>>>> > there is a limitation to block this sequence:
>>>> > if we do pre_reset() before VF FLR, it would go KIQ way to do 
>>>> register
>>>> > access and stuck there, because KIQ probably won't work by that time
>>>> > (e.g. you already made GFX hang)
>>>> >
>>>> > so the best way right now is to simply remove it.
>>>> >
>>>> > Signed-off-by: Monk Liu <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>
>>>> > ---
>>>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>>>> >   1 file changed, 2 deletions(-)
>>>> >
>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > index 605cef6..ae962b9 100644
>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
>>>> amdgpu_device *adev,
>>>> >        if (r)
>>>> >                return r;
>>>> >
>>>> > -     amdgpu_amdkfd_pre_reset(adev);
>>>> > -
>>>> >        /* Resume IP prior to SMC */
>>>> >        r = amdgpu_device_ip_reinit_early_sriov(adev);
>>>> >        if (r)
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
  2019-12-19 16:59               ` shaoyunl
  2019-12-19 22:44                 ` Felix Kuehling
@ 2019-12-20  2:46                 ` Liu, Monk
  1 sibling, 0 replies; 16+ messages in thread
From: Liu, Monk @ 2019-12-20  2:46 UTC (permalink / raw)
  To: Liu, Shaoyun, amd-gfx


[-- Attachment #1.1.1: Type: text/plain, Size: 11532 bytes --]

>>>. For kiq, there is no return for WREG3

We can make amdgpu_virt_kiq_wreg() return a value if really needed, e.g.: return if this write success

_____________________________________
Monk Liu|GPU Virtualization Team |AMD
[sig-cloud-gpu]

From: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Sent: Friday, December 20, 2019 12:59 AM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCwH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV


After check the code , in KFD side , should be simple just add the check in stop_cpsch code . For kiq, there is no return for WREG32 , so no easy way to check the return value . Maybe we can add kiq_status in struct amdgpu_kiq  to indicate the kiq is hang or not ,  in hdq_destroy function check this  kiq_status after acquire_queue , finish the destroy function is kiq is hang for SRIOV only .

Any comments ?



shaoyun.liu



On 2019-12-19 9:51 a.m., Liu, Shaoyun wrote:
I see, thanks for the detail information.
Normally when CP is hang, the hiq access to unmap the queue will failed before driver call to the hqd_destroy. I think driver should add the code to check the return value  and directly finish the pre_reset in this case . If the hiq does not hang but kiq hang. We can use the same logic in hqd_destroy function,  return in first access failure instead go further.  With this change we probably can move the pre_reset function back to normal .
Felix, do you have any concerns or comments for the change.

Regards
Shaoyun.liu
________________________________
From: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>
Sent: December 19, 2019 1:13:24 AM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com><mailto:Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV


>>> I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.



Because before VF FLR calling function would lead to register access through KIQ,  which will not complete because KIQ/GFX already hang by that time



>>> I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .



Please check "void pm_uninit(struct packet_manager *pm)" which is invoked inside of amdkfd_pre_reset() :



It will call uninitialized() in kfd_kernel_queue.c file



And then go to the path of "kq->mqd_mgr->destroy_mqd(...)"



And finally it calls "static int kgd_hqd_destroy(...)" in amdgpu_amdkfd_gfx_v10.c





539 {

540     struct amdgpu_device *adev = get_amdgpu_device(kgd);

541     enum hqd_dequeue_request_type type;

542     unsigned long end_jiffies;

543     uint32_t temp;

544     struct v10_compute_mqd *m = get_mqd(mqd);

545

546 #if 0

547     unsigned long flags;

548     int retry;

549 #endif

550

551     acquire_queue(kgd, pipe_id, queue_id); //this introduce register access via KIQ

552

553     if (m->cp_hqd_vmid == 0)

554         WREG32_FIELD15(GC, 0, RLC_CP_SCHEDULERS, scheduler1, 0); //this introduce register access via KIQ

555

556     switch (reset_type) {

557     case KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:

558         type = DRAIN_PIPE;

559         break;

560     case KFD_PREEMPT_TYPE_WAVEFRONT_RESET:

561         type = RESET_WAVES;

562         break;

563     default:

564         type = DRAIN_PIPE;

565         break;

566     }

624     WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), type); //this introduce register access via KIQ

625

626     end_jiffies = (utimeout * HZ / 1000) + jiffies;

627     while (true) {

628         temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE)); //this introduce register access via KIQ

629         if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))

630             break;

631         if (time_after(jiffies, end_jiffies)) {

632             pr_err("cp queue preemption time out.\n");

633             release_queue(kgd);

634             return -ETIME;

635         }

636         usleep_range(500, 1000);

637     }

638

639     release_queue(kgd);

640     return 0;



If we use the sequence from bare-metal, all above highlighted register access will not work because KIQ/GFX already died by that time which means the amdkfd_pre_reset() is actually  not working as expected.



_____________________________________

Monk Liu|GPU Virtualization Team |AMD

[sig-cloud-gpu]



From: Liu, Shaoyun <Shaoyun.Liu@amd.com><mailto:Shaoyun.Liu@amd.com>
Sent: Thursday, December 19, 2019 12:30 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV



I don't remember any register access by amdkfd_pre_reset call,   please let me know if this assumption is wrong .
This function will use hiq to access CP, in case CP already hang, we might not able to get the response from hw and will got a timeout. I think kfd internal should handle this. Felix already have some comments on that.
I would like to check why we need a special sequences for sriov on this pre_reset. If possible, make it the same as bare metal mode sounds better solution.

Regards
Shaoyun.liu

________________________________

From: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>
Sent: December 18, 2019 10:52:47 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com<mailto:Shaoyun.Liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV



Oh, by the way

>>> Do we know the root cause why this function would ruin MEC ?

Only we call this function right after VF FLR will ruin MEC and lead to following KIQ ring test fail , and on bare-metal it is called before gpu rest , so that's why on bare-metal we don't have this issue

But the reason we cannot call it before VF FLR on SRIOV case was already stated in this thread

Thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl <shaoyun.liu@amd.com<mailto:shaoyun.liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register
> access and stuck there, because KIQ probably won't work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>        if (r)
>                return r;
>
> -     amdgpu_amdkfd_pre_reset(adev);
> -
>        /* Resume IP prior to SMC */
>        r = amdgpu_device_ip_reinit_early_sriov(adev);
>        if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&amp;sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639048992279&sdata=38z3sISWX26bZPplKeHvD0xIPCRbPAW%2BgKv2cXqetXc%3D&reserved=0>



_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CShaoyun.Liu%40amd.com%7Cff429b9d30b24af8955508d78492e8bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637123639049012267&amp;sdata=se3rrEVIDZa677riVu5MAf95y%2BxndiDw5BULScsxFBc%3D&amp;reserved=0

[-- Attachment #1.1.2: Type: text/html, Size: 23297 bytes --]

[-- Attachment #1.2: image001.png --]
[-- Type: image/png, Size: 12243 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-12-20  2:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 10:19 [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV of SRIOV Monk Liu
2019-12-17 10:19 ` [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR " Monk Liu
2019-12-17 10:38   ` Deng, Emily
2019-12-17 15:37   ` shaoyunl
2019-12-17 21:14     ` Felix Kuehling
2019-12-19  3:49     ` Liu, Monk
2019-12-19  3:52       ` Liu, Monk
2019-12-19  4:29         ` Liu, Shaoyun
2019-12-19  6:13           ` Liu, Monk
2019-12-19 14:51             ` [PATCwH " Liu, Shaoyun
2019-12-19 16:59               ` shaoyunl
2019-12-19 22:44                 ` Felix Kuehling
2019-12-19 23:25                   ` shaoyunl
2019-12-20  0:27                     ` Felix Kuehling
2019-12-20  2:46                 ` Liu, Monk
2019-12-17 10:31 ` [PATCH 1/2] drm/amdgpu: fix double gpu_recovery for NV " Deng, Emily

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.