All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
@ 2021-07-28  6:36 Changfeng
  2021-07-28 13:21 ` Alex Deucher
  2021-07-28 14:21 ` Felix Kuehling
  0 siblings, 2 replies; 5+ messages in thread
From: Changfeng @ 2021-07-28  6:36 UTC (permalink / raw)
  To: amd-gfx, Felix.Kuehling, Ray.Huang, Yifan1.Zhang; +Cc: changzhu

From: changzhu <Changfeng.Zhu@amd.com>

From: Changfeng <Changfeng.Zhu@amd.com>

It can't find any issues with noretry=1 except two SVM migrate issues.
Oppositely, it will cause most SVM cases fail with noretry=0.
The two SVM migrate issues also happen with noretry=0. So it can set
default noretry=1 for raven firstly to fix most SVM fails.

Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
Signed-off-by: Changfeng <Changfeng.Zhu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 09edfb64cce0..d7f69dbd48e6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
 		 * noretry = 0 will cause kfd page fault tests fail
 		 * for some ASICs, so set default to 1 for these ASICs.
 		 */
+	case CHIP_RAVEN:
+		/*
+		 * TODO: Raven currently can fix most SVM issues with
+		 * noretry =1. However it has two issues with noretry = 1
+		 * on kfd migrate tests. It still needs to root causes
+		 * with these two migrate fails on raven with noretry = 1.
+		 */
 		if (amdgpu_noretry == -1)
 			gmc->noretry = 1;
 		else
 			gmc->noretry = amdgpu_noretry;
 		break;
-	case CHIP_RAVEN:
 	default:
-		/* Raven currently has issues with noretry
-		 * regardless of what we decide for other
-		 * asics, we should leave raven with
-		 * noretry = 0 until we root cause the
-		 * issues.
-		 *
+		/*
 		 * default this to 0 for now, but we may want
 		 * to change this in the future for certain
 		 * GPUs as it can increase performance in
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
  2021-07-28  6:36 [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven Changfeng
@ 2021-07-28 13:21 ` Alex Deucher
  2021-07-28 14:21 ` Felix Kuehling
  1 sibling, 0 replies; 5+ messages in thread
From: Alex Deucher @ 2021-07-28 13:21 UTC (permalink / raw)
  To: Changfeng; +Cc: Yifan Zhang, Kuehling, Felix, Huang Rui, amd-gfx list

On Wed, Jul 28, 2021 at 2:36 AM Changfeng <Changfeng.Zhu@amd.com> wrote:
>
> From: changzhu <Changfeng.Zhu@amd.com>
>
> From: Changfeng <Changfeng.Zhu@amd.com>
>
> It can't find any issues with noretry=1 except two SVM migrate issues.
> Oppositely, it will cause most SVM cases fail with noretry=0.
> The two SVM migrate issues also happen with noretry=0. So it can set
> default noretry=1 for raven firstly to fix most SVM fails.
>
> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
> Signed-off-by: Changfeng <Changfeng.Zhu@amd.com>

I would suggest testing this on a wide variety of raven systems,
including some OEM ones if possible.  Last time we did this it caused
tons of stability issues with raven systems.

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 09edfb64cce0..d7f69dbd48e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>                  * noretry = 0 will cause kfd page fault tests fail
>                  * for some ASICs, so set default to 1 for these ASICs.
>                  */
> +       case CHIP_RAVEN:
> +               /*
> +                * TODO: Raven currently can fix most SVM issues with
> +                * noretry =1. However it has two issues with noretry = 1
> +                * on kfd migrate tests. It still needs to root causes
> +                * with these two migrate fails on raven with noretry = 1.
> +                */
>                 if (amdgpu_noretry == -1)
>                         gmc->noretry = 1;
>                 else
>                         gmc->noretry = amdgpu_noretry;
>                 break;
> -       case CHIP_RAVEN:
>         default:
> -               /* Raven currently has issues with noretry
> -                * regardless of what we decide for other
> -                * asics, we should leave raven with
> -                * noretry = 0 until we root cause the
> -                * issues.
> -                *
> +               /*
>                  * default this to 0 for now, but we may want
>                  * to change this in the future for certain
>                  * GPUs as it can increase performance in
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
  2021-07-28  6:36 [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven Changfeng
  2021-07-28 13:21 ` Alex Deucher
@ 2021-07-28 14:21 ` Felix Kuehling
  2021-08-05  8:51   ` Zhu, Changfeng
  1 sibling, 1 reply; 5+ messages in thread
From: Felix Kuehling @ 2021-07-28 14:21 UTC (permalink / raw)
  To: Changfeng, amd-gfx, Ray.Huang, Yifan1.Zhang

Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for
system memory access depend on correct retry handling in the SQ.
Therefore noretry must be 0 on Raven.

I believe the reason that SVM has trouble with retry enabled is, that
IOMMUv2 is catching the page faults, so the driver never gets to handle
the page fault interrupts. That breaks page-fault based migration in the
SVM code. I think the better solution is to disable SVM on APUs where
IOMMUv2 is enabled.

Alternatively, we could give up on IOMMUv2 entirely and always rely on
SVM to provide that functionality. But that requires more changes in the
amdgpu_vm code.

Regards,
  Felix


Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
> From: changzhu <Changfeng.Zhu@amd.com>
>
> From: Changfeng <Changfeng.Zhu@amd.com>
>
> It can't find any issues with noretry=1 except two SVM migrate issues.
> Oppositely, it will cause most SVM cases fail with noretry=0.
> The two SVM migrate issues also happen with noretry=0. So it can set
> default noretry=1 for raven firstly to fix most SVM fails.
>
> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
> Signed-off-by: Changfeng <Changfeng.Zhu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 09edfb64cce0..d7f69dbd48e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>  		 * noretry = 0 will cause kfd page fault tests fail
>  		 * for some ASICs, so set default to 1 for these ASICs.
>  		 */
> +	case CHIP_RAVEN:
> +		/*
> +		 * TODO: Raven currently can fix most SVM issues with
> +		 * noretry =1. However it has two issues with noretry = 1
> +		 * on kfd migrate tests. It still needs to root causes
> +		 * with these two migrate fails on raven with noretry = 1.
> +		 */
>  		if (amdgpu_noretry == -1)
>  			gmc->noretry = 1;
>  		else
>  			gmc->noretry = amdgpu_noretry;
>  		break;
> -	case CHIP_RAVEN:
>  	default:
> -		/* Raven currently has issues with noretry
> -		 * regardless of what we decide for other
> -		 * asics, we should leave raven with
> -		 * noretry = 0 until we root cause the
> -		 * issues.
> -		 *
> +		/*
>  		 * default this to 0 for now, but we may want
>  		 * to change this in the future for certain
>  		 * GPUs as it can increase performance in
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
  2021-07-28 14:21 ` Felix Kuehling
@ 2021-08-05  8:51   ` Zhu, Changfeng
  2021-08-05 15:10     ` Felix Kuehling
  0 siblings, 1 reply; 5+ messages in thread
From: Zhu, Changfeng @ 2021-08-05  8:51 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx, Huang, Ray, Zhang, Yifan

[AMD Official Use Only]

Hi Felix,

Can we set noretry=1 for dgpu path(ignore_crat=1) which doesn’t to through iommuv2 path on raven as below:
> +	case CHIP_RAVEN:
> +		/*
> +		 * TODO: Raven currently can fix most SVM issues with
> +		 * noretry =1. However it has two issues with noretry = 1
> +		 * on kfd migrate tests. It still needs to root causes
> +		 * with these two migrate fails on raven with noretry = 1.
> +		 */
>  		if (amdgpu_noretry == -1) {
>			If(ignore_crat)
>  				gmc->noretry = 1;
>			else
>				gmc->noretry = 0;
>		}
>  		else
>  			gmc->noretry = amdgpu_noretry;
>  		break;

BR,
Changfeng.

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Wednesday, July 28, 2021 10:22 PM
To: Zhu, Changfeng <Changfeng.Zhu@amd.com>; amd-gfx@lists.freedesktop.org; Huang, Ray <Ray.Huang@amd.com>; Zhang, Yifan <Yifan1.Zhang@amd.com>
Subject: Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven

Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for system memory access depend on correct retry handling in the SQ.
Therefore noretry must be 0 on Raven.

I believe the reason that SVM has trouble with retry enabled is, that
IOMMUv2 is catching the page faults, so the driver never gets to handle the page fault interrupts. That breaks page-fault based migration in the SVM code. I think the better solution is to disable SVM on APUs where
IOMMUv2 is enabled.

Alternatively, we could give up on IOMMUv2 entirely and always rely on SVM to provide that functionality. But that requires more changes in the amdgpu_vm code.

Regards,
  Felix


Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
> From: changzhu <Changfeng.Zhu@amd.com>
>
> From: Changfeng <Changfeng.Zhu@amd.com>
>
> It can't find any issues with noretry=1 except two SVM migrate issues.
> Oppositely, it will cause most SVM cases fail with noretry=0.
> The two SVM migrate issues also happen with noretry=0. So it can set 
> default noretry=1 for raven firstly to fix most SVM fails.
>
> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
> Signed-off-by: Changfeng <Changfeng.Zhu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 09edfb64cce0..d7f69dbd48e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>  		 * noretry = 0 will cause kfd page fault tests fail
>  		 * for some ASICs, so set default to 1 for these ASICs.
>  		 */
> +	case CHIP_RAVEN:
> +		/*
> +		 * TODO: Raven currently can fix most SVM issues with
> +		 * noretry =1. However it has two issues with noretry = 1
> +		 * on kfd migrate tests. It still needs to root causes
> +		 * with these two migrate fails on raven with noretry = 1.
> +		 */
>  		if (amdgpu_noretry == -1)
>  			gmc->noretry = 1;
>  		else
>  			gmc->noretry = amdgpu_noretry;
>  		break;
> -	case CHIP_RAVEN:
>  	default:
> -		/* Raven currently has issues with noretry
> -		 * regardless of what we decide for other
> -		 * asics, we should leave raven with
> -		 * noretry = 0 until we root cause the
> -		 * issues.
> -		 *
> +		/*
>  		 * default this to 0 for now, but we may want
>  		 * to change this in the future for certain
>  		 * GPUs as it can increase performance in

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
  2021-08-05  8:51   ` Zhu, Changfeng
@ 2021-08-05 15:10     ` Felix Kuehling
  0 siblings, 0 replies; 5+ messages in thread
From: Felix Kuehling @ 2021-08-05 15:10 UTC (permalink / raw)
  To: Zhu, Changfeng, amd-gfx, Huang, Ray, Zhang, Yifan


Am 2021-08-05 um 4:51 a.m. schrieb Zhu, Changfeng:
> [AMD Official Use Only]
>
> Hi Felix,
>
> Can we set noretry=1 for dgpu path(ignore_crat=1) which doesn’t to through iommuv2 path on raven as below:

There are other possible reasons than ignore_crat for Raven to work in
dGPU mode (broken CRAT, disabled IOMMU). However, those are not known
until later in the initialization.

Regards,
  Felix


>> +	case CHIP_RAVEN:
>> +		/*
>> +		 * TODO: Raven currently can fix most SVM issues with
>> +		 * noretry =1. However it has two issues with noretry = 1
>> +		 * on kfd migrate tests. It still needs to root causes
>> +		 * with these two migrate fails on raven with noretry = 1.
>> +		 */
>>  		if (amdgpu_noretry == -1) {
>> 			If(ignore_crat)
>>  				gmc->noretry = 1;
>> 			else
>> 				gmc->noretry = 0;
>> 		}
>>  		else
>>  			gmc->noretry = amdgpu_noretry;
>>  		break;
> BR,
> Changfeng.
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com> 
> Sent: Wednesday, July 28, 2021 10:22 PM
> To: Zhu, Changfeng <Changfeng.Zhu@amd.com>; amd-gfx@lists.freedesktop.org; Huang, Ray <Ray.Huang@amd.com>; Zhang, Yifan <Yifan1.Zhang@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
>
> Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for system memory access depend on correct retry handling in the SQ.
> Therefore noretry must be 0 on Raven.
>
> I believe the reason that SVM has trouble with retry enabled is, that
> IOMMUv2 is catching the page faults, so the driver never gets to handle the page fault interrupts. That breaks page-fault based migration in the SVM code. I think the better solution is to disable SVM on APUs where
> IOMMUv2 is enabled.
>
> Alternatively, we could give up on IOMMUv2 entirely and always rely on SVM to provide that functionality. But that requires more changes in the amdgpu_vm code.
>
> Regards,
>   Felix
>
>
> Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
>> From: changzhu <Changfeng.Zhu@amd.com>
>>
>> From: Changfeng <Changfeng.Zhu@amd.com>
>>
>> It can't find any issues with noretry=1 except two SVM migrate issues.
>> Oppositely, it will cause most SVM cases fail with noretry=0.
>> The two SVM migrate issues also happen with noretry=0. So it can set 
>> default noretry=1 for raven firstly to fix most SVM fails.
>>
>> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
>> Signed-off-by: Changfeng <Changfeng.Zhu@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 09edfb64cce0..d7f69dbd48e6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>>  		 * noretry = 0 will cause kfd page fault tests fail
>>  		 * for some ASICs, so set default to 1 for these ASICs.
>>  		 */
>> +	case CHIP_RAVEN:
>> +		/*
>> +		 * TODO: Raven currently can fix most SVM issues with
>> +		 * noretry =1. However it has two issues with noretry = 1
>> +		 * on kfd migrate tests. It still needs to root causes
>> +		 * with these two migrate fails on raven with noretry = 1.
>> +		 */
>>  		if (amdgpu_noretry == -1)
>>  			gmc->noretry = 1;
>>  		else
>>  			gmc->noretry = amdgpu_noretry;
>>  		break;
>> -	case CHIP_RAVEN:
>>  	default:
>> -		/* Raven currently has issues with noretry
>> -		 * regardless of what we decide for other
>> -		 * asics, we should leave raven with
>> -		 * noretry = 0 until we root cause the
>> -		 * issues.
>> -		 *
>> +		/*
>>  		 * default this to 0 for now, but we may want
>>  		 * to change this in the future for certain
>>  		 * GPUs as it can increase performance in

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-05 15:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-28  6:36 [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven Changfeng
2021-07-28 13:21 ` Alex Deucher
2021-07-28 14:21 ` Felix Kuehling
2021-08-05  8:51   ` Zhu, Changfeng
2021-08-05 15:10     ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.