All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
       [not found] <20201015033557.3046-1-Jack.Gui@amd.com>
@ 2020-10-15 14:30 ` Felix Kuehling
  2020-11-29  0:38   ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 6+ messages in thread
From: Felix Kuehling @ 2020-10-15 14:30 UTC (permalink / raw)
  To: Chengming Gui, amd-gfx, Alexander.Deucher
  Cc: Tao.Zhou1, Ray.Huang, Guchun.Chen, Hawking.Zhang

Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
> noretry = 0 cause some dGPU's kfd page fault tests fail,
> so set noretry to 1 for these special ASICs:
> vega20/navi10/navi14/ARCTURUS
>
> v2: merge raven and default case due to the same setting
> v3: remove ARCTURUS
>
> Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
> Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872

Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 36604d751d62..f26eb4e54b12 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>  	struct amdgpu_gmc *gmc = &adev->gmc;
>  
>  	switch (adev->asic_type) {
> -	case CHIP_RAVEN:
> -		/* Raven currently has issues with noretry
> -		 * regardless of what we decide for other
> -		 * asics, we should leave raven with
> -		 * noretry = 0 until we root cause the
> -		 * issues.
> +	case CHIP_VEGA20:
> +	case CHIP_NAVI10:
> +	case CHIP_NAVI14:
> +		/*
> +		 * noretry = 0 will cause kfd page fault tests fail
> +		 * for some ASICs, so set default to 1 for these ASICs.
>  		 */
>  		if (amdgpu_noretry == -1)
> -			gmc->noretry = 0;
> +			gmc->noretry = 1;
>  		else
>  			gmc->noretry = amdgpu_noretry;
>  		break;
> +	case CHIP_RAVEN:
>  	default:
> -		/* default this to 0 for now, but we may want
> +		/* Raven currently has issues with noretry
> +		 * regardless of what we decide for other
> +		 * asics, we should leave raven with
> +		 * noretry = 0 until we root cause the
> +		 * issues.
> +		 *
> +		 * default this to 0 for now, but we may want
>  		 * to change this in the future for certain
>  		 * GPUs as it can increase performance in
>  		 * certain cases.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
  2020-10-15 14:30 ` [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs Felix Kuehling
@ 2020-11-29  0:38   ` Bas Nieuwenhuizen
  2020-11-29  7:22     ` Chen, Guchun
  0 siblings, 1 reply; 6+ messages in thread
From: Bas Nieuwenhuizen @ 2020-11-29  0:38 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Chengming Gui, Guchun.Chen, Tao Zhou, amd-gfx mailing list,
	Ray.Huang, Alex Deucher, Hawking Zhang

Can we revert this patch to fix
https://gitlab.freedesktop.org/drm/amd/-/issues/1374 ?

On Thu, Oct 15, 2020 at 4:30 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
> > noretry = 0 cause some dGPU's kfd page fault tests fail,
> > so set noretry to 1 for these special ASICs:
> > vega20/navi10/navi14/ARCTURUS
> >
> > v2: merge raven and default case due to the same setting
> > v3: remove ARCTURUS
> >
> > Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
> > Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
>
> Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>
>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23 +++++++++++++++--------
> >  1 file changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > index 36604d751d62..f26eb4e54b12 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
> >       struct amdgpu_gmc *gmc = &adev->gmc;
> >
> >       switch (adev->asic_type) {
> > -     case CHIP_RAVEN:
> > -             /* Raven currently has issues with noretry
> > -              * regardless of what we decide for other
> > -              * asics, we should leave raven with
> > -              * noretry = 0 until we root cause the
> > -              * issues.
> > +     case CHIP_VEGA20:
> > +     case CHIP_NAVI10:
> > +     case CHIP_NAVI14:
> > +             /*
> > +              * noretry = 0 will cause kfd page fault tests fail
> > +              * for some ASICs, so set default to 1 for these ASICs.
> >                */
> >               if (amdgpu_noretry == -1)
> > -                     gmc->noretry = 0;
> > +                     gmc->noretry = 1;
> >               else
> >                       gmc->noretry = amdgpu_noretry;
> >               break;
> > +     case CHIP_RAVEN:
> >       default:
> > -             /* default this to 0 for now, but we may want
> > +             /* Raven currently has issues with noretry
> > +              * regardless of what we decide for other
> > +              * asics, we should leave raven with
> > +              * noretry = 0 until we root cause the
> > +              * issues.
> > +              *
> > +              * default this to 0 for now, but we may want
> >                * to change this in the future for certain
> >                * GPUs as it can increase performance in
> >                * certain cases.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
  2020-11-29  0:38   ` Bas Nieuwenhuizen
@ 2020-11-29  7:22     ` Chen, Guchun
  2020-11-30 17:22       ` Deucher, Alexander
  0 siblings, 1 reply; 6+ messages in thread
From: Chen, Guchun @ 2020-11-29  7:22 UTC (permalink / raw)
  To: Bas Nieuwenhuizen, Kuehling, Felix
  Cc: Gui, Jack, Zhou1, Tao, amd-gfx mailing list, Huang, Ray, Deucher,
	Alexander, Zhang, Hawking

[AMD Public Use]

Hi Bas Nieuwenhuizen,

I don't think direct revert is one right approach, though it's able to fix your problem.  noretry=0 will cause other test failure on several ASICs.

Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Bas Nieuwenhuizen
Sent: Sunday, November 29, 2020 8:38 AM
To: Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Gui, Jack <Jack.Gui@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx mailing list <amd-gfx@lists.freedesktop.org>; Huang, Ray <Ray.Huang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs

Can we revert this patch to fix
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1374&amp;data=04%7C01%7Cguchun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Jxa2V1TuszoBKtF%2FPbIA3YwOrXHgLreBY%2FXej1HTZ4k%3D&amp;reserved=0 ?

On Thu, Oct 15, 2020 at 4:30 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
> > noretry = 0 cause some dGPU's kfd page fault tests fail, so set 
> > noretry to 1 for these special ASICs:
> > vega20/navi10/navi14/ARCTURUS
> >
> > v2: merge raven and default case due to the same setting
> > v3: remove ARCTURUS
> >
> > Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
> > Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
>
> Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>
>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23 
> > +++++++++++++++--------
> >  1 file changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > index 36604d751d62..f26eb4e54b12 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
> >       struct amdgpu_gmc *gmc = &adev->gmc;
> >
> >       switch (adev->asic_type) {
> > -     case CHIP_RAVEN:
> > -             /* Raven currently has issues with noretry
> > -              * regardless of what we decide for other
> > -              * asics, we should leave raven with
> > -              * noretry = 0 until we root cause the
> > -              * issues.
> > +     case CHIP_VEGA20:
> > +     case CHIP_NAVI10:
> > +     case CHIP_NAVI14:
> > +             /*
> > +              * noretry = 0 will cause kfd page fault tests fail
> > +              * for some ASICs, so set default to 1 for these ASICs.
> >                */
> >               if (amdgpu_noretry == -1)
> > -                     gmc->noretry = 0;
> > +                     gmc->noretry = 1;
> >               else
> >                       gmc->noretry = amdgpu_noretry;
> >               break;
> > +     case CHIP_RAVEN:
> >       default:
> > -             /* default this to 0 for now, but we may want
> > +             /* Raven currently has issues with noretry
> > +              * regardless of what we decide for other
> > +              * asics, we should leave raven with
> > +              * noretry = 0 until we root cause the
> > +              * issues.
> > +              *
> > +              * default this to 0 for now, but we may want
> >                * to change this in the future for certain
> >                * GPUs as it can increase performance in
> >                * certain cases.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cgu
> chun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884
> e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb
> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> 7C1000&amp;sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&amp;
> reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cguchun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
  2020-11-29  7:22     ` Chen, Guchun
@ 2020-11-30 17:22       ` Deucher, Alexander
  2020-11-30 17:35         ` Felix Kuehling
  0 siblings, 1 reply; 6+ messages in thread
From: Deucher, Alexander @ 2020-11-30 17:22 UTC (permalink / raw)
  To: Chen, Guchun, Bas Nieuwenhuizen, Kuehling, Felix
  Cc: Zhou1, Tao, Huang, Ray, Gui, Jack, amd-gfx mailing list, Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 5720 bytes --]

[AMD Public Use]

We need to figure out what the root cause is then.  If we can't figure it out soon, we should revert the change for navi1x and continue to debug it until we can find the root cause and we can safely re-enable it.

Alex
________________________________
From: Chen, Guchun <Guchun.Chen@amd.com>
Sent: Sunday, November 29, 2020 2:22 AM
To: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>; Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Gui, Jack <Jack.Gui@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx mailing list <amd-gfx@lists.freedesktop.org>; Huang, Ray <Ray.Huang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: RE: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs

[AMD Public Use]

Hi Bas Nieuwenhuizen,

I don't think direct revert is one right approach, though it's able to fix your problem.  noretry=0 will cause other test failure on several ASICs.

Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Bas Nieuwenhuizen
Sent: Sunday, November 29, 2020 8:38 AM
To: Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Gui, Jack <Jack.Gui@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx mailing list <amd-gfx@lists.freedesktop.org>; Huang, Ray <Ray.Huang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs

Can we revert this patch to fix
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1374&amp;data=04%7C01%7Cguchun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Jxa2V1TuszoBKtF%2FPbIA3YwOrXHgLreBY%2FXej1HTZ4k%3D&amp;reserved=0 ?

On Thu, Oct 15, 2020 at 4:30 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
> > noretry = 0 cause some dGPU's kfd page fault tests fail, so set
> > noretry to 1 for these special ASICs:
> > vega20/navi10/navi14/ARCTURUS
> >
> > v2: merge raven and default case due to the same setting
> > v3: remove ARCTURUS
> >
> > Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
> > Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
>
> Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>
>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23
> > +++++++++++++++--------
> >  1 file changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > index 36604d751d62..f26eb4e54b12 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
> >       struct amdgpu_gmc *gmc = &adev->gmc;
> >
> >       switch (adev->asic_type) {
> > -     case CHIP_RAVEN:
> > -             /* Raven currently has issues with noretry
> > -              * regardless of what we decide for other
> > -              * asics, we should leave raven with
> > -              * noretry = 0 until we root cause the
> > -              * issues.
> > +     case CHIP_VEGA20:
> > +     case CHIP_NAVI10:
> > +     case CHIP_NAVI14:
> > +             /*
> > +              * noretry = 0 will cause kfd page fault tests fail
> > +              * for some ASICs, so set default to 1 for these ASICs.
> >                */
> >               if (amdgpu_noretry == -1)
> > -                     gmc->noretry = 0;
> > +                     gmc->noretry = 1;
> >               else
> >                       gmc->noretry = amdgpu_noretry;
> >               break;
> > +     case CHIP_RAVEN:
> >       default:
> > -             /* default this to 0 for now, but we may want
> > +             /* Raven currently has issues with noretry
> > +              * regardless of what we decide for other
> > +              * asics, we should leave raven with
> > +              * noretry = 0 until we root cause the
> > +              * issues.
> > +              *
> > +              * default this to 0 for now, but we may want
> >                * to change this in the future for certain
> >                * GPUs as it can increase performance in
> >                * certain cases.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cgu
> chun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884
> e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb
> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> 7C1000&amp;sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&amp;
> reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cguchun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 10773 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
  2020-11-30 17:22       ` Deucher, Alexander
@ 2020-11-30 17:35         ` Felix Kuehling
  2020-11-30 19:31           ` Felix Kuehling
  0 siblings, 1 reply; 6+ messages in thread
From: Felix Kuehling @ 2020-11-30 17:35 UTC (permalink / raw)
  To: Deucher, Alexander, Chen, Guchun, Bas Nieuwenhuizen
  Cc: Zhou1, Tao, Huang, Ray, Gui, Jack, amd-gfx mailing list, Zhang, Hawking

Like I stated elsewhere, I would recommend noretry=0 for Navi and later
GPUs because there is no performance advantage from disabling retry on
those GPUs.


Regards,
  Felix


Am 2020-11-30 um 12:22 p.m. schrieb Deucher, Alexander:
>
> [AMD Public Use]
>
>
> We need to figure out what the root cause is then.  If we can't figure
> it out soon, we should revert the change for navi1x and continue to
> debug it until we can find the root cause and we can safely re-enable it.
>
> Alex
> ------------------------------------------------------------------------
> *From:* Chen, Guchun <Guchun.Chen@amd.com>
> *Sent:* Sunday, November 29, 2020 2:22 AM
> *To:* Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>; Kuehling, Felix
> <Felix.Kuehling@amd.com>
> *Cc:* Gui, Jack <Jack.Gui@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>;
> amd-gfx mailing list <amd-gfx@lists.freedesktop.org>; Huang, Ray
> <Ray.Huang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>;
> Zhang, Hawking <Hawking.Zhang@amd.com>
> *Subject:* RE: [PATCH v3] drm/amd/amdgpu: set the default value of
> noretry to 1 for some dGPUs
>  
> [AMD Public Use]
>
> Hi Bas Nieuwenhuizen,
>
> I don't think direct revert is one right approach, though it's able to
> fix your problem.  noretry=0 will cause other test failure on several
> ASICs.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Bas
> Nieuwenhuizen
> Sent: Sunday, November 29, 2020 8:38 AM
> To: Kuehling, Felix <Felix.Kuehling@amd.com>
> Cc: Gui, Jack <Jack.Gui@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
> Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx mailing list
> <amd-gfx@lists.freedesktop.org>; Huang, Ray <Ray.Huang@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
> <Hawking.Zhang@amd.com>
> Subject: Re: [PATCH v3] drm/amd/amdgpu: set the default value of
> noretry to 1 for some dGPUs
>
> Can we revert this patch to fix
> https://gitlab.freedesktop.org/drm/amd/-/issues/1374 ?
>
> On Thu, Oct 15, 2020 at 4:30 PM Felix Kuehling
> <felix.kuehling@amd.com> wrote:
> >
> > Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
> > > noretry = 0 cause some dGPU's kfd page fault tests fail, so set
> > > noretry to 1 for these special ASICs:
> > > vega20/navi10/navi14/ARCTURUS
> > >
> > > v2: merge raven and default case due to the same setting
> > > v3: remove ARCTURUS
> > >
> > > Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
> > > Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
> >
> > Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>
> >
> >
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23
> > > +++++++++++++++--------
> > >  1 file changed, 15 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > > index 36604d751d62..f26eb4e54b12 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > > @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct
> amdgpu_device *adev)
> > >       struct amdgpu_gmc *gmc = &adev->gmc;
> > >
> > >       switch (adev->asic_type) {
> > > -     case CHIP_RAVEN:
> > > -             /* Raven currently has issues with noretry
> > > -              * regardless of what we decide for other
> > > -              * asics, we should leave raven with
> > > -              * noretry = 0 until we root cause the
> > > -              * issues.
> > > +     case CHIP_VEGA20:
> > > +     case CHIP_NAVI10:
> > > +     case CHIP_NAVI14:
> > > +             /*
> > > +              * noretry = 0 will cause kfd page fault tests fail
> > > +              * for some ASICs, so set default to 1 for these ASICs.
> > >                */
> > >               if (amdgpu_noretry == -1)
> > > -                     gmc->noretry = 0;
> > > +                     gmc->noretry = 1;
> > >               else
> > >                       gmc->noretry = amdgpu_noretry;
> > >               break;
> > > +     case CHIP_RAVEN:
> > >       default:
> > > -             /* default this to 0 for now, but we may want
> > > +             /* Raven currently has issues with noretry
> > > +              * regardless of what we decide for other
> > > +              * asics, we should leave raven with
> > > +              * noretry = 0 until we root cause the
> > > +              * issues.
> > > +              *
> > > +              * default this to 0 for now, but we may want
> > >                * to change this in the future for certain
> > >                * GPUs as it can increase performance in
> > >                * certain cases.
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://list/ <https://list>
> > s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cgu
> > chun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884
> > e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb
> > 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> > 7C1000&amp;sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&amp;
> > reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
  2020-11-30 17:35         ` Felix Kuehling
@ 2020-11-30 19:31           ` Felix Kuehling
  0 siblings, 0 replies; 6+ messages in thread
From: Felix Kuehling @ 2020-11-30 19:31 UTC (permalink / raw)
  To: Deucher, Alexander, Chen, Guchun, Bas Nieuwenhuizen
  Cc: Zhou1, Tao, Huang, Ray, Gui, Jack, amd-gfx mailing list, Zhang, Hawking

Another related thought: I think the reason some chips had failing VM
fault tests with noretry=0 was due to a dependency on IH rerouting of
retry faults. This dependency has been fixed by Christian recently:

commit 849c62248ee84c1e304a9ce2f673c79e23f29bf9
Author: Christian K?nig <christian.koenig@amd.com>
Date:   Sat Oct 31 18:39:54 2020 +0100

    drm/amdgpu: enabled software IH ring for Vega

    Seems like we won't get the hardware IH1/2 rings on Vega20 working.

    Signed-off-by: Christian K?nig <christian.koenig@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

 drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 7 +++++++
 1 file changed, 7 insertions(+)

commit 198237744d85c4a23914de56d78fba0acf5a2803
Author: Christian K?nig <christian.koenig@amd.com>
Date:   Tue Nov 3 14:22:50 2020 +0100

    drm/amdgpu: enabled software IH ring for Navi

    Felix pointed out that we need this for Navi as well.

    Signed-off-by: Christian K?nig <christian.koenig@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

 drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 7 +++++++
 1 file changed, 7 insertions(+)

So it should now be safe to enable retry faults on most chips. Only on
GFXv9 there can be a performance advantage to disabling retry.

Regards,
  Felix

Am 2020-11-30 um 12:35 p.m. schrieb Felix Kuehling:
> Like I stated elsewhere, I would recommend noretry=0 for Navi and later
> GPUs because there is no performance advantage from disabling retry on
> those GPUs.
>
>
> Regards,
>   Felix
>
>
> Am 2020-11-30 um 12:22 p.m. schrieb Deucher, Alexander:
>> [AMD Public Use]
>>
>>
>> We need to figure out what the root cause is then.  If we can't figure
>> it out soon, we should revert the change for navi1x and continue to
>> debug it until we can find the root cause and we can safely re-enable it.
>>
>> Alex
>> ------------------------------------------------------------------------
>> *From:* Chen, Guchun <Guchun.Chen@amd.com>
>> *Sent:* Sunday, November 29, 2020 2:22 AM
>> *To:* Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>; Kuehling, Felix
>> <Felix.Kuehling@amd.com>
>> *Cc:* Gui, Jack <Jack.Gui@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>;
>> amd-gfx mailing list <amd-gfx@lists.freedesktop.org>; Huang, Ray
>> <Ray.Huang@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>;
>> Zhang, Hawking <Hawking.Zhang@amd.com>
>> *Subject:* RE: [PATCH v3] drm/amd/amdgpu: set the default value of
>> noretry to 1 for some dGPUs
>>  
>> [AMD Public Use]
>>
>> Hi Bas Nieuwenhuizen,
>>
>> I don't think direct revert is one right approach, though it's able to
>> fix your problem.  noretry=0 will cause other test failure on several
>> ASICs.
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Bas
>> Nieuwenhuizen
>> Sent: Sunday, November 29, 2020 8:38 AM
>> To: Kuehling, Felix <Felix.Kuehling@amd.com>
>> Cc: Gui, Jack <Jack.Gui@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>;
>> Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx mailing list
>> <amd-gfx@lists.freedesktop.org>; Huang, Ray <Ray.Huang@amd.com>;
>> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
>> <Hawking.Zhang@amd.com>
>> Subject: Re: [PATCH v3] drm/amd/amdgpu: set the default value of
>> noretry to 1 for some dGPUs
>>
>> Can we revert this patch to fix
>> https://gitlab.freedesktop.org/drm/amd/-/issues/1374 ?
>>
>> On Thu, Oct 15, 2020 at 4:30 PM Felix Kuehling
>> <felix.kuehling@amd.com> wrote:
>>> Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
>>>> noretry = 0 cause some dGPU's kfd page fault tests fail, so set
>>>> noretry to 1 for these special ASICs:
>>>> vega20/navi10/navi14/ARCTURUS
>>>>
>>>> v2: merge raven and default case due to the same setting
>>>> v3: remove ARCTURUS
>>>>
>>>> Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
>>>> Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
>>> Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>
>>>
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23
>>>> +++++++++++++++--------
>>>>   1 file changed, 15 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> index 36604d751d62..f26eb4e54b12 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct
>> amdgpu_device *adev)
>>>>        struct amdgpu_gmc *gmc = &adev->gmc;
>>>>
>>>>        switch (adev->asic_type) {
>>>> -     case CHIP_RAVEN:
>>>> -             /* Raven currently has issues with noretry
>>>> -              * regardless of what we decide for other
>>>> -              * asics, we should leave raven with
>>>> -              * noretry = 0 until we root cause the
>>>> -              * issues.
>>>> +     case CHIP_VEGA20:
>>>> +     case CHIP_NAVI10:
>>>> +     case CHIP_NAVI14:
>>>> +             /*
>>>> +              * noretry = 0 will cause kfd page fault tests fail
>>>> +              * for some ASICs, so set default to 1 for these ASICs.
>>>>                 */
>>>>                if (amdgpu_noretry == -1)
>>>> -                     gmc->noretry = 0;
>>>> +                     gmc->noretry = 1;
>>>>                else
>>>>                        gmc->noretry = amdgpu_noretry;
>>>>                break;
>>>> +     case CHIP_RAVEN:
>>>>        default:
>>>> -             /* default this to 0 for now, but we may want
>>>> +             /* Raven currently has issues with noretry
>>>> +              * regardless of what we decide for other
>>>> +              * asics, we should leave raven with
>>>> +              * noretry = 0 until we root cause the
>>>> +              * issues.
>>>> +              *
>>>> +              * default this to 0 for now, but we may want
>>>>                 * to change this in the future for certain
>>>>                 * GPUs as it can increase performance in
>>>>                 * certain cases.
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://list/ <https://list>
>>> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cgu
>>> chun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884
>>> e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb
>>> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
>>> 7C1000&amp;sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&amp;
>>> reserved=0
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-11-30 19:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201015033557.3046-1-Jack.Gui@amd.com>
2020-10-15 14:30 ` [PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs Felix Kuehling
2020-11-29  0:38   ` Bas Nieuwenhuizen
2020-11-29  7:22     ` Chen, Guchun
2020-11-30 17:22       ` Deucher, Alexander
2020-11-30 17:35         ` Felix Kuehling
2020-11-30 19:31           ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.