RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case

From: "Ma, Le" <Le.Ma-5C7GfCeVMHo@public.gmane.org>
To: "Zhang, Hawking" <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Cc: "Deucher,
	Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>,
	"Zhou1, Tao" <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>,
	"Li, Dennis" <Dennis.Li-5C7GfCeVMHo@public.gmane.org>,
	"Chen, Guchun" <Guchun.Chen-5C7GfCeVMHo@public.gmane.org>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case
Date: Wed, 27 Nov 2019 12:35:57 +0000	[thread overview]
Message-ID: <MN2PR12MB4285E37DF7D44270D5CAD0E5F6440@MN2PR12MB4285.namprd12.prod.outlook.com> (raw)
In-Reply-To: <DM5PR12MB141825CB772FEEF1FD013EDBFC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 4511 bytes --]

Agree with your thoughts that we drop amdgpu_ras_enable=2 condition. The only concern in my side is that besides fatal_error, another result may happen that atombios_init timeout on xgmi by baco (not sure psp mode1 reset causes this as well).



Assuming no amdgpu_ras_enable=2 check, if PMFW > 40.52,  the use cases as my understanding includes:

  1.  sGPU without RAS:
     *   new: baco
     *   old: baco
  2.  sGPU with RAS:

  *   new: baco
  *   old: psp mode1 chain reset and legacy fatal_error handling

  1.  XGMI with RAS: baco
     *   new: baco
     *   old: psp mode1 chain reset and legacy fatal_error handling
  2.  XGMI without RAS: baco
     *   new: baco
     *   old: psp mode1 chain reset



That is to say, all uses cases go on baco road when PMFW > 40.52.



Regards,

Ma Le



-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>
Sent: Wednesday, November 27, 2019 7:28 PM
To: Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: Chen, Guchun <Guchun.Chen-5C7GfCeVMHo@public.gmane.org>; Zhou1, Tao <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>; Li, Dennis <Dennis.Li-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>; Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case



[AMD Public Use]



After thinking it a bit, I think we can just rely on PMFW version to decide to go RAS recovery or legacy fatal_error handling for the platforms that support RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not necessary? Even baco ras recovery not stable, it is the same result as legacy fatal_error handling that user has to reboot the node manually.



So the new soc reset use cases are:

XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW 40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW prior to 40.52): use legacy fatal_error handling.

Anything else?



Regards,

Hawking

-----Original Message-----

From: Le Ma <le.ma-5C7GfCeVMHo@public.gmane.org<mailto:le.ma-5C7GfCeVMHo@public.gmane.org>>

Sent: 2019年11月27日 17:15

To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org<mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>

Cc: Zhang, Hawking <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org<mailto:Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>>; Chen, Guchun <Guchun.Chen-5C7GfCeVMHo@public.gmane.org<mailto:Guchun.Chen-5C7GfCeVMHo@public.gmane.org>>; Zhou1, Tao <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org<mailto:Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>>; Li, Dennis <Dennis.Li-5C7GfCeVMHo@public.gmane.org<mailto:Dennis.Li-5C7GfCeVMHo@public.gmane.org>>; Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org<mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>>; Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org<mailto:Le.Ma@amd.com>>

Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case



Avoid to change default reset behavior for production card by checking amdgpu_ras_enable equal to 2. And only new enough smu ucode can support baco for xgmi/ras case.



Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b

Signed-off-by: Le Ma <le.ma-5C7GfCeVMHo@public.gmane.org<mailto:le.ma-5C7GfCeVMHo@public.gmane.org>>

---

drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++-

1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c

index 951327f..6202333 100644

--- a/drivers/gpu/drm/amd/amdgpu/soc15.c

+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c

@@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev)

                                   struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, 0);

                                   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);

-                                   if (hive || (ras && ras->supported))

+                                  if ((hive || (ras && ras->supported)) &&

+                                      (amdgpu_ras_enable != 2 ||

+                                      adev->pm.fw_version <= 0x283400))

                                               baco_reset = false;

                       }

                       break;

--

2.7.4

[-- Attachment #1.2: Type: text/html, Size: 16311 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx