Agree with your thoughts that we drop amdgpu_ras_enable=2 condition. The only concern in my side is that besides fatal_error, another result may happen that atombios_init timeout on xgmi by baco (not sure psp mode1 reset causes this as well). Assuming no amdgpu_ras_enable=2 check, if PMFW > 40.52, the use cases as my understanding includes: 1. sGPU without RAS: * new: baco * old: baco 2. sGPU with RAS: * new: baco * old: psp mode1 chain reset and legacy fatal_error handling 1. XGMI with RAS: baco * new: baco * old: psp mode1 chain reset and legacy fatal_error handling 2. XGMI without RAS: baco * new: baco * old: psp mode1 chain reset That is to say, all uses cases go on baco road when PMFW > 40.52. Regards, Ma Le -----Original Message----- From: Zhang, Hawking Sent: Wednesday, November 27, 2019 7:28 PM To: Ma, Le ; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Cc: Chen, Guchun ; Zhou1, Tao ; Li, Dennis ; Deucher, Alexander ; Ma, Le Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case [AMD Public Use] After thinking it a bit, I think we can just rely on PMFW version to decide to go RAS recovery or legacy fatal_error handling for the platforms that support RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not necessary? Even baco ras recovery not stable, it is the same result as legacy fatal_error handling that user has to reboot the node manually. So the new soc reset use cases are: XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW 40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW prior to 40.52): use legacy fatal_error handling. Anything else? Regards, Hawking -----Original Message----- From: Le Ma > Sent: 2019年11月27日 17:15 To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Cc: Zhang, Hawking >; Chen, Guchun >; Zhou1, Tao >; Li, Dennis >; Deucher, Alexander >; Ma, Le > Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Avoid to change default reset behavior for production card by checking amdgpu_ras_enable equal to 2. And only new enough smu ucode can support baco for xgmi/ras case. Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b Signed-off-by: Le Ma > --- drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 951327f..6202333 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev) struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, 0); struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); - if (hive || (ras && ras->supported)) + if ((hive || (ras && ras->supported)) && + (amdgpu_ras_enable != 2 || + adev->pm.fw_version <= 0x283400)) baco_reset = false; } break; -- 2.7.4