From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Ma, Le" Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Date: Wed, 27 Nov 2019 12:35:57 +0000 Message-ID: References: <1574846129-4826-1-git-send-email-le.ma@amd.com> <1574846129-4826-5-git-send-email-le.ma@amd.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1598236071==" Return-path: In-Reply-To: Content-Language: en-US List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: "Zhang, Hawking" , "amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" Cc: "Deucher, Alexander" , "Zhou1, Tao" , "Li, Dennis" , "Chen, Guchun" --===============1598236071== Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_" --_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_ Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable Agree with your thoughts that we drop amdgpu_ras_enable=3D2 condition. The = only concern in my side is that besides fatal_error, another result may hap= pen that atombios_init timeout on xgmi by baco (not sure psp mode1 reset ca= uses this as well). Assuming no amdgpu_ras_enable=3D2 check, if PMFW > 40.52, the use cases as= my understanding includes: 1. sGPU without RAS: * new: baco * old: baco 2. sGPU with RAS: * new: baco * old: psp mode1 chain reset and legacy fatal_error handling 1. XGMI with RAS: baco * new: baco * old: psp mode1 chain reset and legacy fatal_error handling 2. XGMI without RAS: baco * new: baco * old: psp mode1 chain reset That is to say, all uses cases go on baco road when PMFW > 40.52. Regards, Ma Le -----Original Message----- From: Zhang, Hawking Sent: Wednesday, November 27, 2019 7:28 PM To: Ma, Le ; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Cc: Chen, Guchun ; Zhou1, Tao ; Li,= Dennis ; Deucher, Alexander = ; Ma, Le Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgm= i/ras case [AMD Public Use] After thinking it a bit, I think we can just rely on PMFW version to decide= to go RAS recovery or legacy fatal_error handling for the platforms that s= upport RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not = necessary? Even baco ras recovery not stable, it is the same result as lega= cy fatal_error handling that user has to reboot the node manually. So the new soc reset use cases are: XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW= 40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW pr= ior to 40.52): use legacy fatal_error handling. Anything else? Regards, Hawking -----Original Message----- From: Le Ma > Sent: 2019=1B$BG/=1B(B11=1B$B7n=1B(B27=1B$BF|=1B(B 17:15 To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Cc: Zhang, Hawking >; C= hen, Guchun >; Zhou1, Tao <= Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>; Li, Dennis >; Deucher, Alexander >; Ma, Le > Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ra= s case Avoid to change default reset behavior for production card by checking amdg= pu_ras_enable equal to 2. And only new enough smu ucode can support baco fo= r xgmi/ras case. Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b Signed-off-by: Le Ma > --- drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgp= u/soc15.c index 951327f..6202333 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev) struct amdgpu_hive_info *hive =3D amdgpu= _get_xgmi_hive(adev, 0); struct amdgpu_ras *ras =3D amdgpu_ras_ge= t_context(adev); - if (hive || (ras && ras->supported)) + if ((hive || (ras && ras->supported)) && + (amdgpu_ras_enable !=3D 2 || + adev->pm.fw_version <=3D 0x283400)) baco_reset =3D false; } break; -- 2.7.4 --_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_ Content-Type: text/html; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable

Agree with your thoughts that we drop amdgpu_ras_= enable=3D2 condition. The only concern in my side is that besides fatal_err= or, another result may happen that atombios_init timeout on xgmi by baco (n= ot sure psp mode1 reset causes this as well).

 

Assuming no amdgpu_ras_enable=3D2 check, if PMFW = > 40.52,  the use cases as my understanding includes:

  1. sGPU without RAS:
    • new: baco
    • old: baco
  2. sGPU with RAS:
  • new: baco
  • old: psp mode1 chain reset and legacy fa= tal_error handling
  1. XGMI with RAS: baco
    • new: baco
    • old: psp mode1 chain reset and legacy fata= l_error handling
  2. XGMI without RAS: baco
    • new: baco
    • old: psp mode1 chain reset
    • =

 

That is to say, all uses cases go on baco road wh= en PMFW > 40.52.

 

Regards,

Ma Le

 

-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>
Sent: Wednesday, November 27, 2019 7:28 PM
To: Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: Chen, Guchun <Guchun.Chen-5C7GfCeVMHo@public.gmane.org>; Zhou1, Tao <Tao.Zhou1@amd.= com>; Li, Dennis <Dennis.Li-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexa= nder.Deucher-5C7GfCeVMHo@public.gmane.org>; Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgm= i/ras case

 

[AMD Public Use]

 

After thinking it a bit, I think we can just rely= on PMFW version to decide to go RAS recovery or legacy fatal_error handlin= g for the platforms that support RAS. Leveraging amdgpu_ras_enable as a tem= porary solution seems not necessary? Even baco ras recovery not stable, it is the same result as legacy fatal_e= rror handling that user has to reboot the node manually.

 

So the new soc reset use cases are:

XGMI (without RAS): use PSP mode1 based chain res= et, RAS enabled (with PMFW 40.52 and onwards): use BACO based RAS recovery,= RAS enabled (with PMFW prior to 40.52): use legacy fatal_error handling.

Anything else?

 

Regards,

Hawking

-----Original Message-----

From: Le Ma <= le.ma-5C7GfCeVMHo@public.gmane.org<= /a>>

Sent: 2019=1B$BG/=1B(B11=1B= $B7n=1B(B27=1B$BF|=1B(B 17:15

To: amd-gfx-PD4FTy7X32mzQB+pC5nmwQ@public.gmane.org= edesktop.org

Cc: Zhang, Hawking <Hawkin= g.Zhang-5C7GfCeVMHo@public.gmane.org>; Chen, Guchun <Guchun.C= hen-5C7GfCeVMHo@public.gmane.org>; Zhou1, Tao <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>; Li, Denni= s <Dennis.Li-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>; Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>

Subject: [PATCH 06/10] drm/amdgpu: add condition = to enable baco for xgmi/ras case

 

Avoid to change default reset behavior for produc= tion card by checking amdgpu_ras_enable equal to 2. And only new enough smu= ucode can support baco for xgmi/ras case.

 

Change-Id: I07c3e6862be03e068745c73db8ea71f428ecb= a6b

Signed-off-by: Le Ma <le.ma-urvtwAKJhsc@public.gmane.org= m>

---

drivers/gpu/drm/amd/amdgpu/soc15.c | 4 ++= +-

1 file changed, 3 insertions(+), 1 deletion(-= )

 

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b= /drivers/gpu/drm/amd/amdgpu/soc15.c

index 951327f..6202333 100644

--- a/drivers/gpu/drm/amd/amdgpu/soc15.c

+++ b/drivers/gpu/drm/amd/amdgpu/soc1= 5.c

@@ -577,7 +577,9 @@ soc15_asic_reset_method(s= truct amdgpu_device *adev)

        &= nbsp;           &nbs= p;            &= nbsp; struct amdgpu_hive_info *hive =3D amdgpu_get_xgmi_hive(adev, 0);=

        &= nbsp;           &nbs= p;            &= nbsp; struct amdgpu_ras *ras =3D amdgpu_ras_get_context(adev);

-        =             &nb= sp;            =   if (hive || (ras && ras->supported))

+       &n= bsp;            = ;            &n= bsp; if ((hive || (ras && ras->supported)) &&=

+       &n= bsp;            = ;            &n= bsp;     (amdgpu_ras_enable !=3D 2 ||

+       &n= bsp;            = ;            &n= bsp;     adev->pm.fw_version <=3D 0x283400))

        &= nbsp;           &nbs= p;            &= nbsp;           &nbs= p; baco_reset =3D false;

        &= nbsp;           &nbs= p;  }

        &= nbsp;           &nbs= p;  break;

--

2.7.4

--_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_-- --===============1598236071== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4 --===============1598236071==-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.7 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,T_KAM_HTML_FONT_INVALID autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54075C432C0 for ; Wed, 27 Nov 2019 12:36:06 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2BD772075C for ; Wed, 27 Nov 2019 12:36:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2BD772075C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amd.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CF0236E29A; Wed, 27 Nov 2019 12:36:05 +0000 (UTC) Received: from NAM05-DM3-obe.outbound.protection.outlook.com (mail-eopbgr730085.outbound.protection.outlook.com [40.107.73.85]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3F6416E29A for ; Wed, 27 Nov 2019 12:36:04 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=k/C/oS0S7Ae6usX4JLD/JVZYIJhRo84HRFgoe5WIxrbqYmUk2Fail5/LqYSlHT5wcrm/aTM0Pf39axAY64+SIZkxSElmEEJiJVJw9z6193fBIoRd/sesXMk4LrX5ZPtrGjEb4prRq7j5ZWIBaKc/xXcnahsXV4HrjdhWsmTJwxM2es6Gn0hsykja+bm2c8ljMAARJz5b4EaSzw1SbJIJvlBZEyaTg9cd1au2mBZ/MLjA39yeO8amLc/i0ex5NSC7Sry5ysX77ept5Tlx6YDcUrXiBfVSjV73LdSyLOlmpKK1YGRpJEOr2XWimly9vwXYnSqmXPgWWUfV9Wi00xUDKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hKrf0++O3nnlDlOG4RAX8SuNzEKQkmv8nmL9q88OC7s=; b=ntxzQp7INGDqKdsTZLL88+EhSkp2xskHiUoGUSCc/Fcp8rU+Z5hxDvHq/7+TZ0obE5jLiGFnOzzNq9zeVgxDIpIBx4mKwmHUNJg3fecd6Tokmh3G/6KsIBkSBLJ5X8ojh60cP6pqG8gJag5SRrjdS1dv2pP/17VUvJ5J5+R6CUSl4zNj/85cjAFw5rOnqVliS+67L9l01mKx4LhAYLTKHm6YlxegvmWf6o7EwM+Z9iZPUrwtLeVb3ctkt43LSdDaFavud7dKiWUe0BHczdgcFci9+BtNlGSPBE0gLDD0c2NoF8a7Yr8d54yDrelD5NlMaKfiRj7qDbl2cGyJqiN4nQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none Received: from MN2PR12MB4285.namprd12.prod.outlook.com (52.135.49.140) by MN2PR12MB3631.namprd12.prod.outlook.com (20.178.240.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2495.19; Wed, 27 Nov 2019 12:35:57 +0000 Received: from MN2PR12MB4285.namprd12.prod.outlook.com ([fe80::b4d9:8cb3:3876:ed5]) by MN2PR12MB4285.namprd12.prod.outlook.com ([fe80::b4d9:8cb3:3876:ed5%6]) with mapi id 15.20.2495.014; Wed, 27 Nov 2019 12:35:57 +0000 From: "Ma, Le" To: "Zhang, Hawking" , "amd-gfx@lists.freedesktop.org" Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Thread-Topic: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Thread-Index: AQHVpQNF8p3JP+rj2kWZAyQf5kXx/6ee4XoAgAAFO3A= Date: Wed, 27 Nov 2019 12:35:57 +0000 Message-ID: References: <1574846129-4826-1-git-send-email-le.ma@amd.com> <1574846129-4826-5-git-send-email-le.ma@amd.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_Enabled=true; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_SetDate=2019-11-27T11:03:26Z; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_Method=Privileged; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_Name=Public_0; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_SiteId=3dd8961f-e488-4e60-8e11-a82d994e183d; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_ActionId=1deaade1-8d67-4c40-8ee3-0000834588a2; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_ContentBits=1 x-originating-ip: [180.167.199.189] x-ms-publictraffictype: Email x-ms-office365-filtering-ht: Tenant x-ms-office365-filtering-correlation-id: 3f2cf12c-9381-4b38-2225-08d773365a2a x-ms-traffictypediagnostic: MN2PR12MB3631:|MN2PR12MB3631: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:8273; x-forefront-prvs: 023495660C x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(4636009)(376002)(346002)(366004)(396003)(39860400002)(136003)(13464003)(199004)(189003)(186003)(26005)(76176011)(8936002)(4326008)(74316002)(11346002)(25786009)(52536014)(7736002)(6246003)(2501003)(2906002)(76116006)(5660300002)(81166006)(81156014)(99286004)(8676002)(102836004)(53546011)(86362001)(6506007)(110136005)(54906003)(54896002)(6306002)(316002)(66066001)(236005)(478600001)(66556008)(55016002)(66476007)(66446008)(64756008)(66946007)(14454004)(9686003)(446003)(256004)(14444005)(790700001)(3846002)(7696005)(6116002)(71190400001)(71200400001)(229853002)(6436002)(33656002); DIR:OUT; SFP:1101; SCL:1; SRVR:MN2PR12MB3631; H:MN2PR12MB4285.namprd12.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: amd.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: e5HnP+8lA92J/82vAOoU3StOW7wzQ4JOz76IrdRKWoz6R2WaV/TG3xhkCJvgGvG9JHEX0GAVGBT3CZdoyXX4BJZqSje1L7JWatA8Tr8KCuCaicyt9oATjiYQ40zZvnHPt+KIdpa9pYhYVJnNXhPJ9UTp2dbG5gWOCnZDlwqlQmODx6LHWc+ndhc1dQH7nCucNSc9RtmAsS8W72n3CJwbW+XkdFpckPeQsc4iiQm7pyQ+T5rs+S51+5IS6xkO1ngTT3ofkD0cCTGuaeSLBetF6T70DM7+4KtHKYCEMCpeviBxNkFniyZQCe91P0lDRFeNFQfsUbvOhpzgN92yI2MR5+Z30aAnID0S0z2mI+JT6M4/v6tye56zG/Ow1TqXaf3Q0vib+Wbc39GRX8ANOxuN2f0pvAeQCRi0pU5AoXBERjOz0QYvb4fYdi2ousLmH/Bc MIME-Version: 1.0 X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 3f2cf12c-9381-4b38-2225-08d773365a2a X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Nov 2019 12:35:57.1983 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: jC71nEBQRogfrWHk2cgs79MHdtKqE9d7uwj3+UOd8Opu24Dq/M86Qa7zLldGO8+F X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB3631 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hKrf0++O3nnlDlOG4RAX8SuNzEKQkmv8nmL9q88OC7s=; b=If/xYTY/hQ9D3ZJIFPIEZNHliyYmcoFnIynP52omZqkKD4or2Io2jfz9jb8ED4l/Pfs9QY3cNOpivFJdH8PlYQQ28KzvfldpiTha9y8kqkxrvrOCMgEi7yiYJjM0yWVla+IyHWSI5H+2lX5TrnpB5L+13oUC0IE1tOddxl9tdQE= X-Mailman-Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Le.Ma@amd.com; X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Deucher, Alexander" , "Zhou1, Tao" , "Li, Dennis" , "Chen, Guchun" Content-Type: multipart/mixed; boundary="===============1598236071==" Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" Message-ID: <20191127123557.t6hdpWeERzLydbQCd-cmZ8Ezdf-AUrbq-85n6WHKSbQ@z> --===============1598236071== Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_" --_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_ Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable Agree with your thoughts that we drop amdgpu_ras_enable=3D2 condition. The = only concern in my side is that besides fatal_error, another result may hap= pen that atombios_init timeout on xgmi by baco (not sure psp mode1 reset ca= uses this as well). Assuming no amdgpu_ras_enable=3D2 check, if PMFW > 40.52, the use cases as= my understanding includes: 1. sGPU without RAS: * new: baco * old: baco 2. sGPU with RAS: * new: baco * old: psp mode1 chain reset and legacy fatal_error handling 1. XGMI with RAS: baco * new: baco * old: psp mode1 chain reset and legacy fatal_error handling 2. XGMI without RAS: baco * new: baco * old: psp mode1 chain reset That is to say, all uses cases go on baco road when PMFW > 40.52. Regards, Ma Le -----Original Message----- From: Zhang, Hawking Sent: Wednesday, November 27, 2019 7:28 PM To: Ma, Le ; amd-gfx@lists.freedesktop.org Cc: Chen, Guchun ; Zhou1, Tao ; Li,= Dennis ; Deucher, Alexander = ; Ma, Le Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgm= i/ras case [AMD Public Use] After thinking it a bit, I think we can just rely on PMFW version to decide= to go RAS recovery or legacy fatal_error handling for the platforms that s= upport RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not = necessary? Even baco ras recovery not stable, it is the same result as lega= cy fatal_error handling that user has to reboot the node manually. So the new soc reset use cases are: XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW= 40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW pr= ior to 40.52): use legacy fatal_error handling. Anything else? Regards, Hawking -----Original Message----- From: Le Ma > Sent: 2019=1B$BG/=1B(B11=1B$B7n=1B(B27=1B$BF|=1B(B 17:15 To: amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking >; C= hen, Guchun >; Zhou1, Tao <= Tao.Zhou1@amd.com>; Li, Dennis >; Deucher, Alexander >; Ma, Le > Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ra= s case Avoid to change default reset behavior for production card by checking amdg= pu_ras_enable equal to 2. And only new enough smu ucode can support baco fo= r xgmi/ras case. Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b Signed-off-by: Le Ma > --- drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgp= u/soc15.c index 951327f..6202333 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev) struct amdgpu_hive_info *hive =3D amdgpu= _get_xgmi_hive(adev, 0); struct amdgpu_ras *ras =3D amdgpu_ras_ge= t_context(adev); - if (hive || (ras && ras->supported)) + if ((hive || (ras && ras->supported)) && + (amdgpu_ras_enable !=3D 2 || + adev->pm.fw_version <=3D 0x283400)) baco_reset =3D false; } break; -- 2.7.4 --_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_ Content-Type: text/html; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable

Agree with your thoughts that we drop amdgpu_ras_= enable=3D2 condition. The only concern in my side is that besides fatal_err= or, another result may happen that atombios_init timeout on xgmi by baco (n= ot sure psp mode1 reset causes this as well).

 

Assuming no amdgpu_ras_enable=3D2 check, if PMFW = > 40.52,  the use cases as my understanding includes:

  1. sGPU without RAS:
    • new: baco
    • old: baco
  2. sGPU with RAS:
  • new: baco
  • old: psp mode1 chain reset and legacy fa= tal_error handling
  1. XGMI with RAS: baco
    • new: baco
    • old: psp mode1 chain reset and legacy fata= l_error handling
  2. XGMI without RAS: baco
    • new: baco
    • old: psp mode1 chain reset
    • =

 

That is to say, all uses cases go on baco road wh= en PMFW > 40.52.

 

Regards,

Ma Le

 

-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang@amd.com>
Sent: Wednesday, November 27, 2019 7:28 PM
To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.= com>; Li, Dennis <Dennis.Li@amd.com>; Deucher, Alexander <Alexa= nder.Deucher@amd.com>; Ma, Le <Le.Ma@amd.com>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgm= i/ras case

 

[AMD Public Use]

 

After thinking it a bit, I think we can just rely= on PMFW version to decide to go RAS recovery or legacy fatal_error handlin= g for the platforms that support RAS. Leveraging amdgpu_ras_enable as a tem= porary solution seems not necessary? Even baco ras recovery not stable, it is the same result as legacy fatal_e= rror handling that user has to reboot the node manually.

 

So the new soc reset use cases are:

XGMI (without RAS): use PSP mode1 based chain res= et, RAS enabled (with PMFW 40.52 and onwards): use BACO based RAS recovery,= RAS enabled (with PMFW prior to 40.52): use legacy fatal_error handling.

Anything else?

 

Regards,

Hawking

-----Original Message-----

From: Le Ma <= le.ma@amd.com<= /a>>

Sent: 2019=1B$BG/=1B(B11=1B= $B7n=1B(B27=1B$BF|=1B(B 17:15

To: amd-gfx@lists.fre= edesktop.org

Cc: Zhang, Hawking <Hawkin= g.Zhang@amd.com>; Chen, Guchun <Guchun.C= hen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Li, Denni= s <Dennis.Li@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Ma, Le <Le.Ma@amd.com>

Subject: [PATCH 06/10] drm/amdgpu: add condition = to enable baco for xgmi/ras case

 

Avoid to change default reset behavior for produc= tion card by checking amdgpu_ras_enable equal to 2. And only new enough smu= ucode can support baco for xgmi/ras case.

 

Change-Id: I07c3e6862be03e068745c73db8ea71f428ecb= a6b

Signed-off-by: Le Ma <le.ma@amd.co= m>

---

drivers/gpu/drm/amd/amdgpu/soc15.c | 4 ++= +-

1 file changed, 3 insertions(+), 1 deletion(-= )

 

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b= /drivers/gpu/drm/amd/amdgpu/soc15.c

index 951327f..6202333 100644

--- a/drivers/gpu/drm/amd/amdgpu/soc15.c

+++ b/drivers/gpu/drm/amd/amdgpu/soc1= 5.c

@@ -577,7 +577,9 @@ soc15_asic_reset_method(s= truct amdgpu_device *adev)

        &= nbsp;           &nbs= p;            &= nbsp; struct amdgpu_hive_info *hive =3D amdgpu_get_xgmi_hive(adev, 0);=

        &= nbsp;           &nbs= p;            &= nbsp; struct amdgpu_ras *ras =3D amdgpu_ras_get_context(adev);

-        =             &nb= sp;            =   if (hive || (ras && ras->supported))

+       &n= bsp;            = ;            &n= bsp; if ((hive || (ras && ras->supported)) &&=

+       &n= bsp;            = ;            &n= bsp;     (amdgpu_ras_enable !=3D 2 ||

+       &n= bsp;            = ;            &n= bsp;     adev->pm.fw_version <=3D 0x283400))

        &= nbsp;           &nbs= p;            &= nbsp;           &nbs= p; baco_reset =3D false;

        &= nbsp;           &nbs= p;  }

        &= nbsp;           &nbs= p;  break;

--

2.7.4

--_000_MN2PR12MB4285E37DF7D44270D5CAD0E5F6440MN2PR12MB4285namp_-- --===============1598236071== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4 --===============1598236071==--