From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, T_KAM_HTML_FONT_INVALID,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28B85C43603 for ; Mon, 9 Dec 2019 22:15:40 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id DD701206D5 for ; Mon, 9 Dec 2019 22:15:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD701206D5 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amd.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8A0576E56A; Mon, 9 Dec 2019 22:15:39 +0000 (UTC) Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2043.outbound.protection.outlook.com [40.107.244.43]) by gabe.freedesktop.org (Postfix) with ESMTPS id 136F86E56A for ; Mon, 9 Dec 2019 22:15:38 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=n9Tt9FiyHc9uMocJb6bG45v/NPiNH1xwO/64LvzUqgew76AwuuRMqUkBWAr5dstrwYIryMrLLupVW7ZQw6ZW0CaSgnXwD8O8R2IbZ2WkBcfix/s6XRbaxx3CY3s5kYWZOx2VLZf+1yKz8wahXZxcgBKPMehv6v+IPAbphHDKr0CnCLroYWXlHIGmH9DU0ItvTD83Dg+aOIWMIoN5XvTkbZWYst4PM4f702yjEueaSzuFaw9csD4Q0HAo+JNaG3wkxWE3vr3vpozzjw7MGo5VmtPgylsDov+AOD6n7mP69KD3tI8jiBTAP+gn02ykU8vWSY4UKDc7bsgBkn1ppqaUCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=cI8DTtRkSME1ita4AM1Yf4H4dpwigUhL+RSm/jibOLQ=; b=RZkcU/51Bvccb/J+Gl1knBp9A4QXrUIC92ef2nEHxLCnCX7JOUlykBTkYg1h70wmTvliBj1PxXCbks62dwE/NhedmRsRaTb4AWgyII7WKwsFrc0vLQ/sJLR+JlQN2a0XqL6cnQTUt4XSL2Iremt05Qa9WqHMD85IuFQ60MSH6Nso/HFFL0+ABk+wVfF4VWM/SrEGJD7CMldfRosO19Asdg03vWctaOx6YfFzRZk79+vr/PGAuYXqmnFC0hn7hdIBfO3pe104BnRicvIfU6HLa/CTUgHYMiu8vAGoRJkW40cHvmDpghF5VOSO0d3tCo7bLJ/93ZzYt8DafSHC/JY3NA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none Received: from MWHPR12MB1453.namprd12.prod.outlook.com (10.172.55.22) by MWHPR12MB1806.namprd12.prod.outlook.com (10.175.52.149) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2516.17; Mon, 9 Dec 2019 22:00:35 +0000 Received: from MWHPR12MB1453.namprd12.prod.outlook.com ([fe80::514b:dbf8:d19f:a80]) by MWHPR12MB1453.namprd12.prod.outlook.com ([fe80::514b:dbf8:d19f:a80%12]) with mapi id 15.20.2516.018; Mon, 9 Dec 2019 22:00:35 +0000 Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI To: "Ma, Le" , "amd-gfx@lists.freedesktop.org" , "Zhou1, Tao" , "Deucher, Alexander" , "Li, Dennis" , "Zhang, Hawking" References: <1574846129-4826-1-git-send-email-le.ma@amd.com> <1574846129-4826-6-git-send-email-le.ma@amd.com> <157d7671-803c-4f6e-f77c-9738f32905e3@amd.com> <5b505116-17aa-383d-5cdf-246663a1f4f9@amd.com> <2c4dd3f3-e2ce-9843-312b-1e5c05a51521@amd.com> <0cf9f58a-3ce4-2a9c-cb1a-db3cb13760b9@amd.com> From: Andrey Grodzovsky Message-ID: <1f271be0-4b91-d612-b289-67eacea62652@amd.com> Date: Mon, 9 Dec 2019 17:00:32 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 In-Reply-To: Content-Language: en-US X-ClientProxiedBy: YT1PR01CA0035.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01::48) To MWHPR12MB1453.namprd12.prod.outlook.com (2603:10b6:301:e::22) MIME-Version: 1.0 X-Originating-IP: [165.204.55.251] X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: a7d210fa-c021-4d39-ca97-08d77cf337b1 X-MS-TrafficTypeDiagnostic: MWHPR12MB1806:|MWHPR12MB1806: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:3631; X-Forefront-PRVS: 02462830BE X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(4636009)(396003)(136003)(39860400002)(376002)(366004)(346002)(199004)(189003)(52314003)(110136005)(52116002)(81156014)(6506007)(66946007)(66476007)(66556008)(81166006)(8676002)(53546011)(36756003)(186003)(5660300002)(31686004)(6512007)(790700001)(6486002)(2906002)(478600001)(2616005)(26005)(86362001)(33964004)(31696002)(6666004)(6636002)(229853002)(8936002)(71190400001)(4326008)(316002)(921003)(1121003); DIR:OUT; SFP:1101; SCL:1; SRVR:MWHPR12MB1806; H:MWHPR12MB1453.namprd12.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; Received-SPF: None (protection.outlook.com: amd.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 8C5DbgaRXSSDdzM15EPJ2ujaIsyBlvsW89OGHX8G8gurIm/fjy7v06jFW7Il56iexbxIHUhjGIr+0kBGvgcxBwBn1QUQ7wTAHYhpRhp2fDzsfu/Oh9eYRFIgKZXSX/sJkWG2sRhJjoFHe4DTzI+NE7q/pjvSInKB+maJQKgMgpsEUUNAhO/8H904qlS+scmNpqBtBPdkyVGxdTub9dpdyJW5is+McyoElP/uA92V5jbSPjeOENONHkB+m9KxdAA3Yx8+VRUl80/P9ft5LVAba9NFI4hOcMqplHQvsEYF669jfckJNwYto/j2ROo5nkZtXmhE8hTlM8yRYbzCuLyG2szKSH4csvBockVVzNwMHX8WQSWGXLztBvcnHhUYOJ+3RwCGcErwr0e9GqG1qQwyd5Nk9VffSQib0RgkzRIfVcFZPgnWP+eycywqwpmAhDIVj9TLfr+kIbv74A3QY7k9xzHMUsb2KGYj5naZwBaofxIXAHWuzGsR8BfckHheerqOT76BMS3ZAiqDyp/FkGJagg== X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: a7d210fa-c021-4d39-ca97-08d77cf337b1 X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Dec 2019 22:00:35.4278 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: /3z6ts7HjK825mkDcW4WQyutUaIR00PbMEivnoanC8LF+hHUYojrbsRdqTxW7qvXZAKjbB2wTYvbCTFrWMqy3Q== X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR12MB1806 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=cI8DTtRkSME1ita4AM1Yf4H4dpwigUhL+RSm/jibOLQ=; b=TpzuDk6v8IXEmguT/7pUjFVCfDnDh+5MXlHweJ0+4hjCPZ7MSSdV2ElO3s7AxjFgfTY6VD6VvB6uoI+XIkkyEEXVO15R8yHJYp13xES7tCoK8ecqPFHSinE3gYtc4RiCM6PTzTYMVUDsJFofzCtpuxisj5R3hpBlhaBtpUR2dqE= X-Mailman-Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Andrey.Grodzovsky@amd.com; X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Chen, Guchun" Content-Type: multipart/mixed; boundary="===============1562486652==" Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" --===============1562486652== Content-Type: multipart/alternative; boundary="------------75ACAD8FC7EF0031518255DF" Content-Language: en-US --------------75ACAD8FC7EF0031518255DF Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit I reproduced the issue on my side - i consistently  observe amdgpu: [powerplay] Failed to send message 0x58, response 0x0 - Baco exit failure - do you know what is the strict time interval within which all the Baco enter/Exit messages needs to be sent to all the nodes in the hive ? Andrey On 12/9/19 6:34 AM, Ma, Le wrote: > > [AMD Official Use Only - Internal Distribution Only] > > > Hi Andrey, > > I tried your patches on my 2P XGMI platform. The baco can work at most > time, and randomly got following error: > > [ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, > response 0x0 > > This error usually means some sync issue exist for xgmi baco case. > Feel free to debug your patches on my XGMI platform. > > Regards, > > Ma Le > > *From:*Grodzovsky, Andrey > *Sent:* Saturday, December 7, 2019 5:51 AM > *To:* Ma, Le ; amd-gfx@lists.freedesktop.org; Zhou1, > Tao ; Deucher, Alexander > ; Li, Dennis ; Zhang, > Hawking > *Cc:* Chen, Guchun > *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset > support for XGMI > > Hey Ma, attached a solution - it's just compiled as I still can't make > my XGMI setup work (with bridge connected only one device is visible > to the system while the other is not). Please try it on your system if > you have a chance. > > Andrey > > On 12/4/19 10:14 PM, Ma, Le wrote: > > AFAIK it's enough for even single one node in the hive to to fail > the enter the BACO state on time to fail the entire hive reset > procedure, no ? > > [Le]: Yeah, agree that. I’ve been thinking that make all nodes > entering baco simultaneously can reduce the possibility of node > failure to enter/exit BACO risk. For example, in an XGMI hive with > 8 nodes, the total time interval of 8 nodes enter/exit BACO on 8 > CPUs is less than the interval that 8 nodes enter BACO serially > and exit BACO serially depending on one CPU with yield capability. > This interval is usually strict for BACO feature itself. Anyway, > we need more looping test later on any method we will choose. > > Any way - I see our discussion blocks your entire patch set - I > think you can go ahead and commit yours way (I think you got an RB > from Hawking) and I will look then and see if I can implement my > method and if it works will just revert your patch. > > [Le]: OK, fine. > > Andrey > --------------75ACAD8FC7EF0031518255DF Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit

I reproduced the issue on my side - i consistently  observe amdgpu: [powerplay] Failed to send message 0x58, response 0x0 - Baco exit failure - do you know what is the strict time interval within which all the Baco enter/Exit messages needs to be sent to all the nodes in the hive ?

Andrey

On 12/9/19 6:34 AM, Ma, Le wrote:

[AMD Official Use Only - Internal Distribution Only]


Hi Andrey,

 

I tried your patches on my 2P XGMI platform. The baco can work at most time, and randomly got following error:

[ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, response 0x0

 

This error usually means some sync issue exist for xgmi baco case. Feel free to debug your patches on my XGMI platform.

 

Regards,

Ma Le

 

From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Sent: Saturday, December 7, 2019 5:51 AM
To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org; Zhou1, Tao <Tao.Zhou1@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Cc: Chen, Guchun <Guchun.Chen@amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

 

Hey Ma, attached a solution - it's just compiled as I still can't make my XGMI setup work (with bridge connected only one device is visible to the system while the other is not). Please try it on your system if you have a chance.

Andrey

On 12/4/19 10:14 PM, Ma, Le wrote:

AFAIK it's enough for even single one node in the hive to to fail the enter the BACO state on time to fail the entire hive reset procedure, no ?

[Le]: Yeah, agree that. I’ve been thinking that make all nodes entering baco simultaneously can reduce the possibility of node failure to enter/exit BACO risk. For example, in an XGMI hive with 8 nodes, the total time interval of 8 nodes enter/exit BACO on 8 CPUs is less than the interval that 8 nodes enter BACO serially and exit BACO serially depending on one CPU with yield capability. This interval is usually strict for BACO feature itself. Anyway, we need more looping test later on any method we will choose.

Any way - I see our discussion blocks your entire patch set - I think you can go ahead and commit yours way (I think you got an RB from Hawking) and I will look then and see if I can implement my method and if it works will just revert your patch.

[Le]: OK, fine.

Andrey

--------------75ACAD8FC7EF0031518255DF-- --===============1562486652== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4 --===============1562486652==--