From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A38E9C433F5
	for <dpdk-dev@archiver.kernel.org>; Tue, 11 Oct 2022 14:48:44 +0000 (UTC)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id C0C5A42DE1;
	Tue, 11 Oct 2022 16:48:43 +0200 (CEST)
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
 by mails.dpdk.org (Postfix) with ESMTP id E3B9E42B7D
 for <dev@dpdk.org>; Tue, 11 Oct 2022 16:48:40 +0200 (CEST)
Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.53])
 by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Mmz7S1LN8zlVvP;
 Tue, 11 Oct 2022 22:44:04 +0800 (CST)
Received: from [10.82.180.49] (10.82.180.49) by dggpeml500024.china.huawei.com
 (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 11 Oct
 2022 22:48:37 +0800
Message-ID: <fcad90db-443f-bc22-16ae-30112c61cc9f@huawei.com>
Date: Tue, 11 Oct 2022 22:48:37 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.1
Subject: Re: [PATCH v11 2/5] ethdev: support proactive error handling mode
To: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>, <thomas@monjalon.net>,
 <ferruh.yigit@xilinx.com>, <ferruh.yigit@amd.com>
CC: <dev@dpdk.org>, <kalesh-anakkur.purayil@broadcom.com>,
 <somnath.kotur@broadcom.com>, <ajit.khaparde@broadcom.com>, <mdr@ashroe.eu>
References: <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
 <20221009091009.38978-1-fengchengwen@huawei.com>
 <20221009091009.38978-3-fengchengwen@huawei.com>
 <15fd413f-5759-2509-3d4c-35c3a2e5b2b8@oktetlabs.ru>
Content-Language: en-US
From: fengchengwen <fengchengwen@huawei.com>
In-Reply-To: <15fd413f-5759-2509-3d4c-35c3a2e5b2b8@oktetlabs.ru>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.82.180.49]
X-ClientProxiedBy: dggpeml100017.china.huawei.com (7.185.36.161) To
 dggpeml500024.china.huawei.com (7.185.36.10)
X-CFilter-Loop: Reflected
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Hi Andrew,

On 2022/10/10 16:47, Andrew Rybchenko wrote:
> On 10/9/22 12:10, Chengwen Feng wrote:
>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>
>> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
>> to recover from the errors. In this process, the PMD sets the data path
>> pointers to dummy functions (which will prevent the crash), and also
>> make sure the control path operations failed with retcode -EBUSY.
>
> Could you explain why passive mode is not good. Why is
> proactive better? What are the benefits? IMHO, it would
> be simpler to have just one error recovery mode.


I think the two modes are not good or bad. To a large extent, they are 
determined

by the hardware and software design of the network card chip. Here take 
the hns3

driver as an examples:

During the error recovery, multiple handshakes are required between the 
driver and

the firmware, in addition, the handshake timeout are required.

If chose passive mode, the application may not register the callback 
(and also we

found that only ovs-dpdk register the reset event in many DPDK-based 
opensource

software), so the recovery will failed.  Furthermore, even if registered 
the callback,

the recovery process involves multiple handshakes which may take too 
much time

to complete, imagine having multiple ports to report the reset time at 
the same time.

(This possibility exists. Consider that the PF is reset due to multiple 
VFs under the PF.)

In this case, many VFs report event, but the event callback is executed 
sequentially

(because there is only one interrupt thread). As a result, later VFs 
cannot be processed

in time, and the reset may fails.


In conclusion, the proactive mode is an available troubleshooting method in

engineering practice.


>>
>> The above error handling mode is known as
>> RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).
>>
>> In some service scenarios, application needs to be aware of the event
>> to determine whether to migrate services. So three events were
>> introduced:
>>
>> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
>> detected an error and the recovery is being started. Upon receiving the
>> event, the application should not invoke any control path APIs until
>> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>
>> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
>> it recovers successful from the error, the PMD already re-configures the
>> port, and the effect is the same as that of the restart operation.
>>
>> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> recovers failed from the error, the port should not usable anymore. The
>> application should close the port.
>>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>
> The code itself LGTM. I just want to understand why we need it.
> It should be proved in the description.
>