From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by smtp.lore.kernel.org (Postfix) with ESMTP id A38E9C433F5 for ; Tue, 11 Oct 2022 14:48:44 +0000 (UTC) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C0C5A42DE1; Tue, 11 Oct 2022 16:48:43 +0200 (CEST) Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by mails.dpdk.org (Postfix) with ESMTP id E3B9E42B7D for ; Tue, 11 Oct 2022 16:48:40 +0200 (CEST) Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.53]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Mmz7S1LN8zlVvP; Tue, 11 Oct 2022 22:44:04 +0800 (CST) Received: from [10.82.180.49] (10.82.180.49) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 11 Oct 2022 22:48:37 +0800 Message-ID: Date: Tue, 11 Oct 2022 22:48:37 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.1 Subject: Re: [PATCH v11 2/5] ethdev: support proactive error handling mode To: Andrew Rybchenko , , , CC: , , , , References: <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com> <20221009091009.38978-1-fengchengwen@huawei.com> <20221009091009.38978-3-fengchengwen@huawei.com> <15fd413f-5759-2509-3d4c-35c3a2e5b2b8@oktetlabs.ru> Content-Language: en-US From: fengchengwen In-Reply-To: <15fd413f-5759-2509-3d4c-35c3a2e5b2b8@oktetlabs.ru> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.82.180.49] X-ClientProxiedBy: dggpeml100017.china.huawei.com (7.185.36.161) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi Andrew, On 2022/10/10 16:47, Andrew Rybchenko wrote: > On 10/9/22 12:10, Chengwen Feng wrote: >> From: Kalesh AP >> >> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try >> to recover from the errors. In this process, the PMD sets the data path >> pointers to dummy functions (which will prevent the crash), and also >> make sure the control path operations failed with retcode -EBUSY. > > Could you explain why passive mode is not good. Why is > proactive better? What are the benefits? IMHO, it would > be simpler to have just one error recovery mode. I think the two modes are not good or bad. To a large extent, they are determined by the hardware and software design of the network card chip. Here take the hns3 driver as an examples: During the error recovery, multiple handshakes are required between the driver and the firmware, in addition, the handshake timeout are required. If chose passive mode, the application may not register the callback (and also we found that only ovs-dpdk register the reset event in many DPDK-based opensource software), so the recovery will failed.  Furthermore, even if registered the callback, the recovery process involves multiple handshakes which may take too much time to complete, imagine having multiple ports to report the reset time at the same time. (This possibility exists. Consider that the PF is reset due to multiple VFs under the PF.) In this case, many VFs report event, but the event callback is executed sequentially (because there is only one interrupt thread). As a result, later VFs cannot be processed in time, and the reset may fails. In conclusion, the proactive mode is an available troubleshooting method in engineering practice. >> >> The above error handling mode is known as >> RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode). >> >> In some service scenarios, application needs to be aware of the event >> to determine whether to migrate services. So three events were >> introduced: >> >> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it >> detected an error and the recovery is being started. Upon receiving the >> event, the application should not invoke any control path APIs until >> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or >> RTE_ETH_EVENT_RECOVERY_FAILED event. >> >> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that >> it recovers successful from the error, the PMD already re-configures the >> port, and the effect is the same as that of the restart operation. >> >> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it >> recovers failed from the error, the port should not usable anymore. The >> application should close the port. >> >> Signed-off-by: Kalesh AP >> Signed-off-by: Somnath Kotur >> Signed-off-by: Chengwen Feng >> Reviewed-by: Ajit Khaparde > > The code itself LGTM. I just want to understand why we need it. > It should be proved in the description. >