From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1117C4332F for ; Thu, 13 Oct 2022 08:58:19 +0000 (UTC) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 19E00410F2; Thu, 13 Oct 2022 10:58:19 +0200 (CEST) Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113]) by mails.dpdk.org (Postfix) with ESMTP id BBFFE40C35 for ; Thu, 13 Oct 2022 10:58:17 +0200 (CEST) Received: from [192.168.38.17] (aros.oktetlabs.ru [192.168.38.17]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 120A45D; Thu, 13 Oct 2022 11:58:17 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 120A45D DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru; s=default; t=1665651497; bh=+VixaRmXUuk2c5Pd01JZ3QAVdeUV8w/xDA9rhBEh40I=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=bcs96bDuz4J/ika6OZG0DktKFdGJhgAh3d3jG39fJqi8eOdhLVEqsIQoW0EO9TSH7 FDFcg5Lh/Nlj4FF3Di2ypStbPTTGNv1UQtbWuMzDTvyMwQmcx8Y5cHHR6zf3I2rBY0 42bWz4uC/vVchdStRNeYJwIM6xA80qJBHbeQ5cZ4= Message-ID: Date: Thu, 13 Oct 2022 11:58:16 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 Subject: Re: [PATCH v12 2/5] ethdev: support proactive error handling mode Content-Language: en-US To: Chengwen Feng , thomas@monjalon.net, ferruh.yigit@xilinx.com, ferruh.yigit@amd.com Cc: dev@dpdk.org, kalesh-anakkur.purayil@broadcom.com, somnath.kotur@broadcom.com, ajit.khaparde@broadcom.com, mdr@ashroe.eu References: <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com> <20221012034555.9781-1-fengchengwen@huawei.com> <20221012034555.9781-3-fengchengwen@huawei.com> From: Andrew Rybchenko Organization: OKTET Labs In-Reply-To: <20221012034555.9781-3-fengchengwen@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 10/12/22 06:45, Chengwen Feng wrote: > From: Kalesh AP > > Some PMDs (e.g. hns3) could detect hardware or firmware errors, one > error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and > wait for application invoke rte_eth_dev_reset() to recover the port, > however, this mode has the following weaknesses: > > 1) Due to different hardware and software design, some NIC port recovery > process requires multiple handshakes with the firmware and PF (when the > port is VF). It takes a long time to complete the entire operation for > one port, If multiple ports (for example, multiple VFs of a PF) are > reset at the same time, other VFs may fail to be reset. (Because the > reset processing is serial, the previous VFs must be processed before > the subsequent VFs). > > 2) The impact on the application layer is great, and it should stop > working queues, stop calling Rx and Tx functions, and then call > rte_eth_dev_reset(), and re-setup all again. > > This patch introduces proactive error handling mode, the PMD will try > to recover from the errors itself. In this process, the PMD sets the > data path pointers to dummy functions (which will prevent the crash), > and also make sure the control path operations failed with retcode > -EBUSY. > > Because the PMD recovers automatically, the application can only sense > that the data flow is disconnected for a while and the control API > returns an error in this period. > > In order to sense the error happening/recovering, three events were > introduced: > > 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it > detected an error and the recovery is being started. Upon receiving the > event, the application should not invoke any control path APIs until > receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or > RTE_ETH_EVENT_RECOVERY_FAILED event. > > 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that > it recovers successful from the error, the PMD already re-configures the > port, and the effect is the same as that of the restart operation. > > 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it > recovers failed from the error, the port should not usable anymore. The > application should close the port. > > Signed-off-by: Kalesh AP > Signed-off-by: Somnath Kotur > Signed-off-by: Chengwen Feng > Reviewed-by: Ajit Khaparde With few nits below, Acked-by: Andrew Rybchenko [snip] > diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst > index 9d081b1cba..73941a74bd 100644 > --- a/doc/guides/prog_guide/poll_mode_drv.rst > +++ b/doc/guides/prog_guide/poll_mode_drv.rst > @@ -627,3 +627,41 @@ by application. > The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger > the application to handle reset event. It is duty of application to > handle all synchronization before it calls rte_eth_dev_reset(). > + > +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``. > + > +Proactive Error Handling Mode > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect > +hardware or firmware errors, the PMD will try to recover from the errors. In > +this process, the PMD sets the data path pointers to dummy functions (which > +will prevent the crash), and also make sure the control path operations failed > +with retcode -EBUSY. > + > +Also in this process, from the perspective of application, services are > +affected. For example, the Rx/Tx bust APIs cannot receive and send packets, bust -> burst > +and the control plane API return failure. I think we need to highlight here that the key advantage of the proactive error recover that it requires nothing from PMD by default. The recover simply happens. > + > +In some service scenarios, application needs to be aware of the event to > +determine whether to migrate services. So three events were introduced: > + > +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected > + an error and the recovery is being started. Upon receiving the event, the > + application should not invoke any control path APIs until receiving > + RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event. > + > +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it > + recovers successful from the error, the PMD already re-configures the port, > + and the effect is the same as that of the restart operation. > + > +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it > + recovers failed from the error, the port should not usable anymore. the > + application should close the port. > + > +.. note:: > + * Before the PMD reports the recovery result, the PMD may report the > + ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error > + may occur during the recovery. > + * The error handling mode supported by the PMD can be reported through > + the ``rte_eth_dev_info_get`` API. > diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst > + * - LRO configuration > + * - LSC configuration > + * - MTU > + * - Mac address (default and those supplied by MAC address array) > + * - Promiscuous and allmulticast mode > + * - PTP configuration > + * - Queue (Rx/Tx) settings > + * - Queue statistics mappings > + * - RSS configuration by rte_eth_dev_rss_xxx() family > + * - Rx checksum configuration > + * - Rx interrupt settings > + * - Traffic management configuration > + * - VLAN configuration (including filtering, tpid, strip, pvid) > + * - VMDq configuration > + * b) the following configuration maybe retained or not depending on the > + * device capabilities: > + * - flow rules > + * @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP > + * - shared flow objects > + * @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP > + * c) the other configuration will not be stored and will need to be > + * re-configured. > + */ > + RTE_ETH_EVENT_RECOVERY_SUCCESS, > + /** Port recovers failed from the error. > + * It means that the port should not usable anymore. The application > + * should close the port. > + */ > + RTE_ETH_EVENT_RECOVERY_FAILED, > RTE_ETH_EVENT_MAX /**< max value of this enum */ > }; [snip]