All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
To: Chengwen Feng <fengchengwen@huawei.com>,
	thomas@monjalon.net, ferruh.yigit@xilinx.com,
	ferruh.yigit@amd.com
Cc: dev@dpdk.org, kalesh-anakkur.purayil@broadcom.com,
	somnath.kotur@broadcom.com, ajit.khaparde@broadcom.com,
	mdr@ashroe.eu
Subject: Re: [PATCH v12 2/5] ethdev: support proactive error handling mode
Date: Thu, 13 Oct 2022 11:58:16 +0300	[thread overview]
Message-ID: <f2c881de-32c8-f402-8b73-24d70865bca1@oktetlabs.ru> (raw)
In-Reply-To: <20221012034555.9781-3-fengchengwen@huawei.com>

On 10/12/22 06:45, Chengwen Feng wrote:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
> error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
> wait for application invoke rte_eth_dev_reset() to recover the port,
> however, this mode has the following weaknesses:
> 
> 1) Due to different hardware and software design, some NIC port recovery
> process requires multiple handshakes with the firmware and PF (when the
> port is VF). It takes a long time to complete the entire operation for
> one port, If multiple ports (for example, multiple VFs of a PF) are
> reset at the same time, other VFs may fail to be reset. (Because the
> reset processing is serial, the previous VFs must be processed before
> the subsequent VFs).
> 
> 2) The impact on the application layer is great, and it should stop
> working queues, stop calling Rx and Tx functions, and then call
> rte_eth_dev_reset(), and re-setup all again.
> 
> This patch introduces proactive error handling mode, the PMD will try
> to recover from the errors itself. In this process, the PMD sets the
> data path pointers to dummy functions (which will prevent the crash),
> and also make sure the control path operations failed with retcode
> -EBUSY.
> 
> Because the PMD recovers automatically, the application can only sense
> that the data flow is disconnected for a while and the control API
> returns an error in this period.
> 
> In order to sense the error happening/recovering, three events were
> introduced:
> 
> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
> detected an error and the recovery is being started. Upon receiving the
> event, the application should not invoke any control path APIs until
> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> 
> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
> it recovers successful from the error, the PMD already re-configures the
> port, and the effect is the same as that of the restart operation.
> 
> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
> recovers failed from the error, the port should not usable anymore. The
> application should close the port.
> 
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

With few nits below,

Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

[snip]

> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> index 9d081b1cba..73941a74bd 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -627,3 +627,41 @@ by application.
>   The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
>   the application to handle reset event. It is duty of application to
>   handle all synchronization before it calls rte_eth_dev_reset().
> +
> +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
> +
> +Proactive Error Handling Mode
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
> +hardware or firmware errors, the PMD will try to recover from the errors. In
> +this process, the PMD sets the data path pointers to dummy functions (which
> +will prevent the crash), and also make sure the control path operations failed
> +with retcode -EBUSY.
> +
> +Also in this process, from the perspective of application, services are
> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,

bust -> burst

> +and the control plane API return failure.

I think we need to highlight here that the key advantage of the
proactive error recover that it requires nothing from PMD by
default. The recover simply happens.

> +
> +In some service scenarios, application needs to be aware of the event to
> +determine whether to migrate services. So three events were introduced:
> +
> +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
> +  an error and the recovery is being started. Upon receiving the event, the
> +  application should not invoke any control path APIs until receiving
> +  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
> +
> +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
> +  recovers successful from the error, the PMD already re-configures the port,
> +  and the effect is the same as that of the restart operation.
> +
> +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
> +  recovers failed from the error, the port should not usable anymore. the
> +  application should close the port.
> +
> +.. note::
> +        * Before the PMD reports the recovery result, the PMD may report the
> +          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
> +          may occur during the recovery.
> +        * The error handling mode supported by the PMD can be reported through
> +          the ``rte_eth_dev_info_get`` API.
> diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
> +	 *    - LRO configuration
> +	 *    - LSC configuration
> +	 *    - MTU
> +	 *    - Mac address (default and those supplied by MAC address array)
> +	 *    - Promiscuous and allmulticast mode
> +	 *    - PTP configuration
> +	 *    - Queue (Rx/Tx) settings
> +	 *    - Queue statistics mappings
> +	 *    - RSS configuration by rte_eth_dev_rss_xxx() family
> +	 *    - Rx checksum configuration
> +	 *    - Rx interrupt settings
> +	 *    - Traffic management configuration
> +	 *    - VLAN configuration (including filtering, tpid, strip, pvid)
> +	 *    - VMDq configuration
> +	 * b) the following configuration maybe retained or not depending on the
> +	 *    device capabilities:
> +	 *    - flow rules
> +	 *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
> +	 *    - shared flow objects
> +	 *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
> +	 * c) the other configuration will not be stored and will need to be
> +	 *    re-configured.
> +	 */
> +	RTE_ETH_EVENT_RECOVERY_SUCCESS,
> +	/** Port recovers failed from the error.
> +	 * It means that the port should not usable anymore. The application
> +	 * should close the port.
> +	 */
> +	RTE_ETH_EVENT_RECOVERY_FAILED,
>   	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>   };

[snip]



  reply	other threads:[~2022-10-13  8:58 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
2022-09-22  7:41   ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
2022-10-03 17:35     ` Ferruh Yigit
2022-10-05  1:56       ` fengchengwen
2022-09-22  7:41   ` [PATCH v9 2/5] ethdev: support proactive " Chengwen Feng
2022-10-03 17:35     ` Ferruh Yigit
2022-09-22  7:41   ` [PATCH v9 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-09-22  7:41   ` [PATCH v9 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-09-22  7:41   ` [PATCH v9 5/5] net/bnxt: " Chengwen Feng
2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 1/5] ethdev: support get port " Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 2/5] ethdev: support proactive " Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 5/5] net/bnxt: " Chengwen Feng
2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
2022-10-10  8:38     ` Andrew Rybchenko
2022-10-10  8:44     ` Andrew Rybchenko
2022-10-09  9:10   ` [PATCH v11 2/5] ethdev: support proactive " Chengwen Feng
2022-10-10  8:47     ` Andrew Rybchenko
2022-10-11 14:48       ` fengchengwen
2022-10-09  9:10   ` [PATCH v11 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-09  9:10   ` [PATCH v11 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-09 11:05     ` Dongdong Liu
2022-10-09  9:10   ` [PATCH v11 5/5] net/bnxt: " Chengwen Feng
2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 1/5] ethdev: add error handling mode to device info Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 2/5] ethdev: support proactive error handling mode Chengwen Feng
2022-10-13  8:58     ` Andrew Rybchenko [this message]
2022-10-13 12:50       ` fengchengwen
2022-10-12  3:45   ` [PATCH v12 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 5/5] net/bnxt: " Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 1/5] ethdev: add error handling mode to device info Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 2/5] ethdev: support proactive error handling mode Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 5/5] net/bnxt: " Chengwen Feng
2022-10-17  7:42   ` [PATCH v13 0/5] support " Andrew Rybchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f2c881de-32c8-f402-8b73-24d70865bca1@oktetlabs.ru \
    --to=andrew.rybchenko@oktetlabs.ru \
    --cc=ajit.khaparde@broadcom.com \
    --cc=dev@dpdk.org \
    --cc=fengchengwen@huawei.com \
    --cc=ferruh.yigit@amd.com \
    --cc=ferruh.yigit@xilinx.com \
    --cc=kalesh-anakkur.purayil@broadcom.com \
    --cc=mdr@ashroe.eu \
    --cc=somnath.kotur@broadcom.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.