All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/5] support error handling mode
       [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
@ 2022-09-22  7:41 ` Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
                     ` (4 more replies)
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
                   ` (3 subsequent siblings)
  4 siblings, 5 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-09-22  7:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patchset introduce error handling mode concept, the supported modes 
are as follows:
1) PASSIVE: passive error handling, after the PMD detect that a reset 
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset to recover the port.

2) PROACTIVE: proactive error handling, after the PMD detect that a reset 
is required, the PMD reports RTE_ETH_EVENT_ERR_RECOVERING event, and do 
recovery internally, finally, reports the recovery result event.

Chengwen Feng (2):
  ethdev: support get port error handling mode
  net/hns3: support proactive error handling mode

Kalesh AP (3):
  ethdev: support proactive error handling mode
  app/testpmd: support error handling mode event
  net/bnxt: support proactive error handling mode

---
v9: Introduce error handling mode concept.
    Addressed comments from Thomas and Ray. 
v8: Addressed comments from Thomas and Ferruh.
    Also introduced RECOVER_FAIL event.
    Add hns3 driver patch.
v7: Addressed comments from Thomas and Andrew.
v6: Addressed comments from Asaf Penso.
    1. Updated 20.11 release notes with the new events added.
    2. updated testpmd parse_event_printing_config function.
v5: Addressed comments from Ophir Munk.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Addressed comments from Thomas Monjalon
    1. Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

 app/test-pmd/config.c                   |  4 ++
 app/test-pmd/parameters.c               | 10 ++++-
 app/test-pmd/testpmd.c                  |  8 +++-
 doc/guides/prog_guide/poll_mode_drv.rst | 39 +++++++++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 ++++++
 drivers/net/bnxt/bnxt_cpr.c             |  4 ++
 drivers/net/bnxt/bnxt_ethdev.c          | 13 ++++++-
 drivers/net/e1000/igb_ethdev.c          |  2 +
 drivers/net/ena/ena_ethdev.c            |  2 +
 drivers/net/hns3/hns3_common.c          |  2 +
 drivers/net/hns3/hns3_intr.c            | 24 ++++++++++++
 drivers/net/iavf/iavf_ethdev.c          |  2 +
 drivers/net/ixgbe/ixgbe_ethdev.c        |  2 +
 drivers/net/txgbe/txgbe_ethdev_vf.c     |  2 +
 lib/ethdev/rte_ethdev.h                 | 52 ++++++++++++++++++++++++-
 15 files changed, 173 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v9 1/5] ethdev: support get port error handling mode
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
@ 2022-09-22  7:41   ` Chengwen Feng
  2022-10-03 17:35     ` Ferruh Yigit
  2022-09-22  7:41   ` [PATCH v9 2/5] ethdev: support proactive " Chengwen Feng
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Chengwen Feng @ 2022-09-22  7:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch support gets port's error handling mode by
rte_eth_dev_info_get() API.

Currently, the defined modes include:
1) NONE: it means no error handling modes are supported by this port.
2) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset() to recover the port.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 app/test-pmd/config.c               |  2 ++
 drivers/net/e1000/igb_ethdev.c      |  2 ++
 drivers/net/ena/ena_ethdev.c        |  2 ++
 drivers/net/iavf/iavf_ethdev.c      |  2 ++
 drivers/net/ixgbe/ixgbe_ethdev.c    |  2 ++
 drivers/net/txgbe/txgbe_ethdev_vf.c |  2 ++
 lib/ethdev/rte_ethdev.h             | 19 ++++++++++++++++++-
 7 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 86054455d2..0c10c663e9 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -922,6 +922,8 @@ port_infos_display(portid_t port_id)
 			printf("Switch Rx domain: %u\n",
 			       dev_info.switch_info.rx_domain);
 	}
+	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
+		printf("Device error handling mode: passive\n");
 }
 
 void
diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index a9c18b27e8..dea69c9db1 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -2341,6 +2341,8 @@ eth_igbvf_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c
index 3e88bcda6c..efcb163027 100644
--- a/drivers/net/ena/ena_ethdev.c
+++ b/drivers/net/ena/ena_ethdev.c
@@ -2482,6 +2482,8 @@ static int ena_infos_get(struct rte_eth_dev *dev,
 	dev_info->default_rxportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 	dev_info->default_txportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 652f0d00a5..b2ef2dc366 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1178,6 +1178,8 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		.nb_align = IAVF_ALIGN_RING_DESC,
 	};
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index f31bbb7895..7b68b171e6 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -4056,6 +4056,8 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index f52cd8bc19..3b1f7c913b 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -521,6 +521,8 @@ txgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..930b0a2fff 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1848,6 +1848,19 @@ enum rte_eth_representor_type {
 	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
 };
 
+/**
+ * Ethernet device error handling mode.
+ */
+enum rte_eth_err_handle_mode {
+	/** No error handling modes are supported. */
+	RTE_ETH_ERROR_HANDLE_MODE_NONE,
+	/** Passive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
+	 * application invoke @see rte_eth_dev_reset to recover the port.
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+};
+
 /**
  * A structure used to retrieve the contextual information of
  * an Ethernet device, such as the controlling driver of the
@@ -1908,8 +1921,12 @@ struct rte_eth_dev_info {
 	 * embedded managed interconnect/switch.
 	 */
 	struct rte_eth_switch_info switch_info;
+	/** Supported error handling mode. @see enum rte_eth_err_handle_mode */
+	uint8_t err_handle_mode;
 
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	uint8_t reserved_8;       /**< Reserved for future fields  */
+	uint16_t reserved_16s[3]; /**< Reserved for future fields  */
+	uint64_t reserved_64;     /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v9 2/5] ethdev: support proactive error handling mode
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
@ 2022-09-22  7:41   ` Chengwen Feng
  2022-10-03 17:35     ` Ferruh Yigit
  2022-09-22  7:41   ` [PATCH v9 3/5] app/testpmd: support error handling mode event Chengwen Feng
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Chengwen Feng @ 2022-09-22  7:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
to recover from the errors. In this process, the PMD sets the data path
pointers to dummy functions (which will prevent the crash), and also
make sure the control path operations failed with retcode -EBUSY.

The above error handling mode is known as
RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).

In some service scenarios, application needs to be aware of the event
to determine whether to migrate services. So three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures
the port to the state prior to the error.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/config.c                   |  2 ++
 doc/guides/prog_guide/poll_mode_drv.rst | 39 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 ++++++++
 lib/ethdev/rte_ethdev.h                 | 33 +++++++++++++++++++++
 4 files changed, 86 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 0c10c663e9..b716d2a15f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -924,6 +924,8 @@ port_infos_display(portid_t port_id)
 	}
 	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
 		printf("Device error handling mode: passive\n");
+	else if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE)
+		printf("Device error handling mode: proactive\n");
 }
 
 void
diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..232dc459b0 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,42 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
+
+Proactive Error Handling Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
+hardware or firmware errors, the PMD will try to recover from the errors. In
+this process, the PMD sets the data path pointers to dummy functions (which
+will prevent the crash), and also make sure the control path operations failed
+with retcode -EBUSY.
+
+Also in this process, from the perspective of application, services are
+affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
+and the control plane API return failure.
+
+In some service scenarios, application needs to be aware of the event to
+determine whether to migrate services. So three events were introduced:
+
+* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
+  an error and the recovery is being started. Upon receiving the event, the
+  application should not invoke any control path APIs until receiving
+  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
+
+
+* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
+  recovers successful from the error, the PMD already re-configures the port to
+  the state prior to the error.
+
+* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
+  recovers failed from the error, the port should not usable anymore. the
+  application should close the port.
+
+.. note::
+        * Before the PMD reports the recovery result, the PMD may report the
+          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
+          may occur during the recovery.
+        * The error handling mode supported by the PMD can be reported through
+          the ``rte_eth_dev_info_get`` API.
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 8c021cf050..fc85e5fa87 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -55,6 +55,18 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added proactive error handling mode for ethdev.**
+
+  Added proactive error handling mode for ethdev, and three event were
+  introduced:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 
 Removed Items
 -------------
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 930b0a2fff..d3e81b98a7 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1859,6 +1859,12 @@ enum rte_eth_err_handle_mode {
 	 * application invoke @see rte_eth_dev_reset to recover the port.
 	 */
 	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+	/** Proactive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_ERR_RECOVERING event,
+	 * and do recovery internally, finally, reports the recovery result
+	 * event (@see RTE_ETH_EVENT_RECOVERY_*).
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE,
 };
 
 /**
@@ -3944,6 +3950,33 @@ enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error.
+	 * If PMD supports proactive error recovery, it should trigger this
+	 * event to notify application that it detected an error and the
+	 * recovery is being started. Upon receiving the event, the application
+	 * should not invoke any control path APIs (such as
+	 * rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
+	 * event.
+	 * The PMD will set the data path pointers to dummy functions, and
+	 * re-set the data patch pointers to non-dummy functions before reports
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS event. It means that the application
+	 * cannot send or receive any packets during this period.
+	 * @note Before the PMD reports the recovery result, the PMD may report
+	 * the RTE_ETH_EVENT_ERR_RECOVERING event again, because a larger error
+	 * may occur during the recovery.
+	 */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error.
+	 * The PMD already re-configures the port to the state prior to the
+	 * error.
+	 */
+	RTE_ETH_EVENT_RECOVERY_SUCCESS,
+	/** Port recovers failed from the error.
+	 * It means that the port should not usable anymore. The application
+	 * should close the port.
+	 */
+	RTE_ETH_EVENT_RECOVERY_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v9 3/5] app/testpmd: support error handling mode event
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 2/5] ethdev: support proactive " Chengwen Feng
@ 2022-09-22  7:41   ` Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 4/5] net/hns3: support proactive error handling mode Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-09-22  7:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports error handling mode event process.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/parameters.c | 10 ++++++++--
 app/test-pmd/testpmd.c    |  8 +++++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index e3c9757f3f..06fdad2644 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -178,9 +178,9 @@ usage(char* progname)
 	printf("  --no-rmv-interrupt: disable device removal interrupt.\n");
 	printf("  --bitrate-stats=N: set the logical core N to perform "
 		"bit-rate calculation.\n");
-	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed|all>: "
 	       "enable print of designated event or all of them.\n");
-	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed||all>: "
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
@@ -464,6 +464,12 @@ parse_event_printing_config(const char *optarg, int enable)
 		mask = UINT32_C(1) << RTE_ETH_EVENT_DESTROY;
 	else if (!strcmp(optarg, "flow_aged"))
 		mask = UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED;
+	else if (!strcmp(optarg, "err_recovering"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING;
+	else if (!strcmp(optarg, "recovery_success"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS;
+	else if (!strcmp(optarg, "recovery_failed"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED;
 	else if (!strcmp(optarg, "all"))
 		mask = ~UINT32_C(0);
 	else {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index addcbcac85..7ef53020fe 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -425,6 +425,9 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
 	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "RxQ available descriptors threshold reached",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "error recovering",
+	[RTE_ETH_EVENT_RECOVERY_SUCCESS] = "error recovery successful",
+	[RTE_ETH_EVENT_RECOVERY_FAILED] = "error recovery failed",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -439,7 +442,10 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v9 4/5] net/hns3: support proactive error handling mode
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
                     ` (2 preceding siblings ...)
  2022-09-22  7:41   ` [PATCH v9 3/5] app/testpmd: support error handling mode event Chengwen Feng
@ 2022-09-22  7:41   ` Chengwen Feng
  2022-09-22  7:41   ` [PATCH v9 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-09-22  7:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch supports proactive error handling mode.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/hns3/hns3_common.c |  2 ++
 drivers/net/hns3/hns3_intr.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/net/hns3/hns3_common.c b/drivers/net/hns3/hns3_common.c
index 424205356e..624360b601 100644
--- a/drivers/net/hns3/hns3_common.c
+++ b/drivers/net/hns3/hns3_common.c
@@ -149,6 +149,8 @@ hns3_dev_infos_get(struct rte_eth_dev *eth_dev, struct rte_eth_dev_info *info)
 		info->max_mac_addrs = HNS3_VF_UC_MACADDR_NUM;
 	}
 
+	info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/hns3/hns3_intr.c b/drivers/net/hns3/hns3_intr.c
index 3ca2e1e338..65bcbfb0a1 100644
--- a/drivers/net/hns3/hns3_intr.c
+++ b/drivers/net/hns3/hns3_intr.c
@@ -1486,6 +1486,27 @@ static const struct hns3_hw_err_type hns3_hw_error_type[] = {
 	}
 };
 
+static void
+hns3_report_reset_begin(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_ERR_RECOVERING, NULL);
+}
+
+static void
+hns3_report_reset_success(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
+}
+
+static void
+hns3_report_reset_failed(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+}
+
 static int
 hns3_config_ncsi_hw_err_int(struct hns3_adapter *hns, bool en)
 {
@@ -2645,6 +2666,7 @@ hns3_reset_pre(struct hns3_adapter *hns)
 	if (hw->reset.stage == RESET_STAGE_NONE) {
 		__atomic_store_n(&hns->hw.reset.resetting, 1, __ATOMIC_RELAXED);
 		hw->reset.stage = RESET_STAGE_DOWN;
+		hns3_report_reset_begin(hw);
 		ret = hw->reset.ops->stop_service(hns);
 		hns3_clock_gettime(&tv);
 		if (ret) {
@@ -2754,6 +2776,7 @@ hns3_reset_post(struct hns3_adapter *hns)
 			  hns3_clock_calctime_ms(&tv_delta),
 			  tv.tv_sec, tv.tv_usec);
 		hw->reset.level = HNS3_NONE_RESET;
+		hns3_report_reset_success(hw);
 	}
 	return 0;
 
@@ -2799,6 +2822,7 @@ hns3_reset_fail_handle(struct hns3_adapter *hns)
 		  hns3_clock_calctime_ms(&tv_delta),
 		  tv.tv_sec, tv.tv_usec);
 	hw->reset.level = HNS3_NONE_RESET;
+	hns3_report_reset_failed(hw);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v9 5/5] net/bnxt: support proactive error handling mode
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
                     ` (3 preceding siblings ...)
  2022-09-22  7:41   ` [PATCH v9 4/5] net/hns3: support proactive error handling mode Chengwen Feng
@ 2022-09-22  7:41   ` Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-09-22  7:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports proactive error handling mode.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/bnxt/bnxt_cpr.c    |  4 ++++
 drivers/net/bnxt/bnxt_ethdev.c | 13 ++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 99af0f9e87..5bb376d4d5 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -180,6 +180,10 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			return;
 		}
 
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
+
 		pthread_mutex_lock(&bp->err_recovery_lock);
 		event_data = data1;
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index e275d3a53f..3da0302b1b 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -1063,6 +1063,8 @@ static int bnxt_dev_info_get_op(struct rte_eth_dev *eth_dev,
 	dev_info->vmdq_pool_base = 0;
 	dev_info->vmdq_queue_base = 0;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
@@ -4381,13 +4383,18 @@ static void bnxt_dev_recover(void *arg)
 	PMD_DRV_LOG(INFO, "Port: %u Recovered from FW reset\n",
 		    bp->eth_dev->data->port_id);
 	pthread_mutex_unlock(&bp->err_recovery_lock);
-
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_SUCCESS,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop(bp->eth_dev);
 err:
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_FAILED,
+				     NULL);
 	if (bp->eth_dev->data->dev_conf.intr_conf.rmv)
 		rte_eth_dev_callback_process(bp->eth_dev,
 					     RTE_ETH_EVENT_INTR_RMV,
@@ -4559,6 +4566,10 @@ static void bnxt_check_fw_health(void *arg)
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
 
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
+
 	if (bnxt_is_primary_func(bp))
 		wait_msec = info->primary_func_wait_period;
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v9 1/5] ethdev: support get port error handling mode
  2022-09-22  7:41   ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
@ 2022-10-03 17:35     ` Ferruh Yigit
  2022-10-05  1:56       ` fengchengwen
  0 siblings, 1 reply; 41+ messages in thread
From: Ferruh Yigit @ 2022-10-03 17:35 UTC (permalink / raw)
  To: Chengwen Feng, thomas
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

On 9/22/2022 8:41 AM, Chengwen Feng wrote:
> This patch support gets port's error handling mode by
> rte_eth_dev_info_get() API.
> 
> Currently, the defined modes include:
> 1) NONE: it means no error handling modes are supported by this port.
> 2) PASSIVE: passive error handling, after the PMD detect that a reset
> is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
> application invoke rte_eth_dev_reset() to recover the port.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

<...>

> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..930b0a2fff 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1848,6 +1848,19 @@ enum rte_eth_representor_type {
>   	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
>   };
>   
> +/**
> + * Ethernet device error handling mode.

Needs to be experimental, if decides to keep.

> + */
> +enum rte_eth_err_handle_mode {
> +	/** No error handling modes are supported. */
> +	RTE_ETH_ERROR_HANDLE_MODE_NONE,
> +	/** Passive error handling, after the PMD detect that a reset is
> +	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
> +	 * application invoke @see rte_eth_dev_reset to recover the port.
> +	 */
> +	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,

Hi Chengwen,

Is the intention of 'PASSIVE' / 'PROACTIVE' mode to let application 
decide which event to register? Like some kind of capability?

If mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE
	register RTE_ETH_EVENT_INTR_RESET

if mode == RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE
	register ERR_RECOVERING | RECOVERY_SUCCESS | RECOVERY_FAILED

Can't a PMD support both?
Or is application really needs to know this, what happens if it register 
all events and implements related actions for it?


> +};
> +
>   /**
>    * A structure used to retrieve the contextual information of
>    * an Ethernet device, such as the controlling driver of the
> @@ -1908,8 +1921,12 @@ struct rte_eth_dev_info {
>   	 * embedded managed interconnect/switch.
>   	 */
>   	struct rte_eth_switch_info switch_info;
> +	/** Supported error handling mode. @see enum rte_eth_err_handle_mode */
> +	uint8_t err_handle_mode;
>   

I guess 'uint8_t' is used to save space, but 'enum' is mostly integer 
(although as far as I remember compiler can select smaller type is cases 
fit it), so I concern if it case any warning. If not agree to use 
smaller type, since we know possible number of handler type is limited 
and small.

> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	uint8_t reserved_8;       /**< Reserved for future fields  */
> +	uint16_t reserved_16s[3]; /**< Reserved for future fields  */
> +	uint64_t reserved_64;     /**< Reserved for future fields */
>   	void *reserved_ptrs[2];   /**< Reserved for future fields */
>   };
>   


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v9 2/5] ethdev: support proactive error handling mode
  2022-09-22  7:41   ` [PATCH v9 2/5] ethdev: support proactive " Chengwen Feng
@ 2022-10-03 17:35     ` Ferruh Yigit
  0 siblings, 0 replies; 41+ messages in thread
From: Ferruh Yigit @ 2022-10-03 17:35 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

On 9/22/2022 8:41 AM, Chengwen Feng wrote:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
> to recover from the errors. In this process, the PMD sets the data path
> pointers to dummy functions (which will prevent the crash), and also
> make sure the control path operations failed with retcode -EBUSY.
> 
> The above error handling mode is known as
> RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).
> 
> In some service scenarios, application needs to be aware of the event
> to determine whether to migrate services. So three events were
> introduced:
> 
> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
> detected an error and the recovery is being started. Upon receiving the
> event, the application should not invoke any control path APIs until
> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> 
> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
> it recovers successful from the error, the PMD already re-configures
> the port to the state prior to the error.
> 
> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
> recovers failed from the error, the port should not usable anymore. The
> application should close the port.
> 

I think two separate events as 'RECOVERY_SUCCESS' & 'RECOVERY_FAILED' is 
better than previous 'RECOVERED' event.

'RECOVERY_FAILED' is clear,
but for 'RECOVERY_SUCCESS' case, can we try to define more what 
application should do?
Like should application assume nothing changed in the device 
configuration, flow rules etc or on other extreme should it assume that 
all configuration lost?

> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

<...>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v9 1/5] ethdev: support get port error handling mode
  2022-10-03 17:35     ` Ferruh Yigit
@ 2022-10-05  1:56       ` fengchengwen
  0 siblings, 0 replies; 41+ messages in thread
From: fengchengwen @ 2022-10-05  1:56 UTC (permalink / raw)
  To: Ferruh Yigit, thomas
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

Hi Ferruh,

On 2022/10/4 1:35, Ferruh Yigit wrote:
> On 9/22/2022 8:41 AM, Chengwen Feng wrote:
>> This patch support gets port's error handling mode by
>> rte_eth_dev_info_get() API.
>>
>> Currently, the defined modes include:
>> 1) NONE: it means no error handling modes are supported by this port.
>> 2) PASSIVE: passive error handling, after the PMD detect that a reset
>> is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
>> application invoke rte_eth_dev_reset() to recover the port.
>>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>
> <...>
>
>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>> index de9e970d4d..930b0a2fff 100644
>> --- a/lib/ethdev/rte_ethdev.h
>> +++ b/lib/ethdev/rte_ethdev.h
>> @@ -1848,6 +1848,19 @@ enum rte_eth_representor_type {
>>       RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical 
>> Function. */
>>   };
>>   +/**
>> + * Ethernet device error handling mode.
>
> Needs to be experimental, if decides to keep.


will fix in v10


>
>> + */
>> +enum rte_eth_err_handle_mode {
>> +    /** No error handling modes are supported. */
>> +    RTE_ETH_ERROR_HANDLE_MODE_NONE,
>> +    /** Passive error handling, after the PMD detect that a reset is
>> +     * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET 
>> event, and
>> +     * application invoke @see rte_eth_dev_reset to recover the port.
>> +     */
>> +    RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
>
> Hi Chengwen,
>
> Is the intention of 'PASSIVE' / 'PROACTIVE' mode to let application 
> decide which event to register? Like some kind of capability?
>
> If mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE
>     register RTE_ETH_EVENT_INTR_RESET
>
> if mode == RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE
>     register ERR_RECOVERING | RECOVERY_SUCCESS | RECOVERY_FAILED
>

It mainly to standardize the two error handling modes to avoid poor 
perception.

In the concept space, the reset mode is separated, so that it is not 
difficult to understand.


> Can't a PMD support both?
Currently, I find that no PMD support both.


If the new PMD supports both the two types, it can be extended well, for 
example, it can be defined as bitmask,

and value should not change because PASSIVE correspond 1(1<<0), and 
PROACTIVE correspond 2(1<<1)


>
> Or is application really needs to know this, what happens if it 
> register all events and implements related actions for it?


For simpler, the application could register all events and do what the 
framework requirements.


>
>
>> +};
>> +
>>   /**
>>    * A structure used to retrieve the contextual information of
>>    * an Ethernet device, such as the controlling driver of the
>> @@ -1908,8 +1921,12 @@ struct rte_eth_dev_info {
>>        * embedded managed interconnect/switch.
>>        */
>>       struct rte_eth_switch_info switch_info;
>> +    /** Supported error handling mode. @see enum 
>> rte_eth_err_handle_mode */
>> +    uint8_t err_handle_mode;
>
> I guess 'uint8_t' is used to save space, but 'enum' is mostly integer 
> (although as far as I remember compiler can select smaller type is 
> cases fit it), so I concern if it case any warning. If not agree to 
> use smaller type, since we know possible number of handler type is 
> limited and small.


Yes, uint8_t is used to save space. It will depend on compiler if use 
enum here, so I think it's OK to use deterministic type.

as for warning, I have not get such converting warning yeth.


>
>> -    uint64_t reserved_64s[2]; /**< Reserved for future fields */
>> +    uint8_t reserved_8;       /**< Reserved for future fields  */
>> +    uint16_t reserved_16s[3]; /**< Reserved for future fields  */
>> +    uint64_t reserved_64;     /**< Reserved for future fields */
>>       void *reserved_ptrs[2];   /**< Reserved for future fields */
>>   };
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v10 0/5] support error handling mode
       [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
@ 2022-10-09  7:53 ` Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 1/5] ethdev: support get port " Chengwen Feng
                     ` (4 more replies)
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
                   ` (2 subsequent siblings)
  4 siblings, 5 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  7:53 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patchset introduce error handling mode concept, the supported modes
are as follows:
1) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset to recover the port.

2) PROACTIVE: proactive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_ERR_RECOVERING event, and do
recovery internally, finally, reports the recovery result event.

Chengwen Feng (2):
  ethdev: support get port error handling mode
  net/hns3: support proactive error handling mode

Kalesh AP (3):
  ethdev: support proactive error handling mode
  app/testpmd: support error handling mode event
  net/bnxt: support proactive error handling mode

---
v10: Accurately describe the recovery success scenario so that
     addressed comments from Ferruh.
v9: Introduce error handling mode concept.
    Addressed comments from Thomas and Ray.
v8: Addressed comments from Thomas and Ferruh.
    Also introduced RECOVER_FAIL event.
    Add hns3 driver patch.
v7: Addressed comments from Thomas and Andrew.
v6: Addressed comments from Asaf Penso.
    1. Updated 20.11 release notes with the new events added.
    2. updated testpmd parse_event_printing_config function.
v5: Addressed comments from Ophir Munk.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Addressed comments from Thomas Monjalon
    1. Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

 app/test-pmd/config.c                   |  4 ++
 app/test-pmd/parameters.c               | 10 +++-
 app/test-pmd/testpmd.c                  |  8 ++-
 doc/guides/prog_guide/poll_mode_drv.rst | 38 ++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 ++++
 drivers/net/bnxt/bnxt_cpr.c             |  4 ++
 drivers/net/bnxt/bnxt_ethdev.c          | 13 +++-
 drivers/net/e1000/igb_ethdev.c          |  2 +
 drivers/net/ena/ena_ethdev.c            |  2 +
 drivers/net/hns3/hns3_common.c          |  2 +
 drivers/net/hns3/hns3_intr.c            | 24 ++++++++
 drivers/net/iavf/iavf_ethdev.c          |  2 +
 drivers/net/ixgbe/ixgbe_ethdev.c        |  2 +
 drivers/net/txgbe/txgbe_ethdev_vf.c     |  2 +
 lib/ethdev/rte_ethdev.h                 | 79 ++++++++++++++++++++++++-
 15 files changed, 199 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v10 1/5] ethdev: support get port error handling mode
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
@ 2022-10-09  7:53   ` Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 2/5] ethdev: support proactive " Chengwen Feng
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  7:53 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch support gets port's error handling mode by
rte_eth_dev_info_get() API.

Currently, the defined modes include:
1) NONE: it means no error handling modes are supported by this port.
2) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset() to recover the port.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 app/test-pmd/config.c               |  2 ++
 drivers/net/e1000/igb_ethdev.c      |  2 ++
 drivers/net/ena/ena_ethdev.c        |  2 ++
 drivers/net/iavf/iavf_ethdev.c      |  2 ++
 drivers/net/ixgbe/ixgbe_ethdev.c    |  2 ++
 drivers/net/txgbe/txgbe_ethdev_vf.c |  2 ++
 lib/ethdev/rte_ethdev.h             | 20 +++++++++++++++++++-
 7 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 841e8efe78..bd7f2ba257 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -921,6 +921,8 @@ port_infos_display(portid_t port_id)
 			printf("Switch Rx domain: %u\n",
 			       dev_info.switch_info.rx_domain);
 	}
+	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
+		printf("Device error handling mode: passive\n");
 }
 
 void
diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index d6bcc5bf58..8858f975f8 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -2341,6 +2341,8 @@ eth_igbvf_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c
index 3e88bcda6c..efcb163027 100644
--- a/drivers/net/ena/ena_ethdev.c
+++ b/drivers/net/ena/ena_ethdev.c
@@ -2482,6 +2482,8 @@ static int ena_infos_get(struct rte_eth_dev *dev,
 	dev_info->default_rxportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 	dev_info->default_txportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 782be82c7f..b1958e0474 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1179,6 +1179,8 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		.nb_align = IAVF_ALIGN_RING_DESC,
 	};
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index bf70ee041d..fd06ddbe35 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -4056,6 +4056,8 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index f52cd8bc19..3b1f7c913b 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -521,6 +521,8 @@ txgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index e8d1e1c658..bf34d68d48 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1629,6 +1629,20 @@ enum rte_eth_representor_type {
 	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
 };
 
+/**
+ * Ethernet device error handling mode.
+ */
+__rte_experimental
+enum rte_eth_err_handle_mode {
+	/** No error handling modes are supported. */
+	RTE_ETH_ERROR_HANDLE_MODE_NONE,
+	/** Passive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
+	 * application invoke @see rte_eth_dev_reset to recover the port.
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+};
+
 /**
  * A structure used to retrieve the contextual information of
  * an Ethernet device, such as the controlling driver of the
@@ -1689,8 +1703,12 @@ struct rte_eth_dev_info {
 	 * embedded managed interconnect/switch.
 	 */
 	struct rte_eth_switch_info switch_info;
+	/** Supported error handling mode. @see enum rte_eth_err_handle_mode */
+	uint8_t err_handle_mode;
 
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	uint8_t reserved_8;       /**< Reserved for future fields  */
+	uint16_t reserved_16s[3]; /**< Reserved for future fields  */
+	uint64_t reserved_64;     /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v10 2/5] ethdev: support proactive error handling mode
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 1/5] ethdev: support get port " Chengwen Feng
@ 2022-10-09  7:53   ` Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 3/5] app/testpmd: support error handling mode event Chengwen Feng
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  7:53 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
to recover from the errors. In this process, the PMD sets the data path
pointers to dummy functions (which will prevent the crash), and also
make sure the control path operations failed with retcode -EBUSY.

The above error handling mode is known as
RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).

In some service scenarios, application needs to be aware of the event
to determine whether to migrate services. So three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures the
port, and the effect is the same as that of the restart operation.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/config.c                   |  2 +
 doc/guides/prog_guide/poll_mode_drv.rst | 38 ++++++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 +++++
 lib/ethdev/rte_ethdev.h                 | 59 +++++++++++++++++++++++++
 4 files changed, 111 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index bd7f2ba257..e53f2177b6 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -923,6 +923,8 @@ port_infos_display(portid_t port_id)
 	}
 	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
 		printf("Device error handling mode: passive\n");
+	else if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE)
+		printf("Device error handling mode: proactive\n");
 }
 
 void
diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..73941a74bd 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,41 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
+
+Proactive Error Handling Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
+hardware or firmware errors, the PMD will try to recover from the errors. In
+this process, the PMD sets the data path pointers to dummy functions (which
+will prevent the crash), and also make sure the control path operations failed
+with retcode -EBUSY.
+
+Also in this process, from the perspective of application, services are
+affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
+and the control plane API return failure.
+
+In some service scenarios, application needs to be aware of the event to
+determine whether to migrate services. So three events were introduced:
+
+* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
+  an error and the recovery is being started. Upon receiving the event, the
+  application should not invoke any control path APIs until receiving
+  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
+
+* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
+  recovers successful from the error, the PMD already re-configures the port,
+  and the effect is the same as that of the restart operation.
+
+* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
+  recovers failed from the error, the port should not usable anymore. the
+  application should close the port.
+
+.. note::
+        * Before the PMD reports the recovery result, the PMD may report the
+          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
+          may occur during the recovery.
+        * The error handling mode supported by the PMD can be reported through
+          the ``rte_eth_dev_info_get`` API.
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 10168ffece..b749f51405 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -111,6 +111,18 @@ New Features
   Added new flow action which allows application to re-route packets
   directly to the kernel without software involvement.
 
+* **Added proactive error handling mode for ethdev.**
+
+  Added proactive error handling mode for ethdev, and three event were
+  introduced:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 * **Updated AF_XDP driver.**
 
   * Made compatible with libbpf v0.8.0 (when used with libxdp).
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index bf34d68d48..6a62e594ce 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1641,6 +1641,12 @@ enum rte_eth_err_handle_mode {
 	 * application invoke @see rte_eth_dev_reset to recover the port.
 	 */
 	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+	/** Proactive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_ERR_RECOVERING event,
+	 * and do recovery internally, finally, reports the recovery result
+	 * event (@see RTE_ETH_EVENT_RECOVERY_*).
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE,
 };
 
 /**
@@ -3822,6 +3828,59 @@ enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error.
+	 * If PMD supports proactive error recovery, it should trigger this
+	 * event to notify application that it detected an error and the
+	 * recovery is being started. Upon receiving the event, the application
+	 * should not invoke any control path APIs (such as
+	 * rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
+	 * event.
+	 * The PMD will set the data path pointers to dummy functions, and
+	 * re-set the data patch pointers to non-dummy functions before reports
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS event. It means that the application
+	 * cannot send or receive any packets during this period.
+	 * @note Before the PMD reports the recovery result, the PMD may report
+	 * the RTE_ETH_EVENT_ERR_RECOVERING event again, because a larger error
+	 * may occur during the recovery.
+	 */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error.
+	 * The PMD already re-configures the port, and the effect is the same as
+	 * that of the restart operation.
+	 * a) the following operation will be retained: (alphabetically)
+	 *    - DCB configuration
+	 *    - FEC configuration
+	 *    - Flow control configuration
+	 *    - LRO configuration
+	 *    - LSC configuration
+	 *    - MTU
+	 *    - Mac address (default and those supplied by MAC address array)
+	 *    - Promiscuous and allmulticast mode
+	 *    - PTP configuration
+	 *    - Queue (Rx/Tx) settings
+	 *    - Queue statistics mappings
+	 *    - RSS configuration by rte_eth_dev_rss_xxx() family
+	 *    - Rx checksum configuration
+	 *    - Rx interrupt settings
+	 *    - Traffic management configuration
+	 *    - VLAN configuration (including filtering, tpid, strip, pvid)
+	 *    - VMDq configuration
+	 * b) the following configuration maybe retained or not depending on the
+	 *    device capabilities:
+	 *    - flow rules
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
+	 *    - shared flow objects
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
+	 * c) the other configuration will not be stored and will need to be
+	 *    re-configured.
+	 */
+	RTE_ETH_EVENT_RECOVERY_SUCCESS,
+	/** Port recovers failed from the error.
+	 * It means that the port should not usable anymore. The application
+	 * should close the port.
+	 */
+	RTE_ETH_EVENT_RECOVERY_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v10 3/5] app/testpmd: support error handling mode event
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 1/5] ethdev: support get port " Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 2/5] ethdev: support proactive " Chengwen Feng
@ 2022-10-09  7:53   ` Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 4/5] net/hns3: support proactive error handling mode Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  7:53 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports error handling mode event process.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/parameters.c | 10 ++++++++--
 app/test-pmd/testpmd.c    |  8 +++++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 14752f9571..c49681708e 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -166,9 +166,9 @@ usage(char* progname)
 	printf("  --no-rmv-interrupt: disable device removal interrupt.\n");
 	printf("  --bitrate-stats=N: set the logical core N to perform "
 		"bit-rate calculation.\n");
-	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed|all>: "
 	       "enable print of designated event or all of them.\n");
-	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed||all>: "
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
@@ -452,6 +452,12 @@ parse_event_printing_config(const char *optarg, int enable)
 		mask = UINT32_C(1) << RTE_ETH_EVENT_DESTROY;
 	else if (!strcmp(optarg, "flow_aged"))
 		mask = UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED;
+	else if (!strcmp(optarg, "err_recovering"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING;
+	else if (!strcmp(optarg, "recovery_success"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS;
+	else if (!strcmp(optarg, "recovery_failed"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED;
 	else if (!strcmp(optarg, "all"))
 		mask = ~UINT32_C(0);
 	else {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index bb1c901742..7104e61a2c 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -425,6 +425,9 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
 	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "RxQ available descriptors threshold reached",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "error recovering",
+	[RTE_ETH_EVENT_RECOVERY_SUCCESS] = "error recovery successful",
+	[RTE_ETH_EVENT_RECOVERY_FAILED] = "error recovery failed",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -439,7 +442,10 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v10 4/5] net/hns3: support proactive error handling mode
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
                     ` (2 preceding siblings ...)
  2022-10-09  7:53   ` [PATCH v10 3/5] app/testpmd: support error handling mode event Chengwen Feng
@ 2022-10-09  7:53   ` Chengwen Feng
  2022-10-09  7:53   ` [PATCH v10 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  7:53 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch supports proactive error handling mode.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/hns3/hns3_common.c |  2 ++
 drivers/net/hns3/hns3_intr.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/net/hns3/hns3_common.c b/drivers/net/hns3/hns3_common.c
index 14291193cb..7adc6a4972 100644
--- a/drivers/net/hns3/hns3_common.c
+++ b/drivers/net/hns3/hns3_common.c
@@ -149,6 +149,8 @@ hns3_dev_infos_get(struct rte_eth_dev *eth_dev, struct rte_eth_dev_info *info)
 		info->max_mac_addrs = HNS3_VF_UC_MACADDR_NUM;
 	}
 
+	info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/hns3/hns3_intr.c b/drivers/net/hns3/hns3_intr.c
index 57679254ee..44a1119415 100644
--- a/drivers/net/hns3/hns3_intr.c
+++ b/drivers/net/hns3/hns3_intr.c
@@ -1480,6 +1480,27 @@ static const struct hns3_hw_err_type hns3_hw_error_type[] = {
 	}
 };
 
+static void
+hns3_report_reset_begin(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_ERR_RECOVERING, NULL);
+}
+
+static void
+hns3_report_reset_success(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
+}
+
+static void
+hns3_report_reset_failed(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+}
+
 static int
 hns3_config_ncsi_hw_err_int(struct hns3_adapter *hns, bool en)
 {
@@ -2642,6 +2663,7 @@ hns3_reset_pre(struct hns3_adapter *hns)
 	if (hw->reset.stage == RESET_STAGE_NONE) {
 		__atomic_store_n(&hns->hw.reset.resetting, 1, __ATOMIC_RELAXED);
 		hw->reset.stage = RESET_STAGE_DOWN;
+		hns3_report_reset_begin(hw);
 		ret = hw->reset.ops->stop_service(hns);
 		hns3_clock_gettime(&tv);
 		if (ret) {
@@ -2751,6 +2773,7 @@ hns3_reset_post(struct hns3_adapter *hns)
 			  hns3_clock_calctime_ms(&tv_delta),
 			  tv.tv_sec, tv.tv_usec);
 		hw->reset.level = HNS3_NONE_RESET;
+		hns3_report_reset_success(hw);
 	}
 	return 0;
 
@@ -2796,6 +2819,7 @@ hns3_reset_fail_handle(struct hns3_adapter *hns)
 		  hns3_clock_calctime_ms(&tv_delta),
 		  tv.tv_sec, tv.tv_usec);
 	hw->reset.level = HNS3_NONE_RESET;
+	hns3_report_reset_failed(hw);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v10 5/5] net/bnxt: support proactive error handling mode
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
                     ` (3 preceding siblings ...)
  2022-10-09  7:53   ` [PATCH v10 4/5] net/hns3: support proactive error handling mode Chengwen Feng
@ 2022-10-09  7:53   ` Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  7:53 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports proactive error handling mode.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/bnxt/bnxt_cpr.c    |  4 ++++
 drivers/net/bnxt/bnxt_ethdev.c | 13 ++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 99af0f9e87..5bb376d4d5 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -180,6 +180,10 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			return;
 		}
 
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
+
 		pthread_mutex_lock(&bp->err_recovery_lock);
 		event_data = data1;
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index 3dfe9efc09..b3de490d36 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -1063,6 +1063,8 @@ static int bnxt_dev_info_get_op(struct rte_eth_dev *eth_dev,
 	dev_info->vmdq_pool_base = 0;
 	dev_info->vmdq_queue_base = 0;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
@@ -4382,13 +4384,18 @@ static void bnxt_dev_recover(void *arg)
 	PMD_DRV_LOG(INFO, "Port: %u Recovered from FW reset\n",
 		    bp->eth_dev->data->port_id);
 	pthread_mutex_unlock(&bp->err_recovery_lock);
-
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_SUCCESS,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop(bp->eth_dev);
 err:
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_FAILED,
+				     NULL);
 	if (bp->eth_dev->data->dev_conf.intr_conf.rmv)
 		rte_eth_dev_callback_process(bp->eth_dev,
 					     RTE_ETH_EVENT_INTR_RMV,
@@ -4560,6 +4567,10 @@ static void bnxt_check_fw_health(void *arg)
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
 
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
+
 	if (bnxt_is_primary_func(bp))
 		wait_msec = info->primary_func_wait_period;
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 0/5] support error handling mode
       [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
  2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
  2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
@ 2022-10-09  9:10 ` Chengwen Feng
  2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
                     ` (4 more replies)
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
  4 siblings, 5 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  9:10 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patchset introduce error handling mode concept, the supported modes
are as follows:
1) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset to recover the port.

2) PROACTIVE: proactive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_ERR_RECOVERING event, and do
recovery internally, finally, reports the recovery result event.

Chengwen Feng (2):
  ethdev: support get port error handling mode
  net/hns3: support proactive error handling mode

Kalesh AP (3):
  ethdev: support proactive error handling mode
  app/testpmd: support error handling mode event
  net/bnxt: support proactive error handling mode

---
v11: Fix clang-static fail due wrong experimental placement.
v10: Accurately describe the recovery success scenario so that
     addressed comments from Ferruh.
v9: Introduce error handling mode concept.
    Addressed comments from Thomas and Ray.
v8: Addressed comments from Thomas and Ferruh.
    Also introduced RECOVER_FAIL event.
    Add hns3 driver patch.
v7: Addressed comments from Thomas and Andrew.
v6: Addressed comments from Asaf Penso.
    1. Updated 20.11 release notes with the new events added.
    2. updated testpmd parse_event_printing_config function.
v5: Addressed comments from Ophir Munk.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Addressed comments from Thomas Monjalon
    1. Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

 app/test-pmd/config.c                   |  4 ++
 app/test-pmd/parameters.c               | 10 ++-
 app/test-pmd/testpmd.c                  |  8 ++-
 doc/guides/prog_guide/poll_mode_drv.rst | 38 ++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 ++++
 drivers/net/bnxt/bnxt_cpr.c             |  4 ++
 drivers/net/bnxt/bnxt_ethdev.c          | 13 +++-
 drivers/net/e1000/igb_ethdev.c          |  2 +
 drivers/net/ena/ena_ethdev.c            |  2 +
 drivers/net/hns3/hns3_common.c          |  2 +
 drivers/net/hns3/hns3_intr.c            | 24 ++++++++
 drivers/net/iavf/iavf_ethdev.c          |  2 +
 drivers/net/ixgbe/ixgbe_ethdev.c        |  2 +
 drivers/net/txgbe/txgbe_ethdev_vf.c     |  2 +
 lib/ethdev/rte_ethdev.h                 | 81 ++++++++++++++++++++++++-
 15 files changed, 201 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v11 1/5] ethdev: support get port error handling mode
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
@ 2022-10-09  9:10   ` Chengwen Feng
  2022-10-10  8:38     ` Andrew Rybchenko
  2022-10-10  8:44     ` Andrew Rybchenko
  2022-10-09  9:10   ` [PATCH v11 2/5] ethdev: support proactive " Chengwen Feng
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  9:10 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch support gets port's error handling mode by
rte_eth_dev_info_get() API.

Currently, the defined modes include:
1) NONE: it means no error handling modes are supported by this port.
2) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset() to recover the port.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 app/test-pmd/config.c               |  2 ++
 drivers/net/e1000/igb_ethdev.c      |  2 ++
 drivers/net/ena/ena_ethdev.c        |  2 ++
 drivers/net/iavf/iavf_ethdev.c      |  2 ++
 drivers/net/ixgbe/ixgbe_ethdev.c    |  2 ++
 drivers/net/txgbe/txgbe_ethdev_vf.c |  2 ++
 lib/ethdev/rte_ethdev.h             | 22 +++++++++++++++++++++-
 7 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 841e8efe78..bd7f2ba257 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -921,6 +921,8 @@ port_infos_display(portid_t port_id)
 			printf("Switch Rx domain: %u\n",
 			       dev_info.switch_info.rx_domain);
 	}
+	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
+		printf("Device error handling mode: passive\n");
 }
 
 void
diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index d6bcc5bf58..8858f975f8 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -2341,6 +2341,8 @@ eth_igbvf_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c
index 3e88bcda6c..efcb163027 100644
--- a/drivers/net/ena/ena_ethdev.c
+++ b/drivers/net/ena/ena_ethdev.c
@@ -2482,6 +2482,8 @@ static int ena_infos_get(struct rte_eth_dev *dev,
 	dev_info->default_rxportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 	dev_info->default_txportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 782be82c7f..b1958e0474 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1179,6 +1179,8 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		.nb_align = IAVF_ALIGN_RING_DESC,
 	};
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index bf70ee041d..fd06ddbe35 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -4056,6 +4056,8 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index f52cd8bc19..3b1f7c913b 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -521,6 +521,8 @@ txgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index e8d1e1c658..3443bf20e1 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1629,6 +1629,22 @@ enum rte_eth_representor_type {
 	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this enumeration may change without prior notice.
+ *
+ * Ethernet device error handling mode.
+ */
+enum rte_eth_err_handle_mode {
+	/** No error handling modes are supported. */
+	RTE_ETH_ERROR_HANDLE_MODE_NONE,
+	/** Passive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
+	 * application invoke @see rte_eth_dev_reset to recover the port.
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+};
+
 /**
  * A structure used to retrieve the contextual information of
  * an Ethernet device, such as the controlling driver of the
@@ -1689,8 +1705,12 @@ struct rte_eth_dev_info {
 	 * embedded managed interconnect/switch.
 	 */
 	struct rte_eth_switch_info switch_info;
+	/** Supported error handling mode. @see enum rte_eth_err_handle_mode */
+	uint8_t err_handle_mode;
 
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	uint8_t reserved_8;       /**< Reserved for future fields  */
+	uint16_t reserved_16s[3]; /**< Reserved for future fields  */
+	uint64_t reserved_64;     /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 2/5] ethdev: support proactive error handling mode
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
  2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
@ 2022-10-09  9:10   ` Chengwen Feng
  2022-10-10  8:47     ` Andrew Rybchenko
  2022-10-09  9:10   ` [PATCH v11 3/5] app/testpmd: support error handling mode event Chengwen Feng
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  9:10 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
to recover from the errors. In this process, the PMD sets the data path
pointers to dummy functions (which will prevent the crash), and also
make sure the control path operations failed with retcode -EBUSY.

The above error handling mode is known as
RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).

In some service scenarios, application needs to be aware of the event
to determine whether to migrate services. So three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures the
port, and the effect is the same as that of the restart operation.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/config.c                   |  2 +
 doc/guides/prog_guide/poll_mode_drv.rst | 38 ++++++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 +++++
 lib/ethdev/rte_ethdev.h                 | 59 +++++++++++++++++++++++++
 4 files changed, 111 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index bd7f2ba257..e53f2177b6 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -923,6 +923,8 @@ port_infos_display(portid_t port_id)
 	}
 	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
 		printf("Device error handling mode: passive\n");
+	else if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE)
+		printf("Device error handling mode: proactive\n");
 }
 
 void
diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..73941a74bd 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,41 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
+
+Proactive Error Handling Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
+hardware or firmware errors, the PMD will try to recover from the errors. In
+this process, the PMD sets the data path pointers to dummy functions (which
+will prevent the crash), and also make sure the control path operations failed
+with retcode -EBUSY.
+
+Also in this process, from the perspective of application, services are
+affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
+and the control plane API return failure.
+
+In some service scenarios, application needs to be aware of the event to
+determine whether to migrate services. So three events were introduced:
+
+* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
+  an error and the recovery is being started. Upon receiving the event, the
+  application should not invoke any control path APIs until receiving
+  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
+
+* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
+  recovers successful from the error, the PMD already re-configures the port,
+  and the effect is the same as that of the restart operation.
+
+* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
+  recovers failed from the error, the port should not usable anymore. the
+  application should close the port.
+
+.. note::
+        * Before the PMD reports the recovery result, the PMD may report the
+          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
+          may occur during the recovery.
+        * The error handling mode supported by the PMD can be reported through
+          the ``rte_eth_dev_info_get`` API.
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 10168ffece..b749f51405 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -111,6 +111,18 @@ New Features
   Added new flow action which allows application to re-route packets
   directly to the kernel without software involvement.
 
+* **Added proactive error handling mode for ethdev.**
+
+  Added proactive error handling mode for ethdev, and three event were
+  introduced:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 * **Updated AF_XDP driver.**
 
   * Made compatible with libbpf v0.8.0 (when used with libxdp).
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 3443bf20e1..ccdd6c8655 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1643,6 +1643,12 @@ enum rte_eth_err_handle_mode {
 	 * application invoke @see rte_eth_dev_reset to recover the port.
 	 */
 	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+	/** Proactive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_ERR_RECOVERING event,
+	 * and do recovery internally, finally, reports the recovery result
+	 * event (@see RTE_ETH_EVENT_RECOVERY_*).
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE,
 };
 
 /**
@@ -3824,6 +3830,59 @@ enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error.
+	 * If PMD supports proactive error recovery, it should trigger this
+	 * event to notify application that it detected an error and the
+	 * recovery is being started. Upon receiving the event, the application
+	 * should not invoke any control path APIs (such as
+	 * rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
+	 * event.
+	 * The PMD will set the data path pointers to dummy functions, and
+	 * re-set the data patch pointers to non-dummy functions before reports
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS event. It means that the application
+	 * cannot send or receive any packets during this period.
+	 * @note Before the PMD reports the recovery result, the PMD may report
+	 * the RTE_ETH_EVENT_ERR_RECOVERING event again, because a larger error
+	 * may occur during the recovery.
+	 */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error.
+	 * The PMD already re-configures the port, and the effect is the same as
+	 * that of the restart operation.
+	 * a) the following operation will be retained: (alphabetically)
+	 *    - DCB configuration
+	 *    - FEC configuration
+	 *    - Flow control configuration
+	 *    - LRO configuration
+	 *    - LSC configuration
+	 *    - MTU
+	 *    - Mac address (default and those supplied by MAC address array)
+	 *    - Promiscuous and allmulticast mode
+	 *    - PTP configuration
+	 *    - Queue (Rx/Tx) settings
+	 *    - Queue statistics mappings
+	 *    - RSS configuration by rte_eth_dev_rss_xxx() family
+	 *    - Rx checksum configuration
+	 *    - Rx interrupt settings
+	 *    - Traffic management configuration
+	 *    - VLAN configuration (including filtering, tpid, strip, pvid)
+	 *    - VMDq configuration
+	 * b) the following configuration maybe retained or not depending on the
+	 *    device capabilities:
+	 *    - flow rules
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
+	 *    - shared flow objects
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
+	 * c) the other configuration will not be stored and will need to be
+	 *    re-configured.
+	 */
+	RTE_ETH_EVENT_RECOVERY_SUCCESS,
+	/** Port recovers failed from the error.
+	 * It means that the port should not usable anymore. The application
+	 * should close the port.
+	 */
+	RTE_ETH_EVENT_RECOVERY_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 3/5] app/testpmd: support error handling mode event
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
  2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
  2022-10-09  9:10   ` [PATCH v11 2/5] ethdev: support proactive " Chengwen Feng
@ 2022-10-09  9:10   ` Chengwen Feng
  2022-10-09  9:10   ` [PATCH v11 4/5] net/hns3: support proactive error handling mode Chengwen Feng
  2022-10-09  9:10   ` [PATCH v11 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  9:10 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports error handling mode event process.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/parameters.c | 10 ++++++++--
 app/test-pmd/testpmd.c    |  8 +++++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 14752f9571..c49681708e 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -166,9 +166,9 @@ usage(char* progname)
 	printf("  --no-rmv-interrupt: disable device removal interrupt.\n");
 	printf("  --bitrate-stats=N: set the logical core N to perform "
 		"bit-rate calculation.\n");
-	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed|all>: "
 	       "enable print of designated event or all of them.\n");
-	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed||all>: "
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
@@ -452,6 +452,12 @@ parse_event_printing_config(const char *optarg, int enable)
 		mask = UINT32_C(1) << RTE_ETH_EVENT_DESTROY;
 	else if (!strcmp(optarg, "flow_aged"))
 		mask = UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED;
+	else if (!strcmp(optarg, "err_recovering"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING;
+	else if (!strcmp(optarg, "recovery_success"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS;
+	else if (!strcmp(optarg, "recovery_failed"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED;
 	else if (!strcmp(optarg, "all"))
 		mask = ~UINT32_C(0);
 	else {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index bb1c901742..7104e61a2c 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -425,6 +425,9 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
 	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "RxQ available descriptors threshold reached",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "error recovering",
+	[RTE_ETH_EVENT_RECOVERY_SUCCESS] = "error recovery successful",
+	[RTE_ETH_EVENT_RECOVERY_FAILED] = "error recovery failed",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -439,7 +442,10 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 4/5] net/hns3: support proactive error handling mode
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
                     ` (2 preceding siblings ...)
  2022-10-09  9:10   ` [PATCH v11 3/5] app/testpmd: support error handling mode event Chengwen Feng
@ 2022-10-09  9:10   ` Chengwen Feng
  2022-10-09 11:05     ` Dongdong Liu
  2022-10-09  9:10   ` [PATCH v11 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 1 reply; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  9:10 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch supports proactive error handling mode.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/hns3/hns3_common.c |  2 ++
 drivers/net/hns3/hns3_intr.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/net/hns3/hns3_common.c b/drivers/net/hns3/hns3_common.c
index 14291193cb..7adc6a4972 100644
--- a/drivers/net/hns3/hns3_common.c
+++ b/drivers/net/hns3/hns3_common.c
@@ -149,6 +149,8 @@ hns3_dev_infos_get(struct rte_eth_dev *eth_dev, struct rte_eth_dev_info *info)
 		info->max_mac_addrs = HNS3_VF_UC_MACADDR_NUM;
 	}
 
+	info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/hns3/hns3_intr.c b/drivers/net/hns3/hns3_intr.c
index 57679254ee..44a1119415 100644
--- a/drivers/net/hns3/hns3_intr.c
+++ b/drivers/net/hns3/hns3_intr.c
@@ -1480,6 +1480,27 @@ static const struct hns3_hw_err_type hns3_hw_error_type[] = {
 	}
 };
 
+static void
+hns3_report_reset_begin(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_ERR_RECOVERING, NULL);
+}
+
+static void
+hns3_report_reset_success(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
+}
+
+static void
+hns3_report_reset_failed(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+}
+
 static int
 hns3_config_ncsi_hw_err_int(struct hns3_adapter *hns, bool en)
 {
@@ -2642,6 +2663,7 @@ hns3_reset_pre(struct hns3_adapter *hns)
 	if (hw->reset.stage == RESET_STAGE_NONE) {
 		__atomic_store_n(&hns->hw.reset.resetting, 1, __ATOMIC_RELAXED);
 		hw->reset.stage = RESET_STAGE_DOWN;
+		hns3_report_reset_begin(hw);
 		ret = hw->reset.ops->stop_service(hns);
 		hns3_clock_gettime(&tv);
 		if (ret) {
@@ -2751,6 +2773,7 @@ hns3_reset_post(struct hns3_adapter *hns)
 			  hns3_clock_calctime_ms(&tv_delta),
 			  tv.tv_sec, tv.tv_usec);
 		hw->reset.level = HNS3_NONE_RESET;
+		hns3_report_reset_success(hw);
 	}
 	return 0;
 
@@ -2796,6 +2819,7 @@ hns3_reset_fail_handle(struct hns3_adapter *hns)
 		  hns3_clock_calctime_ms(&tv_delta),
 		  tv.tv_sec, tv.tv_usec);
 	hw->reset.level = HNS3_NONE_RESET;
+	hns3_report_reset_failed(hw);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 5/5] net/bnxt: support proactive error handling mode
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
                     ` (3 preceding siblings ...)
  2022-10-09  9:10   ` [PATCH v11 4/5] net/hns3: support proactive error handling mode Chengwen Feng
@ 2022-10-09  9:10   ` Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-09  9:10 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports proactive error handling mode.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/bnxt/bnxt_cpr.c    |  4 ++++
 drivers/net/bnxt/bnxt_ethdev.c | 13 ++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 99af0f9e87..5bb376d4d5 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -180,6 +180,10 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			return;
 		}
 
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
+
 		pthread_mutex_lock(&bp->err_recovery_lock);
 		event_data = data1;
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index 3dfe9efc09..b3de490d36 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -1063,6 +1063,8 @@ static int bnxt_dev_info_get_op(struct rte_eth_dev *eth_dev,
 	dev_info->vmdq_pool_base = 0;
 	dev_info->vmdq_queue_base = 0;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
@@ -4382,13 +4384,18 @@ static void bnxt_dev_recover(void *arg)
 	PMD_DRV_LOG(INFO, "Port: %u Recovered from FW reset\n",
 		    bp->eth_dev->data->port_id);
 	pthread_mutex_unlock(&bp->err_recovery_lock);
-
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_SUCCESS,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop(bp->eth_dev);
 err:
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_FAILED,
+				     NULL);
 	if (bp->eth_dev->data->dev_conf.intr_conf.rmv)
 		rte_eth_dev_callback_process(bp->eth_dev,
 					     RTE_ETH_EVENT_INTR_RMV,
@@ -4560,6 +4567,10 @@ static void bnxt_check_fw_health(void *arg)
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
 
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
+
 	if (bnxt_is_primary_func(bp))
 		wait_msec = info->primary_func_wait_period;
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 4/5] net/hns3: support proactive error handling mode
  2022-10-09  9:10   ` [PATCH v11 4/5] net/hns3: support proactive error handling mode Chengwen Feng
@ 2022-10-09 11:05     ` Dongdong Liu
  0 siblings, 0 replies; 41+ messages in thread
From: Dongdong Liu @ 2022-10-09 11:05 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko


On 2022/10/9 17:10, Chengwen Feng wrote:
> This patch supports proactive error handling mode.
>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

Acked-by: Dongdong Liu <liudongdong3@huawei.com>

> ---
>  drivers/net/hns3/hns3_common.c |  2 ++
>  drivers/net/hns3/hns3_intr.c   | 24 ++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
>
> diff --git a/drivers/net/hns3/hns3_common.c b/drivers/net/hns3/hns3_common.c
> index 14291193cb..7adc6a4972 100644
> --- a/drivers/net/hns3/hns3_common.c
> +++ b/drivers/net/hns3/hns3_common.c
> @@ -149,6 +149,8 @@ hns3_dev_infos_get(struct rte_eth_dev *eth_dev, struct rte_eth_dev_info *info)
>  		info->max_mac_addrs = HNS3_VF_UC_MACADDR_NUM;
>  	}
>
> +	info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
> +
>  	return 0;
>  }
>
> diff --git a/drivers/net/hns3/hns3_intr.c b/drivers/net/hns3/hns3_intr.c
> index 57679254ee..44a1119415 100644
> --- a/drivers/net/hns3/hns3_intr.c
> +++ b/drivers/net/hns3/hns3_intr.c
> @@ -1480,6 +1480,27 @@ static const struct hns3_hw_err_type hns3_hw_error_type[] = {
>  	}
>  };
>
> +static void
> +hns3_report_reset_begin(struct hns3_hw *hw)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
> +	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_ERR_RECOVERING, NULL);
> +}
> +
> +static void
> +hns3_report_reset_success(struct hns3_hw *hw)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
> +	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
> +}
> +
> +static void
> +hns3_report_reset_failed(struct hns3_hw *hw)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
> +	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
> +}
> +
>  static int
>  hns3_config_ncsi_hw_err_int(struct hns3_adapter *hns, bool en)
>  {
> @@ -2642,6 +2663,7 @@ hns3_reset_pre(struct hns3_adapter *hns)
>  	if (hw->reset.stage == RESET_STAGE_NONE) {
>  		__atomic_store_n(&hns->hw.reset.resetting, 1, __ATOMIC_RELAXED);
>  		hw->reset.stage = RESET_STAGE_DOWN;
> +		hns3_report_reset_begin(hw);
>  		ret = hw->reset.ops->stop_service(hns);
>  		hns3_clock_gettime(&tv);
>  		if (ret) {
> @@ -2751,6 +2773,7 @@ hns3_reset_post(struct hns3_adapter *hns)
>  			  hns3_clock_calctime_ms(&tv_delta),
>  			  tv.tv_sec, tv.tv_usec);
>  		hw->reset.level = HNS3_NONE_RESET;
> +		hns3_report_reset_success(hw);
>  	}
>  	return 0;
>
> @@ -2796,6 +2819,7 @@ hns3_reset_fail_handle(struct hns3_adapter *hns)
>  		  hns3_clock_calctime_ms(&tv_delta),
>  		  tv.tv_sec, tv.tv_usec);
>  	hw->reset.level = HNS3_NONE_RESET;
> +	hns3_report_reset_failed(hw);
>  }
>
>  /*
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 1/5] ethdev: support get port error handling mode
  2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
@ 2022-10-10  8:38     ` Andrew Rybchenko
  2022-10-10  8:44     ` Andrew Rybchenko
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Rybchenko @ 2022-10-10  8:38 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Stephen Hemminger

On 10/9/22 12:10, Chengwen Feng wrote:
> This patch support gets port's error handling mode by
> rte_eth_dev_info_get() API.

Just: "Add error handling mode to device info."

> 
> Currently, the defined modes include:
> 1) NONE: it means no error handling modes are supported by this port.
> 2) PASSIVE: passive error handling, after the PMD detect that a reset
> is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
> application invoke rte_eth_dev_reset() to recover the port.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

With review notes applied (may be except usage of reserved
fields):
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index e8d1e1c658..3443bf20e1 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1629,6 +1629,22 @@ enum rte_eth_representor_type {
>   	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
>   };
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this enumeration may change without prior notice.
> + *
> + * Ethernet device error handling mode.
> + */
> +enum rte_eth_err_handle_mode {
> +	/** No error handling modes are supported. */
> +	RTE_ETH_ERROR_HANDLE_MODE_NONE,
> +	/** Passive error handling, after the PMD detect that a reset is
> +	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
> +	 * application invoke @see rte_eth_dev_reset to recover the port.
> +	 */
> +	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
> +};
> +
>   /**
>    * A structure used to retrieve the contextual information of
>    * an Ethernet device, such as the controlling driver of the
> @@ -1689,8 +1705,12 @@ struct rte_eth_dev_info {
>   	 * embedded managed interconnect/switch.
>   	 */
>   	struct rte_eth_switch_info switch_info;
> +	/** Supported error handling mode. @see enum rte_eth_err_handle_mode */
> +	uint8_t err_handle_mode;

IMHO, it must be
     enum rte_eth_err_handle_mode err_handle_mode;
Yes, it takes a bit more space, but it is a control path and
code clearness is more important here than few extra bytes.

>   
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	uint8_t reserved_8;       /**< Reserved for future fields  */
> +	uint16_t reserved_16s[3]; /**< Reserved for future fields  */
> +	uint64_t reserved_64;     /**< Reserved for future fields */

As far as I know it is done as per Stephen review notes, but
I'm not really sure why it is a right way in ABI breaking
release. I'd not touch it and just add a new field.

>   	void *reserved_ptrs[2];   /**< Reserved for future fields */
>   };
>   


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 1/5] ethdev: support get port error handling mode
  2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
  2022-10-10  8:38     ` Andrew Rybchenko
@ 2022-10-10  8:44     ` Andrew Rybchenko
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Rybchenko @ 2022-10-10  8:44 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr

On 10/9/22 12:10, Chengwen Feng wrote:
> This patch support gets port's error handling mode by
> rte_eth_dev_info_get() API.
> 
> Currently, the defined modes include:
> 1) NONE: it means no error handling modes are supported by this port.
> 2) PASSIVE: passive error handling, after the PMD detect that a reset
> is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
> application invoke rte_eth_dev_reset() to recover the port.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

in fact one more point below

> ---
>   app/test-pmd/config.c               |  2 ++
>   drivers/net/e1000/igb_ethdev.c      |  2 ++
>   drivers/net/ena/ena_ethdev.c        |  2 ++
>   drivers/net/iavf/iavf_ethdev.c      |  2 ++
>   drivers/net/ixgbe/ixgbe_ethdev.c    |  2 ++
>   drivers/net/txgbe/txgbe_ethdev_vf.c |  2 ++
>   lib/ethdev/rte_ethdev.h             | 22 +++++++++++++++++++++-
>   7 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> index 841e8efe78..bd7f2ba257 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -921,6 +921,8 @@ port_infos_display(portid_t port_id)
>   			printf("Switch Rx domain: %u\n",
>   			       dev_info.switch_info.rx_domain);
>   	}
> +	if (dev_info.err_handle_mode == RTE_ETH_ERROR_HANDLE_MODE_PASSIVE)
> +		printf("Device error handling mode: passive\n");

It should be done using switch/case instead of if/elseif.
Also I'd say that none should be handled as well.

>   }
>   
>   void


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 2/5] ethdev: support proactive error handling mode
  2022-10-09  9:10   ` [PATCH v11 2/5] ethdev: support proactive " Chengwen Feng
@ 2022-10-10  8:47     ` Andrew Rybchenko
  2022-10-11 14:48       ` fengchengwen
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Rybchenko @ 2022-10-10  8:47 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr

On 10/9/22 12:10, Chengwen Feng wrote:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
> to recover from the errors. In this process, the PMD sets the data path
> pointers to dummy functions (which will prevent the crash), and also
> make sure the control path operations failed with retcode -EBUSY.

Could you explain why passive mode is not good. Why is
proactive better? What are the benefits? IMHO, it would
be simpler to have just one error recovery mode.	
> 
> The above error handling mode is known as
> RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).
> 
> In some service scenarios, application needs to be aware of the event
> to determine whether to migrate services. So three events were
> introduced:
> 
> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
> detected an error and the recovery is being started. Upon receiving the
> event, the application should not invoke any control path APIs until
> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> 
> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
> it recovers successful from the error, the PMD already re-configures the
> port, and the effect is the same as that of the restart operation.
> 
> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
> recovers failed from the error, the port should not usable anymore. The
> application should close the port.
> 
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

The code itself LGTM. I just want to understand why we need it.
It should be proved in the description.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 2/5] ethdev: support proactive error handling mode
  2022-10-10  8:47     ` Andrew Rybchenko
@ 2022-10-11 14:48       ` fengchengwen
  0 siblings, 0 replies; 41+ messages in thread
From: fengchengwen @ 2022-10-11 14:48 UTC (permalink / raw)
  To: Andrew Rybchenko, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr

Hi Andrew,

On 2022/10/10 16:47, Andrew Rybchenko wrote:
> On 10/9/22 12:10, Chengwen Feng wrote:
>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>
>> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
>> to recover from the errors. In this process, the PMD sets the data path
>> pointers to dummy functions (which will prevent the crash), and also
>> make sure the control path operations failed with retcode -EBUSY.
>
> Could you explain why passive mode is not good. Why is
> proactive better? What are the benefits? IMHO, it would
> be simpler to have just one error recovery mode.


I think the two modes are not good or bad. To a large extent, they are 
determined

by the hardware and software design of the network card chip. Here take 
the hns3

driver as an examples:

During the error recovery, multiple handshakes are required between the 
driver and

the firmware, in addition, the handshake timeout are required.

If chose passive mode, the application may not register the callback 
(and also we

found that only ovs-dpdk register the reset event in many DPDK-based 
opensource

software), so the recovery will failed.  Furthermore, even if registered 
the callback,

the recovery process involves multiple handshakes which may take too 
much time

to complete, imagine having multiple ports to report the reset time at 
the same time.

(This possibility exists. Consider that the PF is reset due to multiple 
VFs under the PF.)

In this case, many VFs report event, but the event callback is executed 
sequentially

(because there is only one interrupt thread). As a result, later VFs 
cannot be processed

in time, and the reset may fails.


In conclusion, the proactive mode is an available troubleshooting method in

engineering practice.


>>
>> The above error handling mode is known as
>> RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).
>>
>> In some service scenarios, application needs to be aware of the event
>> to determine whether to migrate services. So three events were
>> introduced:
>>
>> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
>> detected an error and the recovery is being started. Upon receiving the
>> event, the application should not invoke any control path APIs until
>> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>
>> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
>> it recovers successful from the error, the PMD already re-configures the
>> port, and the effect is the same as that of the restart operation.
>>
>> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> recovers failed from the error, the port should not usable anymore. The
>> application should close the port.
>>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>
> The code itself LGTM. I just want to understand why we need it.
> It should be proved in the description.
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v12 0/5] support error handling mode
       [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
                   ` (2 preceding siblings ...)
  2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
@ 2022-10-12  3:45 ` Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 1/5] ethdev: add error handling mode to device info Chengwen Feng
                     ` (4 more replies)
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
  4 siblings, 5 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-12  3:45 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patchset introduce error handling mode concept, the supported modes
are as follows:

1) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset to recover the port.

2) PROACTIVE: proactive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_ERR_RECOVERING event, and do
recovery internally, finally, reports the recovery result event.

Chengwen Feng (2):
  ethdev: add error handling mode to device info
  net/hns3: support proactive error handling mode

Kalesh AP (3):
  ethdev: support proactive error handling mode
  app/testpmd: support error handling mode event
  net/bnxt: support proactive error handling mode

---
v12: Address comments from Andrew.
v11: Fix clang-static fail due wrong experimental placement.
v10: Accurately describe the recovery success scenario so that
     addressed comments from Ferruh.
v9: Introduce error handling mode concept.
    Addressed comments from Thomas and Ray.
v8: Addressed comments from Thomas and Ferruh.
    Also introduced RECOVER_FAIL event.
    Add hns3 driver patch.
v7: Addressed comments from Thomas and Andrew.
v6: Addressed comments from Asaf Penso.
    1. Updated 20.11 release notes with the new events added.
    2. updated testpmd parse_event_printing_config function.
v5: Addressed comments from Ophir Munk.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Addressed comments from Thomas Monjalon
    1. Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

 app/test-pmd/config.c                   | 15 +++++
 app/test-pmd/parameters.c               | 10 +++-
 app/test-pmd/testpmd.c                  |  8 ++-
 doc/guides/prog_guide/poll_mode_drv.rst | 38 ++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 ++++
 drivers/net/bnxt/bnxt_cpr.c             |  4 ++
 drivers/net/bnxt/bnxt_ethdev.c          | 13 ++++-
 drivers/net/e1000/igb_ethdev.c          |  2 +
 drivers/net/ena/ena_ethdev.c            |  2 +
 drivers/net/hns3/hns3_common.c          |  2 +
 drivers/net/hns3/hns3_intr.c            | 24 ++++++++
 drivers/net/iavf/iavf_ethdev.c          |  2 +
 drivers/net/ixgbe/ixgbe_ethdev.c        |  2 +
 drivers/net/txgbe/txgbe_ethdev_vf.c     |  2 +
 lib/ethdev/rte_ethdev.h                 | 77 +++++++++++++++++++++++++
 15 files changed, 209 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v12 1/5] ethdev: add error handling mode to device info
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
@ 2022-10-12  3:45   ` Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 2/5] ethdev: support proactive error handling mode Chengwen Feng
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-12  3:45 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch adds error handling mode to device info, currently, the
defined error handling modes include:

1) NONE: it means no error handling modes are supported by this port.

2) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset() to recover the port.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test-pmd/config.c               | 12 ++++++++++++
 drivers/net/e1000/igb_ethdev.c      |  2 ++
 drivers/net/ena/ena_ethdev.c        |  2 ++
 drivers/net/iavf/iavf_ethdev.c      |  2 ++
 drivers/net/ixgbe/ixgbe_ethdev.c    |  2 ++
 drivers/net/txgbe/txgbe_ethdev_vf.c |  2 ++
 lib/ethdev/rte_ethdev.h             | 18 ++++++++++++++++++
 7 files changed, 40 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index dec16a9049..4cddcd0bf7 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -921,6 +921,18 @@ port_infos_display(portid_t port_id)
 			printf("Switch Rx domain: %u\n",
 			       dev_info.switch_info.rx_domain);
 	}
+	printf("Device error handling mode: ");
+	switch (dev_info.err_handle_mode) {
+	case RTE_ETH_ERROR_HANDLE_MODE_NONE:
+		printf("none\n");
+		break;
+	case RTE_ETH_ERROR_HANDLE_MODE_PASSIVE:
+		printf("passive\n");
+		break;
+	default:
+		printf("unknown\n");
+		break;
+	}
 }
 
 void
diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index d6bcc5bf58..8858f975f8 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -2341,6 +2341,8 @@ eth_igbvf_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c
index 3e88bcda6c..efcb163027 100644
--- a/drivers/net/ena/ena_ethdev.c
+++ b/drivers/net/ena/ena_ethdev.c
@@ -2482,6 +2482,8 @@ static int ena_infos_get(struct rte_eth_dev *dev,
 	dev_info->default_rxportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 	dev_info->default_txportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 782be82c7f..b1958e0474 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1179,6 +1179,8 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		.nb_align = IAVF_ALIGN_RING_DESC,
 	};
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index bf70ee041d..fd06ddbe35 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -4056,6 +4056,8 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index f52cd8bc19..3b1f7c913b 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -521,6 +521,8 @@ txgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d43a638aff..5de8e13866 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1686,6 +1686,22 @@ enum rte_eth_representor_type {
 	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this enumeration may change without prior notice.
+ *
+ * Ethernet device error handling mode.
+ */
+enum rte_eth_err_handle_mode {
+	/** No error handling modes are supported. */
+	RTE_ETH_ERROR_HANDLE_MODE_NONE,
+	/** Passive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
+	 * application invoke @see rte_eth_dev_reset to recover the port.
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+};
+
 /**
  * A structure used to retrieve the contextual information of
  * an Ethernet device, such as the controlling driver of the
@@ -1753,6 +1769,8 @@ struct rte_eth_dev_info {
 	 * embedded managed interconnect/switch.
 	 */
 	struct rte_eth_switch_info switch_info;
+	/** Supported error handling mode. */
+	enum rte_eth_err_handle_mode err_handle_mode;
 
 	uint64_t reserved_64s[2]; /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v12 2/5] ethdev: support proactive error handling mode
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 1/5] ethdev: add error handling mode to device info Chengwen Feng
@ 2022-10-12  3:45   ` Chengwen Feng
  2022-10-13  8:58     ` Andrew Rybchenko
  2022-10-12  3:45   ` [PATCH v12 3/5] app/testpmd: support error handling mode event Chengwen Feng
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Chengwen Feng @ 2022-10-12  3:45 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
wait for application invoke rte_eth_dev_reset() to recover the port,
however, this mode has the following weaknesses:

1) Due to different hardware and software design, some NIC port recovery
process requires multiple handshakes with the firmware and PF (when the
port is VF). It takes a long time to complete the entire operation for
one port, If multiple ports (for example, multiple VFs of a PF) are
reset at the same time, other VFs may fail to be reset. (Because the
reset processing is serial, the previous VFs must be processed before
the subsequent VFs).

2) The impact on the application layer is great, and it should stop
working queues, stop calling Rx and Tx functions, and then call
rte_eth_dev_reset(), and re-setup all again.

This patch introduces proactive error handling mode, the PMD will try
to recover from the errors itself. In this process, the PMD sets the
data path pointers to dummy functions (which will prevent the crash),
and also make sure the control path operations failed with retcode
-EBUSY.

Because the PMD recovers automatically, the application can only sense
that the data flow is disconnected for a while and the control API
returns an error in this period.

In order to sense the error happening/recovering, three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures the
port, and the effect is the same as that of the restart operation.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/config.c                   |  3 ++
 doc/guides/prog_guide/poll_mode_drv.rst | 38 ++++++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 +++++
 lib/ethdev/rte_ethdev.h                 | 59 +++++++++++++++++++++++++
 4 files changed, 112 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 4cddcd0bf7..0f7dbd698f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -929,6 +929,9 @@ port_infos_display(portid_t port_id)
 	case RTE_ETH_ERROR_HANDLE_MODE_PASSIVE:
 		printf("passive\n");
 		break;
+	case RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE:
+		printf("proactive\n");
+		break;
 	default:
 		printf("unknown\n");
 		break;
diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..73941a74bd 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,41 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
+
+Proactive Error Handling Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
+hardware or firmware errors, the PMD will try to recover from the errors. In
+this process, the PMD sets the data path pointers to dummy functions (which
+will prevent the crash), and also make sure the control path operations failed
+with retcode -EBUSY.
+
+Also in this process, from the perspective of application, services are
+affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
+and the control plane API return failure.
+
+In some service scenarios, application needs to be aware of the event to
+determine whether to migrate services. So three events were introduced:
+
+* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
+  an error and the recovery is being started. Upon receiving the event, the
+  application should not invoke any control path APIs until receiving
+  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
+
+* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
+  recovers successful from the error, the PMD already re-configures the port,
+  and the effect is the same as that of the restart operation.
+
+* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
+  recovers failed from the error, the port should not usable anymore. the
+  application should close the port.
+
+.. note::
+        * Before the PMD reports the recovery result, the PMD may report the
+          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
+          may occur during the recovery.
+        * The error handling mode supported by the PMD can be reported through
+          the ``rte_eth_dev_info_get`` API.
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 2da8bc9661..a3700bbb34 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -124,6 +124,18 @@ New Features
   Added new flow action which allows application to re-route packets
   directly to the kernel without software involvement.
 
+* **Added proactive error handling mode for ethdev.**
+
+  Added proactive error handling mode for ethdev, and three events were
+  introduced:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 * **Updated AF_XDP driver.**
 
   * Made compatible with libbpf v0.8.0 (when used with libxdp).
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 5de8e13866..46ecc9a0fe 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1700,6 +1700,12 @@ enum rte_eth_err_handle_mode {
 	 * application invoke @see rte_eth_dev_reset to recover the port.
 	 */
 	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+	/** Proactive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_ERR_RECOVERING event,
+	 * and do recovery internally, finally, reports the recovery result
+	 * event (@see RTE_ETH_EVENT_RECOVERY_*).
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE,
 };
 
 /**
@@ -3886,6 +3892,59 @@ enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error.
+	 * If PMD supports proactive error recovery, it should trigger this
+	 * event to notify application that it detected an error and the
+	 * recovery is being started. Upon receiving the event, the application
+	 * should not invoke any control path APIs (such as
+	 * rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
+	 * event.
+	 * The PMD will set the data path pointers to dummy functions, and
+	 * re-set the data patch pointers to non-dummy functions before reports
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS event. It means that the application
+	 * cannot send or receive any packets during this period.
+	 * @note Before the PMD reports the recovery result, the PMD may report
+	 * the RTE_ETH_EVENT_ERR_RECOVERING event again, because a larger error
+	 * may occur during the recovery.
+	 */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error.
+	 * The PMD already re-configures the port, and the effect is the same as
+	 * that of the restart operation.
+	 * a) the following operation will be retained: (alphabetically)
+	 *    - DCB configuration
+	 *    - FEC configuration
+	 *    - Flow control configuration
+	 *    - LRO configuration
+	 *    - LSC configuration
+	 *    - MTU
+	 *    - Mac address (default and those supplied by MAC address array)
+	 *    - Promiscuous and allmulticast mode
+	 *    - PTP configuration
+	 *    - Queue (Rx/Tx) settings
+	 *    - Queue statistics mappings
+	 *    - RSS configuration by rte_eth_dev_rss_xxx() family
+	 *    - Rx checksum configuration
+	 *    - Rx interrupt settings
+	 *    - Traffic management configuration
+	 *    - VLAN configuration (including filtering, tpid, strip, pvid)
+	 *    - VMDq configuration
+	 * b) the following configuration maybe retained or not depending on the
+	 *    device capabilities:
+	 *    - flow rules
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
+	 *    - shared flow objects
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
+	 * c) the other configuration will not be stored and will need to be
+	 *    re-configured.
+	 */
+	RTE_ETH_EVENT_RECOVERY_SUCCESS,
+	/** Port recovers failed from the error.
+	 * It means that the port should not usable anymore. The application
+	 * should close the port.
+	 */
+	RTE_ETH_EVENT_RECOVERY_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v12 3/5] app/testpmd: support error handling mode event
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 1/5] ethdev: add error handling mode to device info Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 2/5] ethdev: support proactive error handling mode Chengwen Feng
@ 2022-10-12  3:45   ` Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 4/5] net/hns3: support proactive error handling mode Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-12  3:45 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports error handling mode event process.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/parameters.c | 10 ++++++++--
 app/test-pmd/testpmd.c    |  8 +++++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index ff760460ec..b56383dc4a 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,9 +167,9 @@ usage(char* progname)
 	printf("  --no-rmv-interrupt: disable device removal interrupt.\n");
 	printf("  --bitrate-stats=N: set the logical core N to perform "
 		"bit-rate calculation.\n");
-	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed|all>: "
 	       "enable print of designated event or all of them.\n");
-	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed||all>: "
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
@@ -453,6 +453,12 @@ parse_event_printing_config(const char *optarg, int enable)
 		mask = UINT32_C(1) << RTE_ETH_EVENT_DESTROY;
 	else if (!strcmp(optarg, "flow_aged"))
 		mask = UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED;
+	else if (!strcmp(optarg, "err_recovering"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING;
+	else if (!strcmp(optarg, "recovery_success"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS;
+	else if (!strcmp(optarg, "recovery_failed"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED;
 	else if (!strcmp(optarg, "all"))
 		mask = ~UINT32_C(0);
 	else {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 5b0f0838dc..f4f1888446 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -426,6 +426,9 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
 	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "RxQ available descriptors threshold reached",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "error recovering",
+	[RTE_ETH_EVENT_RECOVERY_SUCCESS] = "error recovery successful",
+	[RTE_ETH_EVENT_RECOVERY_FAILED] = "error recovery failed",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -440,7 +443,10 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v12 4/5] net/hns3: support proactive error handling mode
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
                     ` (2 preceding siblings ...)
  2022-10-12  3:45   ` [PATCH v12 3/5] app/testpmd: support error handling mode event Chengwen Feng
@ 2022-10-12  3:45   ` Chengwen Feng
  2022-10-12  3:45   ` [PATCH v12 5/5] net/bnxt: " Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-12  3:45 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch supports proactive error handling mode.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_common.c |  2 ++
 drivers/net/hns3/hns3_intr.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/net/hns3/hns3_common.c b/drivers/net/hns3/hns3_common.c
index 14291193cb..7adc6a4972 100644
--- a/drivers/net/hns3/hns3_common.c
+++ b/drivers/net/hns3/hns3_common.c
@@ -149,6 +149,8 @@ hns3_dev_infos_get(struct rte_eth_dev *eth_dev, struct rte_eth_dev_info *info)
 		info->max_mac_addrs = HNS3_VF_UC_MACADDR_NUM;
 	}
 
+	info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/hns3/hns3_intr.c b/drivers/net/hns3/hns3_intr.c
index 57679254ee..44a1119415 100644
--- a/drivers/net/hns3/hns3_intr.c
+++ b/drivers/net/hns3/hns3_intr.c
@@ -1480,6 +1480,27 @@ static const struct hns3_hw_err_type hns3_hw_error_type[] = {
 	}
 };
 
+static void
+hns3_report_reset_begin(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_ERR_RECOVERING, NULL);
+}
+
+static void
+hns3_report_reset_success(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
+}
+
+static void
+hns3_report_reset_failed(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+}
+
 static int
 hns3_config_ncsi_hw_err_int(struct hns3_adapter *hns, bool en)
 {
@@ -2642,6 +2663,7 @@ hns3_reset_pre(struct hns3_adapter *hns)
 	if (hw->reset.stage == RESET_STAGE_NONE) {
 		__atomic_store_n(&hns->hw.reset.resetting, 1, __ATOMIC_RELAXED);
 		hw->reset.stage = RESET_STAGE_DOWN;
+		hns3_report_reset_begin(hw);
 		ret = hw->reset.ops->stop_service(hns);
 		hns3_clock_gettime(&tv);
 		if (ret) {
@@ -2751,6 +2773,7 @@ hns3_reset_post(struct hns3_adapter *hns)
 			  hns3_clock_calctime_ms(&tv_delta),
 			  tv.tv_sec, tv.tv_usec);
 		hw->reset.level = HNS3_NONE_RESET;
+		hns3_report_reset_success(hw);
 	}
 	return 0;
 
@@ -2796,6 +2819,7 @@ hns3_reset_fail_handle(struct hns3_adapter *hns)
 		  hns3_clock_calctime_ms(&tv_delta),
 		  tv.tv_sec, tv.tv_usec);
 	hw->reset.level = HNS3_NONE_RESET;
+	hns3_report_reset_failed(hw);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v12 5/5] net/bnxt: support proactive error handling mode
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
                     ` (3 preceding siblings ...)
  2022-10-12  3:45   ` [PATCH v12 4/5] net/hns3: support proactive error handling mode Chengwen Feng
@ 2022-10-12  3:45   ` Chengwen Feng
  4 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-12  3:45 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports proactive error handling mode.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/bnxt/bnxt_cpr.c    |  4 ++++
 drivers/net/bnxt/bnxt_ethdev.c | 13 ++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 99af0f9e87..5bb376d4d5 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -180,6 +180,10 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			return;
 		}
 
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
+
 		pthread_mutex_lock(&bp->err_recovery_lock);
 		event_data = data1;
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index 3dfe9efc09..b3de490d36 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -1063,6 +1063,8 @@ static int bnxt_dev_info_get_op(struct rte_eth_dev *eth_dev,
 	dev_info->vmdq_pool_base = 0;
 	dev_info->vmdq_queue_base = 0;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
@@ -4382,13 +4384,18 @@ static void bnxt_dev_recover(void *arg)
 	PMD_DRV_LOG(INFO, "Port: %u Recovered from FW reset\n",
 		    bp->eth_dev->data->port_id);
 	pthread_mutex_unlock(&bp->err_recovery_lock);
-
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_SUCCESS,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop(bp->eth_dev);
 err:
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_FAILED,
+				     NULL);
 	if (bp->eth_dev->data->dev_conf.intr_conf.rmv)
 		rte_eth_dev_callback_process(bp->eth_dev,
 					     RTE_ETH_EVENT_INTR_RMV,
@@ -4560,6 +4567,10 @@ static void bnxt_check_fw_health(void *arg)
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
 
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
+
 	if (bnxt_is_primary_func(bp))
 		wait_msec = info->primary_func_wait_period;
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v12 2/5] ethdev: support proactive error handling mode
  2022-10-12  3:45   ` [PATCH v12 2/5] ethdev: support proactive error handling mode Chengwen Feng
@ 2022-10-13  8:58     ` Andrew Rybchenko
  2022-10-13 12:50       ` fengchengwen
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Rybchenko @ 2022-10-13  8:58 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr

On 10/12/22 06:45, Chengwen Feng wrote:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
> error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
> wait for application invoke rte_eth_dev_reset() to recover the port,
> however, this mode has the following weaknesses:
> 
> 1) Due to different hardware and software design, some NIC port recovery
> process requires multiple handshakes with the firmware and PF (when the
> port is VF). It takes a long time to complete the entire operation for
> one port, If multiple ports (for example, multiple VFs of a PF) are
> reset at the same time, other VFs may fail to be reset. (Because the
> reset processing is serial, the previous VFs must be processed before
> the subsequent VFs).
> 
> 2) The impact on the application layer is great, and it should stop
> working queues, stop calling Rx and Tx functions, and then call
> rte_eth_dev_reset(), and re-setup all again.
> 
> This patch introduces proactive error handling mode, the PMD will try
> to recover from the errors itself. In this process, the PMD sets the
> data path pointers to dummy functions (which will prevent the crash),
> and also make sure the control path operations failed with retcode
> -EBUSY.
> 
> Because the PMD recovers automatically, the application can only sense
> that the data flow is disconnected for a while and the control API
> returns an error in this period.
> 
> In order to sense the error happening/recovering, three events were
> introduced:
> 
> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
> detected an error and the recovery is being started. Upon receiving the
> event, the application should not invoke any control path APIs until
> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> 
> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
> it recovers successful from the error, the PMD already re-configures the
> port, and the effect is the same as that of the restart operation.
> 
> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
> recovers failed from the error, the port should not usable anymore. The
> application should close the port.
> 
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

With few nits below,

Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

[snip]

> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> index 9d081b1cba..73941a74bd 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -627,3 +627,41 @@ by application.
>   The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
>   the application to handle reset event. It is duty of application to
>   handle all synchronization before it calls rte_eth_dev_reset().
> +
> +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
> +
> +Proactive Error Handling Mode
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
> +hardware or firmware errors, the PMD will try to recover from the errors. In
> +this process, the PMD sets the data path pointers to dummy functions (which
> +will prevent the crash), and also make sure the control path operations failed
> +with retcode -EBUSY.
> +
> +Also in this process, from the perspective of application, services are
> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,

bust -> burst

> +and the control plane API return failure.

I think we need to highlight here that the key advantage of the
proactive error recover that it requires nothing from PMD by
default. The recover simply happens.

> +
> +In some service scenarios, application needs to be aware of the event to
> +determine whether to migrate services. So three events were introduced:
> +
> +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
> +  an error and the recovery is being started. Upon receiving the event, the
> +  application should not invoke any control path APIs until receiving
> +  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
> +
> +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
> +  recovers successful from the error, the PMD already re-configures the port,
> +  and the effect is the same as that of the restart operation.
> +
> +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
> +  recovers failed from the error, the port should not usable anymore. the
> +  application should close the port.
> +
> +.. note::
> +        * Before the PMD reports the recovery result, the PMD may report the
> +          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
> +          may occur during the recovery.
> +        * The error handling mode supported by the PMD can be reported through
> +          the ``rte_eth_dev_info_get`` API.
> diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
> +	 *    - LRO configuration
> +	 *    - LSC configuration
> +	 *    - MTU
> +	 *    - Mac address (default and those supplied by MAC address array)
> +	 *    - Promiscuous and allmulticast mode
> +	 *    - PTP configuration
> +	 *    - Queue (Rx/Tx) settings
> +	 *    - Queue statistics mappings
> +	 *    - RSS configuration by rte_eth_dev_rss_xxx() family
> +	 *    - Rx checksum configuration
> +	 *    - Rx interrupt settings
> +	 *    - Traffic management configuration
> +	 *    - VLAN configuration (including filtering, tpid, strip, pvid)
> +	 *    - VMDq configuration
> +	 * b) the following configuration maybe retained or not depending on the
> +	 *    device capabilities:
> +	 *    - flow rules
> +	 *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
> +	 *    - shared flow objects
> +	 *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
> +	 * c) the other configuration will not be stored and will need to be
> +	 *    re-configured.
> +	 */
> +	RTE_ETH_EVENT_RECOVERY_SUCCESS,
> +	/** Port recovers failed from the error.
> +	 * It means that the port should not usable anymore. The application
> +	 * should close the port.
> +	 */
> +	RTE_ETH_EVENT_RECOVERY_FAILED,
>   	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>   };

[snip]



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v13 0/5] support error handling mode
       [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
                   ` (3 preceding siblings ...)
  2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
@ 2022-10-13 12:42 ` Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 1/5] ethdev: add error handling mode to device info Chengwen Feng
                     ` (5 more replies)
  4 siblings, 6 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-13 12:42 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patchset introduce error handling mode concept, the supported modes
are as follows:

1) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset to recover the port.
  
2) PROACTIVE: proactive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_ERR_RECOVERING event, and do
recovery internally, finally, reports the recovery result event.

Chengwen Feng (2):
  ethdev: add error handling mode to device info
  net/hns3: support proactive error handling mode

Kalesh AP (3):
  ethdev: support proactive error handling mode
  app/testpmd: support error handling mode event
  net/bnxt: support proactive error handling mode

---
v13: Address comments from Andrew (rework part of rst).
v12: Address comments from Andrew.
v11: Fix clang-static fail due wrong experimental placement.
v10: Accurately describe the recovery success scenario so that
     addressed comments from Ferruh.
v9: Introduce error handling mode concept.
    Addressed comments from Thomas and Ray.
v8: Addressed comments from Thomas and Ferruh.
    Also introduced RECOVER_FAIL event.
    Add hns3 driver patch.
v7: Addressed comments from Thomas and Andrew.
v6: Addressed comments from Asaf Penso.
    1. Updated 20.11 release notes with the new events added.
    2. updated testpmd parse_event_printing_config function.
v5: Addressed comments from Ophir Munk.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Addressed comments from Thomas Monjalon
    1. Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

 app/test-pmd/config.c                   | 15 +++++
 app/test-pmd/parameters.c               | 10 +++-
 app/test-pmd/testpmd.c                  |  8 ++-
 doc/guides/prog_guide/poll_mode_drv.rst | 41 +++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 ++++
 drivers/net/bnxt/bnxt_cpr.c             |  4 ++
 drivers/net/bnxt/bnxt_ethdev.c          | 13 ++++-
 drivers/net/e1000/igb_ethdev.c          |  2 +
 drivers/net/ena/ena_ethdev.c            |  2 +
 drivers/net/hns3/hns3_common.c          |  2 +
 drivers/net/hns3/hns3_intr.c            | 24 ++++++++
 drivers/net/iavf/iavf_ethdev.c          |  2 +
 drivers/net/ixgbe/ixgbe_ethdev.c        |  2 +
 drivers/net/txgbe/txgbe_ethdev_vf.c     |  2 +
 lib/ethdev/rte_ethdev.h                 | 77 +++++++++++++++++++++++++
 15 files changed, 212 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v13 1/5] ethdev: add error handling mode to device info
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
@ 2022-10-13 12:42   ` Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 2/5] ethdev: support proactive error handling mode Chengwen Feng
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-13 12:42 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch adds error handling mode to device info, currently, the
defined error handling modes include:

1) NONE: it means no error handling modes are supported by this port.

2) PASSIVE: passive error handling, after the PMD detect that a reset
is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
application invoke rte_eth_dev_reset() to recover the port.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test-pmd/config.c               | 12 ++++++++++++
 drivers/net/e1000/igb_ethdev.c      |  2 ++
 drivers/net/ena/ena_ethdev.c        |  2 ++
 drivers/net/iavf/iavf_ethdev.c      |  2 ++
 drivers/net/ixgbe/ixgbe_ethdev.c    |  2 ++
 drivers/net/txgbe/txgbe_ethdev_vf.c |  2 ++
 lib/ethdev/rte_ethdev.h             | 18 ++++++++++++++++++
 7 files changed, 40 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index dec16a9049..4cddcd0bf7 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -921,6 +921,18 @@ port_infos_display(portid_t port_id)
 			printf("Switch Rx domain: %u\n",
 			       dev_info.switch_info.rx_domain);
 	}
+	printf("Device error handling mode: ");
+	switch (dev_info.err_handle_mode) {
+	case RTE_ETH_ERROR_HANDLE_MODE_NONE:
+		printf("none\n");
+		break;
+	case RTE_ETH_ERROR_HANDLE_MODE_PASSIVE:
+		printf("passive\n");
+		break;
+	default:
+		printf("unknown\n");
+		break;
+	}
 }
 
 void
diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index d6bcc5bf58..8858f975f8 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -2341,6 +2341,8 @@ eth_igbvf_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c
index 3e88bcda6c..efcb163027 100644
--- a/drivers/net/ena/ena_ethdev.c
+++ b/drivers/net/ena/ena_ethdev.c
@@ -2482,6 +2482,8 @@ static int ena_infos_get(struct rte_eth_dev *dev,
 	dev_info->default_rxportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 	dev_info->default_txportconf.ring_size = ENA_DEFAULT_RING_SIZE;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 782be82c7f..b1958e0474 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1179,6 +1179,8 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		.nb_align = IAVF_ALIGN_RING_DESC,
 	};
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index bf70ee041d..fd06ddbe35 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -4056,6 +4056,8 @@ ixgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index f52cd8bc19..3b1f7c913b 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -521,6 +521,8 @@ txgbevf_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->rx_desc_lim = rx_desc_lim;
 	dev_info->tx_desc_lim = tx_desc_lim;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PASSIVE;
+
 	return 0;
 }
 
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d43a638aff..5de8e13866 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1686,6 +1686,22 @@ enum rte_eth_representor_type {
 	RTE_ETH_REPRESENTOR_PF,   /**< representor of Physical Function. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this enumeration may change without prior notice.
+ *
+ * Ethernet device error handling mode.
+ */
+enum rte_eth_err_handle_mode {
+	/** No error handling modes are supported. */
+	RTE_ETH_ERROR_HANDLE_MODE_NONE,
+	/** Passive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_INTR_RESET event, and
+	 * application invoke @see rte_eth_dev_reset to recover the port.
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+};
+
 /**
  * A structure used to retrieve the contextual information of
  * an Ethernet device, such as the controlling driver of the
@@ -1753,6 +1769,8 @@ struct rte_eth_dev_info {
 	 * embedded managed interconnect/switch.
 	 */
 	struct rte_eth_switch_info switch_info;
+	/** Supported error handling mode. */
+	enum rte_eth_err_handle_mode err_handle_mode;
 
 	uint64_t reserved_64s[2]; /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v13 2/5] ethdev: support proactive error handling mode
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 1/5] ethdev: add error handling mode to device info Chengwen Feng
@ 2022-10-13 12:42   ` Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 3/5] app/testpmd: support error handling mode event Chengwen Feng
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-13 12:42 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
wait for application invoke rte_eth_dev_reset() to recover the port,
however, this mode has the following weaknesses:

1) Due to different hardware and software design, some NIC port recovery
process requires multiple handshakes with the firmware and PF (when the
port is VF). It takes a long time to complete the entire operation for
one port, If multiple ports (for example, multiple VFs of a PF) are
reset at the same time, other VFs may fail to be reset. (Because the
reset processing is serial, the previous VFs must be processed before
the subsequent VFs).

2) The impact on the application layer is great, and it should stop
working queues, stop calling Rx and Tx functions, and then call
rte_eth_dev_reset(), and re-setup all again.

This patch introduces proactive error handling mode, the PMD will try
to recover from the errors itself. In this process, the PMD sets the
data path pointers to dummy functions (which will prevent the crash),
and also make sure the control path operations failed with retcode
-EBUSY.

Because the PMD recovers automatically, the application can only sense
that the data flow is disconnected for a while and the control API
returns an error in this period.

In order to sense the error happening/recovering, three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures the
port, and the effect is the same as that of the restart operation.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test-pmd/config.c                   |  3 ++
 doc/guides/prog_guide/poll_mode_drv.rst | 41 +++++++++++++++++
 doc/guides/rel_notes/release_22_11.rst  | 12 +++++
 lib/ethdev/rte_ethdev.h                 | 59 +++++++++++++++++++++++++
 4 files changed, 115 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 4cddcd0bf7..0f7dbd698f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -929,6 +929,9 @@ port_infos_display(portid_t port_id)
 	case RTE_ETH_ERROR_HANDLE_MODE_PASSIVE:
 		printf("passive\n");
 		break;
+	case RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE:
+		printf("proactive\n");
+		break;
 	default:
 		printf("unknown\n");
 		break;
diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..7a9c43d1cb 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,44 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
+
+Proactive Error Handling Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, different from
+the application invokes recovery in PASSIVE mode, the PMD automatically recovers
+from error in PROACTIVE mode, and only a small amount of work is required for
+the application.
+
+During error detection and automatic recovery, the PMD sets the data path
+pointers to dummy functions (which will prevent the crash), and also make sure
+the control path operations failed with retcode -EBUSY.
+
+Because the PMD recovers automatically, the application can only sense that the
+data flow is disconnected for a while and the control API returns an error in
+this period.
+
+In order to sense the error happening/recovering, as well as restore some
+additional configuration, three events were introduced:
+
+* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
+  an error and the recovery is being started. Upon receiving the event, the
+  application should not invoke any control path APIs until receiving
+  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
+
+* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
+  recovers successful from the error, the PMD already re-configures the port,
+  and the effect is the same as that of the restart operation.
+
+* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
+  recovers failed from the error, the port should not usable anymore. the
+  application should close the port.
+
+.. note::
+        * Before the PMD reports the recovery result, the PMD may report the
+          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
+          may occur during the recovery.
+        * The error handling mode supported by the PMD can be reported through
+          the ``rte_eth_dev_info_get`` API.
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 2da8bc9661..a3700bbb34 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -124,6 +124,18 @@ New Features
   Added new flow action which allows application to re-route packets
   directly to the kernel without software involvement.
 
+* **Added proactive error handling mode for ethdev.**
+
+  Added proactive error handling mode for ethdev, and three events were
+  introduced:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 * **Updated AF_XDP driver.**
 
   * Made compatible with libbpf v0.8.0 (when used with libxdp).
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 5de8e13866..46ecc9a0fe 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1700,6 +1700,12 @@ enum rte_eth_err_handle_mode {
 	 * application invoke @see rte_eth_dev_reset to recover the port.
 	 */
 	RTE_ETH_ERROR_HANDLE_MODE_PASSIVE,
+	/** Proactive error handling, after the PMD detect that a reset is
+	 * required, the PMD reports @see RTE_ETH_EVENT_ERR_RECOVERING event,
+	 * and do recovery internally, finally, reports the recovery result
+	 * event (@see RTE_ETH_EVENT_RECOVERY_*).
+	 */
+	RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE,
 };
 
 /**
@@ -3886,6 +3892,59 @@ enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error.
+	 * If PMD supports proactive error recovery, it should trigger this
+	 * event to notify application that it detected an error and the
+	 * recovery is being started. Upon receiving the event, the application
+	 * should not invoke any control path APIs (such as
+	 * rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
+	 * event.
+	 * The PMD will set the data path pointers to dummy functions, and
+	 * re-set the data patch pointers to non-dummy functions before reports
+	 * RTE_ETH_EVENT_RECOVERY_SUCCESS event. It means that the application
+	 * cannot send or receive any packets during this period.
+	 * @note Before the PMD reports the recovery result, the PMD may report
+	 * the RTE_ETH_EVENT_ERR_RECOVERING event again, because a larger error
+	 * may occur during the recovery.
+	 */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error.
+	 * The PMD already re-configures the port, and the effect is the same as
+	 * that of the restart operation.
+	 * a) the following operation will be retained: (alphabetically)
+	 *    - DCB configuration
+	 *    - FEC configuration
+	 *    - Flow control configuration
+	 *    - LRO configuration
+	 *    - LSC configuration
+	 *    - MTU
+	 *    - Mac address (default and those supplied by MAC address array)
+	 *    - Promiscuous and allmulticast mode
+	 *    - PTP configuration
+	 *    - Queue (Rx/Tx) settings
+	 *    - Queue statistics mappings
+	 *    - RSS configuration by rte_eth_dev_rss_xxx() family
+	 *    - Rx checksum configuration
+	 *    - Rx interrupt settings
+	 *    - Traffic management configuration
+	 *    - VLAN configuration (including filtering, tpid, strip, pvid)
+	 *    - VMDq configuration
+	 * b) the following configuration maybe retained or not depending on the
+	 *    device capabilities:
+	 *    - flow rules
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
+	 *    - shared flow objects
+	 *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
+	 * c) the other configuration will not be stored and will need to be
+	 *    re-configured.
+	 */
+	RTE_ETH_EVENT_RECOVERY_SUCCESS,
+	/** Port recovers failed from the error.
+	 * It means that the port should not usable anymore. The application
+	 * should close the port.
+	 */
+	RTE_ETH_EVENT_RECOVERY_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v13 3/5] app/testpmd: support error handling mode event
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 1/5] ethdev: add error handling mode to device info Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 2/5] ethdev: support proactive error handling mode Chengwen Feng
@ 2022-10-13 12:42   ` Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 4/5] net/hns3: support proactive error handling mode Chengwen Feng
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-13 12:42 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports error handling mode event process.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/parameters.c | 10 ++++++++--
 app/test-pmd/testpmd.c    |  8 +++++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index ff760460ec..b56383dc4a 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,9 +167,9 @@ usage(char* progname)
 	printf("  --no-rmv-interrupt: disable device removal interrupt.\n");
 	printf("  --bitrate-stats=N: set the logical core N to perform "
 		"bit-rate calculation.\n");
-	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed|all>: "
 	       "enable print of designated event or all of them.\n");
-	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovery_success|recovery_failed||all>: "
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
@@ -453,6 +453,12 @@ parse_event_printing_config(const char *optarg, int enable)
 		mask = UINT32_C(1) << RTE_ETH_EVENT_DESTROY;
 	else if (!strcmp(optarg, "flow_aged"))
 		mask = UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED;
+	else if (!strcmp(optarg, "err_recovering"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING;
+	else if (!strcmp(optarg, "recovery_success"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS;
+	else if (!strcmp(optarg, "recovery_failed"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED;
 	else if (!strcmp(optarg, "all"))
 		mask = ~UINT32_C(0);
 	else {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 5b0f0838dc..f4f1888446 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -426,6 +426,9 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
 	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "RxQ available descriptors threshold reached",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "error recovering",
+	[RTE_ETH_EVENT_RECOVERY_SUCCESS] = "error recovery successful",
+	[RTE_ETH_EVENT_RECOVERY_FAILED] = "error recovery failed",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -440,7 +443,10 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_SUCCESS) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERY_FAILED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v13 4/5] net/hns3: support proactive error handling mode
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
                     ` (2 preceding siblings ...)
  2022-10-13 12:42   ` [PATCH v13 3/5] app/testpmd: support error handling mode event Chengwen Feng
@ 2022-10-13 12:42   ` Chengwen Feng
  2022-10-13 12:42   ` [PATCH v13 5/5] net/bnxt: " Chengwen Feng
  2022-10-17  7:42   ` [PATCH v13 0/5] support " Andrew Rybchenko
  5 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-13 12:42 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

This patch supports proactive error handling mode.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_common.c |  2 ++
 drivers/net/hns3/hns3_intr.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/net/hns3/hns3_common.c b/drivers/net/hns3/hns3_common.c
index 14291193cb..7adc6a4972 100644
--- a/drivers/net/hns3/hns3_common.c
+++ b/drivers/net/hns3/hns3_common.c
@@ -149,6 +149,8 @@ hns3_dev_infos_get(struct rte_eth_dev *eth_dev, struct rte_eth_dev_info *info)
 		info->max_mac_addrs = HNS3_VF_UC_MACADDR_NUM;
 	}
 
+	info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
diff --git a/drivers/net/hns3/hns3_intr.c b/drivers/net/hns3/hns3_intr.c
index 57679254ee..44a1119415 100644
--- a/drivers/net/hns3/hns3_intr.c
+++ b/drivers/net/hns3/hns3_intr.c
@@ -1480,6 +1480,27 @@ static const struct hns3_hw_err_type hns3_hw_error_type[] = {
 	}
 };
 
+static void
+hns3_report_reset_begin(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_ERR_RECOVERING, NULL);
+}
+
+static void
+hns3_report_reset_success(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
+}
+
+static void
+hns3_report_reset_failed(struct hns3_hw *hw)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[hw->data->port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+}
+
 static int
 hns3_config_ncsi_hw_err_int(struct hns3_adapter *hns, bool en)
 {
@@ -2642,6 +2663,7 @@ hns3_reset_pre(struct hns3_adapter *hns)
 	if (hw->reset.stage == RESET_STAGE_NONE) {
 		__atomic_store_n(&hns->hw.reset.resetting, 1, __ATOMIC_RELAXED);
 		hw->reset.stage = RESET_STAGE_DOWN;
+		hns3_report_reset_begin(hw);
 		ret = hw->reset.ops->stop_service(hns);
 		hns3_clock_gettime(&tv);
 		if (ret) {
@@ -2751,6 +2773,7 @@ hns3_reset_post(struct hns3_adapter *hns)
 			  hns3_clock_calctime_ms(&tv_delta),
 			  tv.tv_sec, tv.tv_usec);
 		hw->reset.level = HNS3_NONE_RESET;
+		hns3_report_reset_success(hw);
 	}
 	return 0;
 
@@ -2796,6 +2819,7 @@ hns3_reset_fail_handle(struct hns3_adapter *hns)
 		  hns3_clock_calctime_ms(&tv_delta),
 		  tv.tv_sec, tv.tv_usec);
 	hw->reset.level = HNS3_NONE_RESET;
+	hns3_report_reset_failed(hw);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v13 5/5] net/bnxt: support proactive error handling mode
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
                     ` (3 preceding siblings ...)
  2022-10-13 12:42   ` [PATCH v13 4/5] net/hns3: support proactive error handling mode Chengwen Feng
@ 2022-10-13 12:42   ` Chengwen Feng
  2022-10-17  7:42   ` [PATCH v13 0/5] support " Andrew Rybchenko
  5 siblings, 0 replies; 41+ messages in thread
From: Chengwen Feng @ 2022-10-13 12:42 UTC (permalink / raw)
  To: thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr,
	Andrew.Rybchenko

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch supports proactive error handling mode.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/bnxt/bnxt_cpr.c    |  4 ++++
 drivers/net/bnxt/bnxt_ethdev.c | 13 ++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 99af0f9e87..5bb376d4d5 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -180,6 +180,10 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			return;
 		}
 
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
+
 		pthread_mutex_lock(&bp->err_recovery_lock);
 		event_data = data1;
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index 3dfe9efc09..b3de490d36 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -1063,6 +1063,8 @@ static int bnxt_dev_info_get_op(struct rte_eth_dev *eth_dev,
 	dev_info->vmdq_pool_base = 0;
 	dev_info->vmdq_queue_base = 0;
 
+	dev_info->err_handle_mode = RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE;
+
 	return 0;
 }
 
@@ -4382,13 +4384,18 @@ static void bnxt_dev_recover(void *arg)
 	PMD_DRV_LOG(INFO, "Port: %u Recovered from FW reset\n",
 		    bp->eth_dev->data->port_id);
 	pthread_mutex_unlock(&bp->err_recovery_lock);
-
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_SUCCESS,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop(bp->eth_dev);
 err:
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERY_FAILED,
+				     NULL);
 	if (bp->eth_dev->data->dev_conf.intr_conf.rmv)
 		rte_eth_dev_callback_process(bp->eth_dev,
 					     RTE_ETH_EVENT_INTR_RMV,
@@ -4560,6 +4567,10 @@ static void bnxt_check_fw_health(void *arg)
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
 
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
+
 	if (bnxt_is_primary_func(bp))
 		wait_msec = info->primary_func_wait_period;
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v12 2/5] ethdev: support proactive error handling mode
  2022-10-13  8:58     ` Andrew Rybchenko
@ 2022-10-13 12:50       ` fengchengwen
  0 siblings, 0 replies; 41+ messages in thread
From: fengchengwen @ 2022-10-13 12:50 UTC (permalink / raw)
  To: Andrew Rybchenko, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr

Hi Andrew,

  I rework part of rst according your comments, sent by v13, please take a look.

Thanks.

On 2022/10/13 16:58, Andrew Rybchenko wrote:
> On 10/12/22 06:45, Chengwen Feng wrote:
>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>
>> Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
>> error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
>> wait for application invoke rte_eth_dev_reset() to recover the port,
>> however, this mode has the following weaknesses:
>>
>> 1) Due to different hardware and software design, some NIC port recovery
>> process requires multiple handshakes with the firmware and PF (when the
>> port is VF). It takes a long time to complete the entire operation for
>> one port, If multiple ports (for example, multiple VFs of a PF) are
>> reset at the same time, other VFs may fail to be reset. (Because the
>> reset processing is serial, the previous VFs must be processed before
>> the subsequent VFs).
>>
>> 2) The impact on the application layer is great, and it should stop
>> working queues, stop calling Rx and Tx functions, and then call
>> rte_eth_dev_reset(), and re-setup all again.
>>
>> This patch introduces proactive error handling mode, the PMD will try
>> to recover from the errors itself. In this process, the PMD sets the
>> data path pointers to dummy functions (which will prevent the crash),
>> and also make sure the control path operations failed with retcode
>> -EBUSY.
>>
>> Because the PMD recovers automatically, the application can only sense
>> that the data flow is disconnected for a while and the control API
>> returns an error in this period.
>>
>> In order to sense the error happening/recovering, three events were
>> introduced:
>>
>> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
>> detected an error and the recovery is being started. Upon receiving the
>> event, the application should not invoke any control path APIs until
>> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>
>> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
>> it recovers successful from the error, the PMD already re-configures the
>> port, and the effect is the same as that of the restart operation.
>>
>> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> recovers failed from the error, the port should not usable anymore. The
>> application should close the port.
>>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> 
> With few nits below,
> 
> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> 
> [snip]
> 
>> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
>> index 9d081b1cba..73941a74bd 100644
>> --- a/doc/guides/prog_guide/poll_mode_drv.rst
>> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
>> @@ -627,3 +627,41 @@ by application.
>>   The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
>>   the application to handle reset event. It is duty of application to
>>   handle all synchronization before it calls rte_eth_dev_reset().
>> +
>> +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
>> +
>> +Proactive Error Handling Mode
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
>> +hardware or firmware errors, the PMD will try to recover from the errors. In
>> +this process, the PMD sets the data path pointers to dummy functions (which
>> +will prevent the crash), and also make sure the control path operations failed
>> +with retcode -EBUSY.
>> +
>> +Also in this process, from the perspective of application, services are
>> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
> 
> bust -> burst
> 
>> +and the control plane API return failure.
> 
> I think we need to highlight here that the key advantage of the
> proactive error recover that it requires nothing from PMD by
> default. The recover simply happens.
> 
>> +
>> +In some service scenarios, application needs to be aware of the event to
>> +determine whether to migrate services. So three events were introduced:
>> +
>> +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
>> +  an error and the recovery is being started. Upon receiving the event, the
>> +  application should not invoke any control path APIs until receiving
>> +  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>> +
>> +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
>> +  recovers successful from the error, the PMD already re-configures the port,
>> +  and the effect is the same as that of the restart operation.
>> +
>> +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> +  recovers failed from the error, the port should not usable anymore. the
>> +  application should close the port.
>> +
>> +.. note::
>> +        * Before the PMD reports the recovery result, the PMD may report the
>> +          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
>> +          may occur during the recovery.
>> +        * The error handling mode supported by the PMD can be reported through
>> +          the ``rte_eth_dev_info_get`` API.
>> diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
>> +     *    - LRO configuration
>> +     *    - LSC configuration
>> +     *    - MTU
>> +     *    - Mac address (default and those supplied by MAC address array)
>> +     *    - Promiscuous and allmulticast mode
>> +     *    - PTP configuration
>> +     *    - Queue (Rx/Tx) settings
>> +     *    - Queue statistics mappings
>> +     *    - RSS configuration by rte_eth_dev_rss_xxx() family
>> +     *    - Rx checksum configuration
>> +     *    - Rx interrupt settings
>> +     *    - Traffic management configuration
>> +     *    - VLAN configuration (including filtering, tpid, strip, pvid)
>> +     *    - VMDq configuration
>> +     * b) the following configuration maybe retained or not depending on the
>> +     *    device capabilities:
>> +     *    - flow rules
>> +     *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
>> +     *    - shared flow objects
>> +     *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
>> +     * c) the other configuration will not be stored and will need to be
>> +     *    re-configured.
>> +     */
>> +    RTE_ETH_EVENT_RECOVERY_SUCCESS,
>> +    /** Port recovers failed from the error.
>> +     * It means that the port should not usable anymore. The application
>> +     * should close the port.
>> +     */
>> +    RTE_ETH_EVENT_RECOVERY_FAILED,
>>       RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>   };
> 
> [snip]
> 
> 
> .

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v13 0/5] support error handling mode
  2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
                     ` (4 preceding siblings ...)
  2022-10-13 12:42   ` [PATCH v13 5/5] net/bnxt: " Chengwen Feng
@ 2022-10-17  7:42   ` Andrew Rybchenko
  5 siblings, 0 replies; 41+ messages in thread
From: Andrew Rybchenko @ 2022-10-17  7:42 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit, ferruh.yigit
  Cc: dev, kalesh-anakkur.purayil, somnath.kotur, ajit.khaparde, mdr

On 10/13/22 15:42, Chengwen Feng wrote:
> This patchset introduce error handling mode concept, the supported modes
> are as follows:
> 
> 1) PASSIVE: passive error handling, after the PMD detect that a reset
> is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
> application invoke rte_eth_dev_reset to recover the port.
>    
> 2) PROACTIVE: proactive error handling, after the PMD detect that a reset
> is required, the PMD reports RTE_ETH_EVENT_ERR_RECOVERING event, and do
> recovery internally, finally, reports the recovery result event.
> 
> Chengwen Feng (2):
>    ethdev: add error handling mode to device info
>    net/hns3: support proactive error handling mode
> 
> Kalesh AP (3):
>    ethdev: support proactive error handling mode
>    app/testpmd: support error handling mode event
>    net/bnxt: support proactive error handling mode
> 
> ---
> v13: Address comments from Andrew (rework part of rst).
> v12: Address comments from Andrew.
> v11: Fix clang-static fail due wrong experimental placement.
> v10: Accurately describe the recovery success scenario so that
>       addressed comments from Ferruh.
> v9: Introduce error handling mode concept.
>      Addressed comments from Thomas and Ray.
> v8: Addressed comments from Thomas and Ferruh.
>      Also introduced RECOVER_FAIL event.
>      Add hns3 driver patch.
> v7: Addressed comments from Thomas and Andrew.
> v6: Addressed comments from Asaf Penso.
>      1. Updated 20.11 release notes with the new events added.
>      2. updated testpmd parse_event_printing_config function.
> v5: Addressed comments from Ophir Munk.
>      1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
>      2. Fixed testpmd logs.
>      3. Documented the new recovery events.
> v4: Addressed comments from Thomas Monjalon
>      1. Added doxygen comments about new events.
> V3: Fixed a typo in commit log.
> V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
>      RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.
> 
>   app/test-pmd/config.c                   | 15 +++++
>   app/test-pmd/parameters.c               | 10 +++-
>   app/test-pmd/testpmd.c                  |  8 ++-
>   doc/guides/prog_guide/poll_mode_drv.rst | 41 +++++++++++++
>   doc/guides/rel_notes/release_22_11.rst  | 12 ++++
>   drivers/net/bnxt/bnxt_cpr.c             |  4 ++
>   drivers/net/bnxt/bnxt_ethdev.c          | 13 ++++-
>   drivers/net/e1000/igb_ethdev.c          |  2 +
>   drivers/net/ena/ena_ethdev.c            |  2 +
>   drivers/net/hns3/hns3_common.c          |  2 +
>   drivers/net/hns3/hns3_intr.c            | 24 ++++++++
>   drivers/net/iavf/iavf_ethdev.c          |  2 +
>   drivers/net/ixgbe/ixgbe_ethdev.c        |  2 +
>   drivers/net/txgbe/txgbe_ethdev_vf.c     |  2 +
>   lib/ethdev/rte_ethdev.h                 | 77 +++++++++++++++++++++++++
>   15 files changed, 212 insertions(+), 4 deletions(-)
> 

Applied to dpdk-next-net/main, thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2022-10-17  7:42 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
2022-09-22  7:41 ` [PATCH v9 0/5] support error handling mode Chengwen Feng
2022-09-22  7:41   ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
2022-10-03 17:35     ` Ferruh Yigit
2022-10-05  1:56       ` fengchengwen
2022-09-22  7:41   ` [PATCH v9 2/5] ethdev: support proactive " Chengwen Feng
2022-10-03 17:35     ` Ferruh Yigit
2022-09-22  7:41   ` [PATCH v9 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-09-22  7:41   ` [PATCH v9 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-09-22  7:41   ` [PATCH v9 5/5] net/bnxt: " Chengwen Feng
2022-10-09  7:53 ` [PATCH v10 0/5] support " Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 1/5] ethdev: support get port " Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 2/5] ethdev: support proactive " Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-09  7:53   ` [PATCH v10 5/5] net/bnxt: " Chengwen Feng
2022-10-09  9:10 ` [PATCH v11 0/5] support " Chengwen Feng
2022-10-09  9:10   ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
2022-10-10  8:38     ` Andrew Rybchenko
2022-10-10  8:44     ` Andrew Rybchenko
2022-10-09  9:10   ` [PATCH v11 2/5] ethdev: support proactive " Chengwen Feng
2022-10-10  8:47     ` Andrew Rybchenko
2022-10-11 14:48       ` fengchengwen
2022-10-09  9:10   ` [PATCH v11 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-09  9:10   ` [PATCH v11 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-09 11:05     ` Dongdong Liu
2022-10-09  9:10   ` [PATCH v11 5/5] net/bnxt: " Chengwen Feng
2022-10-12  3:45 ` [PATCH v12 0/5] support " Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 1/5] ethdev: add error handling mode to device info Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 2/5] ethdev: support proactive error handling mode Chengwen Feng
2022-10-13  8:58     ` Andrew Rybchenko
2022-10-13 12:50       ` fengchengwen
2022-10-12  3:45   ` [PATCH v12 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-12  3:45   ` [PATCH v12 5/5] net/bnxt: " Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 1/5] ethdev: add error handling mode to device info Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 2/5] ethdev: support proactive error handling mode Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-13 12:42   ` [PATCH v13 5/5] net/bnxt: " Chengwen Feng
2022-10-17  7:42   ` [PATCH v13 0/5] support " Andrew Rybchenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.