All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Monjalon <thomas@monjalon.net>
To: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>,
	Ajit Khaparde <ajit.khaparde@broadcom.com>
Cc: Ferruh Yigit <ferruh.yigit@intel.com>,
	dev@dpdk.org, Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
	Konstantin Ananyev <konstantin.ananyev@intel.com>,
	Asaf Penso <asafp@nvidia.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	"jerinj@marvell.com" <jerinj@marvell.com>
Subject: Re: [dpdk-dev] [PATCH v7 1/4] ethdev: support device reset and recovery events
Date: Thu, 10 Feb 2022 23:42:27 +0100	[thread overview]
Message-ID: <2837453.o0KrE1Onz3@thomas> (raw)
In-Reply-To: <CACZ4nhuTX4dvDJvfq7edSbbJ7X+sE1uvTS0zS6uU8+qRhQCE6A@mail.gmail.com>

03/02/2022 21:28, Ajit Khaparde:
> On Tue, Feb 1, 2022 at 5:20 AM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> > On 2/1/2022 1:09 PM, Kalesh Anakkur Purayil wrote:
> > > On Tue, Feb 1, 2022 at 5:41 PM Ferruh Yigit <ferruh.yigit@intel.com <mailto:ferruh.yigit@intel.com>> wrote:
> > >     On 1/28/2022 12:48 PM, Kalesh A P wrote:
> > >      > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com <mailto:kalesh-anakkur.purayil@broadcom.com>>
> > >      > --- a/doc/guides/prog_guide/poll_mode_drv.rst
> > >      > +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> > >      > +Error recovery support
> > >      > +~~~~~~~~~~~~~~~~~~~~~~
> > >      > +
> > >      > +When the PMD detects a FW reset or error condition, it may try to recover
> > >      > +from the error without needing the application intervention. In such cases,

Wrong, the application needs to react on the event.

> > >      > +PMD would need events to notify the application that it is undergoing
> > >      > +an error recovery.
> > >      > +
> > >      > +The PMD should trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> > >      > +application that PMD detected a FW reset or FW error condition. PMD may
> > >      > +try to recover from the error by itself. Data path may be quiesced and

PMD may try to recover by itself but the application must take action.

> > >      > +control path operations may fail during the recovery period. The application
> > >      > +should stop polling till it receives RTE_ETH_EVENT_RECOVERED event from the PMD.

The application should stop polling and doing anything else with the device,
otherwise it will receive an error code during the recovery,
meaning until the RECOVERED event is received.
Said like that, it is less vague, you see.

> > >     Between the time FW error occurred and the application receive the RECOVERING event,
> > >     datapath will continue to poll and application may call control APIs, so the event
> > >     really is not solving the issue and driver somehow should be sure this won't crash
> > >     the application, in that case not sure about the benefit of this event.
> > >
> > > [Kalesh]: As soon as the driver detects a FW dead or reset condition, it sets the fastpath pointers to dummy functions. This will prevent the crash. All control path operations would fail with -EBUSY. This change is already there in bnxt PMD. This event is a notification to the application that the PMD is recovering from a FW error condition so that it can stop polling and issue control path operations.
> >
> > This was my point indeed. PMD can't rely on this event and should take actions (for both
> > datapath and control path) to prevent possible crash/error. If that is the case event
> > doesn't have that much benefit since PMD already has to protect itself.
> 
> This is an attempt to prevent PMD and applications from encountering
> unexpected situations because of an error in hardware or firmware.
> The unexpected scenarios can include lockups to segfaults apart from
> other situations.
> 
> When hardware or firmware encounters and error, propagating it to the
> application for it to react can be slow, time consuming and could be
> vendor specific.
> 
> Instead in the case of Broadcom devices we are trying to contain the
> detection and recovery within the hardware, firmware mostly and to a
> small extent driver framework.
> In the case of kernel drivers, this allows avoiding a system reset as
> much as possible while in the case of DPDK PMD, it can prevent
> application reloads or ungraceful exits.
> 
> The PMD generated notification to the application will be more like a
> "For Your Information" for most of the applications.

The most important for the application is to be tolerant to the errors
returned in the datapath and in control API.

> But applications can decide to act on the notification and take
> specific steps - like flushing or deleting all the existing flow
> rules, meters etc.. without waiting for success or failure indication.
> Applications can also re-offload the flow rules, recreate meters based
> on the notification allowing them to sync with the fresh hardware,
> firmware state.

We need to know what has to be recreated.
I think we should keep the same rule as for a restart.

> We think other PMD can also take advantage of these notifications to
> provide better reliability, accessibility and serviceability in
> general.

Indeed others can benefit of some mechanisms if they are common
*and well explained*.

[...]
> > >      > +The PMD should trigger RTE_ETH_EVENT_RECOVERED event to notify the application

s/should/must/

> > >      > +that the it has recovered from the error condition. PMD re-configures the port
> > >      > +to the state prior to the error condition. Control path and data path are up now.
> > >      > +Since the device has undergone a reset, flow rules offloaded prior to reset
> > >      > +may be lost and the application should recreate the rules again.

Please refer to the restart situation, including these capabilities:

/** Device supports keeping flow rules across restart. */
#define RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP         RTE_BIT64(3)
/** Device supports keeping shared flow objects across restart. */
#define RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP RTE_BIT64(4)

> > >     I think the most difficult part here is clarify what application should do
> > >     when this event received consistent for all devices, "flow rules may be lost"
> > >     looks very vague to me.

This is the rule for a restart:

 * Please note that some configuration is not stored between calls to
 * rte_eth_dev_stop()/rte_eth_dev_start(). The following configuration will
 * be retained:
 *
 *     - MTU
 *     - flow control settings
 *     - receive mode configuration (promiscuous mode, all-multicast mode,
 *       hardware checksum mode, RSS/VMDq settings etc.)
 *     - VLAN filtering configuration
 *     - default MAC address
 *     - MAC addresses supplied to MAC address array
 *     - flow director filtering mode (but not filtering rules)
 *     - NIC queue statistics mappings
 *
 * The following configuration may be retained or not
 * depending on the device capabilities:
 *
 *     - flow rules
 *     - flow-related shared objects, e.g. indirect actions
 *
 * Any other configuration will not be stored and will need to be re-entered
 * before a call to rte_eth_dev_start().

> > >     Unless it is not clear for application what to do when this event is received,
> > >     it is not that useful or it will be specific to some PMDs. And I can see it is
> > >     hard to clarify this but perhaps we can define a set of common behavior.
> > >     Not sure what others are thinking.

+1

> > > [Kalesh]: Sure, let's wait for others' opinions as well.

Nothing new, we already did such comment in 2020.

> > >      > +The PMD should trigger RTE_ETH_EVENT_INTR_RMV event to notify the application
> > >      > +that it has failed to recover from the error condition. The device may not be
> > >      > +usable anymore.

Yes good idea to use RMV event in case of failure.
You should update the comment of RTE_ETH_EVENT_INTR_RMV
to list cases when it is fired:
	- device unplugged
	- recovery failure
	- reset failure?

> > >      > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > >      > index 147cc1c..a46819f 100644
> > >      > --- a/lib/ethdev/rte_ethdev.h
> > >      > +++ b/lib/ethdev/rte_ethdev.h
> > >      > @@ -3818,6 +3818,24 @@ enum rte_eth_event_type {
> > >      >       RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > >      >       RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > >      >       RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> > >      > +     RTE_ETH_EVENT_ERR_RECOVERING,
> > >      > +                     /**< port recovering from an error
> > >      > +                      *
> > >      > +                      * PMD detected a FW reset or error condition.
> > >      > +                      * PMD will try to recover from the error.
> > >      > +                      * Data path may be quiesced and Control path operations
> > >      > +                      * may fail at this time.
> > >      > +                      */
> > >
> > >     Please put multi line comments before enum, Andrew did a set of cleanups for these.
> > >
> > > [Kalesh]: Sure, will do.
> > >
> > >
> > >      > +     RTE_ETH_EVENT_RECOVERED,
> > >      > +                     /**< port recovered from an error
> > >      > +                      *
> > >      > +                      * PMD has recovered from the error condition.
> > >      > +                      * Control path and Data path are up now.
> > >      > +                      * PMD re-configures the port to the state prior to the error.
> > >      > +                      * Since the device has undergone a reset, flow rules
> > >      > +                      * offloaded prior to reset may be lost and
> > >      > +                      * the application should recreate the rules again.
> > >      > +                      */

Same as for the RST, the comments need to be very clear,
while being concise here.

I think there are a lot of tips in this thread to make this mechanism
work in a generic way. Please work on a precise new version, thanks.



  reply	other threads:[~2022-02-10 22:42 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
2020-03-11 13:20   ` Thomas Monjalon
2020-03-12  3:31     ` Kalesh Anakkur Purayil
2020-03-12  7:29       ` Thomas Monjalon
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
2020-03-12  3:25   ` Kalesh Anakkur Purayil
2020-03-12  7:34     ` Thomas Monjalon
2020-07-03 16:12       ` Ferruh Yigit
2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:50     ` Thomas Monjalon
2020-09-30  8:35       ` Kalesh Anakkur Purayil
2020-09-30  9:31         ` Thomas Monjalon
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-08 10:53     ` Asaf Penso
2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-06 17:25     ` Ophir Munk
2020-10-07  4:46       ` Kalesh Anakkur Purayil
2020-10-07  8:36         ` Ophir Munk
2020-10-07  9:37         ` Ferruh Yigit
2020-10-07 18:42           ` Ajit Khaparde
2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-08 10:49     ` Asaf Penso
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-11 21:29     ` Thomas Monjalon
2020-10-12  8:09       ` Andrew Rybchenko
2021-02-18 15:32         ` Ferruh Yigit
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event Kalesh A P
2022-01-28 12:48   ` [dpdk-dev] [PATCH v7 0/4] ethdev: error recovery support Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 1/4] ethdev: support device reset and recovery events Kalesh A P
2022-02-01 12:11       ` Ferruh Yigit
2022-02-01 13:09         ` Kalesh Anakkur Purayil
2022-02-01 13:19           ` Ferruh Yigit
2022-02-03 20:28             ` Ajit Khaparde
2022-02-10 22:42               ` Thomas Monjalon [this message]
2022-02-01 12:52       ` Ferruh Yigit
2022-02-02 11:44         ` Ray Kinsella
2022-02-10 22:16           ` Thomas Monjalon
2022-02-11 10:09             ` Ray Kinsella
2022-02-14 10:16               ` Ray Kinsella
2022-02-14 11:15                 ` Thomas Monjalon
2022-02-14 16:06                   ` Ray Kinsella
2022-02-14 16:25                     ` Thomas Monjalon
2022-02-14 18:27                       ` Ray Kinsella
2022-02-15 13:55                         ` Ray Kinsella
2022-02-15 15:12                           ` Thomas Monjalon
2022-02-15 16:12                             ` Ray Kinsella
2022-05-21 10:33                     ` fengchengwen
2022-05-24 15:11                       ` Ray Kinsella
2022-06-10  0:16                         ` fengchengwen
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 2/4] app/testpmd: handle device recovery event Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 3/4] net/bnxt: notify applications about device reset/recovery Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 4/4] doc: update release notes Kalesh A P
2022-02-01 12:12       ` Ferruh Yigit
2022-06-16  9:41     ` [PATCH v8 0/4] ethdev: support error recovery notification Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 1/4] ethdev: support device " Chengwen Feng
2022-06-20 17:42         ` Thomas Monjalon
2022-06-21  1:38           ` fengchengwen
2022-06-21  7:04             ` Thomas Monjalon
2022-09-22  7:53               ` fengchengwen
2022-06-23 15:58         ` Ray Kinsella
2022-06-16  9:41       ` [PATCH v8 2/4] app/testpmd: handle error recovery notification event Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 3/4] net/hns3: support " Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 4/4] net/bnxt: notify applications about device reset/recovery Chengwen Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2837453.o0KrE1Onz3@thomas \
    --to=thomas@monjalon.net \
    --cc=ajit.khaparde@broadcom.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=asafp@nvidia.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=jerinj@marvell.com \
    --cc=kalesh-anakkur.purayil@broadcom.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.