All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH iproute2-next] System specification health API
@ 2018-09-13  8:18 Eran Ben Elisha
  2018-09-13  8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
  2018-09-13 17:36 ` [RFC PATCH iproute2-next] System specification health API Jakub Kicinski
  0 siblings, 2 replies; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-13  8:18 UTC (permalink / raw)
  To: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck
  Cc: Andrew Lunn, Florian Fainelli, Tal Alon, Ariel Almog, Eran Ben Elisha

The health spec is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The health contains sensors which sense for malfunction. Once sensor triggered,
actions such as logs and correction can be taken.
Sensors are sensing the health state and can trigger correction action.

The sensors are divided into the following groups
- Hardware sensor - a sensor which is triggered by the device due to
  malfunction.
- Software sensor - a sensor which is triggered by the software due to
  malfunction.
Both group of sensors can be triggered due to error event or due to a periodic check.

Actions are the way to handle sensor events. Action can be in one of the
following groups:
- Dump -  SW trace, SW dump, HW trace, HW dump
- Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
Actions can be performed by SW or HW.

User is allowed to enable or disable sensors and sensor2action mapping.

This RFC man page patch describes the suggested API of devlink-health in order
to control sensors and actions.

Eran Ben Elisha (1):
  man: Add devlink health man page

 man/man8/devlink-health.8 | 171 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man/man8/devlink-health.8

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13  8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
@ 2018-09-13  8:18 ` Eran Ben Elisha
  2018-09-13 10:27   ` Tobin C. Harding
  2018-09-13 12:08   ` Andrew Lunn
  2018-09-13 17:36 ` [RFC PATCH iproute2-next] System specification health API Jakub Kicinski
  1 sibling, 2 replies; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-13  8:18 UTC (permalink / raw)
  To: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck
  Cc: Andrew Lunn, Florian Fainelli, Tal Alon, Ariel Almog, Eran Ben Elisha

Add devlink-health man page. Devlink-health tool will control device
health attributes, sensors, actions and logging.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>

-------------------------------------------------------
Copy paste man output to here for easier review process of the RFC.

DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)

NAME
       devlink-health - devlink health configuration

SYNOPSIS
       devlink [ OPTIONS ] health  { COMMAND | help }

       OPTIONS := { -V[ersion] | -n[no-nice-names] }

       devlink health show [ DEV ] [ sensor NAME ]

       devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"

       devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }

       devlink health action reinit DEV name NAME

       devlink health help

DESCRIPTION
       devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
       reset and dump of info. In addition, set the health activity termination action.

   devlink health show - Display devlink health sensors and actions attributes
       DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.

           Format is:
             BUS_NAME/BUS_ADDRESS

       sensor NAME - Specifies the devlink sensor to show.

   devlink health sensor set - sets devlink health sensor attributes
       DEV    Specifies the devlink device to show.

       name NAME
              Name of the sensor to set.

       action NAME { active | inactive }
                  Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.

   devlink health action set - sets devlink action attributes
       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

       period PERIOD
              The period on which we limit the amount of performed actions, measured in seconds.

       count COUNT
              The maximum amount of actions performed in a limit time frame.

       fail   { ignore | down }
                  Specify the behavior once count limit was reached.

                  ignore - Ignore errors without execution of any action.

                  down - Driver will remain in nonoperational state.

   devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

EXAMPLES
       devlink health show
           Shows the health state of all devlink devices on the system.

       devlink health show pci/0000:01:00.0
           Shows the health state of specified devlink device.

       devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
           Sets TX_COMP_ERROR sensor parameters for a specific device.

       devlink health action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
           Sets health attributes for reset action.

SEE ALSO
       devlink(8), devlink-port(8), devlink-sb(8), devlink-monitor(8), devlink-dev(8),

AUTHOR
       Eran ben Elisha <eranbe@mellanox.com>

iproute2                                                                                                     15 Aug 2018                                                                                           DEVLINK-HEALTH(8)
---
 man/man8/devlink-health.8 | 171 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man/man8/devlink-health.8

diff --git a/man/man8/devlink-health.8 b/man/man8/devlink-health.8
new file mode 100644
index 000000000000..ac28b020be0d
--- /dev/null
+++ b/man/man8/devlink-health.8
@@ -0,0 +1,171 @@
+.TH DEVLINK\-HEALTH 8 "15 Aug 2018" "iproute2" "Linux"
+.SH NAME
+devlink-health \- devlink health configuration
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B devlink
+.RI "[ " OPTIONS " ]"
+.BR health
+.RI  " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.IR OPTIONS " := { "
+\fB\-V\fR[\fIersion\fR] |
+\fB\-n\fR[\fIno-nice-names\fR] }
+
+.ti -8
+.B devlink health show
+.RI "[ " DEV " ]"
+.RI "[ "
+.B sensor
+.IR NAME
+.RI "]"
+
+.ti -8
+.B devlink health sensor set
+.IR DEV
+.B name
+.IR NAME
+.RI "[ "
+.BR action
+.IR NAME
+.R "{" active "|" inactive "}" ]"
+
+.ti -8
+.B devlink health action set
+.IR DEV
+.B name
+.IR NAME
+.BR period
+.IR PERIOD
+.BR count
+.IR COUNT
+.BR fail " { "
+.IR ignore
+.BR "| "
+.IR down
+.R "} "
+
+.ti -8
+.B devlink health action reinit
+.IR DEV
+.B name
+.IR NAME
+
+.ti -8
+.B devlink health help
+
+.SH "DESCRIPTION"
+.B devlink-health
+tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as, reset and dump of info. In addition, set the health activity termination action.
+
+.SS devlink health show - Display devlink health sensors and actions attributes
+.PP
+.B "DEV"
+- Specifies the devlink device to show.
+If this argument is omitted, all devices are listed.
+
+.in +4
+Format is:
+.in +2
+BUS_NAME/BUS_ADDRESS
+
+.PP
+.BR sensor
+.IR "NAME"
+- Specifies the devlink sensor to show.
+
+.SS devlink health sensor set - sets devlink health sensor attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to show.
+
+.TP
+.BI name " NAME"
+Name of the sensor to set.
+
+.TP
+.BR action
+.IR NAME
+.R "{" active "|" inactive "} "
+.in +4
+Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
+
+.SS devlink health action set - sets devlink action attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.TP
+.BI period " PERIOD"
+The period on which we limit the amount of performed actions, measured in seconds.
+
+.TP
+.BI count " COUNT"
+The maximum amount of actions performed in a limit time frame.
+
+.TP
+.BR fail
+.R "{" ignore "|" down "}"
+.in +4
+Specify the behavior once count limit was reached.
+
+.I ignore
+- Ignore errors without execution of any action.
+
+.I down
+- Driver will remain in nonoperational state.
+
+.SS devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.SH "EXAMPLES"
+.PP
+devlink health show
+.RS 4
+Shows the health state of all devlink devices on the system.
+.RE
+.PP
+devlink health show pci/0000:01:00.0
+.RS 4
+Shows the health state of specified devlink device.
+.RE
+.PP
+devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
+.RS 4
+Sets TX_COMP_ERROR sensor parameters for a specific device.
+.RE
+.PP
+devlink health action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
+.RS 4
+Sets health attributes for reset action.
+.RE
+
+.SH SEE ALSO
+.BR devlink (8),
+.BR devlink-port (8),
+.BR devlink-sb (8),
+.BR devlink-monitor (8),
+.BR devlink-dev (8),
+.br
+
+.SH AUTHOR
+Eran ben Elisha <eranbe@mellanox.com>
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13  8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
@ 2018-09-13 10:27   ` Tobin C. Harding
  2018-09-13 11:58     ` Eran Ben Elisha
  2018-09-13 12:08   ` Andrew Lunn
  1 sibling, 1 reply; 17+ messages in thread
From: Tobin C. Harding @ 2018-09-13 10:27 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Andrew Lunn,
	Florian Fainelli, Tal Alon, Ariel Almog

On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
> Add devlink-health man page. Devlink-health tool will control device
> health attributes, sensors, actions and logging.
> 
> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> 
> -------------------------------------------------------
> Copy paste man output to here for easier review process of the RFC.
> 
> DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
> 
> NAME
>        devlink-health - devlink health configuration
> 
> SYNOPSIS
>        devlink [ OPTIONS ] health  { COMMAND | help }
> 
>        OPTIONS := { -V[ersion] | -n[no-nice-names] }
> 
>        devlink health show [ DEV ] [ sensor NAME ]
> 
>        devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
> 
>        devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
> 
>        devlink health action reinit DEV name NAME
> 
>        devlink health help
> 
> DESCRIPTION
>        devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
>        reset and dump of info. In addition, set the health activity termination action.
> 
>    devlink health show - Display devlink health sensors and actions attributes
>        DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
> 
>            Format is:
>              BUS_NAME/BUS_ADDRESS
> 
>        sensor NAME - Specifies the devlink sensor to show.
> 

Perhaps the commands should include the optional arguments so when
reading the description one doesn't have to scroll to the top of the
page all the time

e.g
     devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes

>    devlink health sensor set - sets devlink health sensor attributes
>        DEV    Specifies the devlink device to show.

	 		      	      	     	set

>        name NAME
>               Name of the sensor to set.
> 
>        action NAME { active | inactive }
>                   Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
> 
>    devlink health action set - sets devlink action attributes
>        DEV    Specifies the devlink device to set.
> 
>        name NAME
>               Specifies the devlink action to set.

This is a little unclear to me?

>        period PERIOD
>               The period on which we limit the amount of performed actions, measured in seconds.
> 
>        count COUNT
>               The maximum amount of actions performed in a limit time frame.

Perhaps		    	    	      	      
                The maximum number of actions performed in a limited time frame.

>        fail   { ignore | down }
>                   Specify the behavior once count limit was reached.
> 
>                   ignore - Ignore errors without execution of any action.
> 
>                   down - Driver will remain in nonoperational state.
> 
>    devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
>        DEV    Specifies the devlink device to set.
> 
>        name NAME
>               Specifies the devlink action to set.

Perhaps s/set/reinitialise/g for the above two descriptions.

Hope this helps,
Tobin.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 10:27   ` Tobin C. Harding
@ 2018-09-13 11:58     ` Eran Ben Elisha
  2018-09-13 22:06       ` Tobin C. Harding
  0 siblings, 1 reply; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-13 11:58 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Andrew Lunn,
	Florian Fainelli, Tal Alon, Ariel Almog



On 9/13/2018 1:27 PM, Tobin C. Harding wrote:
> On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
>> Add devlink-health man page. Devlink-health tool will control device
>> health attributes, sensors, actions and logging.
>>
>> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
>>
>> -------------------------------------------------------
>> Copy paste man output to here for easier review process of the RFC.
>>
>> DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
>>
>> NAME
>>         devlink-health - devlink health configuration
>>
>> SYNOPSIS
>>         devlink [ OPTIONS ] health  { COMMAND | help }
>>
>>         OPTIONS := { -V[ersion] | -n[no-nice-names] }
>>
>>         devlink health show [ DEV ] [ sensor NAME ]
>>
>>         devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
>>
>>         devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
>>
>>         devlink health action reinit DEV name NAME
>>
>>         devlink health help
>>
>> DESCRIPTION
>>         devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
>>         reset and dump of info. In addition, set the health activity termination action.
>>
>>     devlink health show - Display devlink health sensors and actions attributes
>>         DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
>>
>>             Format is:
>>               BUS_NAME/BUS_ADDRESS
>>
>>         sensor NAME - Specifies the devlink sensor to show.
>>
> 
> Perhaps the commands should include the optional arguments so when
> reading the description one doesn't have to scroll to the top of the
> page all the time
> 
> e.g
>       devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes
> 

I followed the scheme presented in all other devlink man pages.
see devlink-region, devlink-port, etc.

 From my perspective, I am fine with adding it to devlink-health, need 
ack from the devlink maintainer to see if he likes it...

>>     devlink health sensor set - sets devlink health sensor attributes
>>         DEV    Specifies the devlink device to show.
> 
> 	 		      	      	     	set
> 
>>         name NAME
>>                Name of the sensor to set.
>>
>>         action NAME { active | inactive }
>>                    Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
>>
>>     devlink health action set - sets devlink action attributes
>>         DEV    Specifies the devlink device to set.
>>
>>         name NAME
>>                Specifies the devlink action to set.
> 
> This is a little unclear to me?

what is not clear? the term 'action' or the naming? can you elaborate?

> 
>>         period PERIOD
>>                The period on which we limit the amount of performed actions, measured in seconds.
>>
>>         count COUNT
>>                The maximum amount of actions performed in a limit time frame.
> 
> Perhaps		    	    	      	
>                  The maximum number of actions performed in a limited time frame.

ack

> 
>>         fail   { ignore | down }
>>                    Specify the behavior once count limit was reached.
>>
>>                    ignore - Ignore errors without execution of any action.
>>
>>                    down - Driver will remain in nonoperational state.
>>
>>     devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
>>         DEV    Specifies the devlink device to set.
>>
>>         name NAME
>>                Specifies the devlink action to set.
> 
> Perhaps s/set/reinitialise/g for the above two descriptions.

ack

> 
> Hope this helps,
> Tobin.

thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13  8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
  2018-09-13 10:27   ` Tobin C. Harding
@ 2018-09-13 12:08   ` Andrew Lunn
  2018-09-13 12:49     ` Eran Ben Elisha
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2018-09-13 12:08 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog

>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>            Sets TX_COMP_ERROR sensor parameters for a specific device.

I hope the real sensors have more understandable names. If i remember
correctly, the same sort of comment was given for resource
management. It was pretty unclear what the resource names actually
mean. Is an average user going to have any idea how to actually use
these sensors and actions?

Can you give more examples of sensors. We should understand if there
are any overlaps with hwmon.

    Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 12:08   ` Andrew Lunn
@ 2018-09-13 12:49     ` Eran Ben Elisha
  2018-09-13 13:24       ` Andrew Lunn
  0 siblings, 1 reply; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-13 12:49 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog



On 9/13/2018 3:08 PM, Andrew Lunn wrote:
>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
> 
> I hope the real sensors have more understandable names. If i remember
> correctly, the same sort of comment was given for resource
> management. It was pretty unclear what the resource names actually
> mean. Is an average user going to have any idea how to actually use
> these sensors and actions?

well, hopefully. the whole point is to have it fully controlled by the 
user. However, names for the command should be short. I guess we shall 
have it documented (challenge is to fit to multi vendors).

> 
> Can you give more examples of sensors. We should understand if there
> are any overlaps with hwmon.

I restate here that we shall have SW sensors as well, and not only HW 
sensors.

This is what I had in mind:
1. command interface error
2. command interface timeout
3. stuck TX queue (like tx_timeout)
4. stuck TX completion queue (driver did not process packets in a 
reasonable time period)
5. stuck RX queue
6. RX completion error
7. TX completion error
8. HW / FW catastrophic error report
9. completion queue overrun

Eran

> 
>      Andrew
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 12:49     ` Eran Ben Elisha
@ 2018-09-13 13:24       ` Andrew Lunn
  2018-09-13 14:30         ` Eran Ben Elisha
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2018-09-13 13:24 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog

On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
> >>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >>            Sets TX_COMP_ERROR sensor parameters for a specific device.
> >
> >I hope the real sensors have more understandable names. If i remember
> >correctly, the same sort of comment was given for resource
> >management. It was pretty unclear what the resource names actually
> >mean. Is an average user going to have any idea how to actually use
> >these sensors and actions?
> 
> well, hopefully. the whole point is to have it fully controlled by the user.
> However, names for the command should be short. I guess we shall have it
> documented (challenge is to fit to multi vendors).
> 
> >
> >Can you give more examples of sensors. We should understand if there
> >are any overlaps with hwmon.
> 
> I restate here that we shall have SW sensors as well, and not only HW
> sensors.
> 
> This is what I had in mind:
> 1. command interface error
> 2. command interface timeout
> 3. stuck TX queue (like tx_timeout)
> 4. stuck TX completion queue (driver did not process packets in a reasonable
> time period)
> 5. stuck RX queue
> 6. RX completion error
> 7. TX completion error
> 8. HW / FW catastrophic error report
> 9. completion queue overrun

Hi Eran

I'm having trouble differentiating between these SW sensors and bugs
which need fixing. What causes a command interface error? Sending it a
command it does not understand? A wrongly formatted command? A command
the version of the firmware does not support? These all sound just
like plain old bugs which need fixing, not something which needs a
framework to detect them and try to recover from them by resetting
something.

I would of expected that all the issues are about physical
properties. Something similar to SMART for hard disks. The power
supplies are starting to droop, suggesting it might die soon. The
tacho on the fan suggests the FAN is not rotating as fast as it
should, so the motor is going to die soon. An SFP is giving i2c
errors, suggesting it is not seated correctly. The card as a whole is
overheating, despite the fan working, suggesting the ambient
temperature is just too high.

	Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 13:24       ` Andrew Lunn
@ 2018-09-13 14:30         ` Eran Ben Elisha
  2018-09-13 15:12           ` Andrew Lunn
  0 siblings, 1 reply; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-13 14:30 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog



On 9/13/2018 4:24 PM, Andrew Lunn wrote:
> On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
>>
>>
>> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
>>>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
>>>
>>> I hope the real sensors have more understandable names. If i remember
>>> correctly, the same sort of comment was given for resource
>>> management. It was pretty unclear what the resource names actually
>>> mean. Is an average user going to have any idea how to actually use
>>> these sensors and actions?
>>
>> well, hopefully. the whole point is to have it fully controlled by the user.
>> However, names for the command should be short. I guess we shall have it
>> documented (challenge is to fit to multi vendors).
>>
>>>
>>> Can you give more examples of sensors. We should understand if there
>>> are any overlaps with hwmon.
>>
>> I restate here that we shall have SW sensors as well, and not only HW
>> sensors.
>>
>> This is what I had in mind:
>> 1. command interface error
>> 2. command interface timeout
>> 3. stuck TX queue (like tx_timeout)
>> 4. stuck TX completion queue (driver did not process packets in a reasonable
>> time period)
>> 5. stuck RX queue
>> 6. RX completion error
>> 7. TX completion error
>> 8. HW / FW catastrophic error report
>> 9. completion queue overrun
> 
> Hi Eran
> 
> I'm having trouble differentiating between these SW sensors and bugs
> which need fixing. What causes a command interface error? Sending it a
> command it does not understand? A wrongly formatted command? A command
> the version of the firmware does not support? These all sound just
> like plain old bugs which need fixing, not something which needs a
> framework to detect them and try to recover from them by resetting
> something.

Such issues do exist in production environment, and need to be handled 
even if root cause is a bug which will be fixed in latest release. My 
feature should help developers / administrator to control and recover 
their live systems, by auto correction and logging support.
Goal is:
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed 
debugging information.

> 
> I would of expected that all the issues are about physical
> properties. Something similar to SMART for hard disks. The power
> supplies are starting to droop, suggesting it might die soon. The
> tacho on the fan suggests the FAN is not rotating as fast as it
> should, so the motor is going to die soon. An SFP is giving i2c
> errors, suggesting it is not seated correctly. The card as a whole is
> overheating, despite the fan working, suggesting the ambient
> temperature is just too high.

AFAIU, the kind of sensors you suggest here requires manual fix / 
physically approaching to the setup, replace HW, install new Fan, etc.
Monitor such events is easy, driver can just log events from HW to the 
dmesg and end its handle there.
None of these is a real networking issue I would like to handle with 
devlink-health.

Eran

> 
> 	Andrew
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 14:30         ` Eran Ben Elisha
@ 2018-09-13 15:12           ` Andrew Lunn
  2018-09-16  9:14             ` Eran Ben Elisha
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2018-09-13 15:12 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog

> >>>>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >>>>            Sets TX_COMP_ERROR sensor parameters for a specific device.

> >>This is what I had in mind:
> >>1. command interface error
> >>2. command interface timeout
> >>3. stuck TX queue (like tx_timeout)
> >>4. stuck TX completion queue (driver did not process packets in a reasonable
> >>time period)
> >>5. stuck RX queue
> >>6. RX completion error
> >>7. TX completion error
> >>8. HW / FW catastrophic error report
> >>9. completion queue overrun

> Such issues do exist in production environment, and need to be handled even
> if root cause is a bug which will be fixed in latest release. My feature
> should help developers / administrator to control and recover their live
> systems, by auto correction and logging support.
> Goal is:
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed
> debugging information.

So maybe you have the wrong name for this. Health is nice in terms of
Marketing, but we are actually talking about bug recovery.

devlink bug sensor set pci/0000:01:00.0 name command_interface_error action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action reset off action dump on

seems a lot more understandable than:

devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on

	Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] System specification health API
  2018-09-13  8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
  2018-09-13  8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
@ 2018-09-13 17:36 ` Jakub Kicinski
  2018-09-16 10:37   ` Eran Ben Elisha
  2018-09-16 19:29   ` Stephen Hemminger
  1 sibling, 2 replies; 17+ messages in thread
From: Jakub Kicinski @ 2018-09-13 17:36 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Simon Horman,
	Alexander Duyck, Andrew Lunn, Florian Fainelli, Tal Alon,
	Ariel Almog

On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> The health spec is targeted for Real Time Alerting, in order to know when
> something bad had happened to a PCI device

By spec you mean some standards body spec you implement or this
proposal is a spec?

> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed debugging
>   information.
> 
> The health contains sensors which sense for malfunction. Once sensor triggered,
> actions such as logs and correction can be taken.
> Sensors are sensing the health state and can trigger correction action.
> 
> The sensors are divided into the following groups
> - Hardware sensor - a sensor which is triggered by the device due to
>   malfunction.
> - Software sensor - a sensor which is triggered by the software due to
>   malfunction.
> Both group of sensors can be triggered due to error event or due to a periodic check.
> 
> Actions are the way to handle sensor events. Action can be in one of the
> following groups:
> - Dump -  SW trace, SW dump, HW trace, HW dump
> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> Actions can be performed by SW or HW.
> 
> User is allowed to enable or disable sensors and sensor2action mapping.
> 
> This RFC man page patch describes the suggested API of devlink-health in order
> to control sensors and actions.

I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better?  Are there going to be values reported?

I'm not so sure about HW sensors in relation to existing HWMON
infrastructure...  I assume you're targeting things like say some HW
engine/block reporting it encountered an error?  Sounds good, too.

Are the actions all envisioned to be performed by the driver?
Firmware?  Hardware?  I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of 
the setting since it was only implemented for params :S

Is the dump option going to tie back into region snapshots?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 11:58     ` Eran Ben Elisha
@ 2018-09-13 22:06       ` Tobin C. Harding
  0 siblings, 0 replies; 17+ messages in thread
From: Tobin C. Harding @ 2018-09-13 22:06 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Andrew Lunn,
	Florian Fainelli, Tal Alon, Ariel Almog

On Thu, Sep 13, 2018 at 02:58:52PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 1:27 PM, Tobin C. Harding wrote:
> > On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
> > > Add devlink-health man page. Devlink-health tool will control device
> > > health attributes, sensors, actions and logging.
> > > 
> > > Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> > > 
> > > -------------------------------------------------------
> > > Copy paste man output to here for easier review process of the RFC.
> > > 
> > > DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
> > > 
> > > NAME
> > >         devlink-health - devlink health configuration
> > > 
> > > SYNOPSIS
> > >         devlink [ OPTIONS ] health  { COMMAND | help }
> > > 
> > >         OPTIONS := { -V[ersion] | -n[no-nice-names] }
> > > 
> > >         devlink health show [ DEV ] [ sensor NAME ]
> > > 
> > >         devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
> > > 
> > >         devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
> > > 
> > >         devlink health action reinit DEV name NAME
> > > 
> > >         devlink health help
> > > 
> > > DESCRIPTION
> > >         devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
> > >         reset and dump of info. In addition, set the health activity termination action.
> > > 
> > >     devlink health show - Display devlink health sensors and actions attributes
> > >         DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
> > > 
> > >             Format is:
> > >               BUS_NAME/BUS_ADDRESS
> > > 
> > >         sensor NAME - Specifies the devlink sensor to show.
> > > 
> > 
> > Perhaps the commands should include the optional arguments so when
> > reading the description one doesn't have to scroll to the top of the
> > page all the time
> > 
> > e.g
> >       devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes
> > 
> 
> I followed the scheme presented in all other devlink man pages.
> see devlink-region, devlink-port, etc.

Oh ok, my mistake.  I'd stick with what you have then.  Thanks for
pointing this out.

> From my perspective, I am fine with adding it to devlink-health, need ack
> from the devlink maintainer to see if he likes it...
> 
> > >     devlink health sensor set - sets devlink health sensor attributes
> > >         DEV    Specifies the devlink device to show.
> > 
> > 	 		      	      	     	set
> > 
> > >         name NAME
> > >                Name of the sensor to set.
> > > 
> > >         action NAME { active | inactive }
> > >                    Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
> > > 
> > >     devlink health action set - sets devlink action attributes
> > >         DEV    Specifies the devlink device to set.
> > > 
> > >         name NAME
> > >                Specifies the devlink action to set.
> > 
> > This is a little unclear to me?
> 
> what is not clear? the term 'action' or the naming? can you elaborate?

It wasn't immediately clear what 'name' referred to.  But following on
from discussion above this may be because I have not read any of the
other devlink man pages.

thanks,
Tobin.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
  2018-09-13 15:12           ` Andrew Lunn
@ 2018-09-16  9:14             ` Eran Ben Elisha
  0 siblings, 0 replies; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-16  9:14 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan,
	Jakub Kicinski, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog



On 9/13/2018 6:12 PM, Andrew Lunn wrote:
>>>>>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>>>>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
> 
>>>> This is what I had in mind:
>>>> 1. command interface error
>>>> 2. command interface timeout
>>>> 3. stuck TX queue (like tx_timeout)
>>>> 4. stuck TX completion queue (driver did not process packets in a reasonable
>>>> time period)
>>>> 5. stuck RX queue
>>>> 6. RX completion error
>>>> 7. TX completion error
>>>> 8. HW / FW catastrophic error report
>>>> 9. completion queue overrun
> 
>> Such issues do exist in production environment, and need to be handled even
>> if root cause is a bug which will be fixed in latest release. My feature
>> should help developers / administrator to control and recover their live
>> systems, by auto correction and logging support.
>> Goal is:
>> - Provide alert debug information
>> - Self healing
>> - If problem needs vendor support, provide a way to gather all needed
>> debugging information.
> 
> So maybe you have the wrong name for this. Health is nice in terms of
> Marketing, but we are actually talking about bug recovery.

The way I see it, this feature is responsible for the health of the 
system from the pci/xxxx perspective.
I though about devlink-recover for example, but I really wouldn't like 
to limit the feature to be called after one of its actions. The same for 
devlink-bug, which highlights only part of the range of capabilities 
(sensor).

My work is currently focused on error reporting and recovery, but I 
wouldn't like to see the API limited for "bugs" only.

Eran

> 
> devlink bug sensor set pci/0000:01:00.0 name command_interface_error action reset off action dump on
> devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action reset off action dump on
> devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action reset off action dump on
> devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action reset off action dump on
> 
> seems a lot more understandable than:
> 
> devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> 
> 	Andrew
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] System specification health API
  2018-09-13 17:36 ` [RFC PATCH iproute2-next] System specification health API Jakub Kicinski
@ 2018-09-16 10:37   ` Eran Ben Elisha
  2018-09-25 12:00     ` Eran Ben Elisha
  2018-09-16 19:29   ` Stephen Hemminger
  1 sibling, 1 reply; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-16 10:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Simon Horman,
	Alexander Duyck, Andrew Lunn, Florian Fainelli, Tal Alon,
	Ariel Almog



On 9/13/2018 8:36 PM, Jakub Kicinski wrote:
> On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
>> The health spec is targeted for Real Time Alerting, in order to know when
>> something bad had happened to a PCI device
> 
> By spec you mean some standards body spec you implement or this
> proposal is a spec?

This proposal is a spec

> 
>> - Provide alert debug information
>> - Self healing
>> - If problem needs vendor support, provide a way to gather all needed debugging
>>    information.
>>
>> The health contains sensors which sense for malfunction. Once sensor triggered,
>> actions such as logs and correction can be taken.
>> Sensors are sensing the health state and can trigger correction action.
>>
>> The sensors are divided into the following groups
>> - Hardware sensor - a sensor which is triggered by the device due to
>>    malfunction.
>> - Software sensor - a sensor which is triggered by the software due to
>>    malfunction.
>> Both group of sensors can be triggered due to error event or due to a periodic check.
>>
>> Actions are the way to handle sensor events. Action can be in one of the
>> following groups:
>> - Dump -  SW trace, SW dump, HW trace, HW dump
>> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
>> Actions can be performed by SW or HW.
>>
>> User is allowed to enable or disable sensors and sensor2action mapping.
>>
>> This RFC man page patch describes the suggested API of devlink-health in order
>> to control sensors and actions.
> 
> I like the idea of configuring response to events like this, although
> I'm not sure the name sensor is appropriate here - perhaps exception or
> error would be better?

I was trying to avoid the negativity description. Have it called sensor 
to avoid restricting the API for errors / exceptions only. I got the 
same type of comment from Andrew as well devlink-health->devlink-bug.

But if other vendors driver developers don't see it can be expanded to 
sensor which are not errors, then I guess we can refactor the names.

Are there going to be values reported?

It depends on the sensor. If it has data that would help in the debug, 
then I assume yes, via the dumps.

> 
> I'm not so sure about HW sensors in relation to existing HWMON
> infrastructure...  I assume you're targeting things like say some HW
> engine/block reporting it encountered an error?  Sounds good, too.

yes, exactly.

> 
> Are the actions all envisioned to be performed by the driver?
> Firmware?  Hardware?  I guess that distinction can be added later.
> For FW/HW actions we would go back to the problem of persistence of
> the setting since it was only implemented for params :S

The problem is not with FW action, the problem is when you try to set 
sensor2action mapping for the FW/HW. this will need persistence 
configuration mode. Sensor2action in SW shall be run-time mode (at least 
as a start).
But it sound as this need some more tuning, to make it clear.

> 
> Is the dump option going to tie back into region snapshots?
> 
no necessarily, dumping SW objects as well can be helpful

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] System specification health API
  2018-09-13 17:36 ` [RFC PATCH iproute2-next] System specification health API Jakub Kicinski
  2018-09-16 10:37   ` Eran Ben Elisha
@ 2018-09-16 19:29   ` Stephen Hemminger
  2018-09-16 19:57     ` Andrew Lunn
  1 sibling, 1 reply; 17+ messages in thread
From: Stephen Hemminger @ 2018-09-16 19:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eran Ben Elisha, netdev, Jiri Pirko, Andy Gospodarek,
	Michael Chan, Simon Horman, Alexander Duyck, Andrew Lunn,
	Florian Fainelli, Tal Alon, Ariel Almog

On Thu, 13 Sep 2018 10:36:04 -0700
Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

> On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> > The health spec is targeted for Real Time Alerting, in order to know when
> > something bad had happened to a PCI device  
> 
> By spec you mean some standards body spec you implement or this
> proposal is a spec?
> 
> > - Provide alert debug information
> > - Self healing
> > - If problem needs vendor support, provide a way to gather all needed debugging
> >   information.
> > 
> > The health contains sensors which sense for malfunction. Once sensor triggered,
> > actions such as logs and correction can be taken.
> > Sensors are sensing the health state and can trigger correction action.
> > 
> > The sensors are divided into the following groups
> > - Hardware sensor - a sensor which is triggered by the device due to
> >   malfunction.
> > - Software sensor - a sensor which is triggered by the software due to
> >   malfunction.
> > Both group of sensors can be triggered due to error event or due to a periodic check.
> > 
> > Actions are the way to handle sensor events. Action can be in one of the
> > following groups:
> > - Dump -  SW trace, SW dump, HW trace, HW dump
> > - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> > Actions can be performed by SW or HW.
> > 
> > User is allowed to enable or disable sensors and sensor2action mapping.
> > 
> > This RFC man page patch describes the suggested API of devlink-health in order
> > to control sensors and actions.  
> 
> I like the idea of configuring response to events like this, although
> I'm not sure the name sensor is appropriate here - perhaps exception or
> error would be better?  Are there going to be values reported?
> 
> I'm not so sure about HW sensors in relation to existing HWMON
> infrastructure...  I assume you're targeting things like say some HW
> engine/block reporting it encountered an error?  Sounds good, too.
> 
> Are the actions all envisioned to be performed by the driver?
> Firmware?  Hardware?  I guess that distinction can be added later.
> For FW/HW actions we would go back to the problem of persistence of 
> the setting since it was only implemented for params :S
> 
> Is the dump option going to tie back into region snapshots?

Why is this going under iproute rather than using one of the existing sensor API's.
For example Intel NIC's have thermal sensors etc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] System specification health API
  2018-09-16 19:29   ` Stephen Hemminger
@ 2018-09-16 19:57     ` Andrew Lunn
  2018-09-25 12:17       ` Eran Ben Elisha
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2018-09-16 19:57 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jakub Kicinski, Eran Ben Elisha, netdev, Jiri Pirko,
	Andy Gospodarek, Michael Chan, Simon Horman, Alexander Duyck,
	Florian Fainelli, Tal Alon, Ariel Almog

> Why is this going under iproute rather than using one of the existing sensor API's.
> For example Intel NIC's have thermal sensors etc.

Hi Stephen

These are not that sort of sensors. This is part of the naming problem
here. It is not really to do with health, it is about exceptions and
bugs. And the sensors are more like timeouts and watchdogs.

It is clear that the current names lead to a lot of confusion. Maybe:

health -> exception
sensor -> condition

       Andrew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] System specification health API
  2018-09-16 10:37   ` Eran Ben Elisha
@ 2018-09-25 12:00     ` Eran Ben Elisha
  0 siblings, 0 replies; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-25 12:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Simon Horman,
	Alexander Duyck, Andrew Lunn, Florian Fainelli, Tal Alon,
	Ariel Almog



On 9/16/2018 1:37 PM, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 8:36 PM, Jakub Kicinski wrote:
>> On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
>>> The health spec is targeted for Real Time Alerting, in order to know 
>>> when
>>> something bad had happened to a PCI device
>>
>> By spec you mean some standards body spec you implement or this
>> proposal is a spec?
> 
> This proposal is a spec
> 
>>
>>> - Provide alert debug information
>>> - Self healing
>>> - If problem needs vendor support, provide a way to gather all needed 
>>> debugging
>>>    information.
>>>
>>> The health contains sensors which sense for malfunction. Once sensor 
>>> triggered,
>>> actions such as logs and correction can be taken.
>>> Sensors are sensing the health state and can trigger correction action.
>>>
>>> The sensors are divided into the following groups
>>> - Hardware sensor - a sensor which is triggered by the device due to
>>>    malfunction.
>>> - Software sensor - a sensor which is triggered by the software due to
>>>    malfunction.
>>> Both group of sensors can be triggered due to error event or due to a 
>>> periodic check.
>>>
>>> Actions are the way to handle sensor events. Action can be in one of the
>>> following groups:
>>> - Dump -  SW trace, SW dump, HW trace, HW dump
>>> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of 
>>> device, etc)
>>> Actions can be performed by SW or HW.
>>>
>>> User is allowed to enable or disable sensors and sensor2action mapping.
>>>
>>> This RFC man page patch describes the suggested API of devlink-health 
>>> in order
>>> to control sensors and actions.
>>
>> I like the idea of configuring response to events like this, although
>> I'm not sure the name sensor is appropriate here - perhaps exception or
>> error would be better?
> 
> I was trying to avoid the negativity description. Have it called sensor 
> to avoid restricting the API for errors / exceptions only. I got the 
> same type of comment from Andrew as well devlink-health->devlink-bug.
> 
> But if other vendors driver developers don't see it can be expanded to 
> sensor which are not errors, then I guess we can refactor the names.
> 
> Are there going to be values reported?
> 
> It depends on the sensor. If it has data that would help in the debug, 
> then I assume yes, via the dumps.
> 
>>
>> I'm not so sure about HW sensors in relation to existing HWMON
>> infrastructure...  I assume you're targeting things like say some HW
>> engine/block reporting it encountered an error?  Sounds good, too.
> 
> yes, exactly.
> 
>>
>> Are the actions all envisioned to be performed by the driver?
>> Firmware?  Hardware?  I guess that distinction can be added later.
>> For FW/HW actions we would go back to the problem of persistence of
>> the setting since it was only implemented for params :S
> 
> The problem is not with FW action, the problem is when you try to set 
> sensor2action mapping for the FW/HW. this will need persistence 
> configuration mode. Sensor2action in SW shall be run-time mode (at least 
> as a start).
> But it sound as this need some more tuning, to make it clear.

Revisiting this (before sending V2). My guideline is that persistency 
inside the device is needed only when a persistence information is 
needed before the driver loads. For any other configuration (i.e post HW 
boot),  one can use standard Linux scripts in order to control its 
persistence information.

If any new sensor will be added that requires pre HW boot information, 
the API can be extended later.

> 
>>
>> Is the dump option going to tie back into region snapshots?
>>
> no necessarily, dumping SW objects as well can be helpful

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH iproute2-next] System specification health API
  2018-09-16 19:57     ` Andrew Lunn
@ 2018-09-25 12:17       ` Eran Ben Elisha
  0 siblings, 0 replies; 17+ messages in thread
From: Eran Ben Elisha @ 2018-09-25 12:17 UTC (permalink / raw)
  To: Andrew Lunn, Stephen Hemminger
  Cc: Jakub Kicinski, netdev, Jiri Pirko, Andy Gospodarek,
	Michael Chan, Simon Horman, Alexander Duyck, Florian Fainelli,
	Tal Alon, Ariel Almog



On 9/16/2018 10:57 PM, Andrew Lunn wrote:
>> Why is this going under iproute rather than using one of the existing sensor API's.
>> For example Intel NIC's have thermal sensors etc.
> 
> Hi Stephen
> 
> These are not that sort of sensors. This is part of the naming problem
> here. It is not really to do with health, it is about exceptions and
> bugs. And the sensors are more like timeouts and watchdogs.
> 
> It is clear that the current names lead to a lot of confusion. Maybe:
> 
> health -> exception
> sensor -> condition
> 
>         Andrew
> 

I think those names renaming can work well.

(Sorry for that response, Local holiday season...)

Eran

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-09-25 18:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-13  8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-13  8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
2018-09-13 10:27   ` Tobin C. Harding
2018-09-13 11:58     ` Eran Ben Elisha
2018-09-13 22:06       ` Tobin C. Harding
2018-09-13 12:08   ` Andrew Lunn
2018-09-13 12:49     ` Eran Ben Elisha
2018-09-13 13:24       ` Andrew Lunn
2018-09-13 14:30         ` Eran Ben Elisha
2018-09-13 15:12           ` Andrew Lunn
2018-09-16  9:14             ` Eran Ben Elisha
2018-09-13 17:36 ` [RFC PATCH iproute2-next] System specification health API Jakub Kicinski
2018-09-16 10:37   ` Eran Ben Elisha
2018-09-25 12:00     ` Eran Ben Elisha
2018-09-16 19:29   ` Stephen Hemminger
2018-09-16 19:57     ` Andrew Lunn
2018-09-25 12:17       ` Eran Ben Elisha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.