All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eran Ben Elisha <eranbe@mellanox.com>
To: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	Jiri Pirko <jiri@mellanox.com>
Cc: Moshe Shemesh <moshe@mellanox.com>, Aya Levin <ayal@mellanox.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Tal Alon <talal@mellanox.com>, Ariel Almog <ariela@mellanox.com>
Subject: [PATCH RFC net-next 19/19] devlink: Add Documentation/networking/devlink-health.txt
Date: Mon, 31 Dec 2018 16:32:13 +0200	[thread overview]
Message-ID: <1546266733-9512-20-git-send-email-eranbe@mellanox.com> (raw)
In-Reply-To: <1546266733-9512-1-git-send-email-eranbe@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

This patch adds a new file to add information about devlink health
mechanism.

Signed-off-by: Aya Levin <ayal@mellanox.com>
---
 Documentation/networking/devlink-health.txt | 86 +++++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 Documentation/networking/devlink-health.txt

diff --git a/Documentation/networking/devlink-health.txt b/Documentation/networking/devlink-health.txt
new file mode 100644
index 000000000000..ea8a9cc773a2
--- /dev/null
+++ b/Documentation/networking/devlink-health.txt
@@ -0,0 +1,86 @@
+The health mechanism is targeted for Real Time Alerting, in order to know when
+something bad had happened to a PCI device
+- Provide alert debug information
+- Self healing
+- If problem needs vendor support, provide a way to gather all needed debugging
+  information.
+
+The main idea is to unify and centralize driver health reports in the
+generic devlink instance and allow the user to set different
+attributes of the health reporting and recovery procedures.
+
+The devlink health reporter:
+Device driver creates a "health reporter" per each error/health type.
+Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+or unknown (driver specific).
+For each registered health reporter a driver can issue error/health reports
+asynchronously. All health reports handling is done by devlink.
+Device driver can provide specific callbacks for each "health reporter", e.g.
+ - Recovery procedures
+ - Diagnostics and object dump procedures
+ - OOB initial parameters
+Different parts of the driver can register different types of health reporters
+with different handlers.
+
+Once an error is reported, devlink health will do the following actions:
+  * A log is being send to the kernel trace events buffer
+  * Health status and statistics are being updated for the reporter instance
+  * Object dump is being taken and saved at the reporter instance (as long as
+    there is no other Objdump which is already stored)
+  * Auto recovery attempt is being done. Depends on:
+    - Auto-recovery configuration
+    - Grace period vs. time passed since last recover
+
+The user interface:
+User can access/change each reporter's parameters and driver specific callbacks
+via devlink, e.g per error type (per health reporter)
+ - Configure reporter's generic parameters (like: disable/enable auto recovery)
+ - Invoke recovery procedure
+ - Run diagnostics
+ - Object dump
+
+The devlink health interface (via netlink):
+DEVLINK_CMD_HEALTH_REPORTER_GET
+  Retrieves status and configuration info per DEV and reporter.
+DEVLINK_CMD_HEALTH_REPORTER_SET
+  Allows reporter-related configuration setting.
+DEVLINK_CMD_HEALTH_REPORTER_RECOVER
+  Triggers a reporter's recovery procedure.
+DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
+  Retrieves diagnostics data from a reporter on a device.
+DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET
+  Retrieves the last stored objdump. Devlink health
+  saves a single objdump. If an objdump is not already stored by the devlink
+  for this reporter, devlink generates a new objdump.
+  Objdump output is defined by the reporter.
+DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
+  Clears the last saved objdump file for the specified reporter.
+
+
+                                               netlink
+                                      +--------------------------+
+                                      |                          |
+                                      |            +             |
+                                      |            |             |
+                                      +--------------------------+
+                                                   |request for ops
+                                                   |(diagnose,
+ mlx5_core                             devlink     |recover,
+                                                   |dump)
++--------+                            +--------------------------+
+|        |                            |    reporter|             |
+|        |                            |  +---------v----------+  |
+|        |   ops execution            |  |                    |  |
+|     <----------------------------------+                    |  |
+|        |                            |  |                    |  |
+|        |                            |  + ^------------------+  |
+|        |                            |    | request for ops     |
+|        |                            |    | (recover, dump)     |
+|        |                            |    |                     |
+|        |                            |  +-+------------------+  |
+|        |     health report          |  | health handler     |  |
+|        +------------------------------->                    |  |
+|        |                            |  +--------------------+  |
+|        |     health reporter create |                          |
+|        +---------------------------->                          |
++--------+                            +--------------------------+
-- 
2.17.1

  parent reply	other threads:[~2018-12-31 14:32 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-31 14:31 [PATCH RFC net-next 00/19] Devlink health reporting and recovery system Eran Ben Elisha
2018-12-31 14:31 ` [PATCH RFC net-next 01/19] devlink: Add health buffer support Eran Ben Elisha
2018-12-31 14:31 ` [PATCH RFC net-next 02/19] devlink: Add health reporter create/destroy functionality Eran Ben Elisha
2018-12-31 14:31 ` [PATCH RFC net-next 03/19] devlink: Add health report functionality Eran Ben Elisha
2018-12-31 14:31 ` [PATCH RFC net-next 04/19] devlink: Add health get command Eran Ben Elisha
2018-12-31 14:31 ` [PATCH RFC net-next 05/19] devlink: Add health set command Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 06/19] devlink: Add health recover command Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 07/19] devlink: Add health diagnose command Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 08/19] devlink: Add health objdump {get,clear} commands Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 09/19] net/mlx5e: Add TX reporter support Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 10/19] net/mlx5e: Add TX timeout support for mlx5e TX reporter Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 11/19] net/mlx5: Move all devlink related functions calls to devlink.c Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 12/19] net/mlx5: Refactor print health info Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 13/19] net/mlx5: Create FW devlink_health_reporter Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 14/19] net/mlx5: Add core dump register access functions Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 15/19] net/mlx5: Add support for FW reporter objdump Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 16/19] net/mlx5: Report devlink health on FW issues Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 17/19] net/mlx5: Add FW fatal devlink_health_reporter Eran Ben Elisha
2018-12-31 14:32 ` [PATCH RFC net-next 18/19] net/mlx5: Report devlink health on FW fatal issues Eran Ben Elisha
2018-12-31 14:32 ` Eran Ben Elisha [this message]
2019-01-01  1:47   ` [PATCH RFC net-next 19/19] devlink: Add Documentation/networking/devlink-health.txt Jakub Kicinski
2019-01-01 10:01     ` Eran Ben Elisha
2018-12-31 14:41 ` [PATCH RFC iproute2-next] devlink: Add health command support Aya Levin
2019-01-01  1:47 ` [PATCH RFC net-next 00/19] Devlink health reporting and recovery system Jakub Kicinski
2019-01-01  9:58   ` Eran Ben Elisha
2019-01-02 22:46     ` Jakub Kicinski
2019-01-03 13:31       ` Eran Ben Elisha
2019-01-03 22:28         ` Jakub Kicinski
2019-01-04  9:12           ` Jiri Pirko
2019-01-04  9:01         ` Jiri Pirko
2019-01-04  8:57   ` Jiri Pirko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1546266733-9512-20-git-send-email-eranbe@mellanox.com \
    --to=eranbe@mellanox.com \
    --cc=ariela@mellanox.com \
    --cc=ayal@mellanox.com \
    --cc=davem@davemloft.net \
    --cc=jiri@mellanox.com \
    --cc=moshe@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=talal@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.