linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv5 net-next 0/3] Add devlink and devlink health reporters to
@ 2020-11-26 14:02 George Cherian
  2020-11-26 14:02 ` [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver George Cherian
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: George Cherian @ 2020-11-26 14:02 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: kuba, davem, sgoutham, lcherian, gakula, masahiroy,
	george.cherian, willemdebruijn.kernel, saeed, jiri


Add basic devlink and devlink health reporters.
Devlink health reporters are added for NPA and NIX blocks.
These reporters report the error count in respective blocks.

Address Jakub's comment to add devlink support for error reporting.
https://www.spinics.net/lists/netdev/msg670712.html

Change-log:
v5 
 - Address Jiri's comment
 - use devlink_fmsg_arr_pair_nest_start() for NIX blocks 

v4 
 - Rebase to net-next (no logic changes).
 
v3
 - Address Saeed's comments on v2.
 - Renamed the reporter name as hw_*.
 - Call devlink_health_report() when an event is raised.
 - Added recover op too.

v2
 - Address Willem's comments on v1.
 - Fixed the sparse issues, reported by Jakub.


George Cherian (3):
  octeontx2-af: Add devlink suppoort to af driver
  octeontx2-af: Add devlink health reporters for NPA
  octeontx2-af: Add devlink health reporters for NIX

 .../net/ethernet/marvell/octeontx2/Kconfig    |   1 +
 .../ethernet/marvell/octeontx2/af/Makefile    |   2 +-
 .../net/ethernet/marvell/octeontx2/af/rvu.c   |   9 +-
 .../net/ethernet/marvell/octeontx2/af/rvu.h   |   4 +
 .../marvell/octeontx2/af/rvu_devlink.c        | 978 ++++++++++++++++++
 .../marvell/octeontx2/af/rvu_devlink.h        |  82 ++
 .../marvell/octeontx2/af/rvu_struct.h         |  33 +
 7 files changed, 1107 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver
  2020-11-26 14:02 [PATCHv5 net-next 0/3] Add devlink and devlink health reporters to George Cherian
@ 2020-11-26 14:02 ` George Cherian
  2020-12-01  2:22   ` Jakub Kicinski
  2020-11-26 14:02 ` [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA George Cherian
  2020-11-26 14:02 ` [PATCHv5 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
  2 siblings, 1 reply; 10+ messages in thread
From: George Cherian @ 2020-11-26 14:02 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: kuba, davem, sgoutham, lcherian, gakula, masahiroy,
	george.cherian, willemdebruijn.kernel, saeed, jiri

Add devlink support to AF driver. Basic devlink support is added.
Currently info_get is the only supported devlink ops.

devlink ouptput looks like this
 # devlink dev
 pci/0002:01:00.0
 # devlink dev info
 pci/0002:01:00.0:
  driver octeontx2-af
  versions:
      fixed:
        mbox version: 9

Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: George Cherian <george.cherian@marvell.com>
---
 .../net/ethernet/marvell/octeontx2/Kconfig    |  1 +
 .../ethernet/marvell/octeontx2/af/Makefile    |  2 +-
 .../net/ethernet/marvell/octeontx2/af/rvu.c   |  9 ++-
 .../net/ethernet/marvell/octeontx2/af/rvu.h   |  4 ++
 .../marvell/octeontx2/af/rvu_devlink.c        | 72 +++++++++++++++++++
 .../marvell/octeontx2/af/rvu_devlink.h        | 20 ++++++
 6 files changed, 106 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
 create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h

diff --git a/drivers/net/ethernet/marvell/octeontx2/Kconfig b/drivers/net/ethernet/marvell/octeontx2/Kconfig
index 543a1d047567..16caa02095fe 100644
--- a/drivers/net/ethernet/marvell/octeontx2/Kconfig
+++ b/drivers/net/ethernet/marvell/octeontx2/Kconfig
@@ -9,6 +9,7 @@ config OCTEONTX2_MBOX
 config OCTEONTX2_AF
 	tristate "Marvell OcteonTX2 RVU Admin Function driver"
 	select OCTEONTX2_MBOX
+	select NET_DEVLINK
 	depends on (64BIT && COMPILE_TEST) || ARM64
 	depends on PCI
 	help
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/Makefile b/drivers/net/ethernet/marvell/octeontx2/af/Makefile
index 7100d1dd856e..eb535c98ca38 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/Makefile
+++ b/drivers/net/ethernet/marvell/octeontx2/af/Makefile
@@ -10,4 +10,4 @@ obj-$(CONFIG_OCTEONTX2_AF) += octeontx2_af.o
 octeontx2_mbox-y := mbox.o rvu_trace.o
 octeontx2_af-y := cgx.o rvu.o rvu_cgx.o rvu_npa.o rvu_nix.o \
 		  rvu_reg.o rvu_npc.o rvu_debugfs.o ptp.o rvu_npc_fs.o \
-		  rvu_cpt.o
+		  rvu_cpt.o rvu_devlink.o
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 9f901c0edcbb..e8fd712860a1 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -2826,17 +2826,23 @@ static int rvu_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (err)
 		goto err_flr;
 
+	err = rvu_register_dl(rvu);
+	if (err)
+		goto err_irq;
+
 	rvu_setup_rvum_blk_revid(rvu);
 
 	/* Enable AF's VFs (if any) */
 	err = rvu_enable_sriov(rvu);
 	if (err)
-		goto err_irq;
+		goto err_dl;
 
 	/* Initialize debugfs */
 	rvu_dbg_init(rvu);
 
 	return 0;
+err_dl:
+	rvu_unregister_dl(rvu);
 err_irq:
 	rvu_unregister_interrupts(rvu);
 err_flr:
@@ -2868,6 +2874,7 @@ static void rvu_remove(struct pci_dev *pdev)
 
 	rvu_dbg_exit(rvu);
 	rvu_unregister_interrupts(rvu);
+	rvu_unregister_dl(rvu);
 	rvu_flr_wq_destroy(rvu);
 	rvu_cgx_exit(rvu);
 	rvu_fwdata_exit(rvu);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index b6c0977499ab..b1a6ecfd563e 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -12,7 +12,10 @@
 #define RVU_H
 
 #include <linux/pci.h>
+#include <net/devlink.h>
+
 #include "rvu_struct.h"
+#include "rvu_devlink.h"
 #include "common.h"
 #include "mbox.h"
 #include "npc.h"
@@ -422,6 +425,7 @@ struct rvu {
 #ifdef CONFIG_DEBUG_FS
 	struct rvu_debugfs	rvu_dbg;
 #endif
+	struct rvu_devlink	*rvu_dl;
 };
 
 static inline void rvu_write64(struct rvu *rvu, u64 block, u64 offset, u64 val)
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
new file mode 100644
index 000000000000..04ef945e7e75
--- /dev/null
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Marvell OcteonTx2 RVU Devlink
+ *
+ * Copyright (C) 2020 Marvell.
+ *
+ */
+
+#include "rvu.h"
+
+#define DRV_NAME "octeontx2-af"
+
+static int rvu_devlink_info_get(struct devlink *devlink, struct devlink_info_req *req,
+				struct netlink_ext_ack *extack)
+{
+	char buf[10];
+	int err;
+
+	err = devlink_info_driver_name_put(req, DRV_NAME);
+	if (err)
+		return err;
+
+	sprintf(buf, "%X", OTX2_MBOX_VERSION);
+	return devlink_info_version_fixed_put(req, "mbox version:", buf);
+}
+
+static const struct devlink_ops rvu_devlink_ops = {
+	.info_get = rvu_devlink_info_get,
+};
+
+int rvu_register_dl(struct rvu *rvu)
+{
+	struct rvu_devlink *rvu_dl;
+	struct devlink *dl;
+	int err;
+
+	rvu_dl = kzalloc(sizeof(*rvu_dl), GFP_KERNEL);
+	if (!rvu_dl)
+		return -ENOMEM;
+
+	dl = devlink_alloc(&rvu_devlink_ops, sizeof(struct rvu_devlink));
+	if (!dl) {
+		dev_warn(rvu->dev, "devlink_alloc failed\n");
+		kfree(rvu_dl);
+		return -ENOMEM;
+	}
+
+	err = devlink_register(dl, rvu->dev);
+	if (err) {
+		dev_err(rvu->dev, "devlink register failed with error %d\n", err);
+		devlink_free(dl);
+		kfree(rvu_dl);
+		return err;
+	}
+
+	rvu_dl->dl = dl;
+	rvu_dl->rvu = rvu;
+	rvu->rvu_dl = rvu_dl;
+	return 0;
+}
+
+void rvu_unregister_dl(struct rvu *rvu)
+{
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	struct devlink *dl = rvu_dl->dl;
+
+	if (!dl)
+		return;
+
+	devlink_unregister(dl);
+	devlink_free(dl);
+	kfree(rvu_dl);
+}
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
new file mode 100644
index 000000000000..1ed6dde79a4e
--- /dev/null
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*  Marvell OcteonTx2 RVU Devlink
+ *
+ * Copyright (C) 2020 Marvell.
+ *
+ */
+
+#ifndef RVU_DEVLINK_H
+#define  RVU_DEVLINK_H
+
+struct rvu_devlink {
+	struct devlink *dl;
+	struct rvu *rvu;
+};
+
+/* Devlink APIs */
+int rvu_register_dl(struct rvu *rvu);
+void rvu_unregister_dl(struct rvu *rvu);
+
+#endif /* RVU_DEVLINK_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
  2020-11-26 14:02 [PATCHv5 net-next 0/3] Add devlink and devlink health reporters to George Cherian
  2020-11-26 14:02 ` [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver George Cherian
@ 2020-11-26 14:02 ` George Cherian
  2020-12-01  2:29   ` Jakub Kicinski
  2020-11-26 14:02 ` [PATCHv5 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
  2 siblings, 1 reply; 10+ messages in thread
From: George Cherian @ 2020-11-26 14:02 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: kuba, davem, sgoutham, lcherian, gakula, masahiroy,
	george.cherian, willemdebruijn.kernel, saeed, jiri

Add health reporters for RVU NPA block.
NPA Health reporters handle following HW event groups
 - GENERAL events
 - ERROR events
 - RAS events
 - RVU event
An event counter per event is maintained in SW.

Output:
 # devlink health
 pci/0002:01:00.0:
   reporter hw_npa
     state healthy error 0 recover 0
 # devlink  health dump show pci/0002:01:00.0 reporter hw_npa
 NPA_AF_GENERAL:
        Unmap PF Error: 0
        NIX:
        0: free disabled RX: 0 free disabled TX: 0
        1: free disabled RX: 0 free disabled TX: 0
        Free Disabled for SSO: 0
        Free Disabled for TIM: 0
        Free Disabled for DPI: 0
        Free Disabled for AURA: 0
        Alloc Disabled for Resvd: 0
  NPA_AF_ERR:
        Memory Fault on NPA_AQ_INST_S read: 0
        Memory Fault on NPA_AQ_RES_S write: 0
        AQ Doorbell Error: 0
        Poisoned data on NPA_AQ_INST_S read: 0
        Poisoned data on NPA_AQ_RES_S write: 0
        Poisoned data on HW context read: 0
  NPA_AF_RVU:
        Unmap Slot Error: 0

Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: George Cherian <george.cherian@marvell.com>
---
 .../marvell/octeontx2/af/rvu_devlink.c        | 498 +++++++++++++++++-
 .../marvell/octeontx2/af/rvu_devlink.h        |  31 ++
 .../marvell/octeontx2/af/rvu_struct.h         |  23 +
 3 files changed, 551 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
index 04ef945e7e75..377264d65d0c 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
@@ -5,10 +5,504 @@
  *
  */
 
+#include<linux/bitfield.h>
+
 #include "rvu.h"
+#include "rvu_reg.h"
+#include "rvu_struct.h"
 
 #define DRV_NAME "octeontx2-af"
 
+static int rvu_report_pair_start(struct devlink_fmsg *fmsg, const char *name)
+{
+	int err;
+
+	err = devlink_fmsg_pair_nest_start(fmsg, name);
+	if (err)
+		return err;
+
+	return  devlink_fmsg_obj_nest_start(fmsg);
+}
+
+static int rvu_report_pair_end(struct devlink_fmsg *fmsg)
+{
+	int err;
+
+	err = devlink_fmsg_obj_nest_end(fmsg);
+	if (err)
+		return err;
+
+	return devlink_fmsg_pair_nest_end(fmsg);
+}
+
+static bool rvu_common_request_irq(struct rvu *rvu, int offset,
+				   const char *name, irq_handler_t fn)
+{
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	int rc;
+
+	sprintf(&rvu->irq_name[offset * NAME_SIZE], name);
+	rc = request_irq(pci_irq_vector(rvu->pdev, offset), fn, 0,
+			 &rvu->irq_name[offset * NAME_SIZE], rvu_dl);
+	if (rc)
+		dev_warn(rvu->dev, "Failed to register %s irq\n", name);
+	else
+		rvu->irq_allocated[offset] = true;
+
+	return rvu->irq_allocated[offset];
+}
+
+static irqreturn_t rvu_npa_af_rvu_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_npa_event_ctx *npa_event_context;
+	struct rvu_npa_event_cnt *npa_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	npa_event_context = rvu_dl->npa_event_ctx;
+	npa_event_count = &npa_event_context->npa_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NPA_AF_RVU_INT);
+	npa_event_context->npa_af_rvu_int = intr;
+
+	if (intr & BIT_ULL(0))
+		npa_event_count->unmap_slot_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NPA_AF_RVU_INT, intr);
+	rvu_write64(rvu, blkaddr, NPA_AF_RVU_INT_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_npa_health_reporter, "NPA_AF_RVU Error",
+			      npa_event_context);
+
+	return IRQ_HANDLED;
+}
+
+static int rvu_npa_inpq_to_cnt(u16 in,
+			       struct rvu_npa_event_cnt *npa_event_count)
+{
+	switch (in) {
+	case 0:
+		return 0;
+	case BIT(NPA_INPQ_NIX0_RX):
+		return npa_event_count->free_dis_nix0_rx_count++;
+	case BIT(NPA_INPQ_NIX0_TX):
+		return npa_event_count->free_dis_nix0_tx_count++;
+	case BIT(NPA_INPQ_NIX1_RX):
+		return npa_event_count->free_dis_nix1_rx_count++;
+	case BIT(NPA_INPQ_NIX1_TX):
+		return npa_event_count->free_dis_nix1_tx_count++;
+	case BIT(NPA_INPQ_SSO):
+		return npa_event_count->free_dis_sso_count++;
+	case BIT(NPA_INPQ_TIM):
+		return npa_event_count->free_dis_tim_count++;
+	case BIT(NPA_INPQ_DPI):
+		return npa_event_count->free_dis_dpi_count++;
+	case BIT(NPA_INPQ_AURA_OP):
+		return npa_event_count->free_dis_aura_count++;
+	case BIT(NPA_INPQ_INTERNAL_RSV):
+		return npa_event_count->free_dis_rsvd_count++;
+	}
+
+	return npa_event_count->alloc_dis_rsvd_count++;
+}
+
+static irqreturn_t rvu_npa_af_gen_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_npa_event_ctx *npa_event_context;
+	struct rvu_npa_event_cnt *npa_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr, val;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	npa_event_context = rvu_dl->npa_event_ctx;
+	npa_event_count = &npa_event_context->npa_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NPA_AF_GEN_INT);
+	npa_event_context->npa_af_rvu_gen = intr;
+
+	if (intr & BIT_ULL(32))
+		npa_event_count->unmap_pf_count++;
+
+	val = FIELD_GET(GENMASK(31, 16), intr);
+	rvu_npa_inpq_to_cnt(val, npa_event_count);
+
+	val = FIELD_GET(GENMASK(15, 0), intr);
+	rvu_npa_inpq_to_cnt(val, npa_event_count);
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NPA_AF_GEN_INT, intr);
+	rvu_write64(rvu, blkaddr, NPA_AF_GEN_INT_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_npa_health_reporter, "NPA_AF_GEN Error",
+			      npa_event_context);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rvu_npa_af_err_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_npa_event_ctx *npa_event_context;
+	struct rvu_npa_event_cnt *npa_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+	npa_event_context = rvu_dl->npa_event_ctx;
+	npa_event_count = &npa_event_context->npa_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NPA_AF_ERR_INT);
+	npa_event_context->npa_af_rvu_err = intr;
+
+	if (intr & BIT_ULL(14))
+		npa_event_count->aq_inst_count++;
+
+	if (intr & BIT_ULL(13))
+		npa_event_count->aq_res_count++;
+
+	if (intr & BIT_ULL(12))
+		npa_event_count->aq_db_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NPA_AF_ERR_INT, intr);
+	rvu_write64(rvu, blkaddr, NPA_AF_ERR_INT_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_npa_health_reporter, "NPA_AF_ERR Error",
+			      npa_event_context);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rvu_npa_af_ras_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_npa_event_ctx *npa_event_context;
+	struct rvu_npa_event_cnt *npa_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	npa_event_context = rvu_dl->npa_event_ctx;
+	npa_event_count = &npa_event_context->npa_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NPA_AF_RAS);
+	npa_event_context->npa_af_rvu_ras = intr;
+
+	if (intr & BIT_ULL(34))
+		npa_event_count->poison_aq_inst_count++;
+
+	if (intr & BIT_ULL(33))
+		npa_event_count->poison_aq_res_count++;
+
+	if (intr & BIT_ULL(32))
+		npa_event_count->poison_aq_cxt_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NPA_AF_RAS, intr);
+	rvu_write64(rvu, blkaddr, NPA_AF_RAS_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_npa_health_reporter, "HW NPA_AF_RAS Error reported",
+			      npa_event_context);
+	return IRQ_HANDLED;
+}
+
+static void rvu_npa_unregister_interrupts(struct rvu *rvu)
+{
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	int i, offs, blkaddr;
+	u64 reg;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return;
+
+	reg = rvu_read64(rvu, blkaddr, NPA_PRIV_AF_INT_CFG);
+	offs = reg & 0x3FF;
+
+	rvu_write64(rvu, blkaddr, NPA_AF_RVU_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NPA_AF_GEN_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NPA_AF_ERR_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NPA_AF_RAS_ENA_W1C, ~0ULL);
+
+	for (i = 0; i < NPA_AF_INT_VEC_CNT; i++)
+		if (rvu->irq_allocated[offs + i]) {
+			free_irq(pci_irq_vector(rvu->pdev, offs + i), rvu_dl);
+			rvu->irq_allocated[offs + i] = false;
+		}
+}
+
+static int rvu_npa_register_interrupts(struct rvu *rvu)
+{
+	int blkaddr, base;
+	bool rc;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return blkaddr;
+
+	/* Get NPA AF MSIX vectors offset. */
+	base = rvu_read64(rvu, blkaddr, NPA_PRIV_AF_INT_CFG) & 0x3ff;
+	if (!base) {
+		dev_warn(rvu->dev,
+			 "Failed to get NPA_AF_INT vector offsets\n");
+		return 0;
+	}
+
+	/* Register and enable NPA_AF_RVU_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base +  NPA_AF_INT_VEC_RVU,
+				    "NPA_AF_RVU_INT",
+				    rvu_npa_af_rvu_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NPA_AF_RVU_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NPA_AF_GEN_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base + NPA_AF_INT_VEC_GEN,
+				    "NPA_AF_RVU_GEN",
+				    rvu_npa_af_gen_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NPA_AF_GEN_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NPA_AF_ERR_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base + NPA_AF_INT_VEC_AF_ERR,
+				    "NPA_AF_ERR_INT",
+				    rvu_npa_af_err_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NPA_AF_ERR_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NPA_AF_RAS interrupt */
+	rc = rvu_common_request_irq(rvu, base + NPA_AF_INT_VEC_POISON,
+				    "NPA_AF_RAS",
+				    rvu_npa_af_ras_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NPA_AF_RAS_ENA_W1S, ~0ULL);
+
+	return 0;
+err:
+	rvu_npa_unregister_interrupts(rvu);
+	return rc;
+}
+
+static int rvu_npa_report_show(struct devlink_fmsg *fmsg, struct rvu *rvu)
+{
+	struct rvu_npa_event_ctx *npa_event_context;
+	struct rvu_npa_event_cnt *npa_event_count;
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	int err;
+
+	npa_event_context = rvu_dl->npa_event_ctx;
+	npa_event_count = &npa_event_context->npa_event_cnt;
+	err = rvu_report_pair_start(fmsg, "NPA_AF_GENERAL");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tUnmap PF Error",
+					npa_event_count->unmap_pf_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_arr_pair_nest_start(fmsg, "\tNIX");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\t0: free disabled RX",
+					npa_event_count->free_dis_nix0_rx_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "free disabled TX",
+					npa_event_count->free_dis_nix0_tx_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\t1: free disabled RX",
+					npa_event_count->free_dis_nix1_rx_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "free disabled TX",
+					npa_event_count->free_dis_nix1_tx_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_arr_pair_nest_end(fmsg);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tFree Disabled for SSO",
+					npa_event_count->free_dis_sso_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tFree Disabled for TIM",
+					npa_event_count->free_dis_tim_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tFree Disabled for DPI",
+					npa_event_count->free_dis_dpi_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tFree Disabled for AURA",
+					npa_event_count->free_dis_aura_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tAlloc Disabled for Resvd",
+					npa_event_count->alloc_dis_rsvd_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	err = rvu_report_pair_start(fmsg, "NPA_AF_ERR");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tMemory Fault on NPA_AQ_INST_S read",
+					npa_event_count->aq_inst_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory Fault on NPA_AQ_RES_S write",
+					npa_event_count->aq_res_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tAQ Doorbell Error",
+					npa_event_count->aq_db_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on NPA_AQ_INST_S read",
+					npa_event_count->poison_aq_inst_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on NPA_AQ_RES_S write",
+					npa_event_count->poison_aq_res_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on HW context read",
+					npa_event_count->poison_aq_cxt_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	err = rvu_report_pair_start(fmsg, "NPA_AF_RVU");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tUnmap Slot Error",
+					npa_event_count->unmap_slot_count);
+	if (err)
+		return err;
+	return rvu_report_pair_end(fmsg);
+}
+
+static int rvu_npa_reporter_dump(struct devlink_health_reporter *reporter,
+				 struct devlink_fmsg *fmsg, void *ctx,
+				 struct netlink_ext_ack *netlink_extack)
+{
+	struct rvu *rvu = devlink_health_reporter_priv(reporter);
+
+	return rvu_npa_report_show(fmsg, rvu);
+}
+
+static int rvu_npa_reporter_recover(struct devlink_health_reporter *reporter,
+				    void *ctx, struct netlink_ext_ack *netlink_extack)
+{
+	struct rvu *rvu = devlink_health_reporter_priv(reporter);
+	struct rvu_npa_event_ctx *npa_event_ctx = ctx;
+	int blkaddr;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NPA, 0);
+	if (blkaddr < 0)
+		return blkaddr;
+
+	if (npa_event_ctx->npa_af_rvu_int) {
+		rvu_write64(rvu, blkaddr, NPA_AF_RVU_INT_ENA_W1S, ~0ULL);
+		npa_event_ctx->npa_af_rvu_int = 0;
+	}
+	if (npa_event_ctx->npa_af_rvu_gen) {
+		rvu_write64(rvu, blkaddr, NPA_AF_GEN_INT_ENA_W1S, ~0ULL);
+		npa_event_ctx->npa_af_rvu_gen = 0;
+	}
+	if (npa_event_ctx->npa_af_rvu_err) {
+		rvu_write64(rvu, blkaddr, NPA_AF_ERR_INT_ENA_W1S, ~0ULL);
+		npa_event_ctx->npa_af_rvu_err = 0;
+	}
+	if (npa_event_ctx->npa_af_rvu_ras) {
+		rvu_write64(rvu, blkaddr, NPA_AF_RAS_ENA_W1S, ~0ULL);
+		npa_event_ctx->npa_af_rvu_ras = 0;
+	}
+
+	return 0;
+}
+
+static const struct devlink_health_reporter_ops rvu_npa_hw_fault_reporter_ops = {
+		.name = "hw_npa",
+		.recover = rvu_npa_reporter_recover,
+		.dump = rvu_npa_reporter_dump,
+};
+
+static int rvu_npa_health_reporters_create(struct rvu_devlink *rvu_dl)
+{
+	struct devlink_health_reporter *rvu_npa_health_reporter;
+	struct rvu_npa_event_ctx *npa_event_context;
+	struct rvu *rvu = rvu_dl->rvu;
+
+	npa_event_context = kzalloc(sizeof(*npa_event_context), GFP_KERNEL);
+	if (!npa_event_context)
+		return -ENOMEM;
+
+	rvu_dl->npa_event_ctx = npa_event_context;
+	rvu_npa_health_reporter = devlink_health_reporter_create(rvu_dl->dl,
+								 &rvu_npa_hw_fault_reporter_ops,
+								 0, rvu);
+	if (IS_ERR(rvu_npa_health_reporter)) {
+		dev_warn(rvu->dev, "Failed to create npa reporter, err =%ld\n",
+			 PTR_ERR(rvu_npa_health_reporter));
+		return PTR_ERR(rvu_npa_health_reporter);
+	}
+
+	rvu_dl->rvu_npa_health_reporter = rvu_npa_health_reporter;
+	rvu_npa_register_interrupts(rvu);
+
+	return 0;
+}
+
+static void rvu_npa_health_reporters_destroy(struct rvu_devlink *rvu_dl)
+{
+	struct rvu *rvu = rvu_dl->rvu;
+
+	if (!rvu_dl->rvu_npa_health_reporter)
+		return;
+
+	devlink_health_reporter_destroy(rvu_dl->rvu_npa_health_reporter);
+	rvu_npa_unregister_interrupts(rvu);
+	kfree(rvu_dl->npa_event_ctx);
+}
+
+static int rvu_health_reporters_create(struct rvu *rvu)
+{
+	struct rvu_devlink *rvu_dl;
+
+	rvu_dl = rvu->rvu_dl;
+	return rvu_npa_health_reporters_create(rvu_dl);
+}
+
+static void rvu_health_reporters_destroy(struct rvu *rvu)
+{
+	struct rvu_devlink *rvu_dl;
+
+	if (!rvu->rvu_dl)
+		return;
+
+	rvu_dl = rvu->rvu_dl;
+	rvu_npa_health_reporters_destroy(rvu_dl);
+}
+
 static int rvu_devlink_info_get(struct devlink *devlink, struct devlink_info_req *req,
 				struct netlink_ext_ack *extack)
 {
@@ -55,7 +549,8 @@ int rvu_register_dl(struct rvu *rvu)
 	rvu_dl->dl = dl;
 	rvu_dl->rvu = rvu;
 	rvu->rvu_dl = rvu_dl;
-	return 0;
+
+	return rvu_health_reporters_create(rvu);
 }
 
 void rvu_unregister_dl(struct rvu *rvu)
@@ -66,6 +561,7 @@ void rvu_unregister_dl(struct rvu *rvu)
 	if (!dl)
 		return;
 
+	rvu_health_reporters_destroy(rvu);
 	devlink_unregister(dl);
 	devlink_free(dl);
 	kfree(rvu_dl);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
index 1ed6dde79a4e..e04603a9952c 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
@@ -8,9 +8,40 @@
 #ifndef RVU_DEVLINK_H
 #define  RVU_DEVLINK_H
 
+struct rvu_npa_event_cnt {
+	u64 unmap_slot_count;
+	u64 unmap_pf_count;
+	u64 free_dis_nix0_rx_count;
+	u64 free_dis_nix0_tx_count;
+	u64 free_dis_nix1_rx_count;
+	u64 free_dis_nix1_tx_count;
+	u64 free_dis_sso_count;
+	u64 free_dis_tim_count;
+	u64 free_dis_dpi_count;
+	u64 free_dis_aura_count;
+	u64 free_dis_rsvd_count;
+	u64 alloc_dis_rsvd_count;
+	u64 aq_inst_count;
+	u64 aq_res_count;
+	u64 aq_db_count;
+	u64 poison_aq_inst_count;
+	u64 poison_aq_res_count;
+	u64 poison_aq_cxt_count;
+};
+
+struct rvu_npa_event_ctx {
+	struct rvu_npa_event_cnt npa_event_cnt;
+	u64 npa_af_rvu_int;
+	u64 npa_af_rvu_gen;
+	u64 npa_af_rvu_err;
+	u64 npa_af_rvu_ras;
+};
+
 struct rvu_devlink {
 	struct devlink *dl;
 	struct rvu *rvu;
+	struct devlink_health_reporter *rvu_npa_health_reporter;
+	struct rvu_npa_event_ctx *npa_event_ctx;
 };
 
 /* Devlink APIs */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
index 723643868589..e2153d47c373 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
@@ -64,6 +64,16 @@ enum rvu_af_int_vec_e {
 	RVU_AF_INT_VEC_CNT    = 0x5,
 };
 
+/* NPA Admin function Interrupt Vector Enumeration */
+enum npa_af_int_vec_e {
+	NPA_AF_INT_VEC_RVU	= 0x0,
+	NPA_AF_INT_VEC_GEN	= 0x1,
+	NPA_AF_INT_VEC_AQ_DONE	= 0x2,
+	NPA_AF_INT_VEC_AF_ERR	= 0x3,
+	NPA_AF_INT_VEC_POISON	= 0x4,
+	NPA_AF_INT_VEC_CNT	= 0x5,
+};
+
 /**
  * RVU PF Interrupt Vector Enumeration
  */
@@ -104,6 +114,19 @@ enum npa_aq_instop {
 	NPA_AQ_INSTOP_UNLOCK = 0x5,
 };
 
+/* ALLOC/FREE input queues Enumeration from coprocessors */
+enum npa_inpq {
+	NPA_INPQ_NIX0_RX       = 0x0,
+	NPA_INPQ_NIX0_TX       = 0x1,
+	NPA_INPQ_NIX1_RX       = 0x2,
+	NPA_INPQ_NIX1_TX       = 0x3,
+	NPA_INPQ_SSO           = 0x4,
+	NPA_INPQ_TIM           = 0x5,
+	NPA_INPQ_DPI           = 0x6,
+	NPA_INPQ_AURA_OP       = 0xe,
+	NPA_INPQ_INTERNAL_RSV  = 0xf,
+};
+
 /* NPA admin queue instruction structure */
 struct npa_aq_inst_s {
 #if defined(__BIG_ENDIAN_BITFIELD)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHv5 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-26 14:02 [PATCHv5 net-next 0/3] Add devlink and devlink health reporters to George Cherian
  2020-11-26 14:02 ` [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver George Cherian
  2020-11-26 14:02 ` [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA George Cherian
@ 2020-11-26 14:02 ` George Cherian
  2 siblings, 0 replies; 10+ messages in thread
From: George Cherian @ 2020-11-26 14:02 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: kuba, davem, sgoutham, lcherian, gakula, masahiroy,
	george.cherian, willemdebruijn.kernel, saeed, jiri

Add health reporters for RVU NIX block.
NIX Health reporter handle following HW event groups
 - GENERAL events
 - RAS events
 - RVU event
An event counter per event is maintained in SW.

Output:
 # devlink health
 pci/0002:01:00.0:
   reporter hw_npa
     state healthy error 0 recover 0
   reporter hw_nix
     state healthy error 0 recover 0
 # devlink  health dump show pci/0002:01:00.0 reporter hw_nix
  NIX_AF_GENERAL:
         Memory Fault on NIX_AQ_INST_S read: 0
         Memory Fault on NIX_AQ_RES_S write: 0
         AQ Doorbell error: 0
         Rx on unmapped PF_FUNC: 0
         Rx multicast replication error: 0
         Memory fault on NIX_RX_MCE_S read: 0
         Memory fault on multicast WQE read: 0
         Memory fault on mirror WQE read: 0
         Memory fault on mirror pkt write: 0
         Memory fault on multicast pkt write: 0
   NIX_AF_RAS:
         Poisoned data on NIX_AQ_INST_S read: 0
         Poisoned data on NIX_AQ_RES_S write: 0
         Poisoned data on HW context read: 0
         Poisoned data on packet read from mirror buffer: 0
         Poisoned data on packet read from mcast buffer: 0
         Poisoned data on WQE read from mirror buffer: 0
         Poisoned data on WQE read from multicast buffer: 0
         Poisoned data on NIX_RX_MCE_S read: 0
   NIX_AF_RVU:
         Unmap Slot Error: 0

Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: George Cherian <george.cherian@marvell.com>
---
 .../marvell/octeontx2/af/rvu_devlink.c        | 414 +++++++++++++++++-
 .../marvell/octeontx2/af/rvu_devlink.h        |  31 ++
 .../marvell/octeontx2/af/rvu_struct.h         |  10 +
 3 files changed, 453 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
index 377264d65d0c..2f20d8b9eef3 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
@@ -35,6 +35,131 @@ static int rvu_report_pair_end(struct devlink_fmsg *fmsg)
 	return devlink_fmsg_pair_nest_end(fmsg);
 }
 
+static irqreturn_t rvu_nix_af_rvu_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_nix_event_ctx *nix_event_context;
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	nix_event_context = rvu_dl->nix_event_ctx;
+	nix_event_count = &nix_event_context->nix_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NIX_AF_RVU_INT);
+	nix_event_context->nix_af_rvu_int = intr;
+
+	if (intr & BIT_ULL(0))
+		nix_event_count->unmap_slot_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT, intr);
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_nix_health_reporter, "NIX_AF_RVU Error",
+			      nix_event_context);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rvu_nix_af_err_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_nix_event_ctx *nix_event_context;
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	nix_event_context = rvu_dl->nix_event_ctx;
+	nix_event_count = &nix_event_context->nix_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NIX_AF_ERR_INT);
+	nix_event_context->nix_af_rvu_err = intr;
+
+	if (intr & BIT_ULL(14))
+		nix_event_count->aq_inst_count++;
+	if (intr & BIT_ULL(13))
+		nix_event_count->aq_res_count++;
+	if (intr & BIT_ULL(12))
+		nix_event_count->aq_db_count++;
+	if (intr & BIT_ULL(6))
+		nix_event_count->rx_on_unmap_pf_count++;
+	if (intr & BIT_ULL(5))
+		nix_event_count->rx_mcast_repl_count++;
+	if (intr & BIT_ULL(4))
+		nix_event_count->rx_mcast_memfault_count++;
+	if (intr & BIT_ULL(3))
+		nix_event_count->rx_mcast_wqe_memfault_count++;
+	if (intr & BIT_ULL(2))
+		nix_event_count->rx_mirror_wqe_memfault_count++;
+	if (intr & BIT_ULL(1))
+		nix_event_count->rx_mirror_pktw_memfault_count++;
+	if (intr & BIT_ULL(0))
+		nix_event_count->rx_mcast_pktw_memfault_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT, intr);
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_nix_health_reporter, "NIX_AF_ERR Error",
+			      nix_event_context);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rvu_nix_af_ras_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_nix_event_ctx *nix_event_context;
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	nix_event_context = rvu_dl->nix_event_ctx;
+	nix_event_count = &nix_event_context->nix_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NIX_AF_RAS);
+	nix_event_context->nix_af_rvu_ras = intr;
+
+	if (intr & BIT_ULL(34))
+		nix_event_count->poison_aq_inst_count++;
+	if (intr & BIT_ULL(33))
+		nix_event_count->poison_aq_res_count++;
+	if (intr & BIT_ULL(32))
+		nix_event_count->poison_aq_cxt_count++;
+	if (intr & BIT_ULL(4))
+		nix_event_count->rx_mirror_data_poison_count++;
+	if (intr & BIT_ULL(3))
+		nix_event_count->rx_mcast_data_poison_count++;
+	if (intr & BIT_ULL(2))
+		nix_event_count->rx_mirror_wqe_poison_count++;
+	if (intr & BIT_ULL(1))
+		nix_event_count->rx_mcast_wqe_poison_count++;
+	if (intr & BIT_ULL(0))
+		nix_event_count->rx_mce_poison_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS, intr);
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS_ENA_W1C, ~0ULL);
+	devlink_health_report(rvu_dl->rvu_nix_health_reporter, "NIX_AF_RAS Error",
+			      nix_event_context);
+
+	return IRQ_HANDLED;
+}
+
 static bool rvu_common_request_irq(struct rvu *rvu, int offset,
 				   const char *name, irq_handler_t fn)
 {
@@ -52,6 +177,285 @@ static bool rvu_common_request_irq(struct rvu *rvu, int offset,
 	return rvu->irq_allocated[offset];
 }
 
+static void rvu_nix_blk_unregister_interrupts(struct rvu *rvu,
+					      int blkaddr)
+{
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	int offs, i;
+
+	offs = rvu_read64(rvu, blkaddr, NIX_PRIV_AF_INT_CFG) & 0x3ff;
+	if (!offs)
+		return;
+
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS_ENA_W1C, ~0ULL);
+
+	if (rvu->irq_allocated[offs + NIX_AF_INT_VEC_RVU]) {
+		free_irq(pci_irq_vector(rvu->pdev, offs + NIX_AF_INT_VEC_RVU),
+			 rvu_dl);
+		rvu->irq_allocated[offs + NIX_AF_INT_VEC_RVU] = false;
+	}
+
+	for (i = NIX_AF_INT_VEC_AF_ERR; i < NIX_AF_INT_VEC_CNT; i++)
+		if (rvu->irq_allocated[offs + i]) {
+			free_irq(pci_irq_vector(rvu->pdev, offs + i), rvu_dl);
+			rvu->irq_allocated[offs + i] = false;
+		}
+}
+
+static void rvu_nix_unregister_interrupts(struct rvu *rvu)
+{
+	int blkaddr = 0;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return;
+
+	rvu_nix_blk_unregister_interrupts(rvu, blkaddr);
+}
+
+static int rvu_nix_blk_register_interrupts(struct rvu *rvu,
+					   int blkaddr)
+{
+	int base;
+	bool rc;
+
+	/* Get NIX AF MSIX vectors offset. */
+	base = rvu_read64(rvu, blkaddr, NIX_PRIV_AF_INT_CFG) & 0x3ff;
+	if (!base) {
+		dev_warn(rvu->dev,
+			 "Failed to get NIX%d NIX_AF_INT vector offsets\n",
+			 blkaddr - BLKADDR_NIX0);
+		return 0;
+	}
+	/* Register and enable NIX_AF_RVU_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base +  NIX_AF_INT_VEC_RVU,
+				    "NIX_AF_RVU_INT",
+				    rvu_nix_af_rvu_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NIX_AF_ERR_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base + NIX_AF_INT_VEC_AF_ERR,
+				    "NIX_AF_ERR_INT",
+				    rvu_nix_af_err_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NIX_AF_RAS interrupt */
+	rc = rvu_common_request_irq(rvu, base + NIX_AF_INT_VEC_POISON,
+				    "NIX_AF_RAS",
+				    rvu_nix_af_ras_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS_ENA_W1S, ~0ULL);
+
+	return 0;
+err:
+	rvu_nix_unregister_interrupts(rvu);
+	return -1;
+}
+
+static int rvu_nix_register_interrupts(struct rvu *rvu)
+{
+	int blkaddr = 0;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return blkaddr;
+
+	rvu_nix_blk_register_interrupts(rvu, blkaddr);
+
+	return 0;
+}
+
+static int rvu_nix_report_show(struct devlink_fmsg *fmsg, struct rvu *rvu)
+{
+	struct rvu_nix_event_ctx *nix_event_context;
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	int err;
+
+	nix_event_context = rvu_dl->nix_event_ctx;
+	nix_event_count = &nix_event_context->nix_event_cnt;
+	err = rvu_report_pair_start(fmsg, "NIX_AF_GENERAL");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tMemory Fault on NIX_AQ_INST_S read",
+					nix_event_count->aq_inst_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory Fault on NIX_AQ_RES_S write",
+					nix_event_count->aq_res_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tAQ Doorbell error",
+					nix_event_count->aq_db_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tRx on unmapped PF_FUNC",
+					nix_event_count->rx_on_unmap_pf_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tRx multicast replication error",
+					nix_event_count->rx_mcast_repl_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on NIX_RX_MCE_S read",
+					nix_event_count->rx_mcast_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on multicast WQE read",
+					nix_event_count->rx_mcast_wqe_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on mirror WQE read",
+					nix_event_count->rx_mirror_wqe_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on mirror pkt write",
+					nix_event_count->rx_mirror_pktw_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on multicast pkt write",
+					nix_event_count->rx_mcast_pktw_memfault_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	err = rvu_report_pair_start(fmsg, "NIX_AF_RAS");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tPoisoned data on NIX_AQ_INST_S read",
+					nix_event_count->poison_aq_inst_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on NIX_AQ_RES_S write",
+					nix_event_count->poison_aq_res_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on HW context read",
+					nix_event_count->poison_aq_cxt_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on packet read from mirror buffer",
+					nix_event_count->rx_mirror_data_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on packet read from mcast buffer",
+					nix_event_count->rx_mcast_data_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on WQE read from mirror buffer",
+					nix_event_count->rx_mirror_wqe_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on WQE read from multicast buffer",
+					nix_event_count->rx_mcast_wqe_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on NIX_RX_MCE_S read",
+					nix_event_count->rx_mce_poison_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	err = rvu_report_pair_start(fmsg, "NIX_AF_RVU");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tUnmap Slot Error",
+					nix_event_count->unmap_slot_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	return 0;
+}
+
+static int rvu_nix_reporter_dump(struct devlink_health_reporter *reporter,
+				 struct devlink_fmsg *fmsg, void *ctx,
+				 struct netlink_ext_ack *netlink_extack)
+{
+	struct rvu *rvu = devlink_health_reporter_priv(reporter);
+
+	return rvu_nix_report_show(fmsg, rvu);
+}
+
+static int rvu_nix_reporter_recover(struct devlink_health_reporter *reporter,
+				    void *ctx, struct netlink_ext_ack *netlink_extack)
+{
+	struct rvu *rvu = devlink_health_reporter_priv(reporter);
+	struct rvu_nix_event_ctx *nix_event_ctx = ctx;
+	int blkaddr;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return blkaddr;
+
+	if (nix_event_ctx->nix_af_rvu_int) {
+		rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT_ENA_W1S, ~0ULL);
+		nix_event_ctx->nix_af_rvu_int = 0;
+	}
+	if (nix_event_ctx->nix_af_rvu_err) {
+		rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT_ENA_W1S, ~0ULL);
+		nix_event_ctx->nix_af_rvu_err = 0;
+	}
+	if (nix_event_ctx->nix_af_rvu_ras) {
+		rvu_write64(rvu, blkaddr, NIX_AF_RAS_ENA_W1S, ~0ULL);
+		nix_event_ctx->nix_af_rvu_ras = 0;
+	}
+
+	return 0;
+}
+
+static const struct devlink_health_reporter_ops rvu_nix_fault_reporter_ops = {
+		.name = "hw_nix",
+		.dump = rvu_nix_reporter_dump,
+		.recover = rvu_nix_reporter_recover,
+};
+
+static int rvu_nix_health_reporters_create(struct rvu_devlink *rvu_dl)
+{
+	struct devlink_health_reporter *rvu_nix_health_reporter;
+	struct rvu_nix_event_ctx *nix_event_context;
+	struct rvu *rvu = rvu_dl->rvu;
+
+	nix_event_context = kzalloc(sizeof(*nix_event_context), GFP_KERNEL);
+	if (!nix_event_context)
+		return -ENOMEM;
+
+	rvu_dl->nix_event_ctx = nix_event_context;
+	rvu_nix_health_reporter = devlink_health_reporter_create(rvu_dl->dl,
+								 &rvu_nix_fault_reporter_ops,
+								 0, rvu);
+	if (IS_ERR(rvu_nix_health_reporter)) {
+		dev_warn(rvu->dev, "Failed to create nix reporter, err = %ld\n",
+			 PTR_ERR(rvu_nix_health_reporter));
+		return PTR_ERR(rvu_nix_health_reporter);
+	}
+
+	rvu_dl->rvu_nix_health_reporter = rvu_nix_health_reporter;
+	rvu_nix_register_interrupts(rvu);
+	return 0;
+}
+
+static void rvu_nix_health_reporters_destroy(struct rvu_devlink *rvu_dl)
+{
+	struct rvu *rvu = rvu_dl->rvu;
+
+	if (!rvu_dl->rvu_nix_health_reporter)
+		return;
+
+	devlink_health_reporter_destroy(rvu_dl->rvu_nix_health_reporter);
+	rvu_nix_unregister_interrupts(rvu);
+}
+
 static irqreturn_t rvu_npa_af_rvu_intr_handler(int irq, void *rvu_irq)
 {
 	struct rvu_npa_event_ctx *npa_event_context;
@@ -214,7 +618,7 @@ static irqreturn_t rvu_npa_af_ras_intr_handler(int irq, void *rvu_irq)
 	/* Clear interrupts */
 	rvu_write64(rvu, blkaddr, NPA_AF_RAS, intr);
 	rvu_write64(rvu, blkaddr, NPA_AF_RAS_ENA_W1C, ~0ULL);
-	devlink_health_report(rvu_dl->rvu_npa_health_reporter, "HW NPA_AF_RAS Error reported",
+	devlink_health_report(rvu_dl->rvu_npa_health_reporter, "NPA_AF_RAS Error",
 			      npa_event_context);
 	return IRQ_HANDLED;
 }
@@ -487,9 +891,14 @@ static void rvu_npa_health_reporters_destroy(struct rvu_devlink *rvu_dl)
 static int rvu_health_reporters_create(struct rvu *rvu)
 {
 	struct rvu_devlink *rvu_dl;
+	int err;
 
 	rvu_dl = rvu->rvu_dl;
-	return rvu_npa_health_reporters_create(rvu_dl);
+	err = rvu_npa_health_reporters_create(rvu_dl);
+	if (err)
+		return err;
+
+	return rvu_nix_health_reporters_create(rvu_dl);
 }
 
 static void rvu_health_reporters_destroy(struct rvu *rvu)
@@ -501,6 +910,7 @@ static void rvu_health_reporters_destroy(struct rvu *rvu)
 
 	rvu_dl = rvu->rvu_dl;
 	rvu_npa_health_reporters_destroy(rvu_dl);
+	rvu_nix_health_reporters_destroy(rvu_dl);
 }
 
 static int rvu_devlink_info_get(struct devlink *devlink, struct devlink_info_req *req,
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
index e04603a9952c..cfc513d945a0 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
@@ -37,11 +37,42 @@ struct rvu_npa_event_ctx {
 	u64 npa_af_rvu_ras;
 };
 
+struct rvu_nix_event_cnt {
+	u64 unmap_slot_count;
+	u64 aq_inst_count;
+	u64 aq_res_count;
+	u64 aq_db_count;
+	u64 rx_on_unmap_pf_count;
+	u64 rx_mcast_repl_count;
+	u64 rx_mcast_memfault_count;
+	u64 rx_mcast_wqe_memfault_count;
+	u64 rx_mirror_wqe_memfault_count;
+	u64 rx_mirror_pktw_memfault_count;
+	u64 rx_mcast_pktw_memfault_count;
+	u64 poison_aq_inst_count;
+	u64 poison_aq_res_count;
+	u64 poison_aq_cxt_count;
+	u64 rx_mirror_data_poison_count;
+	u64 rx_mcast_data_poison_count;
+	u64 rx_mirror_wqe_poison_count;
+	u64 rx_mcast_wqe_poison_count;
+	u64 rx_mce_poison_count;
+};
+
+struct rvu_nix_event_ctx {
+	struct rvu_nix_event_cnt nix_event_cnt;
+	u64 nix_af_rvu_int;
+	u64 nix_af_rvu_err;
+	u64 nix_af_rvu_ras;
+};
+
 struct rvu_devlink {
 	struct devlink *dl;
 	struct rvu *rvu;
 	struct devlink_health_reporter *rvu_npa_health_reporter;
 	struct rvu_npa_event_ctx *npa_event_ctx;
+	struct devlink_health_reporter *rvu_nix_health_reporter;
+	struct rvu_nix_event_ctx *nix_event_ctx;
 };
 
 /* Devlink APIs */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
index e2153d47c373..5e15f4fc11e3 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
@@ -74,6 +74,16 @@ enum npa_af_int_vec_e {
 	NPA_AF_INT_VEC_CNT	= 0x5,
 };
 
+/* NIX Admin function Interrupt Vector Enumeration */
+enum nix_af_int_vec_e {
+	NIX_AF_INT_VEC_RVU	= 0x0,
+	NIX_AF_INT_VEC_GEN	= 0x1,
+	NIX_AF_INT_VEC_AQ_DONE	= 0x2,
+	NIX_AF_INT_VEC_AF_ERR	= 0x3,
+	NIX_AF_INT_VEC_POISON	= 0x4,
+	NIX_AF_INT_VEC_CNT	= 0x5,
+};
+
 /**
  * RVU PF Interrupt Vector Enumeration
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver
  2020-11-26 14:02 ` [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver George Cherian
@ 2020-12-01  2:22   ` Jakub Kicinski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Kicinski @ 2020-12-01  2:22 UTC (permalink / raw)
  To: George Cherian
  Cc: netdev, linux-kernel, davem, sgoutham, lcherian, gakula,
	masahiroy, willemdebruijn.kernel, saeed, jiri

On Thu, 26 Nov 2020 19:32:49 +0530 George Cherian wrote:
> Add devlink support to AF driver. Basic devlink support is added.
> Currently info_get is the only supported devlink ops.
> 
> devlink ouptput looks like this
>  # devlink dev
>  pci/0002:01:00.0
>  # devlink dev info
>  pci/0002:01:00.0:
>   driver octeontx2-af
>   versions:
>       fixed:
>         mbox version: 9

You need to document what this version is exactly. See how other
drivers document things. It's also strongly preferred to use one 
of the existing identifiers if any one is matching the semantics.

Fixed versions are for hardware versions, like ASIC rev or PCB board
id or version.

Your thing looks like a SW message format. It's not even FW.
It doesn't belong in devlink at all.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
  2020-11-26 14:02 ` [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA George Cherian
@ 2020-12-01  2:29   ` Jakub Kicinski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Kicinski @ 2020-12-01  2:29 UTC (permalink / raw)
  To: George Cherian
  Cc: netdev, linux-kernel, davem, sgoutham, lcherian, gakula,
	masahiroy, willemdebruijn.kernel, saeed, jiri

On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> Add health reporters for RVU NPA block.
> NPA Health reporters handle following HW event groups
>  - GENERAL events
>  - ERROR events
>  - RAS events
>  - RVU event
> An event counter per event is maintained in SW.
> 
> Output:
>  # devlink health
>  pci/0002:01:00.0:
>    reporter hw_npa
>      state healthy error 0 recover 0
>  # devlink  health dump show pci/0002:01:00.0 reporter hw_npa
>  NPA_AF_GENERAL:
>         Unmap PF Error: 0
>         NIX:
>         0: free disabled RX: 0 free disabled TX: 0
>         1: free disabled RX: 0 free disabled TX: 0
>         Free Disabled for SSO: 0
>         Free Disabled for TIM: 0
>         Free Disabled for DPI: 0
>         Free Disabled for AURA: 0
>         Alloc Disabled for Resvd: 0
>   NPA_AF_ERR:
>         Memory Fault on NPA_AQ_INST_S read: 0
>         Memory Fault on NPA_AQ_RES_S write: 0
>         AQ Doorbell Error: 0
>         Poisoned data on NPA_AQ_INST_S read: 0
>         Poisoned data on NPA_AQ_RES_S write: 0
>         Poisoned data on HW context read: 0
>   NPA_AF_RVU:
>         Unmap Slot Error: 0

You seem to have missed the feedback Saeed and I gave you on v2.

Did you test this with the errors actually triggering? Devlink should
store only one dump, are the counters not going to get out of sync
unless something clears the dump every time it triggers?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
  2020-12-01  5:23   ` George Cherian
@ 2020-12-01 18:59     ` Jakub Kicinski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Kicinski @ 2020-12-01 18:59 UTC (permalink / raw)
  To: George Cherian
  Cc: netdev, linux-kernel, davem, Sunil Kovvuri Goutham, Linu Cherian,
	Geethasowjanya Akula, masahiroy, willemdebruijn.kernel, saeed,
	jiri

On Tue, 1 Dec 2020 05:23:23 +0000 George Cherian wrote:
> > > > You seem to have missed the feedback Saeed and I gave you on v2.
> > > >
> > > > Did you test this with the errors actually triggering? Devlink
> > > > should store only  
> > > Yes, the same was tested using devlink health test interface by
> > > injecting errors.
> > > The dump gets generated automatically and the counters do get out of
> > > sync, in case of continuous error.
> > > That wouldn't be much of an issue as the user could manually trigger a
> > > dump clear and Re-dump the counters to get the exact status of the
> > > counters at any point of time.  
> > 
> > Now that recover op is added the devlink error counter and recover counter
> > will be proper. The internal counter for each event is needed just to
> > understand within a specific reporter, how many such events occurred.
> > 
> > Following is the log snippet of the devlink health test being done on hw_nix
> > reporter.
> > # for i in `seq 1 33` ; do  devlink health test pci/0002:01:00.0 reporter hw_nix;
> > done //Inject 33 errors (16  of NIX_AF_RVU and 17 of NIX_AF_RAS and
> > NIX_AF_GENERAL errors) # devlink health
> > pci/0002:01:00.0:
> >   reporter hw_npa
> >     state healthy error 0 recover 0 grace_period 0 auto_recover true
> > auto_dump true
> >   reporter hw_nix
> >     state healthy error 250 recover 250 last_dump_date 1970-01-01
> > last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true  
> Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done
> for 250 error injections)  
> # devlink health
> pci/0002:01:00.0:
>   reporter hw_npa
>     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
>   reporter hw_nix
>     state healthy error 33 recover 33

I thought it'd be better to just add each error as its own reporter
rather than combining them and abusing context for reporting detailed
stats.

This seems to be harder to get done than I thought. Maybe just go back
to the prints and we can move on.

> last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true

Why the weird date? Is this something on your system?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
  2020-12-01  5:18 ` George Cherian
@ 2020-12-01  5:23   ` George Cherian
  2020-12-01 18:59     ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: George Cherian @ 2020-12-01  5:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux-kernel, davem, Sunil Kovvuri Goutham, Linu Cherian,
	Geethasowjanya Akula, masahiroy, willemdebruijn.kernel, saeed,
	jiri



> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 10:49 AM
> To: 'Jakub Kicinski' <kuba@kernel.org>
> Cc: 'netdev@vger.kernel.org' <netdev@vger.kernel.org>; 'linux-
> kernel@vger.kernel.org' <linux-kernel@vger.kernel.org>;
> 'davem@davemloft.net' <davem@davemloft.net>; Sunil Kovvuri Goutham
> <sgoutham@marvell.com>; Linu Cherian <lcherian@marvell.com>;
> Geethasowjanya Akula <gakula@marvell.com>; 'masahiroy@kernel.org'
> <masahiroy@kernel.org>; 'willemdebruijn.kernel@gmail.com'
> <willemdebruijn.kernel@gmail.com>; 'saeed@kernel.org'
> <saeed@kernel.org>; 'jiri@resnulli.us' <jiri@resnulli.us>
> Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
> 
> Jakub,
> 
> > -----Original Message-----
> > From: George Cherian
> > Sent: Tuesday, December 1, 2020 9:06 AM
> > To: Jakub Kicinski <kuba@kernel.org>
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > davem@davemloft.net; Sunil Kovvuri Goutham
> <sgoutham@marvell.com>;
> > Linu Cherian <lcherian@marvell.com>; Geethasowjanya Akula
> > <gakula@marvell.com>; masahiroy@kernel.org;
> > willemdebruijn.kernel@gmail.com; saeed@kernel.org; jiri@resnulli.us
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > Hi Jakub,
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Tuesday, December 1, 2020 7:59 AM
> > > To: George Cherian <gcherian@marvell.com>
> > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > davem@davemloft.net; Sunil Kovvuri Goutham
> > <sgoutham@marvell.com>;
> > > Linu Cherian <lcherian@marvell.com>; Geethasowjanya Akula
> > > <gakula@marvell.com>; masahiroy@kernel.org;
> > > willemdebruijn.kernel@gmail.com; saeed@kernel.org; jiri@resnulli.us
> > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > > reporters for NPA
> > >
> > > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > > Add health reporters for RVU NPA block.
> > > > NPA Health reporters handle following HW event groups
> > > >  - GENERAL events
> > > >  - ERROR events
> > > >  - RAS events
> > > >  - RVU event
> > > > An event counter per event is maintained in SW.
> > > >
> > > > Output:
> > > >  # devlink health
> > > >  pci/0002:01:00.0:
> > > >    reporter hw_npa
> > > >      state healthy error 0 recover 0  # devlink  health dump show
> > > > pci/0002:01:00.0 reporter hw_npa
> > > >  NPA_AF_GENERAL:
> > > >         Unmap PF Error: 0
> > > >         NIX:
> > > >         0: free disabled RX: 0 free disabled TX: 0
> > > >         1: free disabled RX: 0 free disabled TX: 0
> > > >         Free Disabled for SSO: 0
> > > >         Free Disabled for TIM: 0
> > > >         Free Disabled for DPI: 0
> > > >         Free Disabled for AURA: 0
> > > >         Alloc Disabled for Resvd: 0
> > > >   NPA_AF_ERR:
> > > >         Memory Fault on NPA_AQ_INST_S read: 0
> > > >         Memory Fault on NPA_AQ_RES_S write: 0
> > > >         AQ Doorbell Error: 0
> > > >         Poisoned data on NPA_AQ_INST_S read: 0
> > > >         Poisoned data on NPA_AQ_RES_S write: 0
> > > >         Poisoned data on HW context read: 0
> > > >   NPA_AF_RVU:
> > > >         Unmap Slot Error: 0
> > >
> > > You seem to have missed the feedback Saeed and I gave you on v2.
> > >
> > > Did you test this with the errors actually triggering? Devlink
> > > should store only
> > Yes, the same was tested using devlink health test interface by
> > injecting errors.
> > The dump gets generated automatically and the counters do get out of
> > sync, in case of continuous error.
> > That wouldn't be much of an issue as the user could manually trigger a
> > dump clear and Re-dump the counters to get the exact status of the
> > counters at any point of time.
> 
> Now that recover op is added the devlink error counter and recover counter
> will be proper. The internal counter for each event is needed just to
> understand within a specific reporter, how many such events occurred.
> 
> Following is the log snippet of the devlink health test being done on hw_nix
> reporter.
> # for i in `seq 1 33` ; do  devlink health test pci/0002:01:00.0 reporter hw_nix;
> done //Inject 33 errors (16  of NIX_AF_RVU and 17 of NIX_AF_RAS and
> NIX_AF_GENERAL errors) # devlink health
> pci/0002:01:00.0:
>   reporter hw_npa
>     state healthy error 0 recover 0 grace_period 0 auto_recover true
> auto_dump true
>   reporter hw_nix
>     state healthy error 250 recover 250 last_dump_date 1970-01-01
> last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done
for 250 error injections)  
# devlink health
pci/0002:01:00.0:
  reporter hw_npa
    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
  reporter hw_nix
    state healthy error 33 recover 33 last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true

> # devlink health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
>         Memory Fault on NIX_AQ_INST_S read: 1
>         Memory Fault on NIX_AQ_RES_S write: 1
>         AQ Doorbell error: 1
>         Rx on unmapped PF_FUNC: 1
>         Rx multicast replication error: 1
>         Memory fault on NIX_RX_MCE_S read: 1
>         Memory fault on multicast WQE read: 1
>         Memory fault on mirror WQE read: 1
>         Memory fault on mirror pkt write: 1
>         Memory fault on multicast pkt write: 1
>   NIX_AF_RAS:
>         Poisoned data on NIX_AQ_INST_S read: 1
>         Poisoned data on NIX_AQ_RES_S write: 1
>         Poisoned data on HW context read: 1
>         Poisoned data on packet read from mirror buffer: 1
>         Poisoned data on packet read from mcast buffer: 1
>         Poisoned data on WQE read from mirror buffer: 1
>         Poisoned data on WQE read from multicast buffer: 1
>         Poisoned data on NIX_RX_MCE_S read: 1
>   NIX_AF_RVU:
>         Unmap Slot Error: 0
> # devlink health dump clear pci/0002:01:00.0 reporter hw_nix # devlink
> health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
>         Memory Fault on NIX_AQ_INST_S read: 17
>         Memory Fault on NIX_AQ_RES_S write: 17
>         AQ Doorbell error: 17
>         Rx on unmapped PF_FUNC: 17
>         Rx multicast replication error: 17
>         Memory fault on NIX_RX_MCE_S read: 17
>         Memory fault on multicast WQE read: 17
>         Memory fault on mirror WQE read: 17
>         Memory fault on mirror pkt write: 17
>         Memory fault on multicast pkt write: 17
>   NIX_AF_RAS:
>         Poisoned data on NIX_AQ_INST_S read: 17
>         Poisoned data on NIX_AQ_RES_S write: 17
>         Poisoned data on HW context read: 17
>         Poisoned data on packet read from mirror buffer: 17
>         Poisoned data on packet read from mcast buffer: 17
>         Poisoned data on WQE read from mirror buffer: 17
>         Poisoned data on WQE read from multicast buffer: 17
>         Poisoned data on NIX_RX_MCE_S read: 17
>   NIX_AF_RVU:
>         Unmap Slot Error: 16
> >
> > > one dump, are the counters not going to get out of sync unless
> > > something clears the dump every time it triggers?
> Also, note that auto_dump is something which can be turned off by user.
> # devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false So
> that user can dump whenever required, which will always return the correct
> counter values.
> 
> >
> > Regards,
> > -George

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
  2020-12-01  3:35 [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA George Cherian
@ 2020-12-01  5:18 ` George Cherian
  2020-12-01  5:23   ` George Cherian
  0 siblings, 1 reply; 10+ messages in thread
From: George Cherian @ 2020-12-01  5:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux-kernel, davem, Sunil Kovvuri Goutham, Linu Cherian,
	Geethasowjanya Akula, masahiroy, willemdebruijn.kernel, saeed,
	jiri

Jakub,

> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 9:06 AM
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> davem@davemloft.net; Sunil Kovvuri Goutham <sgoutham@marvell.com>;
> Linu Cherian <lcherian@marvell.com>; Geethasowjanya Akula
> <gakula@marvell.com>; masahiroy@kernel.org;
> willemdebruijn.kernel@gmail.com; saeed@kernel.org; jiri@resnulli.us
> Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
> 
> Hi Jakub,
> 
> > -----Original Message-----
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Tuesday, December 1, 2020 7:59 AM
> > To: George Cherian <gcherian@marvell.com>
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > davem@davemloft.net; Sunil Kovvuri Goutham
> <sgoutham@marvell.com>;
> > Linu Cherian <lcherian@marvell.com>; Geethasowjanya Akula
> > <gakula@marvell.com>; masahiroy@kernel.org;
> > willemdebruijn.kernel@gmail.com; saeed@kernel.org; jiri@resnulli.us
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > Add health reporters for RVU NPA block.
> > > NPA Health reporters handle following HW event groups
> > >  - GENERAL events
> > >  - ERROR events
> > >  - RAS events
> > >  - RVU event
> > > An event counter per event is maintained in SW.
> > >
> > > Output:
> > >  # devlink health
> > >  pci/0002:01:00.0:
> > >    reporter hw_npa
> > >      state healthy error 0 recover 0  # devlink  health dump show
> > > pci/0002:01:00.0 reporter hw_npa
> > >  NPA_AF_GENERAL:
> > >         Unmap PF Error: 0
> > >         NIX:
> > >         0: free disabled RX: 0 free disabled TX: 0
> > >         1: free disabled RX: 0 free disabled TX: 0
> > >         Free Disabled for SSO: 0
> > >         Free Disabled for TIM: 0
> > >         Free Disabled for DPI: 0
> > >         Free Disabled for AURA: 0
> > >         Alloc Disabled for Resvd: 0
> > >   NPA_AF_ERR:
> > >         Memory Fault on NPA_AQ_INST_S read: 0
> > >         Memory Fault on NPA_AQ_RES_S write: 0
> > >         AQ Doorbell Error: 0
> > >         Poisoned data on NPA_AQ_INST_S read: 0
> > >         Poisoned data on NPA_AQ_RES_S write: 0
> > >         Poisoned data on HW context read: 0
> > >   NPA_AF_RVU:
> > >         Unmap Slot Error: 0
> >
> > You seem to have missed the feedback Saeed and I gave you on v2.
> >
> > Did you test this with the errors actually triggering? Devlink should
> > store only
> Yes, the same was tested using devlink health test interface by injecting
> errors.
> The dump gets generated automatically and the counters do get out of sync,
> in case of continuous error.
> That wouldn't be much of an issue as the user could manually trigger a dump
> clear and Re-dump the counters to get the exact status of the counters at
> any point of time.

Now that recover op is added the devlink error counter and recover counter will be 
proper. The internal counter for each event is needed just to understand within a specific reporter, how 
many such events occurred. 

Following is the log snippet of the devlink health test being done on hw_nix reporter.
# for i in `seq 1 33` ; do  devlink health test pci/0002:01:00.0 reporter hw_nix; done
//Inject 33 errors (16  of NIX_AF_RVU and 17 of NIX_AF_RAS and  NIX_AF_GENERAL errors)
# devlink health 
pci/0002:01:00.0:
  reporter hw_npa
    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
  reporter hw_nix
    state healthy error 250 recover 250 last_dump_date 1970-01-01 last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
# devlink health dump show pci/0002:01:00.0 reporter hw_nix
NIX_AF_GENERAL:
        Memory Fault on NIX_AQ_INST_S read: 1 
        Memory Fault on NIX_AQ_RES_S write: 1 
        AQ Doorbell error: 1 
        Rx on unmapped PF_FUNC: 1 
        Rx multicast replication error: 1 
        Memory fault on NIX_RX_MCE_S read: 1 
        Memory fault on multicast WQE read: 1 
        Memory fault on mirror WQE read: 1 
        Memory fault on mirror pkt write: 1 
        Memory fault on multicast pkt write: 1
  NIX_AF_RAS:
        Poisoned data on NIX_AQ_INST_S read: 1 
        Poisoned data on NIX_AQ_RES_S write: 1 
        Poisoned data on HW context read: 1 
        Poisoned data on packet read from mirror buffer: 1 
        Poisoned data on packet read from mcast buffer: 1 
        Poisoned data on WQE read from mirror buffer: 1 
        Poisoned data on WQE read from multicast buffer: 1 
        Poisoned data on NIX_RX_MCE_S read: 1
  NIX_AF_RVU:
        Unmap Slot Error: 0
# devlink health dump clear pci/0002:01:00.0 reporter hw_nix
# devlink health dump show pci/0002:01:00.0 reporter hw_nix
NIX_AF_GENERAL:
        Memory Fault on NIX_AQ_INST_S read: 17 
        Memory Fault on NIX_AQ_RES_S write: 17 
        AQ Doorbell error: 17 
        Rx on unmapped PF_FUNC: 17 
        Rx multicast replication error: 17 
        Memory fault on NIX_RX_MCE_S read: 17 
        Memory fault on multicast WQE read: 17 
        Memory fault on mirror WQE read: 17 
        Memory fault on mirror pkt write: 17 
        Memory fault on multicast pkt write: 17
  NIX_AF_RAS:
        Poisoned data on NIX_AQ_INST_S read: 17 
        Poisoned data on NIX_AQ_RES_S write: 17 
        Poisoned data on HW context read: 17 
        Poisoned data on packet read from mirror buffer: 17 
        Poisoned data on packet read from mcast buffer: 17 
        Poisoned data on WQE read from mirror buffer: 17 
        Poisoned data on WQE read from multicast buffer: 17 
        Poisoned data on NIX_RX_MCE_S read: 17
  NIX_AF_RVU:
        Unmap Slot Error: 16
> 
> > one dump, are the counters not going to get out of sync unless
> > something clears the dump every time it triggers?
Also, note that auto_dump is something which can be turned off by user.
# devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false
So that user can dump whenever required, which will always return the correct counter values.

> 
> Regards,
> -George

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA
@ 2020-12-01  3:35 George Cherian
  2020-12-01  5:18 ` George Cherian
  0 siblings, 1 reply; 10+ messages in thread
From: George Cherian @ 2020-12-01  3:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux-kernel, davem, Sunil Kovvuri Goutham, Linu Cherian,
	Geethasowjanya Akula, masahiroy, willemdebruijn.kernel, saeed,
	jiri

Hi Jakub,

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, December 1, 2020 7:59 AM
> To: George Cherian <gcherian@marvell.com>
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> davem@davemloft.net; Sunil Kovvuri Goutham <sgoutham@marvell.com>;
> Linu Cherian <lcherian@marvell.com>; Geethasowjanya Akula
> <gakula@marvell.com>; masahiroy@kernel.org;
> willemdebruijn.kernel@gmail.com; saeed@kernel.org; jiri@resnulli.us
> Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
> 
> On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > Add health reporters for RVU NPA block.
> > NPA Health reporters handle following HW event groups
> >  - GENERAL events
> >  - ERROR events
> >  - RAS events
> >  - RVU event
> > An event counter per event is maintained in SW.
> >
> > Output:
> >  # devlink health
> >  pci/0002:01:00.0:
> >    reporter hw_npa
> >      state healthy error 0 recover 0
> >  # devlink  health dump show pci/0002:01:00.0 reporter hw_npa
> >  NPA_AF_GENERAL:
> >         Unmap PF Error: 0
> >         NIX:
> >         0: free disabled RX: 0 free disabled TX: 0
> >         1: free disabled RX: 0 free disabled TX: 0
> >         Free Disabled for SSO: 0
> >         Free Disabled for TIM: 0
> >         Free Disabled for DPI: 0
> >         Free Disabled for AURA: 0
> >         Alloc Disabled for Resvd: 0
> >   NPA_AF_ERR:
> >         Memory Fault on NPA_AQ_INST_S read: 0
> >         Memory Fault on NPA_AQ_RES_S write: 0
> >         AQ Doorbell Error: 0
> >         Poisoned data on NPA_AQ_INST_S read: 0
> >         Poisoned data on NPA_AQ_RES_S write: 0
> >         Poisoned data on HW context read: 0
> >   NPA_AF_RVU:
> >         Unmap Slot Error: 0
> 
> You seem to have missed the feedback Saeed and I gave you on v2.
> 
> Did you test this with the errors actually triggering? Devlink should store only
Yes, the same was tested using devlink health test interface by injecting errors.
The dump gets generated automatically and the counters do get out of sync, 
in case of continuous error.
That wouldn't be much of an issue as the user could manually trigger a dump clear and 
Re-dump the counters to get the exact status of the counters at any point of time.

> one dump, are the counters not going to get out of sync unless something
> clears the dump every time it triggers?

Regards,
-George

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-12-01 19:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 14:02 [PATCHv5 net-next 0/3] Add devlink and devlink health reporters to George Cherian
2020-11-26 14:02 ` [PATCHv5 net-next 1/3] octeontx2-af: Add devlink suppoort to af driver George Cherian
2020-12-01  2:22   ` Jakub Kicinski
2020-11-26 14:02 ` [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA George Cherian
2020-12-01  2:29   ` Jakub Kicinski
2020-11-26 14:02 ` [PATCHv5 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
2020-12-01  3:35 [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA George Cherian
2020-12-01  5:18 ` George Cherian
2020-12-01  5:23   ` George Cherian
2020-12-01 18:59     ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).