[PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors
@ 2022-07-13  9:26 Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 1/5] RDMA/hns: Remove unused abnormal interrupt of type RAS Wenpeng Liang
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13  9:26 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm, liangwenpeng

Add support for the 1bit ECC error recovery by abnormal interrupt reporting
and adjusts the structure of the abnormal interrupt handler.

The following is the outline of each patch:
(1)#1~#4: Cleanup and bugfix for the abnormal interrupt handler.
(2)#5: Support for the 1bit ECC error recovery.

Changes since v1:
* Embed ecc_work into structure hns_roce_dev, no longer dynamically allocated in #5.
* Add the const keyword to the string array that does not change in #5.
* v1 Link: https://patchwork.kernel.org/project/linux-rdma/cover/20220624110845.48184-1-liangwenpeng@huawei.com/

Haoyue Xu (5):
  RDMA/hns: Remove unused abnormal interrupt of type RAS
  RDMA/hns: Fix the wrong type of return value of the interrupt handler
  RDMA/hns: Fix incorrect clearing of interrupt status register
  RDMA/hns: Refactor the abnormal interrupt handler function
  RDMA/hns: Recover 1bit-ECC error of RAM on chip

 drivers/infiniband/hw/hns/hns_roce_device.h |   1 +
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c  | 250 +++++++++++++++++---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.h  |  13 +-
 3 files changed, 229 insertions(+), 35 deletions(-)

--
2.33.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 for-next 1/5] RDMA/hns: Remove unused abnormal interrupt of type RAS
  2022-07-13  9:26 [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors Wenpeng Liang
@ 2022-07-13  9:26 ` Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 2/5] RDMA/hns: Fix the wrong type of return value of the interrupt handler Wenpeng Liang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13  9:26 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm, liangwenpeng

From: Haoyue Xu <xuhaoyue1@hisilicon.com>

The HNS NIC driver receives and handles the abnormal interrupt of the RAS
type generated by ROCEE, and the HNS RDMA driver does not need to handle
this type of interrupt. Therefore, delete unused codes in the HNS RDMA
driver.

Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 10 ----------
 drivers/infiniband/hw/hns/hns_roce_hw_v2.h |  1 -
 2 files changed, 11 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index ba3c742258ef..617713084383 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -6013,16 +6013,6 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 		int_en |= 1 << HNS_ROCE_V2_VF_ABN_INT_EN_S;
 		roce_write(hr_dev, ROCEE_VF_ABN_INT_EN_REG, int_en);
 
-		int_work = 1;
-	} else if (int_st & BIT(HNS_ROCE_V2_VF_INT_ST_RAS_INT_S)) {
-		dev_err(dev, "RAS interrupt!\n");
-
-		int_st |= 1 << HNS_ROCE_V2_VF_INT_ST_RAS_INT_S;
-		roce_write(hr_dev, ROCEE_VF_ABN_INT_ST_REG, int_st);
-
-		int_en |= 1 << HNS_ROCE_V2_VF_ABN_INT_EN_S;
-		roce_write(hr_dev, ROCEE_VF_ABN_INT_EN_REG, int_en);
-
 		int_work = 1;
 	} else {
 		dev_err(dev, "There is no abnormal irq found!\n");
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
index 7ffb7824d268..e6186149ef19 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
@@ -1382,7 +1382,6 @@ struct hns_roce_dip {
 #define HNS_ROCE_V2_ASYNC_EQE_NUM		0x1000
 
 #define HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S	0
-#define HNS_ROCE_V2_VF_INT_ST_RAS_INT_S		1
 
 #define HNS_ROCE_EQ_DB_CMD_AEQ			0x0
 #define HNS_ROCE_EQ_DB_CMD_AEQ_ARMED		0x1
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 for-next 2/5] RDMA/hns: Fix the wrong type of return value of the interrupt handler
  2022-07-13  9:26 [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 1/5] RDMA/hns: Remove unused abnormal interrupt of type RAS Wenpeng Liang
@ 2022-07-13  9:26 ` Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 3/5] RDMA/hns: Fix incorrect clearing of interrupt status register Wenpeng Liang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13  9:26 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm, liangwenpeng

From: Haoyue Xu <xuhaoyue1@hisilicon.com>

The type of return value of the interrupt handler should be irqreturn_t.

Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 27 +++++++++++-----------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 617713084383..bb6073635c53 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -5855,12 +5855,12 @@ static struct hns_roce_aeqe *next_aeqe_sw_v2(struct hns_roce_eq *eq)
 		!!(eq->cons_index & eq->entries)) ? aeqe : NULL;
 }
 
-static int hns_roce_v2_aeq_int(struct hns_roce_dev *hr_dev,
-			       struct hns_roce_eq *eq)
+static irqreturn_t hns_roce_v2_aeq_int(struct hns_roce_dev *hr_dev,
+				       struct hns_roce_eq *eq)
 {
 	struct device *dev = hr_dev->dev;
 	struct hns_roce_aeqe *aeqe = next_aeqe_sw_v2(eq);
-	int aeqe_found = 0;
+	irqreturn_t aeqe_found = IRQ_NONE;
 	int event_type;
 	u32 queue_num;
 	int sub_type;
@@ -5914,7 +5914,7 @@ static int hns_roce_v2_aeq_int(struct hns_roce_dev *hr_dev,
 		eq->event_type = event_type;
 		eq->sub_type = sub_type;
 		++eq->cons_index;
-		aeqe_found = 1;
+		aeqe_found = IRQ_HANDLED;
 
 		hns_roce_v2_init_irq_work(hr_dev, eq, queue_num);
 
@@ -5922,7 +5922,8 @@ static int hns_roce_v2_aeq_int(struct hns_roce_dev *hr_dev,
 	}
 
 	update_eq_db(eq);
-	return aeqe_found;
+
+	return IRQ_RETVAL(aeqe_found);
 }
 
 static struct hns_roce_ceqe *next_ceqe_sw_v2(struct hns_roce_eq *eq)
@@ -5937,11 +5938,11 @@ static struct hns_roce_ceqe *next_ceqe_sw_v2(struct hns_roce_eq *eq)
 		!!(eq->cons_index & eq->entries)) ? ceqe : NULL;
 }
 
-static int hns_roce_v2_ceq_int(struct hns_roce_dev *hr_dev,
-			       struct hns_roce_eq *eq)
+static irqreturn_t hns_roce_v2_ceq_int(struct hns_roce_dev *hr_dev,
+				       struct hns_roce_eq *eq)
 {
 	struct hns_roce_ceqe *ceqe = next_ceqe_sw_v2(eq);
-	int ceqe_found = 0;
+	irqreturn_t ceqe_found = IRQ_NONE;
 	u32 cqn;
 
 	while (ceqe) {
@@ -5955,21 +5956,21 @@ static int hns_roce_v2_ceq_int(struct hns_roce_dev *hr_dev,
 		hns_roce_cq_completion(hr_dev, cqn);
 
 		++eq->cons_index;
-		ceqe_found = 1;
+		ceqe_found = IRQ_HANDLED;
 
 		ceqe = next_ceqe_sw_v2(eq);
 	}
 
 	update_eq_db(eq);
 
-	return ceqe_found;
+	return IRQ_RETVAL(ceqe_found);
 }
 
 static irqreturn_t hns_roce_v2_msix_interrupt_eq(int irq, void *eq_ptr)
 {
 	struct hns_roce_eq *eq = eq_ptr;
 	struct hns_roce_dev *hr_dev = eq->hr_dev;
-	int int_work;
+	irqreturn_t int_work;
 
 	if (eq->type_flag == HNS_ROCE_CEQ)
 		/* Completion event interrupt */
@@ -5985,7 +5986,7 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 {
 	struct hns_roce_dev *hr_dev = dev_id;
 	struct device *dev = hr_dev->dev;
-	int int_work = 0;
+	irqreturn_t int_work = IRQ_NONE;
 	u32 int_st;
 	u32 int_en;
 
@@ -6013,7 +6014,7 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 		int_en |= 1 << HNS_ROCE_V2_VF_ABN_INT_EN_S;
 		roce_write(hr_dev, ROCEE_VF_ABN_INT_EN_REG, int_en);
 
-		int_work = 1;
+		int_work = IRQ_HANDLED;
 	} else {
 		dev_err(dev, "There is no abnormal irq found!\n");
 	}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 for-next 3/5] RDMA/hns: Fix incorrect clearing of interrupt status register
  2022-07-13  9:26 [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 1/5] RDMA/hns: Remove unused abnormal interrupt of type RAS Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 2/5] RDMA/hns: Fix the wrong type of return value of the interrupt handler Wenpeng Liang
@ 2022-07-13  9:26 ` Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 4/5] RDMA/hns: Refactor the abnormal interrupt handler function Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip Wenpeng Liang
  4 siblings, 0 replies; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13  9:26 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm, liangwenpeng

From: Haoyue Xu <xuhaoyue1@hisilicon.com>

The driver will clear all the interrupts in the same area
when the driver handles the interrupt of type AEQ overflow.
It should only set the interrupt status bit of type AEQ overflow.

Fixes: a5073d6054f7 ("RDMA/hns: Add eq support of hip08")
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index bb6073635c53..35bf58fcaeb3 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -6001,8 +6001,8 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 
 		dev_err(dev, "AEQ overflow!\n");
 
-		int_st |= 1 << HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S;
-		roce_write(hr_dev, ROCEE_VF_ABN_INT_ST_REG, int_st);
+		roce_write(hr_dev, ROCEE_VF_ABN_INT_ST_REG,
+			   1 << HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S);
 
 		/* Set reset level for reset_event() */
 		if (ops->set_default_reset_request)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 for-next 4/5] RDMA/hns: Refactor the abnormal interrupt handler function
  2022-07-13  9:26 [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors Wenpeng Liang
                   ` (2 preceding siblings ...)
  2022-07-13  9:26 ` [PATCH v2 for-next 3/5] RDMA/hns: Fix incorrect clearing of interrupt status register Wenpeng Liang
@ 2022-07-13  9:26 ` Wenpeng Liang
  2022-07-13  9:26 ` [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip Wenpeng Liang
  4 siblings, 0 replies; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13  9:26 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm, liangwenpeng

From: Haoyue Xu <xuhaoyue1@hisilicon.com>

Use a single function to handle the same kind of abnormal interrupts.

Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 35 ++++++++++++++--------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 35bf58fcaeb3..782f09a7f8af 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -5982,24 +5982,19 @@ static irqreturn_t hns_roce_v2_msix_interrupt_eq(int irq, void *eq_ptr)
 	return IRQ_RETVAL(int_work);
 }
 
-static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
+static irqreturn_t abnormal_interrupt_basic(struct hns_roce_dev *hr_dev,
+					    u32 int_st)
 {
-	struct hns_roce_dev *hr_dev = dev_id;
-	struct device *dev = hr_dev->dev;
+	struct pci_dev *pdev = hr_dev->pci_dev;
+	struct hnae3_ae_dev *ae_dev = pci_get_drvdata(pdev);
+	const struct hnae3_ae_ops *ops = ae_dev->ops;
 	irqreturn_t int_work = IRQ_NONE;
-	u32 int_st;
 	u32 int_en;
 
-	/* Abnormal interrupt */
-	int_st = roce_read(hr_dev, ROCEE_VF_ABN_INT_ST_REG);
 	int_en = roce_read(hr_dev, ROCEE_VF_ABN_INT_EN_REG);
 
 	if (int_st & BIT(HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S)) {
-		struct pci_dev *pdev = hr_dev->pci_dev;
-		struct hnae3_ae_dev *ae_dev = pci_get_drvdata(pdev);
-		const struct hnae3_ae_ops *ops = ae_dev->ops;
-
-		dev_err(dev, "AEQ overflow!\n");
+		dev_err(hr_dev->dev, "AEQ overflow!\n");
 
 		roce_write(hr_dev, ROCEE_VF_ABN_INT_ST_REG,
 			   1 << HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S);
@@ -6016,12 +6011,28 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 
 		int_work = IRQ_HANDLED;
 	} else {
-		dev_err(dev, "There is no abnormal irq found!\n");
+		dev_err(hr_dev->dev, "there is no basic abn irq found.\n");
 	}
 
 	return IRQ_RETVAL(int_work);
 }
 
+static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
+{
+	struct hns_roce_dev *hr_dev = dev_id;
+	irqreturn_t int_work = IRQ_NONE;
+	u32 int_st;
+
+	int_st = roce_read(hr_dev, ROCEE_VF_ABN_INT_ST_REG);
+
+	if (int_st)
+		int_work = abnormal_interrupt_basic(hr_dev, int_st);
+	else
+		dev_err(hr_dev->dev, "there is no abnormal irq found.\n");
+
+	return IRQ_RETVAL(int_work);
+}
+
 static void hns_roce_v2_int_mask_enable(struct hns_roce_dev *hr_dev,
 					int eq_num, u32 enable_flag)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip
  2022-07-13  9:26 [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors Wenpeng Liang
                   ` (3 preceding siblings ...)
  2022-07-13  9:26 ` [PATCH v2 for-next 4/5] RDMA/hns: Refactor the abnormal interrupt handler function Wenpeng Liang
@ 2022-07-13  9:26 ` Wenpeng Liang
  2022-07-13  9:36   ` Cheng Xu
  4 siblings, 1 reply; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13  9:26 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm, liangwenpeng

From: Haoyue Xu <xuhaoyue1@hisilicon.com>

Since ECC memory maintains a memory system immune to single-bit errors,
add support for correcting the 1bit-ECC error, which prevents a 1bit-ECC
error become an uncorrected type error. When a 1bit-ECC error happens in
the internal ram of the ROCE engine, such as the QPC table, as a 1bit-ECC
error caused by reading, the ROCE engine only corrects those 1bit ECC
errors by writing.

Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_device.h |   1 +
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c  | 184 +++++++++++++++++++-
 drivers/infiniband/hw/hns/hns_roce_hw_v2.h  |  12 ++
 3 files changed, 195 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index 2855e9ad4b32..f848eedc6a23 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -959,6 +959,7 @@ struct hns_roce_dev {
 	const struct hns_roce_hw *hw;
 	void			*priv;
 	struct workqueue_struct *irq_workq;
+	struct work_struct ecc_work;
 	const struct hns_roce_dfx_hw *dfx;
 	u32 func_num;
 	u32 is_vf;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 782f09a7f8af..04133bfb5a93 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -55,6 +55,42 @@ enum {
 	CMD_RST_PRC_EBUSY,
 };
 
+enum ecc_resource_type {
+	ECC_RESOURCE_QPC,
+	ECC_RESOURCE_CQC,
+	ECC_RESOURCE_MPT,
+	ECC_RESOURCE_SRQC,
+	ECC_RESOURCE_GMV,
+	ECC_RESOURCE_QPC_TIMER,
+	ECC_RESOURCE_CQC_TIMER,
+	ECC_RESOURCE_SCCC,
+	ECC_RESOURCE_COUNT,
+};
+
+static const struct {
+	const char *name;
+	u8 read_bt0_op;
+	u8 write_bt0_op;
+} fmea_ram_res[] = {
+	{ "ECC_RESOURCE_QPC",
+	  HNS_ROCE_CMD_READ_QPC_BT0, HNS_ROCE_CMD_WRITE_QPC_BT0 },
+	{ "ECC_RESOURCE_CQC",
+	  HNS_ROCE_CMD_READ_CQC_BT0, HNS_ROCE_CMD_WRITE_CQC_BT0 },
+	{ "ECC_RESOURCE_MPT",
+	  HNS_ROCE_CMD_READ_MPT_BT0, HNS_ROCE_CMD_WRITE_MPT_BT0 },
+	{ "ECC_RESOURCE_SRQC",
+	  HNS_ROCE_CMD_READ_SRQC_BT0, HNS_ROCE_CMD_WRITE_SRQC_BT0 },
+	/* ECC_RESOURCE_GMV is handled by cmdq, not mailbox */
+	{ "ECC_RESOURCE_GMV",
+	  0, 0 },
+	{ "ECC_RESOURCE_QPC_TIMER",
+	  HNS_ROCE_CMD_READ_QPC_TIMER_BT0, HNS_ROCE_CMD_WRITE_QPC_TIMER_BT0 },
+	{ "ECC_RESOURCE_CQC_TIMER",
+	  HNS_ROCE_CMD_READ_CQC_TIMER_BT0, HNS_ROCE_CMD_WRITE_CQC_TIMER_BT0 },
+	{ "ECC_RESOURCE_SCCC",
+	  HNS_ROCE_CMD_READ_SCCC_BT0, HNS_ROCE_CMD_WRITE_SCCC_BT0 },
+};
+
 static inline void set_data_seg_v2(struct hns_roce_v2_wqe_data_seg *dseg,
 				   struct ib_sge *sg)
 {
@@ -6017,6 +6053,144 @@ static irqreturn_t abnormal_interrupt_basic(struct hns_roce_dev *hr_dev,
 	return IRQ_RETVAL(int_work);
 }
 
+static int fmea_ram_ecc_query(struct hns_roce_dev *hr_dev,
+			       struct fmea_ram_ecc *ecc_info)
+{
+	struct hns_roce_cmq_desc desc;
+	struct hns_roce_cmq_req *req = (struct hns_roce_cmq_req *)desc.data;
+	int ret;
+
+	hns_roce_cmq_setup_basic_desc(&desc, HNS_ROCE_QUERY_RAM_ECC, true);
+	ret = hns_roce_cmq_send(hr_dev, &desc, 1);
+	if (ret)
+		return ret;
+
+	ecc_info->is_ecc_err = hr_reg_read(req, QUERY_RAM_ECC_1BIT_ERR);
+	ecc_info->res_type = hr_reg_read(req, QUERY_RAM_ECC_RES_TYPE);
+	ecc_info->index = hr_reg_read(req, QUERY_RAM_ECC_TAG);
+
+	return 0;
+}
+
+static int fmea_recover_gmv(struct hns_roce_dev *hr_dev, u32 idx)
+{
+	struct hns_roce_cmq_desc desc;
+	struct hns_roce_cmq_req *req = (struct hns_roce_cmq_req *)desc.data;
+	u32 addr_upper;
+	u32 addr_low;
+	int ret;
+
+	hns_roce_cmq_setup_basic_desc(&desc, HNS_ROCE_OPC_CFG_GMV_BT, true);
+	hr_reg_write(req, CFG_GMV_BT_IDX, idx);
+
+	ret = hns_roce_cmq_send(hr_dev, &desc, 1);
+	if (ret) {
+		dev_err(hr_dev->dev,
+			"failed to execute cmd to read gmv, ret = %d.\n", ret);
+		return ret;
+	}
+
+	addr_low =  hr_reg_read(req, CFG_GMV_BT_BA_L);
+	addr_upper = hr_reg_read(req, CFG_GMV_BT_BA_H);
+
+	hns_roce_cmq_setup_basic_desc(&desc, HNS_ROCE_OPC_CFG_GMV_BT, false);
+	hr_reg_write(req, CFG_GMV_BT_BA_L, addr_low);
+	hr_reg_write(req, CFG_GMV_BT_BA_H, addr_upper);
+	hr_reg_write(req, CFG_GMV_BT_IDX, idx);
+
+	return hns_roce_cmq_send(hr_dev, &desc, 1);
+}
+
+static u64 fmea_get_ram_res_addr(u32 res_type, __le64 *data)
+{
+	if (res_type == ECC_RESOURCE_QPC_TIMER ||
+	    res_type == ECC_RESOURCE_CQC_TIMER ||
+	    res_type == ECC_RESOURCE_SCCC)
+		return le64_to_cpu(*data);
+
+	return le64_to_cpu(*data) << PAGE_SHIFT;
+}
+
+static int fmea_recover_others(struct hns_roce_dev *hr_dev, u32 res_type,
+			       u32 index)
+{
+	u8 write_bt0_op = fmea_ram_res[res_type].write_bt0_op;
+	u8 read_bt0_op = fmea_ram_res[res_type].read_bt0_op;
+	struct hns_roce_cmd_mailbox *mailbox;
+	u64 addr;
+	int ret;
+
+	mailbox = hns_roce_alloc_cmd_mailbox(hr_dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+
+	ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, read_bt0_op, index);
+	if (ret) {
+		dev_err(hr_dev->dev,
+			"failed to execute cmd to read fmea ram, ret = %d.\n",
+			ret);
+		goto err;
+	}
+
+	addr = fmea_get_ram_res_addr(res_type, mailbox->buf);
+
+	ret = hns_roce_cmd_mbox(hr_dev, addr, 0, write_bt0_op, index);
+	if (ret) {
+		dev_err(hr_dev->dev,
+			"failed to execute cmd to write fmea ram, ret = %d.\n",
+			ret);
+		goto err;
+	}
+
+err:
+	hns_roce_free_cmd_mailbox(hr_dev, mailbox);
+	return ret;
+}
+
+static void fmea_ram_ecc_recover(struct hns_roce_dev *hr_dev,
+				 struct fmea_ram_ecc *ecc_info)
+{
+	u32 res_type = ecc_info->res_type;
+	u32 index = ecc_info->index;
+	int ret;
+
+	BUILD_BUG_ON(ARRAY_SIZE(fmea_ram_res) != ECC_RESOURCE_COUNT);
+
+	if (res_type >= ECC_RESOURCE_COUNT) {
+		dev_err(hr_dev->dev, "unsupported fmea ram ecc type %u.\n",
+			res_type);
+		return;
+	}
+
+	if (res_type == ECC_RESOURCE_GMV)
+		ret = fmea_recover_gmv(hr_dev, index);
+	else
+		ret = fmea_recover_others(hr_dev, res_type, index);
+	if (ret)
+		dev_err(hr_dev->dev,
+			"failed to recover %s, index = %u, ret = %d.\n",
+			fmea_ram_res[res_type].name, index, ret);
+}
+
+static void fmea_ram_ecc_work(struct work_struct *ecc_work)
+{
+	struct hns_roce_dev *hr_dev =
+		container_of(ecc_work, struct hns_roce_dev, ecc_work);
+	struct fmea_ram_ecc ecc_info = {};
+
+	if (fmea_ram_ecc_query(hr_dev, &ecc_info)) {
+		dev_err(hr_dev->dev, "failed to query fmea ram ecc.\n");
+		return;
+	}
+
+	if (!ecc_info.is_ecc_err) {
+		dev_err(hr_dev->dev, "there is no fmea ram ecc err found.\n");
+		return;
+	}
+
+	fmea_ram_ecc_recover(hr_dev, &ecc_info);
+}
+
 static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 {
 	struct hns_roce_dev *hr_dev = dev_id;
@@ -6025,10 +6199,14 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 
 	int_st = roce_read(hr_dev, ROCEE_VF_ABN_INT_ST_REG);
 
-	if (int_st)
+	if (int_st) {
 		int_work = abnormal_interrupt_basic(hr_dev, int_st);
-	else
+	} else if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP09) {
+		queue_work(hr_dev->irq_workq, &hr_dev->ecc_work);
+		int_work = IRQ_HANDLED;
+	} else {
 		dev_err(hr_dev->dev, "there is no abnormal irq found.\n");
+	}
 
 	return IRQ_RETVAL(int_work);
 }
@@ -6344,6 +6522,8 @@ static int hns_roce_v2_init_eq_table(struct hns_roce_dev *hr_dev)
 		}
 	}
 
+	INIT_WORK(&hr_dev->ecc_work, fmea_ram_ecc_work);
+
 	hr_dev->irq_workq = alloc_ordered_workqueue("hns_roce_irq_workq", 0);
 	if (!hr_dev->irq_workq) {
 		dev_err(dev, "failed to create irq workqueue.\n");
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
index e6186149ef19..f96debac30fe 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
@@ -250,6 +250,7 @@ enum hns_roce_opcode_type {
 	HNS_ROCE_OPC_CFG_GMV_TBL			= 0x850f,
 	HNS_ROCE_OPC_CFG_GMV_BT				= 0x8510,
 	HNS_ROCE_OPC_EXT_CFG				= 0x8512,
+	HNS_ROCE_QUERY_RAM_ECC				= 0x8513,
 	HNS_SWITCH_PARAMETER_CFG			= 0x1033,
 };
 
@@ -1107,6 +1108,11 @@ enum {
 #define CFG_GMV_BT_BA_H CMQ_REQ_FIELD_LOC(51, 32)
 #define CFG_GMV_BT_IDX CMQ_REQ_FIELD_LOC(95, 64)
 
+/* Fields of HNS_ROCE_QUERY_RAM_ECC */
+#define QUERY_RAM_ECC_1BIT_ERR CMQ_REQ_FIELD_LOC(31, 0)
+#define QUERY_RAM_ECC_RES_TYPE CMQ_REQ_FIELD_LOC(63, 32)
+#define QUERY_RAM_ECC_TAG CMQ_REQ_FIELD_LOC(95, 64)
+
 struct hns_roce_cfg_sgid_tb {
 	__le32	table_idx_rsv;
 	__le32	vf_sgid_l;
@@ -1343,6 +1349,12 @@ struct hns_roce_dip {
 	struct list_head node; /* all dips are on a list */
 };
 
+struct fmea_ram_ecc {
+	u32	is_ecc_err;
+	u32	res_type;
+	u32	index;
+};
+
 /* only for RNR timeout issue of HIP08 */
 #define HNS_ROCE_CLOCK_ADJUST 1000
 #define HNS_ROCE_MAX_CQ_PERIOD 65
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip
  2022-07-13  9:26 ` [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip Wenpeng Liang
@ 2022-07-13  9:36   ` Cheng Xu
  2022-07-13 10:03     ` Wenpeng Liang
  0 siblings, 1 reply; 10+ messages in thread
From: Cheng Xu @ 2022-07-13  9:36 UTC (permalink / raw)
  To: Wenpeng Liang, jgg, leon; +Cc: linux-rdma, linuxarm



On 7/13/22 5:26 PM, Wenpeng Liang wrote:

<...>

> +static int fmea_recover_others(struct hns_roce_dev *hr_dev, u32 res_type,
> +			       u32 index)
> +{
> +	u8 write_bt0_op = fmea_ram_res[res_type].write_bt0_op;
> +	u8 read_bt0_op = fmea_ram_res[res_type].read_bt0_op;
> +	struct hns_roce_cmd_mailbox *mailbox;
> +	u64 addr;
> +	int ret;
> +
> +	mailbox = hns_roce_alloc_cmd_mailbox(hr_dev);
> +	if (IS_ERR(mailbox))
> +		return PTR_ERR(mailbox);
> +
> +	ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, read_bt0_op, index);
> +	if (ret) {
> +		dev_err(hr_dev->dev,
> +			"failed to execute cmd to read fmea ram, ret = %d.\n",
> +			ret);
> +		goto err;
> +	}
> +
> +	addr = fmea_get_ram_res_addr(res_type, mailbox->buf);
> +
> +	ret = hns_roce_cmd_mbox(hr_dev, addr, 0, write_bt0_op, index);
> +	if (ret) {
> +		dev_err(hr_dev->dev,
> +			"failed to execute cmd to write fmea ram, ret = %d.\n",
> +			ret);
> +		goto err;
> +	}
> +

Here it seems that you miss a "return 0" or the "goto err;" is unnecessary.

Thanks,
Cheng Xu

> +err:
> +	hns_roce_free_cmd_mailbox(hr_dev, mailbox);
> +	return ret;
> +}
> +

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip
  2022-07-13  9:36   ` Cheng Xu
@ 2022-07-13 10:03     ` Wenpeng Liang
  2022-07-13 10:14       ` Cheng Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-13 10:03 UTC (permalink / raw)
  To: Cheng Xu, jgg, leon; +Cc: linux-rdma, linuxarm


On 2022/7/13 17:36, Cheng Xu wrote:
> 
> On 7/13/22 5:26 PM, Wenpeng Liang wrote:
> 
> <...>
> 
>> +static int fmea_recover_others(struct hns_roce_dev *hr_dev, u32 res_type,
>> +			       u32 index)
>> +{
>> +	u8 write_bt0_op = fmea_ram_res[res_type].write_bt0_op;
>> +	u8 read_bt0_op = fmea_ram_res[res_type].read_bt0_op;
>> +	struct hns_roce_cmd_mailbox *mailbox;
>> +	u64 addr;
>> +	int ret;
>> +
>> +	mailbox = hns_roce_alloc_cmd_mailbox(hr_dev);
>> +	if (IS_ERR(mailbox))
>> +		return PTR_ERR(mailbox);
>> +
>> +	ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, read_bt0_op, index);
>> +	if (ret) {
>> +		dev_err(hr_dev->dev,
>> +			"failed to execute cmd to read fmea ram, ret = %d.\n",
>> +			ret);
>> +		goto err;
>> +	}
>> +
>> +	addr = fmea_get_ram_res_addr(res_type, mailbox->buf);
>> +
>> +	ret = hns_roce_cmd_mbox(hr_dev, addr, 0, write_bt0_op, index);
>> +	if (ret) {
>> +		dev_err(hr_dev->dev,
>> +			"failed to execute cmd to write fmea ram, ret = %d.\n",
>> +			ret);
>> +		goto err;
>> +	}
>> +
> Here it seems that you miss a "return 0" or the "goto err;" is unnecessary.
> 

Will remove the "goto err;".

Thanks,
Wenpeng

> Thanks,
> Cheng Xu
> 
>> +err:
>> +	hns_roce_free_cmd_mailbox(hr_dev, mailbox);
>> +	return ret;
>> +}
>> +
> .
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip
  2022-07-13 10:03     ` Wenpeng Liang
@ 2022-07-13 10:14       ` Cheng Xu
  2022-07-14 13:34         ` Wenpeng Liang
  0 siblings, 1 reply; 10+ messages in thread
From: Cheng Xu @ 2022-07-13 10:14 UTC (permalink / raw)
  To: Wenpeng Liang, jgg, leon; +Cc: linux-rdma, linuxarm



On 7/13/22 6:03 PM, Wenpeng Liang wrote:
> 
> On 2022/7/13 17:36, Cheng Xu wrote:
>>
>> On 7/13/22 5:26 PM, Wenpeng Liang wrote:
>>
>> <...>
>>
>>> +static int fmea_recover_others(struct hns_roce_dev *hr_dev, u32 res_type,
>>> +			       u32 index)
>>> +{
>>> +	u8 write_bt0_op = fmea_ram_res[res_type].write_bt0_op;
>>> +	u8 read_bt0_op = fmea_ram_res[res_type].read_bt0_op;
>>> +	struct hns_roce_cmd_mailbox *mailbox;
>>> +	u64 addr;
>>> +	int ret;
>>> +
>>> +	mailbox = hns_roce_alloc_cmd_mailbox(hr_dev);
>>> +	if (IS_ERR(mailbox))
>>> +		return PTR_ERR(mailbox);
>>> +
>>> +	ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, read_bt0_op, index);
>>> +	if (ret) {
>>> +		dev_err(hr_dev->dev,
>>> +			"failed to execute cmd to read fmea ram, ret = %d.\n",
>>> +			ret);
>>> +		goto err;
>>> +	}
>>> +
>>> +	addr = fmea_get_ram_res_addr(res_type, mailbox->buf);
>>> +
>>> +	ret = hns_roce_cmd_mbox(hr_dev, addr, 0, write_bt0_op, index);
>>> +	if (ret) {
>>> +		dev_err(hr_dev->dev,
>>> +			"failed to execute cmd to write fmea ram, ret = %d.\n",
>>> +			ret);
>>> +		goto err;
>>> +	}
>>> +
>> Here it seems that you miss a "return 0" or the "goto err;" is unnecessary.
>>
> 
> Will remove the "goto err;".
> 

And, if the hns_roce_free_cmd_mailbox is called in both normal and error flow,
Maybe using "out" or some name else is better than "err" as the jump label?

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip
  2022-07-13 10:14       ` Cheng Xu
@ 2022-07-14 13:34         ` Wenpeng Liang
  0 siblings, 0 replies; 10+ messages in thread
From: Wenpeng Liang @ 2022-07-14 13:34 UTC (permalink / raw)
  To: Cheng Xu, jgg, leon; +Cc: linux-rdma, linuxarm


On 2022/7/13 18:14, Cheng Xu wrote:
> 
> On 7/13/22 6:03 PM, Wenpeng Liang wrote:
>> On 2022/7/13 17:36, Cheng Xu wrote:
>>> On 7/13/22 5:26 PM, Wenpeng Liang wrote:
>>>
>>> <...>
>>>
>>>> +static int fmea_recover_others(struct hns_roce_dev *hr_dev, u32 res_type,
>>>> +			       u32 index)
>>>> +{
>>>> +	u8 write_bt0_op = fmea_ram_res[res_type].write_bt0_op;
>>>> +	u8 read_bt0_op = fmea_ram_res[res_type].read_bt0_op;
>>>> +	struct hns_roce_cmd_mailbox *mailbox;
>>>> +	u64 addr;
>>>> +	int ret;
>>>> +
>>>> +	mailbox = hns_roce_alloc_cmd_mailbox(hr_dev);
>>>> +	if (IS_ERR(mailbox))
>>>> +		return PTR_ERR(mailbox);
>>>> +
>>>> +	ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, read_bt0_op, index);
>>>> +	if (ret) {
>>>> +		dev_err(hr_dev->dev,
>>>> +			"failed to execute cmd to read fmea ram, ret = %d.\n",
>>>> +			ret);
>>>> +		goto err;
>>>> +	}
>>>> +
>>>> +	addr = fmea_get_ram_res_addr(res_type, mailbox->buf);
>>>> +
>>>> +	ret = hns_roce_cmd_mbox(hr_dev, addr, 0, write_bt0_op, index);
>>>> +	if (ret) {
>>>> +		dev_err(hr_dev->dev,
>>>> +			"failed to execute cmd to write fmea ram, ret = %d.\n",
>>>> +			ret);
>>>> +		goto err;
>>>> +	}
>>>> +
>>> Here it seems that you miss a "return 0" or the "goto err;" is unnecessary.
>>>
>> Will remove the "goto err;".
>>
> And, if the hns_roce_free_cmd_mailbox is called in both normal and error flow,
> Maybe using "out" or some name else is better than "err" as the jump label?
> 

Will fix it.

Thansk,
Wenpeng

> Thanks,
> Cheng Xu

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-07-14 13:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-13  9:26 [PATCH v2 for-next 0/5] RDMA/hns: Supports recovery of on-chip RAM 1bit ECC errors Wenpeng Liang
2022-07-13  9:26 ` [PATCH v2 for-next 1/5] RDMA/hns: Remove unused abnormal interrupt of type RAS Wenpeng Liang
2022-07-13  9:26 ` [PATCH v2 for-next 2/5] RDMA/hns: Fix the wrong type of return value of the interrupt handler Wenpeng Liang
2022-07-13  9:26 ` [PATCH v2 for-next 3/5] RDMA/hns: Fix incorrect clearing of interrupt status register Wenpeng Liang
2022-07-13  9:26 ` [PATCH v2 for-next 4/5] RDMA/hns: Refactor the abnormal interrupt handler function Wenpeng Liang
2022-07-13  9:26 ` [PATCH v2 for-next 5/5] RDMA/hns: Recover 1bit-ECC error of RAM on chip Wenpeng Liang
2022-07-13  9:36   ` Cheng Xu
2022-07-13 10:03     ` Wenpeng Liang
2022-07-13 10:14       ` Cheng Xu
2022-07-14 13:34         ` Wenpeng Liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).