netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] add some vf fault detect patch for hns
@ 2023-01-13  2:08 Hao Lan
  2023-01-13  2:08 ` [PATCH net-next 1/2] net: hns3: add hns3 vf fault detect cap bit support Hao Lan
  2023-01-13  2:08 ` [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras Hao Lan
  0 siblings, 2 replies; 15+ messages in thread
From: Hao Lan @ 2023-01-13  2:08 UTC (permalink / raw)
  To: davem, kuba
  Cc: yisen.zhuang, salil.mehta, edumazet, pabeni, richardcochran,
	shenjian15, wangjie125, netdev

Currently hns3 driver supports vf fault detect feature.Patch #1 is
add hns3 vf fault detect cap bit support.Patch #2 is add vf fault
process in hns3 ras.

Jie Wang (2):
  net: hns3: add hns3 vf fault detect cap bit support
  net: hns3: add vf fault process in hns3 ras

 drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   5 +
 .../hns3/hns3_common/hclge_comm_cmd.c         |   1 +
 .../hns3/hns3_common/hclge_comm_cmd.h         |   2 +
 .../ethernet/hisilicon/hns3/hns3_debugfs.c    |   3 +
 .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
 .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
 .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
 .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
 8 files changed, 124 insertions(+), 6 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next 1/2] net: hns3: add hns3 vf fault detect cap bit support
  2023-01-13  2:08 [PATCH net-next 0/2] add some vf fault detect patch for hns Hao Lan
@ 2023-01-13  2:08 ` Hao Lan
  2023-01-13  2:08 ` [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras Hao Lan
  1 sibling, 0 replies; 15+ messages in thread
From: Hao Lan @ 2023-01-13  2:08 UTC (permalink / raw)
  To: davem, kuba
  Cc: yisen.zhuang, salil.mehta, edumazet, pabeni, richardcochran,
	shenjian15, wangjie125, netdev

From: Jie Wang <wangjie125@huawei.com>

Currently hns3 driver is designed to support VF fault detect feature in
new hardwares. For code compatibility, vf fault detect cap bit is added to
the driver.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h                   | 4 ++++
 .../net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c  | 1 +
 .../net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h  | 1 +
 drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c            | 3 +++
 4 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 17137de9338c..cd85c360335d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -98,6 +98,7 @@ enum HNAE3_DEV_CAP_BITS {
 	HNAE3_DEV_SUPPORT_MC_MAC_MNG_B,
 	HNAE3_DEV_SUPPORT_CQ_B,
 	HNAE3_DEV_SUPPORT_FEC_STATS_B,
+	HNAE3_DEV_SUPPORT_VF_FAULT_B,
 	HNAE3_DEV_SUPPORT_LANE_NUM_B,
 };
 
@@ -164,6 +165,9 @@ enum HNAE3_DEV_CAP_BITS {
 #define hnae3_ae_dev_fec_stats_supported(ae_dev) \
 	test_bit(HNAE3_DEV_SUPPORT_FEC_STATS_B, (ae_dev)->caps)
 
+#define hnae3_ae_dev_vf_fault_supported(ae_dev) \
+	test_bit(HNAE3_DEV_SUPPORT_VF_FAULT_B, (ae_dev)->caps)
+
 #define hnae3_ae_dev_lane_num_supported(ae_dev) \
 	test_bit(HNAE3_DEV_SUPPORT_LANE_NUM_B, (ae_dev)->caps)
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c
index f671a63cecde..2c2d8c7b4e2a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.c
@@ -155,6 +155,7 @@ static const struct hclge_comm_caps_bit_map hclge_pf_cmd_caps[] = {
 	{HCLGE_COMM_CAP_FD_B, HNAE3_DEV_SUPPORT_FD_B},
 	{HCLGE_COMM_CAP_FEC_STATS_B, HNAE3_DEV_SUPPORT_FEC_STATS_B},
 	{HCLGE_COMM_CAP_LANE_NUM_B, HNAE3_DEV_SUPPORT_LANE_NUM_B},
+	{HCLGE_COMM_CAP_VF_FAULT_B, HNAE3_DEV_SUPPORT_VF_FAULT_B},
 };
 
 static const struct hclge_comm_caps_bit_map hclge_vf_cmd_caps[] = {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
index b1f9383b418f..ca3692dc2848 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
@@ -344,6 +344,7 @@ enum HCLGE_COMM_CAP_BITS {
 	HCLGE_COMM_CAP_GRO_B = 20,
 	HCLGE_COMM_CAP_FD_B = 21,
 	HCLGE_COMM_CAP_FEC_STATS_B = 25,
+	HCLGE_COMM_CAP_VF_FAULT_B = 26,
 	HCLGE_COMM_CAP_LANE_NUM_B = 27,
 };
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
index 66feb23f7b7b..cf86ba96716b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
@@ -408,6 +408,9 @@ static struct hns3_dbg_cap_info hns3_dbg_cap[] = {
 	}, {
 		.name = "support lane num",
 		.cap_bit = HNAE3_DEV_SUPPORT_LANE_NUM_B,
+	}, {
+		.name = "support vf fault detect",
+		.cap_bit = HNAE3_DEV_SUPPORT_VF_FAULT_B,
 	}
 };
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-13  2:08 [PATCH net-next 0/2] add some vf fault detect patch for hns Hao Lan
  2023-01-13  2:08 ` [PATCH net-next 1/2] net: hns3: add hns3 vf fault detect cap bit support Hao Lan
@ 2023-01-13  2:08 ` Hao Lan
  2023-01-13  6:51   ` Leon Romanovsky
  2023-01-17  8:04   ` Paolo Abeni
  1 sibling, 2 replies; 15+ messages in thread
From: Hao Lan @ 2023-01-13  2:08 UTC (permalink / raw)
  To: davem, kuba
  Cc: yisen.zhuang, salil.mehta, edumazet, pabeni, richardcochran,
	shenjian15, wangjie125, netdev

From: Jie Wang <wangjie125@huawei.com>

Currently hns3 driver supports vf fault detect feature. Several ras caused
by VF resources don't need to do PF function reset for recovery. The driver
only needs to reset the specified VF.

So this patch adds process in ras module. New process will get detailed
information about ras and do the most correct measures based on these
accurate information.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
 .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
 .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
 .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
 .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
 .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
 6 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index cd85c360335d..76a6230ccfee 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -265,6 +265,7 @@ enum hnae3_reset_type {
 	HNAE3_GLOBAL_RESET,
 	HNAE3_IMP_RESET,
 	HNAE3_NONE_RESET,
+	HNAE3_VF_EXP_RESET,
 	HNAE3_MAX_RESET,
 };
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
index ca3692dc2848..8f823cdc0543 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
@@ -92,6 +92,7 @@ enum hclge_opcode_type {
 	HCLGE_OPC_DFX_SSU_REG_2		= 0x004F,
 
 	HCLGE_OPC_QUERY_DEV_SPECS	= 0x0050,
+	HCLGE_OPC_GET_QUEUE_ERR_VF      = 0x0067,
 
 	/* MAC command */
 	HCLGE_OPC_CONFIG_MAC_MODE	= 0x0301,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index 6efd768cc07c..cf8f3882304a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -1299,10 +1299,12 @@ static const struct hclge_hw_type_id hclge_hw_type_id_st[] = {
 		.msg = "tqp_int_ecc_error"
 	}, {
 		.type_id = PF_ABNORMAL_INT_ERROR,
-		.msg = "pf_abnormal_int_error"
+		.msg = "pf_abnormal_int_error",
+		.cause_by_vf = true
 	}, {
 		.type_id = MPF_ABNORMAL_INT_ERROR,
-		.msg = "mpf_abnormal_int_error"
+		.msg = "mpf_abnormal_int_error",
+		.cause_by_vf = true
 	}, {
 		.type_id = COMMON_ERROR,
 		.msg = "common_error"
@@ -2757,7 +2759,7 @@ void hclge_handle_occurred_error(struct hclge_dev *hdev)
 		hclge_handle_error_info_log(ae_dev);
 }
 
-static void
+static bool
 hclge_handle_error_type_reg_log(struct device *dev,
 				struct hclge_mod_err_info *mod_info,
 				struct hclge_type_reg_err_info *type_reg_info)
@@ -2768,6 +2770,7 @@ hclge_handle_error_type_reg_log(struct device *dev,
 	u8 mod_id, total_module, type_id, total_type, i, is_ras;
 	u8 index_module = MODULE_NONE;
 	u8 index_type = NONE_ERROR;
+	bool cause_by_vf = false;
 
 	mod_id = mod_info->mod_id;
 	type_id = type_reg_info->type_id & HCLGE_ERR_TYPE_MASK;
@@ -2786,6 +2789,7 @@ hclge_handle_error_type_reg_log(struct device *dev,
 	for (i = 0; i < total_type; i++) {
 		if (type_id == hclge_hw_type_id_st[i].type_id) {
 			index_type = i;
+			cause_by_vf = hclge_hw_type_id_st[i].cause_by_vf;
 			break;
 		}
 	}
@@ -2803,6 +2807,8 @@ hclge_handle_error_type_reg_log(struct device *dev,
 	dev_err(dev, "reg_value:\n");
 	for (i = 0; i < type_reg_info->reg_num; i++)
 		dev_err(dev, "0x%08x\n", type_reg_info->hclge_reg[i]);
+
+	return cause_by_vf;
 }
 
 static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev,
@@ -2813,6 +2819,7 @@ static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev,
 	struct device *dev = &hdev->pdev->dev;
 	struct hclge_mod_err_info *mod_info;
 	struct hclge_sum_err_info *sum_info;
+	bool cause_by_vf = false;
 	u8 mod_num, err_num, i;
 	u32 offset = 0;
 
@@ -2841,12 +2848,16 @@ static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev,
 
 			type_reg_info = (struct hclge_type_reg_err_info *)
 					    &buf[offset++];
-			hclge_handle_error_type_reg_log(dev, mod_info,
-							type_reg_info);
+			if (hclge_handle_error_type_reg_log(dev, mod_info,
+							    type_reg_info))
+				cause_by_vf = true;
 
 			offset += type_reg_info->reg_num;
 		}
 	}
+
+	if (hnae3_ae_dev_vf_fault_supported(hdev->ae_dev) && cause_by_vf)
+		set_bit(HNAE3_VF_EXP_RESET, &ae_dev->hw_err_reset_req);
 }
 
 static int hclge_query_all_err_bd_num(struct hclge_dev *hdev, u32 *bd_num)
@@ -2938,3 +2949,95 @@ int hclge_handle_error_info_log(struct hnae3_ae_dev *ae_dev)
 out:
 	return ret;
 }
+
+static bool hclge_reset_vf_in_bitmap(struct hclge_dev *hdev,
+				     unsigned long *bitmap)
+{
+	struct hclge_vport *vport;
+	bool exist_set = false;
+	int func_id;
+	int ret;
+
+	func_id = find_first_bit(bitmap, HCLGE_VPORT_NUM);
+	if (func_id == PF_VPORT_ID)
+		return false;
+
+	while (func_id != HCLGE_VPORT_NUM) {
+		vport = hclge_get_vf_vport(hdev,
+					   func_id - HCLGE_VF_VPORT_START_NUM);
+		if (!vport) {
+			dev_err(&hdev->pdev->dev, "invalid func id(%d)\n",
+				func_id);
+			return false;
+		}
+
+		dev_info(&hdev->pdev->dev, "do function %d recovery.", func_id);
+
+		ret = hclge_reset_tqp(&vport->nic);
+		if (ret) {
+			dev_err(&hdev->pdev->dev,
+				"failed to reset tqp, ret = %d.", ret);
+			return false;
+		}
+
+		ret = hclge_func_reset_cmd(hdev, func_id);
+		if (ret) {
+			dev_err(&hdev->pdev->dev,
+				"failed to reset func %d, ret = %d.",
+				func_id, ret);
+			return false;
+		}
+
+		exist_set = true;
+		clear_bit(func_id, bitmap);
+		func_id = find_first_bit(bitmap, HCLGE_VPORT_NUM);
+	}
+
+	return exist_set;
+}
+
+static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
+				      unsigned long *bitmap)
+{
+#define HCLGE_FIR_FAULT_BYTES	24
+#define HCLGE_SEC_FAULT_BYTES	8
+
+	u8 *buff;
+
+	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
+	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
+	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
+}
+
+int hclge_handle_vf_queue_err_ras(struct hclge_dev *hdev)
+{
+	unsigned long vf_fault_bitmap[BITS_TO_LONGS(HCLGE_VPORT_NUM)];
+	struct hclge_desc desc[2];
+	bool cause_by_vf = false;
+	int ret;
+
+	if (!hnae3_ae_dev_vf_fault_supported(hdev->ae_dev) ||
+	    !test_and_clear_bit(HNAE3_VF_EXP_RESET,
+				&hdev->ae_dev->hw_err_reset_req))
+		return 0;
+
+	hclge_comm_cmd_setup_basic_desc(&desc[0], HCLGE_OPC_GET_QUEUE_ERR_VF,
+					true);
+	desc[0].flag |= cpu_to_le16(HCLGE_COMM_CMD_FLAG_NEXT);
+	hclge_comm_cmd_setup_basic_desc(&desc[1], HCLGE_OPC_GET_QUEUE_ERR_VF,
+					true);
+
+	ret = hclge_comm_cmd_send(&hdev->hw.hw, desc, 2);
+	if (ret) {
+		dev_err(&hdev->pdev->dev,
+			"failed to get vf bitmap, ret = %d.\n", ret);
+		return ret;
+	}
+	hclge_get_vf_fault_bitmap(desc, vf_fault_bitmap);
+
+	cause_by_vf = hclge_reset_vf_in_bitmap(hdev, vf_fault_bitmap);
+	if (cause_by_vf)
+		hdev->ae_dev->hw_err_reset_req = 0;
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
index 86be6fb32990..68b738affa66 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
@@ -196,6 +196,7 @@ struct hclge_hw_module_id {
 struct hclge_hw_type_id {
 	enum hclge_err_type_list type_id;
 	const char *msg;
+	bool cause_by_vf; /* indicate the error may from vf exception */
 };
 
 struct hclge_sum_err_info {
@@ -228,4 +229,5 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
 			       unsigned long *reset_requests);
 int hclge_handle_error_info_log(struct hnae3_ae_dev *ae_dev);
 int hclge_handle_mac_tnl(struct hclge_dev *hdev);
+int hclge_handle_vf_queue_err_ras(struct hclge_dev *hdev);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 07ad5f35219e..a1f1f72db8a6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3520,7 +3520,7 @@ static int hclge_get_status(struct hnae3_handle *handle)
 	return hdev->hw.mac.link;
 }
 
-static struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf)
+struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf)
 {
 	if (!pci_num_vf(hdev->pdev)) {
 		dev_err(&hdev->pdev->dev,
@@ -4559,6 +4559,7 @@ static void hclge_handle_err_recovery(struct hclge_dev *hdev)
 	if (hclge_find_error_source(hdev)) {
 		hclge_handle_error_info_log(ae_dev);
 		hclge_handle_mac_tnl(hdev);
+		hclge_handle_vf_queue_err_ras(hdev);
 	}
 
 	hclge_handle_err_reset_request(hdev);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 13f23d606e77..73cfc26f5389 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -1148,4 +1148,5 @@ int hclge_dbg_dump_rst_info(struct hclge_dev *hdev, char *buf, int len);
 int hclge_push_vf_link_status(struct hclge_vport *vport);
 int hclge_enable_vport_vlan_filter(struct hclge_vport *vport, bool request_en);
 int hclge_mac_update_stats(struct hclge_dev *hdev);
+struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf);
 #endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-13  2:08 ` [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras Hao Lan
@ 2023-01-13  6:51   ` Leon Romanovsky
  2023-01-17  7:04     ` wangjie (L)
  2023-01-17  8:04   ` Paolo Abeni
  1 sibling, 1 reply; 15+ messages in thread
From: Leon Romanovsky @ 2023-01-13  6:51 UTC (permalink / raw)
  To: Hao Lan
  Cc: davem, kuba, yisen.zhuang, salil.mehta, edumazet, pabeni,
	richardcochran, shenjian15, wangjie125, netdev

On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> From: Jie Wang <wangjie125@huawei.com>
> 
> Currently hns3 driver supports vf fault detect feature. Several ras caused
> by VF resources don't need to do PF function reset for recovery. The driver
> only needs to reset the specified VF.
> 
> So this patch adds process in ras module. New process will get detailed
> information about ras and do the most correct measures based on these
> accurate information.
> 
> Signed-off-by: Jie Wang <wangjie125@huawei.com>
> Signed-off-by: Hao Lan <lanhao@huawei.com>
> ---
>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>  6 files changed, 115 insertions(+), 6 deletions(-)

Why is it good idea to reset VF from PF?
What will happen with driver bound to this VF?
Shouldn't PCI recovery handle it?

Thanks

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-13  6:51   ` Leon Romanovsky
@ 2023-01-17  7:04     ` wangjie (L)
  2023-01-17 11:21       ` Leon Romanovsky
  0 siblings, 1 reply; 15+ messages in thread
From: wangjie (L) @ 2023-01-17  7:04 UTC (permalink / raw)
  To: Leon Romanovsky, Hao Lan
  Cc: davem, kuba, yisen.zhuang, salil.mehta, edumazet, pabeni,
	richardcochran, shenjian15, netdev



On 2023/1/13 14:51, Leon Romanovsky wrote:
> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>> From: Jie Wang <wangjie125@huawei.com>
>>
>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>> by VF resources don't need to do PF function reset for recovery. The driver
>> only needs to reset the specified VF.
>>
>> So this patch adds process in ras module. New process will get detailed
>> information about ras and do the most correct measures based on these
>> accurate information.
>>
>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>> ---
>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>  6 files changed, 115 insertions(+), 6 deletions(-)
>
> Why is it good idea to reset VF from PF?
> What will happen with driver bound to this VF?
> Shouldn't PCI recovery handle it?
>
> Thanks
> .
PF doesn't reset VF directly. These VF faults are detected by hardware,
and only reported to PF. PF get the VF id from firmware, then notify the 
VF that it needs reset. VF will do reset after receive the request.

These hardware faults are not standard PCI ras event, so we prefer to 
use MSIx path.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-13  2:08 ` [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras Hao Lan
  2023-01-13  6:51   ` Leon Romanovsky
@ 2023-01-17  8:04   ` Paolo Abeni
  2023-01-18 12:36     ` wangjie (L)
  2023-01-18 12:36     ` wangjie (L)
  1 sibling, 2 replies; 15+ messages in thread
From: Paolo Abeni @ 2023-01-17  8:04 UTC (permalink / raw)
  To: Hao Lan, kuba, Leon Romanovsky
  Cc: yisen.zhuang, salil.mehta, edumazet, richardcochran, shenjian15,
	wangjie125, netdev, davem

Hello,

On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote:
[...]

> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
> +				      unsigned long *bitmap)
> +{
> +#define HCLGE_FIR_FAULT_BYTES	24
> +#define HCLGE_SEC_FAULT_BYTES	8
> +
> +	u8 *buff;
> +
> +	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
> +	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
> +	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
> +}

The above works under the assumption that:

	HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM)

I think it's better to enforce such condition at build time with a
BUILD_BUG_ON(), to avoid future issues.

Also I think Leon still deserve a reply to one of his questions,
specifically: What will happen (at recovery time) with driver bound to
this VF?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-17  7:04     ` wangjie (L)
@ 2023-01-17 11:21       ` Leon Romanovsky
  2023-01-18 12:34         ` wangjie (L)
  0 siblings, 1 reply; 15+ messages in thread
From: Leon Romanovsky @ 2023-01-17 11:21 UTC (permalink / raw)
  To: wangjie (L)
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev

On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/13 14:51, Leon Romanovsky wrote:
> > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > From: Jie Wang <wangjie125@huawei.com>
> > > 
> > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > by VF resources don't need to do PF function reset for recovery. The driver
> > > only needs to reset the specified VF.
> > > 
> > > So this patch adds process in ras module. New process will get detailed
> > > information about ras and do the most correct measures based on these
> > > accurate information.
> > > 
> > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > ---
> > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > 
> > Why is it good idea to reset VF from PF?
> > What will happen with driver bound to this VF?
> > Shouldn't PCI recovery handle it?
> > 
> > Thanks
> > .
> PF doesn't reset VF directly. These VF faults are detected by hardware,
> and only reported to PF. PF get the VF id from firmware, then notify the VF
> that it needs reset. VF will do reset after receive the request.

This description isn't aligned with the code. You are issuing
hclge_func_reset_cmd() command which will reset VF, while notification
are handled by hclge_func_reset_notify_vf().

It also doesn't make any sense to send notification event to VF through
FW while the goal is to recover from stuck FW in that VF.

> 
> These hardware faults are not standard PCI ras event, so we prefer to use
> MSIx path.

What is different here?

> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-17 11:21       ` Leon Romanovsky
@ 2023-01-18 12:34         ` wangjie (L)
  2023-01-20 17:12           ` Leon Romanovsky
  0 siblings, 1 reply; 15+ messages in thread
From: wangjie (L) @ 2023-01-18 12:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev



On 2023/1/17 19:21, Leon Romanovsky wrote:
> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
>>
>>
>> On 2023/1/13 14:51, Leon Romanovsky wrote:
>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>>>> From: Jie Wang <wangjie125@huawei.com>
>>>>
>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>>>> by VF resources don't need to do PF function reset for recovery. The driver
>>>> only needs to reset the specified VF.
>>>>
>>>> So this patch adds process in ras module. New process will get detailed
>>>> information about ras and do the most correct measures based on these
>>>> accurate information.
>>>>
>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>>>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>>>> ---
>>>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>>>  6 files changed, 115 insertions(+), 6 deletions(-)
>>>
>>> Why is it good idea to reset VF from PF?
>>> What will happen with driver bound to this VF?
>>> Shouldn't PCI recovery handle it?
>>>
>>> Thanks
>>> .
>> PF doesn't reset VF directly. These VF faults are detected by hardware,
>> and only reported to PF. PF get the VF id from firmware, then notify the VF
>> that it needs reset. VF will do reset after receive the request.
>
> This description isn't aligned with the code. You are issuing
> hclge_func_reset_cmd() command which will reset VF, while notification
> are handled by hclge_func_reset_notify_vf().
>
> It also doesn't make any sense to send notification event to VF through
> FW while the goal is to recover from stuck FW in that VF.
>
Yes, I misunderstand the hclge_func_reset_notify_vf and
hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
the VF for recovery. I will fix and retest it in V2.

This patch is used to recover specific vf hardware errors, for example the
tx queue configuration exceptions. It make sense in these cases for the
firmware is still working properly and can do the recovery rightly.
>>
>> These hardware faults are not standard PCI ras event, so we prefer to use
>> MSIx path.
>
> What is different here?
>
These hardware faults are reported by MSIx interrupts instead of PCI ras
path.

Thanks!
>>>
> .
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-17  8:04   ` Paolo Abeni
@ 2023-01-18 12:36     ` wangjie (L)
  2023-01-18 12:36     ` wangjie (L)
  1 sibling, 0 replies; 15+ messages in thread
From: wangjie (L) @ 2023-01-18 12:36 UTC (permalink / raw)
  To: Paolo Abeni, Hao Lan, kuba, Leon Romanovsky
  Cc: yisen.zhuang, salil.mehta, edumazet, richardcochran, shenjian15,
	netdev, davem



On 2023/1/17 16:04, Paolo Abeni wrote:
> Hello,
>
> On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote:
> [...]
>
>> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
>> +				      unsigned long *bitmap)
>> +{
>> +#define HCLGE_FIR_FAULT_BYTES	24
>> +#define HCLGE_SEC_FAULT_BYTES	8
>> +
>> +	u8 *buff;
>> +
>> +	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
>> +	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
>> +	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
>> +}
>
> The above works under the assumption that:
>
> 	HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM)
>
> I think it's better to enforce such condition at build time with a
> BUILD_BUG_ON(), to avoid future issues.
>
> Also I think Leon still deserve a reply to one of his questions,
> specifically: What will happen (at recovery time) with driver bound to
> this VF?
>
> Thanks!
>
> Paolo
>
As discussed in reply to Leon, it missed to notify VF driver to prepare for
the reset in V1, So I will add notify to VF driver in v2 to make sure the
VF driver will init to normal at recovery time.

Thanks!
> .
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-17  8:04   ` Paolo Abeni
  2023-01-18 12:36     ` wangjie (L)
@ 2023-01-18 12:36     ` wangjie (L)
  1 sibling, 0 replies; 15+ messages in thread
From: wangjie (L) @ 2023-01-18 12:36 UTC (permalink / raw)
  To: Paolo Abeni, Hao Lan, kuba, Leon Romanovsky
  Cc: yisen.zhuang, salil.mehta, edumazet, richardcochran, shenjian15,
	netdev, davem



On 2023/1/17 16:04, Paolo Abeni wrote:
> Hello,
>
> On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote:
> [...]
>
>> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
>> +				      unsigned long *bitmap)
>> +{
>> +#define HCLGE_FIR_FAULT_BYTES	24
>> +#define HCLGE_SEC_FAULT_BYTES	8
>> +
>> +	u8 *buff;
>> +
>> +	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
>> +	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
>> +	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
>> +}
>
> The above works under the assumption that:
>
> 	HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM)
>
> I think it's better to enforce such condition at build time with a
> BUILD_BUG_ON(), to avoid future issues.
>

OK, I will add it in v2

> Also I think Leon still deserve a reply to one of his questions,
> specifically: What will happen (at recovery time) with driver bound to
> this VF?
>
> Thanks!
>
> Paolo
>
> .
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-18 12:34         ` wangjie (L)
@ 2023-01-20 17:12           ` Leon Romanovsky
  2023-01-31 12:04             ` wangjie (L)
  0 siblings, 1 reply; 15+ messages in thread
From: Leon Romanovsky @ 2023-01-20 17:12 UTC (permalink / raw)
  To: wangjie (L)
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev

On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/17 19:21, Leon Romanovsky wrote:
> > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> > > 
> > > 
> > > On 2023/1/13 14:51, Leon Romanovsky wrote:
> > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > > > From: Jie Wang <wangjie125@huawei.com>
> > > > > 
> > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > > > by VF resources don't need to do PF function reset for recovery. The driver
> > > > > only needs to reset the specified VF.
> > > > > 
> > > > > So this patch adds process in ras module. New process will get detailed
> > > > > information about ras and do the most correct measures based on these
> > > > > accurate information.
> > > > > 
> > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > > > ---
> > > > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > > > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > > > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > > > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > > > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > > > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > > > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > > > 
> > > > Why is it good idea to reset VF from PF?
> > > > What will happen with driver bound to this VF?
> > > > Shouldn't PCI recovery handle it?
> > > > 
> > > > Thanks
> > > > .
> > > PF doesn't reset VF directly. These VF faults are detected by hardware,
> > > and only reported to PF. PF get the VF id from firmware, then notify the VF
> > > that it needs reset. VF will do reset after receive the request.
> > 
> > This description isn't aligned with the code. You are issuing
> > hclge_func_reset_cmd() command which will reset VF, while notification
> > are handled by hclge_func_reset_notify_vf().
> > 
> > It also doesn't make any sense to send notification event to VF through
> > FW while the goal is to recover from stuck FW in that VF.
> > 
> Yes, I misunderstand the hclge_func_reset_notify_vf and
> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
> the VF for recovery. I will fix and retest it in V2.
> 
> This patch is used to recover specific vf hardware errors, for example the
> tx queue configuration exceptions. It make sense in these cases for the
> firmware is still working properly and can do the recovery rightly.

If FW is operational and knows about failure, why can't FW do recovery
internally to that VF without PF involvement?

Thanks

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-20 17:12           ` Leon Romanovsky
@ 2023-01-31 12:04             ` wangjie (L)
  2023-01-31 13:24               ` Leon Romanovsky
  0 siblings, 1 reply; 15+ messages in thread
From: wangjie (L) @ 2023-01-31 12:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev



On 2023/1/21 1:12, Leon Romanovsky wrote:
> On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
>>
>>
>> On 2023/1/17 19:21, Leon Romanovsky wrote:
>>> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
>>>>
>>>>
>>>> On 2023/1/13 14:51, Leon Romanovsky wrote:
>>>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>>>>>> From: Jie Wang <wangjie125@huawei.com>
>>>>>>
>>>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>>>>>> by VF resources don't need to do PF function reset for recovery. The driver
>>>>>> only needs to reset the specified VF.
>>>>>>
>>>>>> So this patch adds process in ras module. New process will get detailed
>>>>>> information about ras and do the most correct measures based on these
>>>>>> accurate information.
>>>>>>
>>>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>>>>>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>>>>>> ---
>>>>>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>>>>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>>>>>  6 files changed, 115 insertions(+), 6 deletions(-)
>>>>>
>>>>> Why is it good idea to reset VF from PF?
>>>>> What will happen with driver bound to this VF?
>>>>> Shouldn't PCI recovery handle it?
>>>>>
>>>>> Thanks
>>>>> .
>>>> PF doesn't reset VF directly. These VF faults are detected by hardware,
>>>> and only reported to PF. PF get the VF id from firmware, then notify the VF
>>>> that it needs reset. VF will do reset after receive the request.
>>>
>>> This description isn't aligned with the code. You are issuing
>>> hclge_func_reset_cmd() command which will reset VF, while notification
>>> are handled by hclge_func_reset_notify_vf().
>>>
>>> It also doesn't make any sense to send notification event to VF through
>>> FW while the goal is to recover from stuck FW in that VF.
>>>
>> Yes, I misunderstand the hclge_func_reset_notify_vf and
>> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
>> the VF for recovery. I will fix and retest it in V2.
>>
>> This patch is used to recover specific vf hardware errors, for example the
>> tx queue configuration exceptions. It make sense in these cases for the
>> firmware is still working properly and can do the recovery rightly.
>
> If FW is operational and knows about failure, why can't FW do recovery
> internally to that VF without PF involvement?
I'm sorry to reply so late because I took a vacation. If firmware reset VF
hardware directly without notify the running VF driver, it will cause VF
driver works abnormal.

Thanks
>
> Thanks
> .
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-31 12:04             ` wangjie (L)
@ 2023-01-31 13:24               ` Leon Romanovsky
  2023-02-02 13:08                 ` wangjie (L)
  0 siblings, 1 reply; 15+ messages in thread
From: Leon Romanovsky @ 2023-01-31 13:24 UTC (permalink / raw)
  To: wangjie (L)
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev

On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/21 1:12, Leon Romanovsky wrote:
> > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
> > > 
> > > 
> > > On 2023/1/17 19:21, Leon Romanovsky wrote:
> > > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> > > > > 
> > > > > 
> > > > > On 2023/1/13 14:51, Leon Romanovsky wrote:
> > > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > > > > > From: Jie Wang <wangjie125@huawei.com>
> > > > > > > 
> > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > > > > > by VF resources don't need to do PF function reset for recovery. The driver
> > > > > > > only needs to reset the specified VF.
> > > > > > > 
> > > > > > > So this patch adds process in ras module. New process will get detailed
> > > > > > > information about ras and do the most correct measures based on these
> > > > > > > accurate information.
> > > > > > > 
> > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > > > > > ---
> > > > > > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > > > > > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > > > > > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > Why is it good idea to reset VF from PF?
> > > > > > What will happen with driver bound to this VF?
> > > > > > Shouldn't PCI recovery handle it?
> > > > > > 
> > > > > > Thanks
> > > > > > .
> > > > > PF doesn't reset VF directly. These VF faults are detected by hardware,
> > > > > and only reported to PF. PF get the VF id from firmware, then notify the VF
> > > > > that it needs reset. VF will do reset after receive the request.
> > > > 
> > > > This description isn't aligned with the code. You are issuing
> > > > hclge_func_reset_cmd() command which will reset VF, while notification
> > > > are handled by hclge_func_reset_notify_vf().
> > > > 
> > > > It also doesn't make any sense to send notification event to VF through
> > > > FW while the goal is to recover from stuck FW in that VF.
> > > > 
> > > Yes, I misunderstand the hclge_func_reset_notify_vf and
> > > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
> > > the VF for recovery. I will fix and retest it in V2.
> > > 
> > > This patch is used to recover specific vf hardware errors, for example the
> > > tx queue configuration exceptions. It make sense in these cases for the
> > > firmware is still working properly and can do the recovery rightly.
> > 
> > If FW is operational and knows about failure, why can't FW do recovery
> > internally to that VF without PF involvement?
> I'm sorry to reply so late because I took a vacation. If firmware reset VF
> hardware directly without notify the running VF driver, it will cause VF
> driver works abnormal.

mlx5 health recovery code proves that it is possible to do.
Even in your case, FW can notify VF without PF in the middle.

Thanks

> 
> Thanks
> > 
> > Thanks
> > .
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-01-31 13:24               ` Leon Romanovsky
@ 2023-02-02 13:08                 ` wangjie (L)
  2023-02-06 13:36                   ` Leon Romanovsky
  0 siblings, 1 reply; 15+ messages in thread
From: wangjie (L) @ 2023-02-02 13:08 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev



On 2023/1/31 21:24, Leon Romanovsky wrote:
> On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote:
>>
>>
>> On 2023/1/21 1:12, Leon Romanovsky wrote:
>>> On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
>>>>
>>>>
>>>> On 2023/1/17 19:21, Leon Romanovsky wrote:
>>>>> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
>>>>>>
>>>>>>
>>>>>> On 2023/1/13 14:51, Leon Romanovsky wrote:
>>>>>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>>>>>>>> From: Jie Wang <wangjie125@huawei.com>
>>>>>>>>
>>>>>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>>>>>>>> by VF resources don't need to do PF function reset for recovery. The driver
>>>>>>>> only needs to reset the specified VF.
>>>>>>>>
>>>>>>>> So this patch adds process in ras module. New process will get detailed
>>>>>>>> information about ras and do the most correct measures based on these
>>>>>>>> accurate information.
>>>>>>>>
>>>>>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>>>>>>>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>>>>>>>> ---
>>>>>>>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>>>>>>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>>>>>>>  6 files changed, 115 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> Why is it good idea to reset VF from PF?
>>>>>>> What will happen with driver bound to this VF?
>>>>>>> Shouldn't PCI recovery handle it?
>>>>>>>
>>>>>>> Thanks
>>>>>>> .
>>>>>> PF doesn't reset VF directly. These VF faults are detected by hardware,
>>>>>> and only reported to PF. PF get the VF id from firmware, then notify the VF
>>>>>> that it needs reset. VF will do reset after receive the request.
>>>>>
>>>>> This description isn't aligned with the code. You are issuing
>>>>> hclge_func_reset_cmd() command which will reset VF, while notification
>>>>> are handled by hclge_func_reset_notify_vf().
>>>>>
>>>>> It also doesn't make any sense to send notification event to VF through
>>>>> FW while the goal is to recover from stuck FW in that VF.
>>>>>
>>>> Yes, I misunderstand the hclge_func_reset_notify_vf and
>>>> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
>>>> the VF for recovery. I will fix and retest it in V2.
>>>>
>>>> This patch is used to recover specific vf hardware errors, for example the
>>>> tx queue configuration exceptions. It make sense in these cases for the
>>>> firmware is still working properly and can do the recovery rightly.
>>>
>>> If FW is operational and knows about failure, why can't FW do recovery
>>> internally to that VF without PF involvement?
>> I'm sorry to reply so late because I took a vacation. If firmware reset VF
>> hardware directly without notify the running VF driver, it will cause VF
>> driver works abnormal.
>
> mlx5 health recovery code proves that it is possible to do.
> Even in your case, FW can notify VF without PF in the middle.
>
> Thanks
>
These faults only report to PF in hns3 devices, even if devlink health is
used in hns3 driver, these faults also need to report to PF.

Thanks
>>
>> Thanks
>>>
>>> Thanks
>>> .
>>>
> .
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras
  2023-02-02 13:08                 ` wangjie (L)
@ 2023-02-06 13:36                   ` Leon Romanovsky
  0 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-02-06 13:36 UTC (permalink / raw)
  To: wangjie (L)
  Cc: Hao Lan, davem, kuba, yisen.zhuang, salil.mehta, edumazet,
	pabeni, richardcochran, shenjian15, netdev

On Thu, Feb 02, 2023 at 09:08:37PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/31 21:24, Leon Romanovsky wrote:
> > On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote:
> > > 
> > > 
> > > On 2023/1/21 1:12, Leon Romanovsky wrote:
> > > > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
> > > > > 
> > > > > 
> > > > > On 2023/1/17 19:21, Leon Romanovsky wrote:
> > > > > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 2023/1/13 14:51, Leon Romanovsky wrote:
> > > > > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > > > > > > > From: Jie Wang <wangjie125@huawei.com>
> > > > > > > > > 
> > > > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > > > > > > > by VF resources don't need to do PF function reset for recovery. The driver
> > > > > > > > > only needs to reset the specified VF.
> > > > > > > > > 
> > > > > > > > > So this patch adds process in ras module. New process will get detailed
> > > > > > > > > information about ras and do the most correct measures based on these
> > > > > > > > > accurate information.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > > > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > > > > > > > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > > > > > > > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > > > > > > > 
> > > > > > > > Why is it good idea to reset VF from PF?
> > > > > > > > What will happen with driver bound to this VF?
> > > > > > > > Shouldn't PCI recovery handle it?
> > > > > > > > 
> > > > > > > > Thanks
> > > > > > > > .
> > > > > > > PF doesn't reset VF directly. These VF faults are detected by hardware,
> > > > > > > and only reported to PF. PF get the VF id from firmware, then notify the VF
> > > > > > > that it needs reset. VF will do reset after receive the request.
> > > > > > 
> > > > > > This description isn't aligned with the code. You are issuing
> > > > > > hclge_func_reset_cmd() command which will reset VF, while notification
> > > > > > are handled by hclge_func_reset_notify_vf().
> > > > > > 
> > > > > > It also doesn't make any sense to send notification event to VF through
> > > > > > FW while the goal is to recover from stuck FW in that VF.
> > > > > > 
> > > > > Yes, I misunderstand the hclge_func_reset_notify_vf and
> > > > > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
> > > > > the VF for recovery. I will fix and retest it in V2.
> > > > > 
> > > > > This patch is used to recover specific vf hardware errors, for example the
> > > > > tx queue configuration exceptions. It make sense in these cases for the
> > > > > firmware is still working properly and can do the recovery rightly.
> > > > 
> > > > If FW is operational and knows about failure, why can't FW do recovery
> > > > internally to that VF without PF involvement?
> > > I'm sorry to reply so late because I took a vacation. If firmware reset VF
> > > hardware directly without notify the running VF driver, it will cause VF
> > > driver works abnormal.
> > 
> > mlx5 health recovery code proves that it is possible to do.
> > Even in your case, FW can notify VF without PF in the middle.
> > 
> > Thanks
> > 
> These faults only report to PF in hns3 devices, even if devlink health is
> used in hns3 driver, these faults also need to report to PF.

I don't think that it is correct, but let's see your v2 without VF reset
and with PF notifications.

Thanks

> 
> Thanks
> > > 
> > > Thanks
> > > > 
> > > > Thanks
> > > > .
> > > > 
> > .
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-02-06 13:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-13  2:08 [PATCH net-next 0/2] add some vf fault detect patch for hns Hao Lan
2023-01-13  2:08 ` [PATCH net-next 1/2] net: hns3: add hns3 vf fault detect cap bit support Hao Lan
2023-01-13  2:08 ` [PATCH net-next 2/2] net: hns3: add vf fault process in hns3 ras Hao Lan
2023-01-13  6:51   ` Leon Romanovsky
2023-01-17  7:04     ` wangjie (L)
2023-01-17 11:21       ` Leon Romanovsky
2023-01-18 12:34         ` wangjie (L)
2023-01-20 17:12           ` Leon Romanovsky
2023-01-31 12:04             ` wangjie (L)
2023-01-31 13:24               ` Leon Romanovsky
2023-02-02 13:08                 ` wangjie (L)
2023-02-06 13:36                   ` Leon Romanovsky
2023-01-17  8:04   ` Paolo Abeni
2023-01-18 12:36     ` wangjie (L)
2023-01-18 12:36     ` wangjie (L)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).