[Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver
@ 2023-06-21  9:10 Lingyu Liu
  2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 01/15] ice: Fix missing legacy 32byte RXDID in the supported bitmap Lingyu Liu
                   ` (14 more replies)
  0 siblings, 15 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:10 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

This series adds vfio live migration support for Intel E810 VF
devices based on the v2 migration protocol.

v2:
 - clarified comments and commit message
---
Lingyu Liu (11):
  ice: check VF migration status before sending messages to VF
  ice: add migration init field and helper functions
  ice: save VF messages as device state
  ice: save and restore device state
  ice: do not notify VF link state during migration
  ice: change VSI id in virtual channel message after migration
  ice: save and restore RX queue head
  ice: save and restore TX queue head
  ice: stop device before saving device states
  ice: mask VF advanced capabilities if live migration is activated
  vfio/ice: implement vfio_pci driver for E800 devices

Xu Ting (1):
  ice: Fix missing legacy 32byte RXDID in the supported bitmap

Yahui Cao (3):
  ice: add function to get rxq context
  vfio: Expose vfio_device_has_container()
  vfio/ice: support iommufd vfio compat mode

 MAINTAINERS                                   |   7 +
 drivers/net/ethernet/intel/ice/Makefile       |   1 +
 drivers/net/ethernet/intel/ice/ice.h          |   2 +
 drivers/net/ethernet/intel/ice/ice_common.c   | 268 +++++
 drivers/net/ethernet/intel/ice/ice_common.h   |   5 +
 .../net/ethernet/intel/ice/ice_migration.c    | 778 ++++++++++++++
 .../intel/ice/ice_migration_private.h         |  31 +
 drivers/net/ethernet/intel/ice/ice_vf_lib.c   |   8 +
 drivers/net/ethernet/intel/ice/ice_vf_lib.h   |   6 +
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 238 +++--
 drivers/net/ethernet/intel/ice/ice_virtchnl.h |  15 +-
 .../ethernet/intel/ice/ice_virtchnl_fdir.c    |  28 +-
 drivers/vfio/group.c                          |   1 +
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/ice/Kconfig                  |  10 +
 drivers/vfio/pci/ice/Makefile                 |   4 +
 drivers/vfio/pci/ice/ice_vfio_pci.c           | 998 ++++++++++++++++++
 include/linux/net/intel/ice_migration.h       |  53 +
 include/linux/vfio.h                          |   1 +
 20 files changed, 2347 insertions(+), 111 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_migration.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_migration_private.h
 create mode 100644 drivers/vfio/pci/ice/Kconfig
 create mode 100644 drivers/vfio/pci/ice/Makefile
 create mode 100644 drivers/vfio/pci/ice/ice_vfio_pci.c
 create mode 100644 include/linux/net/intel/ice_migration.h

-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 01/15] ice: Fix missing legacy 32byte RXDID in the supported bitmap
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
@ 2023-06-21  9:10 ` Lingyu Liu
  2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 02/15] ice: add function to get rxq context Lingyu Liu
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:10 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra, Xu Ting

From: Xu Ting <ting.xu@intel.com>

32byte legacy descriptor format is preassigned.
Commit e753df8fbca5 ("ice: Add support Flex RXD") created a
supported RXDIDs bitmap according to DDP package. But it missed
the legacy 32byte RXDID since it is not listed in the package.
Mark 32byte legacy descriptor format as supported in the supported
RXDIDs flags.

Signed-off-by: Xu Ting <ting.xu@intel.com>
Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 92490fe655ea..9d74dcafde60 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -2617,10 +2617,13 @@ static int ice_vc_query_rxdid(struct ice_vf *vf)
 
 	/* Read flexiflag registers to determine whether the
 	 * corresponding RXDID is configured and supported or not.
-	 * Since Legacy 16byte descriptor format is not supported,
-	 * start from Legacy 32byte descriptor.
+	 * But the legacy 32byte RXDID is not listed in DDP package,
+	 * add it in the bitmap manually and skip check for it in the loop.
+	 * Legacy 16byte descriptor is not supported.
 	 */
-	for (i = ICE_RXDID_LEGACY_1; i < ICE_FLEX_DESC_RXDID_MAX_NUM; i++) {
+	rxdid->supported_rxdids |= BIT(ICE_RXDID_LEGACY_1);
+
+	for (i = ICE_RXDID_FLEX_NIC; i < ICE_FLEX_DESC_RXDID_MAX_NUM; i++) {
 		regval = rd32(hw, GLFLXP_RXDID_FLAGS(i, 0));
 		if ((regval >> GLFLXP_RXDID_FLAGS_FLEXIFLAG_4N_S)
 			& GLFLXP_RXDID_FLAGS_FLEXIFLAG_4N_M)
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 02/15] ice: add function to get rxq context
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
  2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 01/15] ice: Fix missing legacy 32byte RXDID in the supported bitmap Lingyu Liu
@ 2023-06-21  9:10 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 03/15] ice: check VF migration status before sending messages to VF Lingyu Liu
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:10 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

From: Yahui Cao <yahui.cao@intel.com>

Export rxq context. Read binaries from hw registers. Then
use offsets and masks to extract different context fields.

ice_vfio_pci driver that is introduced in coming patches of this series
will use the exported rxq context when saving and restoring device state.

Signed-off-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_common.c | 268 ++++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_common.h |   5 +
 2 files changed, 273 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 7a9e88f3f4b5..21cd617f9620 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1428,6 +1428,34 @@ ice_copy_rxq_ctx_to_hw(struct ice_hw *hw, u8 *ice_rxq_ctx, u32 rxq_index)
 	return 0;
 }
 
+/**
+ * ice_copy_rxq_ctx_from_hw - Copy rxq context register from HW
+ * @hw: pointer to the hardware structure
+ * @ice_rxq_ctx: pointer to the rxq context
+ * @rxq_index: the index of the Rx queue
+ *
+ * Copies rxq context from HW register space to dense structure
+ */
+static int
+ice_copy_rxq_ctx_from_hw(struct ice_hw *hw, u8 *ice_rxq_ctx, u32 rxq_index)
+{
+	u8 i;
+
+	if (!ice_rxq_ctx || rxq_index > QRX_CTRL_MAX_INDEX)
+		return -EINVAL;
+
+	/* Copy each dword separately from HW */
+	for (i = 0; i < ICE_RXQ_CTX_SIZE_DWORDS; i++) {
+		u32 *ctx = (u32 *)(ice_rxq_ctx + (i * sizeof(u32)));
+
+		*ctx = rd32(hw, QRX_CONTEXT(i, rxq_index));
+
+		ice_debug(hw, ICE_DBG_QCTX, "qrxdata[%d]: %08X\n", i, *ctx);
+	}
+
+	return 0;
+}
+
 /* LAN Rx Queue Context */
 static const struct ice_ctx_ele ice_rlan_ctx_info[] = {
 	/* Field		Width	LSB */
@@ -1479,6 +1507,32 @@ ice_write_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
 	return ice_copy_rxq_ctx_to_hw(hw, ctx_buf, rxq_index);
 }
 
+/**
+ * ice_read_rxq_ctx - Read rxq context from HW
+ * @hw: pointer to the hardware structure
+ * @rlan_ctx: pointer to the rxq context
+ * @rxq_index: the index of the Rx queue
+ *
+ * Read rxq context from HW register space and then converts it from dense
+ * structure to sparse
+ */
+int
+ice_read_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
+		 u32 rxq_index)
+{
+	u8 ctx_buf[ICE_RXQ_CTX_SZ] = { 0 };
+	int status;
+
+	if (!rlan_ctx)
+		return -EINVAL;
+
+	status = ice_copy_rxq_ctx_from_hw(hw, ctx_buf, rxq_index);
+	if (status)
+		return status;
+
+	return ice_get_ctx(ctx_buf, (u8 *)rlan_ctx, ice_rlan_ctx_info);
+}
+
 /* LAN Tx Queue Context */
 const struct ice_ctx_ele ice_tlan_ctx_info[] = {
 				    /* Field			Width	LSB */
@@ -4469,6 +4523,220 @@ ice_set_ctx(struct ice_hw *hw, u8 *src_ctx, u8 *dest_ctx,
 	return 0;
 }
 
+/**
+ * ice_read_byte - read context byte into struct
+ * @src_ctx:  the context structure to read from
+ * @dest_ctx: the context to be written to
+ * @ce_info:  a description of the struct to be filled
+ */
+static void
+ice_read_byte(u8 *src_ctx, u8 *dest_ctx, const struct ice_ctx_ele *ce_info)
+{
+	u8 dest_byte, mask;
+	u8 *src, *target;
+	u16 shift_width;
+
+	/* prepare the bits and mask */
+	shift_width = ce_info->lsb % 8;
+	mask = (u8)(BIT(ce_info->width) - 1);
+
+	/* shift to correct alignment */
+	mask <<= shift_width;
+
+	/* get the current bits from the src bit string */
+	src = src_ctx + (ce_info->lsb / 8);
+
+	memcpy(&dest_byte, src, sizeof(dest_byte));
+
+	dest_byte &= mask;
+
+	dest_byte >>= shift_width;
+
+	/* get the address from the struct field */
+	target = dest_ctx + ce_info->offset;
+
+	/* put it back in the struct */
+	memcpy(target, &dest_byte, sizeof(dest_byte));
+}
+
+/**
+ * ice_read_word - read context word into struct
+ * @src_ctx:  the context structure to read from
+ * @dest_ctx: the context to be written to
+ * @ce_info:  a description of the struct to be filled
+ */
+static void
+ice_read_word(u8 *src_ctx, u8 *dest_ctx, const struct ice_ctx_ele *ce_info)
+{
+	u16 dest_word, mask;
+	u8 *src, *target;
+	__le16 src_word;
+	u16 shift_width;
+
+	/* prepare the bits and mask */
+	shift_width = ce_info->lsb % 8;
+	mask = BIT(ce_info->width) - 1;
+
+	/* shift to correct alignment */
+	mask <<= shift_width;
+
+	/* get the current bits from the src bit string */
+	src = src_ctx + (ce_info->lsb / 8);
+
+	memcpy(&src_word, src, sizeof(src_word));
+
+	/* the data in the memory is stored as little endian so mask it
+	 * correctly
+	 */
+	src_word &= cpu_to_le16(mask);
+
+	/* get the data back into host order before shifting */
+	dest_word = le16_to_cpu(src_word);
+
+	dest_word >>= shift_width;
+
+	/* get the address from the struct field */
+	target = dest_ctx + ce_info->offset;
+
+	/* put it back in the struct */
+	memcpy(target, &dest_word, sizeof(dest_word));
+}
+
+/**
+ * ice_read_dword - read context dword into struct
+ * @src_ctx:  the context structure to read from
+ * @dest_ctx: the context to be written to
+ * @ce_info:  a description of the struct to be filled
+ */
+static void
+ice_read_dword(u8 *src_ctx, u8 *dest_ctx, const struct ice_ctx_ele *ce_info)
+{
+	u32 dest_dword, mask;
+	__le32 src_dword;
+	u8 *src, *target;
+	u16 shift_width;
+
+	/* prepare the bits and mask */
+	shift_width = ce_info->lsb % 8;
+
+	/* if the field width is exactly 32 on an x86 machine, then the shift
+	 * operation will not work because the SHL instructions count is masked
+	 * to 5 bits so the shift will do nothing
+	 */
+	if (ce_info->width < 32)
+		mask = BIT(ce_info->width) - 1;
+	else
+		mask = (u32)~0;
+
+	/* shift to correct alignment */
+	mask <<= shift_width;
+
+	/* get the current bits from the src bit string */
+	src = src_ctx + (ce_info->lsb / 8);
+
+	memcpy(&src_dword, src, sizeof(src_dword));
+
+	/* the data in the memory is stored as little endian so mask it
+	 * correctly
+	 */
+	src_dword &= cpu_to_le32(mask);
+
+	/* get the data back into host order before shifting */
+	dest_dword = le32_to_cpu(src_dword);
+
+	dest_dword >>= shift_width;
+
+	/* get the address from the struct field */
+	target = dest_ctx + ce_info->offset;
+
+	/* put it back in the struct */
+	memcpy(target, &dest_dword, sizeof(dest_dword));
+}
+
+/**
+ * ice_read_qword - read context qword into struct
+ * @src_ctx:  the context structure to read from
+ * @dest_ctx: the context to be written to
+ * @ce_info:  a description of the struct to be filled
+ */
+static void
+ice_read_qword(u8 *src_ctx, u8 *dest_ctx, const struct ice_ctx_ele *ce_info)
+{
+	u64 dest_qword, mask;
+	__le64 src_qword;
+	u8 *src, *target;
+	u16 shift_width;
+
+	/* prepare the bits and mask */
+	shift_width = ce_info->lsb % 8;
+
+	/* if the field width is exactly 64 on an x86 machine, then the shift
+	 * operation will not work because the SHL instructions count is masked
+	 * to 6 bits so the shift will do nothing
+	 */
+	if (ce_info->width < 64)
+		mask = BIT_ULL(ce_info->width) - 1;
+	else
+		mask = (u64)~0;
+
+	/* shift to correct alignment */
+	mask <<= shift_width;
+
+	/* get the current bits from the src bit string */
+	src = src_ctx + (ce_info->lsb / 8);
+
+	memcpy(&src_qword, src, sizeof(src_qword));
+
+	/* the data in the memory is stored as little endian so mask it
+	 * correctly
+	 */
+	src_qword &= cpu_to_le64(mask);
+
+	/* get the data back into host order before shifting */
+	dest_qword = le64_to_cpu(src_qword);
+
+	dest_qword >>= shift_width;
+
+	/* get the address from the struct field */
+	target = dest_ctx + ce_info->offset;
+
+	/* put it back in the struct */
+	memcpy(target, &dest_qword, sizeof(dest_qword));
+}
+
+/**
+ * ice_get_ctx - extract context bits from a packed structure
+ * @src_ctx:  pointer to a generic packed context structure
+ * @dest_ctx: pointer to a generic non-packed context structure
+ * @ce_info:  a description of the structure to be read from
+ */
+int
+ice_get_ctx(u8 *src_ctx, u8 *dest_ctx, const struct ice_ctx_ele *ce_info)
+{
+	int i;
+
+	for (i = 0; ce_info[i].width; i++) {
+		switch (ce_info[i].size_of) {
+		case 1:
+			ice_read_byte(src_ctx, dest_ctx, &ce_info[i]);
+			break;
+		case 2:
+			ice_read_word(src_ctx, dest_ctx, &ce_info[i]);
+			break;
+		case 4:
+			ice_read_dword(src_ctx, dest_ctx, &ce_info[i]);
+			break;
+		case 8:
+			ice_read_qword(src_ctx, dest_ctx, &ce_info[i]);
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * ice_get_lan_q_ctx - get the LAN queue context for the given VSI and TC
  * @hw: pointer to the HW struct
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h
index 81961a7d6598..85c477049a34 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -56,6 +56,9 @@ void ice_set_safe_mode_caps(struct ice_hw *hw);
 int
 ice_write_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
 		  u32 rxq_index);
+int
+ice_read_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
+		 u32 rxq_index);
 
 int
 ice_aq_get_rss_lut(struct ice_hw *hw, struct ice_aq_get_set_rss_lut_params *get_params);
@@ -75,6 +78,8 @@ extern const struct ice_ctx_ele ice_tlan_ctx_info[];
 int
 ice_set_ctx(struct ice_hw *hw, u8 *src_ctx, u8 *dest_ctx,
 	    const struct ice_ctx_ele *ce_info);
+int
+ice_get_ctx(u8 *src_ctx, u8 *dest_ctx, const struct ice_ctx_ele *ce_info);
 
 extern struct mutex ice_global_cfg_lock_sw;
 
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 03/15] ice: check VF migration status before sending messages to VF
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
  2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 01/15] ice: Fix missing legacy 32byte RXDID in the supported bitmap Lingyu Liu
  2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 02/15] ice: add function to get rxq context Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions Lingyu Liu
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Add a new VF state ICE_VF_STATE_REPLAYING_VC. During live migration,
virtual channel commands will be replayed to restore VF device state.
Don't send messages back to VF when replaying by checking VF state
before PF sends messages to VF.

ice_vc_send_msg_to_vf() implies sending messages but actually it
does not while replaying messages. Renaming ice_vc_send_msg_to_vf()
to ice_vc_respond_to_vf() to indicate its behavior correctly.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_vf_lib.h   |   1 +
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 182 ++++++++++--------
 drivers/net/ethernet/intel/ice/ice_virtchnl.h |   8 +-
 .../ethernet/intel/ice/ice_virtchnl_fdir.c    |  28 +--
 4 files changed, 120 insertions(+), 99 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.h b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
index 67172fdd9bc2..0c8dd7910129 100644
--- a/drivers/net/ethernet/intel/ice/ice_vf_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
@@ -37,6 +37,7 @@ enum ice_vf_states {
 	ICE_VF_STATE_DIS,
 	ICE_VF_STATE_MC_PROMISC,
 	ICE_VF_STATE_UC_PROMISC,
+	ICE_VF_STATE_REPLAYING_VC,
 	ICE_VF_STATES_NBITS
 };
 
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 9d74dcafde60..e14d0c1584d5 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -281,19 +281,10 @@ void ice_vc_notify_reset(struct ice_pf *pf)
 			    (u8 *)&pfe, sizeof(struct virtchnl_pf_event));
 }
 
-/**
- * ice_vc_send_msg_to_vf - Send message to VF
- * @vf: pointer to the VF info
- * @v_opcode: virtual channel opcode
- * @v_retval: virtual channel return value
- * @msg: pointer to the msg buffer
- * @msglen: msg length
- *
- * send msg to VF
- */
-int
-ice_vc_send_msg_to_vf(struct ice_vf *vf, u32 v_opcode,
-		      enum virtchnl_status_code v_retval, u8 *msg, u16 msglen)
+static int
+ice_vc_send_response_to_vf(struct ice_vf *vf, u32 v_opcode,
+			   enum virtchnl_status_code v_retval,
+			   u8 *msg, u16 msglen)
 {
 	struct device *dev;
 	struct ice_pf *pf;
@@ -314,6 +305,37 @@ ice_vc_send_msg_to_vf(struct ice_vf *vf, u32 v_opcode,
 	return 0;
 }
 
+/**
+ * ice_vc_respond_to_vf - Respond to VF
+ * @vf: pointer to the VF info
+ * @v_opcode: virtual channel opcode
+ * @v_retval: virtual channel return value
+ * @msg: pointer to the msg buffer
+ * @msglen: msg length
+ *
+ * send msg to VF
+ */
+int
+ice_vc_respond_to_vf(struct ice_vf *vf, u32 v_opcode,
+		     enum virtchnl_status_code v_retval, u8 *msg, u16 msglen)
+{
+	struct device *dev;
+	struct ice_pf *pf = vf->pf;
+
+	dev = ice_pf_to_dev(pf);
+
+	if (test_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states)) {
+		if (v_retval == VIRTCHNL_STATUS_SUCCESS)
+			return 0;
+
+		dev_dbg(dev, "Unable to replay virt channel command, VF ID %d, virtchnl status code %d. op code %d, len %d.\n",
+			vf->vf_id, v_retval, v_opcode, msglen);
+		return -EIO;
+	}
+
+	return ice_vc_send_response_to_vf(vf, v_opcode, v_retval, msg, msglen);
+}
+
 /**
  * ice_vc_get_ver_msg
  * @vf: pointer to the VF info
@@ -332,9 +354,9 @@ static int ice_vc_get_ver_msg(struct ice_vf *vf, u8 *msg)
 	if (VF_IS_V10(&vf->vf_ver))
 		info.minor = VIRTCHNL_VERSION_MINOR_NO_VF_CAPS;
 
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_VERSION,
-				     VIRTCHNL_STATUS_SUCCESS, (u8 *)&info,
-				     sizeof(struct virtchnl_version_info));
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_VERSION,
+				    VIRTCHNL_STATUS_SUCCESS, (u8 *)&info,
+				    sizeof(struct virtchnl_version_info));
 }
 
 /**
@@ -519,8 +541,8 @@ static int ice_vc_get_vf_res_msg(struct ice_vf *vf, u8 *msg)
 
 err:
 	/* send the response back to the VF */
-	ret = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_GET_VF_RESOURCES, v_ret,
-				    (u8 *)vfres, len);
+	ret = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_GET_VF_RESOURCES, v_ret,
+				   (u8 *)vfres, len);
 
 	kfree(vfres);
 	return ret;
@@ -889,7 +911,7 @@ static int ice_vc_handle_rss_cfg(struct ice_vf *vf, u8 *msg, bool add)
 	}
 
 error_param:
-	return ice_vc_send_msg_to_vf(vf, v_opcode, v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, v_opcode, v_ret, NULL, 0);
 }
 
 /**
@@ -935,8 +957,8 @@ static int ice_vc_config_rss_key(struct ice_vf *vf, u8 *msg)
 	if (ice_set_rss_key(vsi, vrk->key))
 		v_ret = VIRTCHNL_STATUS_ERR_ADMIN_QUEUE_ERROR;
 error_param:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_RSS_KEY, v_ret,
-				     NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_RSS_KEY, v_ret,
+				    NULL, 0);
 }
 
 /**
@@ -981,8 +1003,8 @@ static int ice_vc_config_rss_lut(struct ice_vf *vf, u8 *msg)
 	if (ice_set_rss_lut(vsi, vrl->lut, ICE_LUT_VSI_SIZE))
 		v_ret = VIRTCHNL_STATUS_ERR_ADMIN_QUEUE_ERROR;
 error_param:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_RSS_LUT, v_ret,
-				     NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_RSS_LUT, v_ret,
+				    NULL, 0);
 }
 
 /**
@@ -1121,8 +1143,8 @@ static int ice_vc_cfg_promiscuous_mode_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 error_param:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -1162,8 +1184,8 @@ static int ice_vc_get_stats_msg(struct ice_vf *vf, u8 *msg)
 
 error_param:
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_GET_STATS, v_ret,
-				     (u8 *)&stats, sizeof(stats));
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_GET_STATS, v_ret,
+				    (u8 *)&stats, sizeof(stats));
 }
 
 /**
@@ -1312,8 +1334,8 @@ static int ice_vc_ena_qs_msg(struct ice_vf *vf, u8 *msg)
 
 error_param:
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ENABLE_QUEUES, v_ret,
-				     NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ENABLE_QUEUES, v_ret,
+				    NULL, 0);
 }
 
 /**
@@ -1452,8 +1474,8 @@ static int ice_vc_dis_qs_msg(struct ice_vf *vf, u8 *msg)
 
 error_param:
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DISABLE_QUEUES, v_ret,
-				     NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DISABLE_QUEUES, v_ret,
+				    NULL, 0);
 }
 
 /**
@@ -1583,8 +1605,8 @@ static int ice_vc_cfg_irq_map_msg(struct ice_vf *vf, u8 *msg)
 
 error_param:
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_IRQ_MAP, v_ret,
-				     NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_IRQ_MAP, v_ret,
+				    NULL, 0);
 }
 
 /**
@@ -1711,8 +1733,8 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES,
-				     VIRTCHNL_STATUS_SUCCESS, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES,
+				    VIRTCHNL_STATUS_SUCCESS, NULL, 0);
 error_param:
 	/* disable whatever we can */
 	for (; i >= 0; i--) {
@@ -1725,8 +1747,8 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES,
-				     VIRTCHNL_STATUS_ERR_PARAM, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES,
+				    VIRTCHNL_STATUS_ERR_PARAM, NULL, 0);
 }
 
 /**
@@ -2028,7 +2050,7 @@ ice_vc_handle_mac_addr_msg(struct ice_vf *vf, u8 *msg, bool set)
 
 handle_mac_exit:
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, vc_op, v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, vc_op, v_ret, NULL, 0);
 }
 
 /**
@@ -2111,8 +2133,8 @@ static int ice_vc_request_qs_msg(struct ice_vf *vf, u8 *msg)
 
 error_param:
 	/* send the response to the VF */
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_REQUEST_QUEUES,
-				     v_ret, (u8 *)vfres, sizeof(*vfres));
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_REQUEST_QUEUES,
+				    v_ret, (u8 *)vfres, sizeof(*vfres));
 }
 
 /**
@@ -2377,11 +2399,11 @@ static int ice_vc_process_vlan_msg(struct ice_vf *vf, u8 *msg, bool add_v)
 error_param:
 	/* send the response to the VF */
 	if (add_v)
-		return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ADD_VLAN, v_ret,
-					     NULL, 0);
+		return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ADD_VLAN, v_ret,
+					    NULL, 0);
 	else
-		return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DEL_VLAN, v_ret,
-					     NULL, 0);
+		return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DEL_VLAN, v_ret,
+					    NULL, 0);
 }
 
 /**
@@ -2439,8 +2461,8 @@ static int ice_vc_ena_vlan_stripping(struct ice_vf *vf)
 		v_ret = VIRTCHNL_STATUS_ERR_PARAM;
 
 error_param:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ENABLE_VLAN_STRIPPING,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ENABLE_VLAN_STRIPPING,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -2474,8 +2496,8 @@ static int ice_vc_dis_vlan_stripping(struct ice_vf *vf)
 		v_ret = VIRTCHNL_STATUS_ERR_PARAM;
 
 error_param:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DISABLE_VLAN_STRIPPING,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DISABLE_VLAN_STRIPPING,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -2510,8 +2532,8 @@ static int ice_vc_get_rss_hena(struct ice_vf *vf)
 	vrh->hena = ICE_DEFAULT_RSS_HENA;
 err:
 	/* send the response back to the VF */
-	ret = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_GET_RSS_HENA_CAPS, v_ret,
-				    (u8 *)vrh, len);
+	ret = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_GET_RSS_HENA_CAPS, v_ret,
+				   (u8 *)vrh, len);
 	kfree(vrh);
 	return ret;
 }
@@ -2576,8 +2598,8 @@ static int ice_vc_set_rss_hena(struct ice_vf *vf, u8 *msg)
 
 	/* send the response to the VF */
 err:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_SET_RSS_HENA, v_ret,
-				     NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_SET_RSS_HENA, v_ret,
+				    NULL, 0);
 }
 
 /**
@@ -2633,8 +2655,8 @@ static int ice_vc_query_rxdid(struct ice_vf *vf)
 	pf->supported_rxdids = rxdid->supported_rxdids;
 
 err:
-	ret = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_GET_SUPPORTED_RXDIDS,
-				    v_ret, (u8 *)rxdid, len);
+	ret = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_GET_SUPPORTED_RXDIDS,
+				   v_ret, (u8 *)rxdid, len);
 	kfree(rxdid);
 	return ret;
 }
@@ -2862,8 +2884,8 @@ static int ice_vc_get_offload_vlan_v2_caps(struct ice_vf *vf)
 	memcpy(&vf->vlan_v2_caps, caps, sizeof(*caps));
 
 out:
-	err = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS,
-				    v_ret, (u8 *)caps, len);
+	err = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS,
+				   v_ret, (u8 *)caps, len);
 	kfree(caps);
 	return err;
 }
@@ -3104,8 +3126,7 @@ static int ice_vc_remove_vlan_v2_msg(struct ice_vf *vf, u8 *msg)
 		v_ret = VIRTCHNL_STATUS_ERR_PARAM;
 
 out:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DEL_VLAN_V2, v_ret, NULL,
-				     0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DEL_VLAN_V2, v_ret, NULL, 0);
 }
 
 /**
@@ -3246,8 +3267,7 @@ static int ice_vc_add_vlan_v2_msg(struct ice_vf *vf, u8 *msg)
 		v_ret = VIRTCHNL_STATUS_ERR_PARAM;
 
 out:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ADD_VLAN_V2, v_ret, NULL,
-				     0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ADD_VLAN_V2, v_ret, NULL, 0);
 }
 
 /**
@@ -3468,8 +3488,8 @@ static int ice_vc_ena_vlan_stripping_v2_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 out:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ENABLE_VLAN_STRIPPING_V2,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ENABLE_VLAN_STRIPPING_V2,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -3538,8 +3558,8 @@ static int ice_vc_dis_vlan_stripping_v2_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 out:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DISABLE_VLAN_STRIPPING_V2,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DISABLE_VLAN_STRIPPING_V2,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -3597,8 +3617,8 @@ static int ice_vc_ena_vlan_insertion_v2_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 out:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ENABLE_VLAN_INSERTION_V2,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ENABLE_VLAN_INSERTION_V2,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -3652,8 +3672,8 @@ static int ice_vc_dis_vlan_insertion_v2_msg(struct ice_vf *vf, u8 *msg)
 	}
 
 out:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DISABLE_VLAN_INSERTION_V2,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DISABLE_VLAN_INSERTION_V2,
+				    v_ret, NULL, 0);
 }
 
 static const struct ice_virtchnl_ops ice_virtchnl_dflt_ops = {
@@ -3750,8 +3770,8 @@ static int ice_vc_repr_add_mac(struct ice_vf *vf, u8 *msg)
 	}
 
 handle_mac_exit:
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ADD_ETH_ADDR,
-				     v_ret, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ADD_ETH_ADDR,
+				    v_ret, NULL, 0);
 }
 
 /**
@@ -3770,8 +3790,8 @@ ice_vc_repr_del_mac(struct ice_vf __always_unused *vf, u8 __always_unused *msg)
 
 	ice_update_legacy_cached_mac(vf, &al->list[0]);
 
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DEL_ETH_ADDR,
-				     VIRTCHNL_STATUS_SUCCESS, NULL, 0);
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DEL_ETH_ADDR,
+				    VIRTCHNL_STATUS_SUCCESS, NULL, 0);
 }
 
 static int
@@ -3780,8 +3800,8 @@ ice_vc_repr_cfg_promiscuous_mode(struct ice_vf *vf, u8 __always_unused *msg)
 	dev_dbg(ice_pf_to_dev(vf->pf),
 		"Can't config promiscuous mode in switchdev mode for VF %d\n",
 		vf->vf_id);
-	return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE,
-				     VIRTCHNL_STATUS_ERR_NOT_SUPPORTED,
+	return ice_vc_respond_to_vf(vf, VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE,
+				    VIRTCHNL_STATUS_ERR_NOT_SUPPORTED,
 				     NULL, 0);
 }
 
@@ -3924,16 +3944,16 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 
 error_handler:
 	if (err) {
-		ice_vc_send_msg_to_vf(vf, v_opcode, VIRTCHNL_STATUS_ERR_PARAM,
-				      NULL, 0);
+		ice_vc_respond_to_vf(vf, v_opcode, VIRTCHNL_STATUS_ERR_PARAM,
+				     NULL, 0);
 		dev_err(dev, "Invalid message from VF %d, opcode %d, len %d, error %d\n",
 			vf_id, v_opcode, msglen, err);
 		goto finish;
 	}
 
 	if (!ice_vc_is_opcode_allowed(vf, v_opcode)) {
-		ice_vc_send_msg_to_vf(vf, v_opcode,
-				      VIRTCHNL_STATUS_ERR_NOT_SUPPORTED, NULL,
+		ice_vc_respond_to_vf(vf, v_opcode,
+				     VIRTCHNL_STATUS_ERR_NOT_SUPPORTED, NULL,
 				      0);
 		goto finish;
 	}
@@ -4045,9 +4065,9 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 	default:
 		dev_err(dev, "Unsupported opcode %d from VF %d\n", v_opcode,
 			vf_id);
-		err = ice_vc_send_msg_to_vf(vf, v_opcode,
-					    VIRTCHNL_STATUS_ERR_NOT_SUPPORTED,
-					    NULL, 0);
+		err = ice_vc_respond_to_vf(vf, v_opcode,
+					   VIRTCHNL_STATUS_ERR_NOT_SUPPORTED,
+					   NULL, 0);
 		break;
 	}
 	if (err) {
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.h b/drivers/net/ethernet/intel/ice/ice_virtchnl.h
index cd747718de73..a2b6094e2f2f 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.h
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.h
@@ -60,8 +60,8 @@ void ice_vc_notify_vf_link_state(struct ice_vf *vf);
 void ice_vc_notify_link_state(struct ice_pf *pf);
 void ice_vc_notify_reset(struct ice_pf *pf);
 int
-ice_vc_send_msg_to_vf(struct ice_vf *vf, u32 v_opcode,
-		      enum virtchnl_status_code v_retval, u8 *msg, u16 msglen);
+ice_vc_respond_to_vf(struct ice_vf *vf, u32 v_opcode,
+		     enum virtchnl_status_code v_retval, u8 *msg, u16 msglen);
 bool ice_vc_isvalid_vsi_id(struct ice_vf *vf, u16 vsi_id);
 void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 			   struct ice_mbx_data *mbxdata);
@@ -73,8 +73,8 @@ static inline void ice_vc_notify_link_state(struct ice_pf *pf) { }
 static inline void ice_vc_notify_reset(struct ice_pf *pf) { }
 
 static inline int
-ice_vc_send_msg_to_vf(struct ice_vf *vf, u32 v_opcode,
-		      enum virtchnl_status_code v_retval, u8 *msg, u16 msglen)
+ice_vc_respond_to_vf(struct ice_vf *vf, u32 v_opcode,
+		     enum virtchnl_status_code v_retval, u8 *msg, u16 msglen)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c b/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c
index daa6a1e894cf..bf6c24901cb0 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c
@@ -1571,8 +1571,8 @@ ice_vc_add_fdir_fltr_post(struct ice_vf *vf, struct ice_vf_fdir_ctx *ctx,
 	resp->flow_id = conf->flow_id;
 	vf->fdir.fdir_fltr_cnt[conf->input.flow_type][is_tun]++;
 
-	ret = ice_vc_send_msg_to_vf(vf, ctx->v_opcode, v_ret,
-				    (u8 *)resp, len);
+	ret = ice_vc_respond_to_vf(vf, ctx->v_opcode, v_ret,
+				   (u8 *)resp, len);
 	kfree(resp);
 
 	dev_dbg(dev, "VF %d: flow_id:0x%X, FDIR %s success!\n",
@@ -1587,8 +1587,8 @@ ice_vc_add_fdir_fltr_post(struct ice_vf *vf, struct ice_vf_fdir_ctx *ctx,
 	ice_vc_fdir_remove_entry(vf, conf, conf->flow_id);
 	devm_kfree(dev, conf);
 
-	ret = ice_vc_send_msg_to_vf(vf, ctx->v_opcode, v_ret,
-				    (u8 *)resp, len);
+	ret = ice_vc_respond_to_vf(vf, ctx->v_opcode, v_ret,
+				   (u8 *)resp, len);
 	kfree(resp);
 	return ret;
 }
@@ -1635,8 +1635,8 @@ ice_vc_del_fdir_fltr_post(struct ice_vf *vf, struct ice_vf_fdir_ctx *ctx,
 	ice_vc_fdir_remove_entry(vf, conf, conf->flow_id);
 	vf->fdir.fdir_fltr_cnt[conf->input.flow_type][is_tun]--;
 
-	ret = ice_vc_send_msg_to_vf(vf, ctx->v_opcode, v_ret,
-				    (u8 *)resp, len);
+	ret = ice_vc_respond_to_vf(vf, ctx->v_opcode, v_ret,
+				   (u8 *)resp, len);
 	kfree(resp);
 
 	dev_dbg(dev, "VF %d: flow_id:0x%X, FDIR %s success!\n",
@@ -1652,8 +1652,8 @@ ice_vc_del_fdir_fltr_post(struct ice_vf *vf, struct ice_vf_fdir_ctx *ctx,
 	if (success)
 		devm_kfree(dev, conf);
 
-	ret = ice_vc_send_msg_to_vf(vf, ctx->v_opcode, v_ret,
-				    (u8 *)resp, len);
+	ret = ice_vc_respond_to_vf(vf, ctx->v_opcode, v_ret,
+				   (u8 *)resp, len);
 	kfree(resp);
 	return ret;
 }
@@ -1850,8 +1850,8 @@ int ice_vc_add_fdir_fltr(struct ice_vf *vf, u8 *msg)
 		v_ret = VIRTCHNL_STATUS_SUCCESS;
 		stat->status = VIRTCHNL_FDIR_SUCCESS;
 		devm_kfree(dev, conf);
-		ret = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ADD_FDIR_FILTER,
-					    v_ret, (u8 *)stat, len);
+		ret = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ADD_FDIR_FILTER,
+					   v_ret, (u8 *)stat, len);
 		goto exit;
 	}
 
@@ -1909,8 +1909,8 @@ int ice_vc_add_fdir_fltr(struct ice_vf *vf, u8 *msg)
 err_free_conf:
 	devm_kfree(dev, conf);
 err_exit:
-	ret = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_ADD_FDIR_FILTER, v_ret,
-				    (u8 *)stat, len);
+	ret = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_ADD_FDIR_FILTER, v_ret,
+				   (u8 *)stat, len);
 	kfree(stat);
 	return ret;
 }
@@ -1993,8 +1993,8 @@ int ice_vc_del_fdir_fltr(struct ice_vf *vf, u8 *msg)
 err_del_tmr:
 	ice_vc_fdir_clear_irq_ctx(vf);
 err_exit:
-	ret = ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_DEL_FDIR_FILTER, v_ret,
-				    (u8 *)stat, len);
+	ret = ice_vc_respond_to_vf(vf, VIRTCHNL_OP_DEL_FDIR_FILTER, v_ret,
+				   (u8 *)stat, len);
 	kfree(stat);
 	return ret;
 }
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (2 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 03/15] ice: check VF migration status before sending messages to VF Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21 13:35   ` Jason Gunthorpe
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 05/15] ice: save VF messages as device state Lingyu Liu
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Adds a function to get ice VF device from pci device.
Adds a field in VF structure to indicate migration init state,
and functions to init and uninit migration.

This will be used by ice_vfio_pci driver introduced in coming patches
from this series.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 drivers/net/ethernet/intel/ice/Makefile       |  1 +
 drivers/net/ethernet/intel/ice/ice.h          |  1 +
 .../net/ethernet/intel/ice/ice_migration.c    | 68 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_vf_lib.c   |  7 ++
 drivers/net/ethernet/intel/ice/ice_vf_lib.h   |  1 +
 include/linux/net/intel/ice_migration.h       | 26 +++++++
 6 files changed, 104 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_migration.c
 create mode 100644 include/linux/net/intel/ice_migration.h

diff --git a/drivers/net/ethernet/intel/ice/Makefile b/drivers/net/ethernet/intel/ice/Makefile
index 960277d78e09..915b70588f79 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -49,3 +49,4 @@ ice-$(CONFIG_RFS_ACCEL) += ice_arfs.o
 ice-$(CONFIG_XDP_SOCKETS) += ice_xsk.o
 ice-$(CONFIG_ICE_SWITCHDEV) += ice_eswitch.o ice_eswitch_br.o
 ice-$(CONFIG_GNSS) += ice_gnss.o
+ice-$(CONFIG_ICE_VFIO_PCI) += ice_migration.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 9109006336f0..ec7f27d93924 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -55,6 +55,7 @@
 #include <net/vxlan.h>
 #include <net/gtp.h>
 #include <linux/ppp_defs.h>
+#include <linux/net/intel/ice_migration.h>
 #include "ice_devids.h"
 #include "ice_type.h"
 #include "ice_txrx.h"
diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
new file mode 100644
index 000000000000..1aadb8577a41
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2018-2023 Intel Corporation */
+
+#include "ice.h"
+
+/**
+ * ice_migration_get_vf - Get ice VF structure pointer by pdev
+ * @vf_pdev: pointer to ice vfio pci VF pdev structure
+ *
+ * Return nonzero for success, NULL for failure.
+ *
+ * ice_put_vf() should be called after finishing accessing VF
+ */
+void *ice_migration_get_vf(struct pci_dev *vf_pdev)
+{
+	struct pci_dev *pf_pdev = vf_pdev->physfn;
+	int vf_id = pci_iov_vf_id(vf_pdev);
+	struct ice_pf *pf;
+
+	if (!pf_pdev || vf_id < 0)
+		return NULL;
+
+	pf = pci_get_drvdata(pf_pdev);
+	return ice_get_vf_by_id(pf, vf_id);
+}
+EXPORT_SYMBOL(ice_migration_get_vf);
+
+/**
+ * ice_migration_put_vf - Release a VF reference
+ * @opaque: pointer to VF handler in ice vdev
+ *
+ * This must be called after ice_get_vf_by_id() after the reference
+ * to the VF is no longer used.
+ */
+void ice_migration_put_vf(void *opaque)
+{
+	struct ice_vf *vf = (struct ice_vf *)opaque;
+
+	ice_put_vf(vf);
+}
+EXPORT_SYMBOL(ice_migration_put_vf);
+
+/**
+ * ice_migration_init_vf - Init ice VF device state data
+ * @opaque: pointer to VF handler in ice vdev
+ */
+void ice_migration_init_vf(void *opaque)
+{
+	struct ice_vf *vf = (struct ice_vf *)opaque;
+
+	vf->migration_active = true;
+}
+EXPORT_SYMBOL(ice_migration_init_vf);
+
+/**
+ * ice_migration_uninit_vf - uninit VF device state data
+ * @opaque: pointer to VF handler in ice vdev
+ */
+void ice_migration_uninit_vf(void *opaque)
+{
+	struct ice_vf *vf = (struct ice_vf *)opaque;
+
+	if (!vf->migration_active)
+		return;
+
+	vf->migration_active = false;
+}
+EXPORT_SYMBOL(ice_migration_uninit_vf);
diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c
index b26ce4425f45..4b1940487b27 100644
--- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c
@@ -56,6 +56,9 @@ static void ice_release_vf(struct kref *ref)
 {
 	struct ice_vf *vf = container_of(ref, struct ice_vf, refcnt);
 
+	if (vf->migration_active)
+		ice_migration_uninit_vf(vf);
+
 	vf->vf_ops->free(vf);
 }
 
@@ -260,6 +263,10 @@ static void ice_vf_pre_vsi_rebuild(struct ice_vf *vf)
 	if (vf->vf_ops->irq_close)
 		vf->vf_ops->irq_close(vf);
 
+	if (vf->migration_active) {
+		ice_migration_uninit_vf(vf);
+		ice_migration_init_vf(vf);
+	}
 	ice_vf_clear_counters(vf);
 	vf->vf_ops->clear_reset_trigger(vf);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.h b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
index 0c8dd7910129..cbff9b5aacd6 100644
--- a/drivers/net/ethernet/intel/ice/ice_vf_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
@@ -134,6 +134,7 @@ struct ice_vf {
 
 	/* devlink port data */
 	struct devlink_port devlink_port;
+	bool migration_active;
 };
 
 /* Flags for controlling behavior of ice_reset_vf */
diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/intel/ice_migration.h
new file mode 100644
index 000000000000..5f1c765ed582
--- /dev/null
+++ b/include/linux/net/intel/ice_migration.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2018-2023 Intel Corporation */
+
+#ifndef _ICE_MIGRATION_H_
+#define _ICE_MIGRATION_H_
+
+#if IS_ENABLED(CONFIG_ICE_VFIO_PCI)
+void *ice_migration_get_vf(struct pci_dev *vf_pdev);
+void ice_migration_put_vf(void *opaque);
+void ice_migration_init_vf(void *opaque);
+void ice_migration_uninit_vf(void *opaque);
+
+#else
+static inline void *ice_migration_get_vf(struct pci_dev *vf_pdev)
+{
+	return NULL;
+}
+static inline void ice_migration_put_vf(void *opaque)
+{
+	return NULL;
+}
+static inline void ice_migration_init_vf(void *opaque) { }
+static inline void ice_migration_uninit_vf(void *opaque) { }
+#endif /* CONFIG_ICE_VFIO_PCI */
+
+#endif /* _ICE_MIGRATION_H_ */
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 05/15] ice: save VF messages as device state
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (3 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 06/15] ice: save and restore " Lingyu Liu
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Add a list in VF structure to save the virtual channel
messages sent by VF.

ice_vfio_pci driver introduced in following patches from this series
needs the saved virtual channel list.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h          |  1 +
 .../net/ethernet/intel/ice/ice_migration.c    | 94 +++++++++++++++++++
 .../intel/ice/ice_migration_private.h         | 22 +++++
 drivers/net/ethernet/intel/ice/ice_vf_lib.h   |  3 +
 drivers/net/ethernet/intel/ice/ice_virtchnl.c |  3 +
 5 files changed, 123 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_migration_private.h

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index ec7f27d93924..d264a17e0d95 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -77,6 +77,7 @@
 #include "ice_vsi_vlan_ops.h"
 #include "ice_gnss.h"
 #include "ice_irq.h"
+#include "ice_migration_private.h"
 
 #define ICE_BAR0		0
 #define ICE_REQ_DESC_MULTIPLE	32
diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index 1aadb8577a41..6f658daf89a5 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -3,6 +3,17 @@
 
 #include "ice.h"
 
+struct ice_migration_virtchnl_msg_slot {
+	u32 opcode;
+	u16 msg_len;
+	char msg_buffer[];
+};
+
+struct ice_migration_virtchnl_msg_listnode {
+	struct list_head node;
+	struct ice_migration_virtchnl_msg_slot msg_slot;
+};
+
 /**
  * ice_migration_get_vf - Get ice VF structure pointer by pdev
  * @vf_pdev: pointer to ice vfio pci VF pdev structure
@@ -49,6 +60,8 @@ void ice_migration_init_vf(void *opaque)
 	struct ice_vf *vf = (struct ice_vf *)opaque;
 
 	vf->migration_active = true;
+	INIT_LIST_HEAD(&vf->virtchnl_msg_list);
+	vf->virtchnl_msg_num = 0;
 }
 EXPORT_SYMBOL(ice_migration_init_vf);
 
@@ -58,11 +71,92 @@ EXPORT_SYMBOL(ice_migration_init_vf);
  */
 void ice_migration_uninit_vf(void *opaque)
 {
+	struct ice_migration_virtchnl_msg_listnode *msg_listnode;
+	struct ice_migration_virtchnl_msg_listnode *dtmp;
 	struct ice_vf *vf = (struct ice_vf *)opaque;
 
 	if (!vf->migration_active)
 		return;
 
 	vf->migration_active = false;
+
+	if (list_empty(&vf->virtchnl_msg_list))
+		return;
+	list_for_each_entry_safe(msg_listnode, dtmp,
+				 &vf->virtchnl_msg_list,
+				 node) {
+		list_del(&msg_listnode->node);
+		kfree(msg_listnode);
+	}
+	vf->virtchnl_msg_num = 0;
 }
 EXPORT_SYMBOL(ice_migration_uninit_vf);
+
+/**
+ * ice_migration_save_vf_msg - Save request message from VF
+ * @vf: pointer to the VF structure
+ * @event: pointer to the AQ event
+ *
+ * save VF message for later restore during live migration
+ */
+void ice_migration_save_vf_msg(struct ice_vf *vf,
+			       struct ice_rq_event_info *event)
+{
+	struct ice_migration_virtchnl_msg_listnode *msg_listnode;
+	u32 v_opcode = le32_to_cpu(event->desc.cookie_high);
+	u16 msglen = event->msg_len;
+	u8 *msg = event->msg_buf;
+	struct device *dev;
+	struct ice_pf *pf;
+
+	pf = vf->pf;
+	dev = ice_pf_to_dev(pf);
+
+	if (!vf->migration_active)
+		return;
+
+	switch (v_opcode) {
+	case VIRTCHNL_OP_VERSION:
+	case VIRTCHNL_OP_GET_VF_RESOURCES:
+	case VIRTCHNL_OP_CONFIG_VSI_QUEUES:
+	case VIRTCHNL_OP_CONFIG_IRQ_MAP:
+	case VIRTCHNL_OP_ADD_ETH_ADDR:
+	case VIRTCHNL_OP_DEL_ETH_ADDR:
+	case VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE:
+	case VIRTCHNL_OP_ENABLE_QUEUES:
+	case VIRTCHNL_OP_DISABLE_QUEUES:
+	case VIRTCHNL_OP_ADD_VLAN:
+	case VIRTCHNL_OP_DEL_VLAN:
+	case VIRTCHNL_OP_ENABLE_VLAN_STRIPPING:
+	case VIRTCHNL_OP_DISABLE_VLAN_STRIPPING:
+	case VIRTCHNL_OP_CONFIG_RSS_KEY:
+	case VIRTCHNL_OP_CONFIG_RSS_LUT:
+	case VIRTCHNL_OP_GET_SUPPORTED_RXDIDS:
+		if (vf->virtchnl_msg_num >= VIRTCHNL_MSG_MAX) {
+			dev_warn(dev, "VF %d has maximum number virtual channel commands\n",
+				 vf->vf_id);
+			return;
+		}
+
+		msg_listnode = (struct ice_migration_virtchnl_msg_listnode *)
+				kzalloc(struct_size(msg_listnode,
+						    msg_slot.msg_buffer,
+						    msglen),
+					GFP_KERNEL);
+		if (!msg_listnode) {
+			dev_err(dev, "VF %d failed to allocate memory for msg listnode\n",
+				vf->vf_id);
+			return;
+		}
+		dev_dbg(dev, "VF %d save virtual channel command, op code: %d, len: %d\n",
+			vf->vf_id, v_opcode, msglen);
+		msg_listnode->msg_slot.opcode = v_opcode;
+		msg_listnode->msg_slot.msg_len = msglen;
+		memcpy(msg_listnode->msg_slot.msg_buffer, msg, msglen);
+		list_add_tail(&msg_listnode->node, &vf->virtchnl_msg_list);
+		vf->virtchnl_msg_num++;
+		break;
+	default:
+		break;
+	}
+}
diff --git a/drivers/net/ethernet/intel/ice/ice_migration_private.h b/drivers/net/ethernet/intel/ice/ice_migration_private.h
new file mode 100644
index 000000000000..4773fbc6b099
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_migration_private.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2018-2023 Intel Corporation */
+
+#ifndef _ICE_MIGRATION_PRIVATE_H_
+#define _ICE_MIGRATION_PRIVATE_H_
+
+/* This header file is for exposing functions in ice_migration.c to
+ * files which will be compiled in ice.ko.
+ * Functions which may be used by other files which will be compiled
+ * in ice-vfio-pic.ko should be exposed as part of ice_migration.h.
+ */
+
+#if IS_ENABLED(CONFIG_ICE_VFIO_PCI)
+void ice_migration_save_vf_msg(struct ice_vf *vf,
+			       struct ice_rq_event_info *event);
+#else
+static inline void
+ice_migration_save_vf_msg(struct ice_vf *vf,
+			  struct ice_rq_event_info *event) { }
+#endif /* CONFIG_ICE_VFIO_PCI */
+
+#endif /* _ICE_MIGRATION_PRIVATE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.h b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
index cbff9b5aacd6..b77daa7d310c 100644
--- a/drivers/net/ethernet/intel/ice/ice_vf_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
@@ -77,6 +77,7 @@ struct ice_vfs {
 	unsigned long last_printed_mdd_jiffies;	/* MDD message rate limit */
 };
 
+#define VIRTCHNL_MSG_MAX 1000
 /* VF information structure */
 struct ice_vf {
 	struct hlist_node entry;
@@ -135,6 +136,8 @@ struct ice_vf {
 	/* devlink port data */
 	struct devlink_port devlink_port;
 	bool migration_active;
+	struct list_head virtchnl_msg_list;
+	u64 virtchnl_msg_num;
 };
 
 /* Flags for controlling behavior of ice_reset_vf */
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index e14d0c1584d5..9eb2ff5c10f1 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -4078,6 +4078,9 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 			 vf_id, v_opcode, err);
 	}
 
+	if (!err && vf->migration_active)
+		ice_migration_save_vf_msg(vf, event);
+
 finish:
 	mutex_unlock(&vf->cfg_lock);
 	ice_put_vf(vf);
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 06/15] ice: save and restore device state
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (4 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 05/15] ice: save VF messages as device state Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 07/15] ice: do not notify VF link state during migration Lingyu Liu
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Add function to migrate device state.

ice_vfio_pci driver introduced in following patches from this series
needs the exported function to save and restore device state.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    | 125 ++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_virtchnl.c |  26 +++-
 drivers/net/ethernet/intel/ice/ice_virtchnl.h |   7 +-
 include/linux/net/intel/ice_migration.h       |  12 ++
 4 files changed, 161 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index 6f658daf89a5..49ad3c252f03 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -160,3 +160,128 @@ void ice_migration_save_vf_msg(struct ice_vf *vf,
 		break;
 	}
 }
+
+/**
+ * ice_migration_save_devstate - save VF msg to migration buffer
+ * @opaque: pointer to VF handler in ice vdev
+ * @buf: pointer to VF msg in migration buffer
+ * @buf_sz: size of migration buffer
+ *
+ * Return 0 for success, negative for error
+ */
+int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
+{
+	struct ice_migration_virtchnl_msg_listnode *msg_listnode;
+	struct ice_migration_virtchnl_msg_slot *last_op;
+	struct ice_vf *vf = (struct ice_vf *)opaque;
+	struct device *dev = ice_pf_to_dev(vf->pf);
+	u64 total_sz = 0;
+
+	if (vf == NULL)
+		return -EINVAL;
+
+	list_for_each_entry(msg_listnode, &vf->virtchnl_msg_list, node) {
+		struct ice_migration_virtchnl_msg_slot *msg_slot;
+		u64 slot_size;
+
+		msg_slot = &msg_listnode->msg_slot;
+		slot_size = struct_size(msg_slot, msg_buffer,
+					msg_slot->msg_len);
+		total_sz += slot_size;
+		if (total_sz > buf_sz) {
+			dev_err(dev, "Insufficient buffer to store virtchnl message for VF %d op: %d, len: %d\n",
+				vf->vf_id, msg_slot->opcode, msg_slot->msg_len);
+			return -ENOBUFS;
+		}
+		dev_dbg(dev, "VF %d copy virtchnl message to migration buffer op: %d, len: %d\n",
+			vf->vf_id, msg_slot->opcode, msg_slot->msg_len);
+		memcpy(buf, msg_slot, slot_size);
+		buf += slot_size;
+	}
+	/* reserve space to mark end of vf messages */
+	total_sz += sizeof(struct ice_migration_virtchnl_msg_slot);
+	if (total_sz > buf_sz) {
+		dev_err(dev, "Insufficient buffer to store virtchnl message for VF %d\n",
+			vf->vf_id);
+		return -ENOBUFS;
+	}
+
+	/* use op code unknown to mark end of vc messages */
+	last_op = (struct ice_migration_virtchnl_msg_slot *)buf;
+	last_op->opcode = VIRTCHNL_OP_UNKNOWN;
+	return 0;
+}
+EXPORT_SYMBOL(ice_migration_save_devstate);
+
+/**
+ * ice_migration_restore_devstate - restore device state at dst
+ * @opaque: pointer to VF handler in ice vdev
+ * @buf: pointer to device state buf in migration buffer
+ * @buf_sz: size of migration buffer
+ *
+ * This function uses the device state saved in migration buffer
+ * to restore device state at dst VM
+ *
+ * Return 0 for success, negative for error
+ */
+int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
+{
+	struct ice_migration_virtchnl_msg_slot *msg_slot;
+	struct ice_vf *vf = (struct ice_vf *)opaque;
+	struct device *dev = ice_pf_to_dev(vf->pf);
+	struct ice_rq_event_info event;
+	u64 total_sz = 0;
+	u64 op_msglen_sz;
+	u64 slot_sz;
+	int ret = 0;
+
+	if (!buf)
+		return -EINVAL;
+
+	msg_slot = (struct ice_migration_virtchnl_msg_slot *)buf;
+	op_msglen_sz = sizeof(struct ice_migration_virtchnl_msg_slot);
+	/* check whether enough space for opcode and msg_len */
+	if (total_sz + op_msglen_sz > buf_sz) {
+		dev_err(dev, "VF %d msg size exceeds buffer size\n", vf->vf_id);
+		return -ENOBUFS;
+	}
+
+	set_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states);
+
+	while (msg_slot->opcode != VIRTCHNL_OP_UNKNOWN) {
+		slot_sz = struct_size(msg_slot, msg_buffer, msg_slot->msg_len);
+		total_sz += slot_sz;
+		/* check whether enough space for whole message */
+		if (total_sz > buf_sz) {
+			dev_err(dev, "VF %d msg size exceeds buffer size\n",
+				vf->vf_id);
+			ret = -ENOBUFS;
+			break;
+		}
+		dev_dbg(dev, "VF %d replay virtchnl message op code: %d, msg len: %d\n",
+			vf->vf_id, msg_slot->opcode, msg_slot->msg_len);
+		event.desc.cookie_high = msg_slot->opcode;
+		event.msg_len = msg_slot->msg_len;
+		event.desc.retval = vf->vf_id;
+		event.msg_buf = (unsigned char *)msg_slot->msg_buffer;
+		ret = ice_vc_process_vf_msg(vf->pf, &event, NULL);
+		if (ret) {
+			dev_err(dev, "failed to replay virtchnl message op code: %d\n",
+				msg_slot->opcode);
+			break;
+		}
+		event.msg_buf = NULL;
+		msg_slot = (struct ice_migration_virtchnl_msg_slot *)
+					((char *)msg_slot + slot_sz);
+		/* check whether enough space for opcode and msg_len */
+		if (total_sz + op_msglen_sz > buf_sz) {
+			dev_err(dev, "VF %d msg size exceeds buffer size\n",
+				vf->vf_id);
+			ret = -ENOBUFS;
+			break;
+		}
+	}
+	clear_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states);
+	return ret;
+}
+EXPORT_SYMBOL(ice_migration_restore_devstate);
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 9eb2ff5c10f1..6c99ad7ac2e0 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -3895,11 +3895,24 @@ ice_is_malicious_vf(struct ice_vf *vf, struct ice_mbx_data *mbxdata)
  * @event: pointer to the AQ event
  * @mbxdata: information used to detect VF attempting mailbox overflow
  *
- * called from the common asq/arq handler to
- * process request from VF
+ * This function will be called from:
+ * 1. the common asq/arq handler to process request from VF
+ *
+ *    The return value is ignored, as the command will send the status of the
+ *    request as a response to the VF. This flow sets the mbxdata to
+ *    a non-NULL value and must call ice_is_malicious_vf to determine if this
+ *    VF might be attempting to overflow the PF message queue.
+ *
+ * 2. replay virtual channel commamds during live migration
+ *
+ *    The return value is used to indicate failure to replay vc commands and
+ *    that the migration failed. This flow sets mbxdata to NULL and skips the
+ *    ice_is_malicious_vf checks which are unnecessary during replay.
+ *
+ * Return 0 if success, negative for failure.
  */
-void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
-			   struct ice_mbx_data *mbxdata)
+int ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
+			  struct ice_mbx_data *mbxdata)
 {
 	u32 v_opcode = le32_to_cpu(event->desc.cookie_high);
 	s16 vf_id = le16_to_cpu(event->desc.retval);
@@ -3916,13 +3929,13 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 	if (!vf) {
 		dev_err(dev, "Unable to locate VF for message from VF ID %d, opcode %d, len %d\n",
 			vf_id, v_opcode, msglen);
-		return;
+		return -EINVAL;
 	}
 
 	mutex_lock(&vf->cfg_lock);
 
 	/* Check if the VF is trying to overflow the mailbox */
-	if (ice_is_malicious_vf(vf, mbxdata))
+	if (!test_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states) && ice_is_malicious_vf(vf, mbxdata))
 		goto finish;
 
 	/* Check if VF is disabled. */
@@ -4084,4 +4097,5 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 finish:
 	mutex_unlock(&vf->cfg_lock);
 	ice_put_vf(vf);
+	return err;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.h b/drivers/net/ethernet/intel/ice/ice_virtchnl.h
index a2b6094e2f2f..4b151a228c52 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.h
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.h
@@ -63,8 +63,8 @@ int
 ice_vc_respond_to_vf(struct ice_vf *vf, u32 v_opcode,
 		     enum virtchnl_status_code v_retval, u8 *msg, u16 msglen);
 bool ice_vc_isvalid_vsi_id(struct ice_vf *vf, u16 vsi_id);
-void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
-			   struct ice_mbx_data *mbxdata);
+int ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
+			  struct ice_mbx_data *mbxdata);
 #else /* CONFIG_PCI_IOV */
 static inline void ice_virtchnl_set_dflt_ops(struct ice_vf *vf) { }
 static inline void ice_virtchnl_set_repr_ops(struct ice_vf *vf) { }
@@ -84,10 +84,11 @@ static inline bool ice_vc_isvalid_vsi_id(struct ice_vf *vf, u16 vsi_id)
 	return false;
 }
 
-static inline void
+static inline int
 ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 		      struct ice_mbx_data *mbxdata)
 {
+	return -EOPNOTSUPP;
 }
 #endif /* !CONFIG_PCI_IOV */
 
diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/intel/ice_migration.h
index 5f1c765ed582..741a242558a1 100644
--- a/include/linux/net/intel/ice_migration.h
+++ b/include/linux/net/intel/ice_migration.h
@@ -9,6 +9,8 @@ void *ice_migration_get_vf(struct pci_dev *vf_pdev);
 void ice_migration_put_vf(void *opaque);
 void ice_migration_init_vf(void *opaque);
 void ice_migration_uninit_vf(void *opaque);
+int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz);
+int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz);
 
 #else
 static inline void *ice_migration_get_vf(struct pci_dev *vf_pdev)
@@ -21,6 +23,16 @@ static inline void ice_migration_put_vf(void *opaque)
 }
 static inline void ice_migration_init_vf(void *opaque) { }
 static inline void ice_migration_uninit_vf(void *opaque) { }
+static inline int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
+{
+	return 0;
+}
+
+static inline int ice_migration_restore_devstate(void *opaque, const u8 *buf,
+						 u64 buf_sz)
+{
+	return 0;
+}
 #endif /* CONFIG_ICE_VFIO_PCI */
 
 #endif /* _ICE_MIGRATION_H_ */
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 07/15] ice: do not notify VF link state during migration
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (5 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 06/15] ice: save and restore " Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 08/15] ice: change VSI id in virtual channel message after migration Lingyu Liu
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

PF driver will notify VF link state when replaying virtual channel
commands GET_VF_RESOURCE and ENABLE_QUEUES.

Block PF driver to send message to VF if it is migrating.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 6c99ad7ac2e0..0a10ea0b3b6d 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -233,6 +233,9 @@ void ice_vc_notify_vf_link_state(struct ice_vf *vf)
 	struct virtchnl_pf_event pfe = { 0 };
 	struct ice_hw *hw = &vf->pf->hw;
 
+	if (test_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states))
+		return;
+
 	pfe.event = VIRTCHNL_EVENT_LINK_CHANGE;
 	pfe.severity = PF_EVENT_SEVERITY_INFO;
 
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 08/15] ice: change VSI id in virtual channel message after migration
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (6 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 07/15] ice: do not notify VF link state during migration Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 09/15] ice: save and restore RX queue head Lingyu Liu
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Save the VSI num used in migration src VM in VF structure
and change VSI id in virtual channel message payload to the dst VF's
VSI id under following two conditions:
1) the VSI id in virtual channel message payload is the same with
src VF's VSI id. Or
2) it is replaying virtual channel message.

This is to prevent PF rejects to process VF messages during migration.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    | 108 ++++++++++++++++++
 .../intel/ice/ice_migration_private.h         |   3 +
 drivers/net/ethernet/intel/ice/ice_vf_lib.c   |   1 +
 drivers/net/ethernet/intel/ice/ice_vf_lib.h   |   1 +
 drivers/net/ethernet/intel/ice/ice_virtchnl.c |   3 +
 5 files changed, 116 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index 49ad3c252f03..68f9ff843d12 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -14,6 +14,10 @@ struct ice_migration_virtchnl_msg_listnode {
 	struct ice_migration_virtchnl_msg_slot msg_slot;
 };
 
+struct ice_migration_dev_state {
+	u16 vsi_id;
+} __aligned(8);
+
 /**
  * ice_migration_get_vf - Get ice VF structure pointer by pdev
  * @vf_pdev: pointer to ice vfio pci VF pdev structure
@@ -62,6 +66,7 @@ void ice_migration_init_vf(void *opaque)
 	vf->migration_active = true;
 	INIT_LIST_HEAD(&vf->virtchnl_msg_list);
 	vf->virtchnl_msg_num = 0;
+	vf->vm_vsi_num = vf->lan_vsi_num;
 }
 EXPORT_SYMBOL(ice_migration_init_vf);
 
@@ -175,11 +180,24 @@ int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 	struct ice_migration_virtchnl_msg_slot *last_op;
 	struct ice_vf *vf = (struct ice_vf *)opaque;
 	struct device *dev = ice_pf_to_dev(vf->pf);
+	struct ice_migration_dev_state *devstate;
 	u64 total_sz = 0;
 
 	if (vf == NULL)
 		return -EINVAL;
 
+	/* reserve space to store device state, saving VSI id in the beginning */
+	total_sz += sizeof(struct ice_migration_dev_state);
+	if (total_sz > buf_sz) {
+		dev_err(dev, "Insufficient buffer to store device state for VF %d\n",
+			vf->vf_id);
+		return -ENOBUFS;
+	}
+
+	devstate = (struct ice_migration_dev_state *)buf;
+	devstate->vsi_id = vf->vm_vsi_num;
+	buf += sizeof(*devstate);
+
 	list_for_each_entry(msg_listnode, &vf->virtchnl_msg_list, node) {
 		struct ice_migration_virtchnl_msg_slot *msg_slot;
 		u64 slot_size;
@@ -229,6 +247,7 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 	struct ice_migration_virtchnl_msg_slot *msg_slot;
 	struct ice_vf *vf = (struct ice_vf *)opaque;
 	struct device *dev = ice_pf_to_dev(vf->pf);
+	struct ice_migration_dev_state *devstate;
 	struct ice_rq_event_info event;
 	u64 total_sz = 0;
 	u64 op_msglen_sz;
@@ -238,6 +257,16 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 	if (!buf)
 		return -EINVAL;
 
+	total_sz += sizeof(struct ice_migration_dev_state);
+	if (total_sz > buf_sz) {
+		dev_err(dev, "VF %d msg size exceeds buffer size\n", vf->vf_id);
+		return -ENOBUFS;
+	}
+
+	devstate = (struct ice_migration_dev_state *)buf;
+	vf->vm_vsi_num = devstate->vsi_id;
+	dev_dbg(dev, "VF %d vm vsi num is:%d\n", vf->vf_id, vf->vm_vsi_num);
+	buf += sizeof(*devstate);
 	msg_slot = (struct ice_migration_virtchnl_msg_slot *)buf;
 	op_msglen_sz = sizeof(struct ice_migration_virtchnl_msg_slot);
 	/* check whether enough space for opcode and msg_len */
@@ -285,3 +314,82 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 	return ret;
 }
 EXPORT_SYMBOL(ice_migration_restore_devstate);
+
+/**
+ * ice_migration_fix_msg_vsi - change virtual channel msg VSI id
+ *
+ * @vf: pointer to the VF structure
+ * @v_opcode: virtchnl message operation code
+ * @msg: pointer to the virtual channel message
+ *
+ * After migration, the VSI id of virtual channel message is still
+ * migration src VSI id. Some virtual channel commands will fail
+ * due to unmatch VSI id.
+ * Change virtual channel message payload VSI id to real VSI id.
+ */
+void ice_migration_fix_msg_vsi(struct ice_vf *vf, u32 v_opcode, u8 *msg)
+{
+	if (!vf->migration_active)
+		return;
+
+	switch (v_opcode) {
+	case VIRTCHNL_OP_ADD_ETH_ADDR:
+	case VIRTCHNL_OP_DEL_ETH_ADDR:
+	case VIRTCHNL_OP_ENABLE_QUEUES:
+	case VIRTCHNL_OP_DISABLE_QUEUES:
+	case VIRTCHNL_OP_CONFIG_RSS_KEY:
+	case VIRTCHNL_OP_CONFIG_RSS_LUT:
+	case VIRTCHNL_OP_GET_STATS:
+	case VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE:
+	case VIRTCHNL_OP_ADD_FDIR_FILTER:
+	case VIRTCHNL_OP_DEL_FDIR_FILTER:
+	case VIRTCHNL_OP_ADD_VLAN:
+	case VIRTCHNL_OP_DEL_VLAN: {
+		/* Read the beginning two bytes of message for VSI id */
+		u16 *vsi_id = (u16 *)msg;
+
+		if (*vsi_id == vf->vm_vsi_num ||
+		    test_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states))
+			*vsi_id = vf->lan_vsi_num;
+		break;
+	}
+	case VIRTCHNL_OP_CONFIG_IRQ_MAP: {
+		struct virtchnl_irq_map_info *irqmap_info;
+		u16 num_q_vectors_mapped;
+		int i;
+
+		irqmap_info = (struct virtchnl_irq_map_info *)msg;
+		num_q_vectors_mapped = irqmap_info->num_vectors;
+		for (i = 0; i < num_q_vectors_mapped; i++) {
+			struct virtchnl_vector_map *map;
+
+			map = &irqmap_info->vecmap[i];
+			if (map->vsi_id == vf->vm_vsi_num ||
+			    test_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states))
+				map->vsi_id = vf->lan_vsi_num;
+		}
+		break;
+	}
+	case VIRTCHNL_OP_CONFIG_VSI_QUEUES: {
+		struct virtchnl_vsi_queue_config_info *qci;
+
+		qci = (struct virtchnl_vsi_queue_config_info *)msg;
+		if (qci->vsi_id == vf->vm_vsi_num ||
+		    test_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states)) {
+			int i;
+
+			qci->vsi_id = vf->lan_vsi_num;
+			for (i = 0; i < qci->num_queue_pairs; i++) {
+				struct virtchnl_queue_pair_info *qpi;
+
+				qpi = &qci->qpair[i];
+				qpi->txq.vsi_id = vf->lan_vsi_num;
+				qpi->rxq.vsi_id = vf->lan_vsi_num;
+			}
+		}
+		break;
+	}
+	default:
+		break;
+	}
+}
diff --git a/drivers/net/ethernet/intel/ice/ice_migration_private.h b/drivers/net/ethernet/intel/ice/ice_migration_private.h
index 4773fbc6b099..728acfaefbdf 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration_private.h
+++ b/drivers/net/ethernet/intel/ice/ice_migration_private.h
@@ -13,10 +13,13 @@
 #if IS_ENABLED(CONFIG_ICE_VFIO_PCI)
 void ice_migration_save_vf_msg(struct ice_vf *vf,
 			       struct ice_rq_event_info *event);
+void ice_migration_fix_msg_vsi(struct ice_vf *vf, u32 v_opcode, u8 *msg);
 #else
 static inline void
 ice_migration_save_vf_msg(struct ice_vf *vf,
 			  struct ice_rq_event_info *event) { }
+static inline void
+ice_migration_fix_msg_vsi(struct ice_vf *vf, u32 v_opcode, u8 *msg) { }
 #endif /* CONFIG_ICE_VFIO_PCI */
 
 #endif /* _ICE_MIGRATION_PRIVATE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c
index 4b1940487b27..200c6eebd5c3 100644
--- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c
@@ -1332,6 +1332,7 @@ void ice_vf_set_initialized(struct ice_vf *vf)
 	clear_bit(ICE_VF_STATE_DIS, vf->vf_states);
 	set_bit(ICE_VF_STATE_INIT, vf->vf_states);
 	memset(&vf->vlan_v2_caps, 0, sizeof(vf->vlan_v2_caps));
+	vf->vm_vsi_num = vf->lan_vsi_num;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.h b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
index b77daa7d310c..7304bb854f44 100644
--- a/drivers/net/ethernet/intel/ice/ice_vf_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.h
@@ -138,6 +138,7 @@ struct ice_vf {
 	bool migration_active;
 	struct list_head virtchnl_msg_list;
 	u64 virtchnl_msg_num;
+	u16 vm_vsi_num;
 };
 
 /* Flags for controlling behavior of ice_reset_vf */
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 0a10ea0b3b6d..7f5868c975d7 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -3974,6 +3974,9 @@ int ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event,
 		goto finish;
 	}
 
+	if (vf->migration_active)
+		ice_migration_fix_msg_vsi(vf, v_opcode, msg);
+
 	switch (v_opcode) {
 	case VIRTCHNL_OP_VERSION:
 		err = ops->get_ver_msg(vf, msg);
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 09/15] ice: save and restore RX queue head
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (7 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 08/15] ice: change VSI id in virtual channel message after migration Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX " Lingyu Liu
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Save RX queue head in device migration region and
restore RX queue head at migration dst by writing RX queue context
after replaying virtual channel command VIRTCHNL_OP_CONFIG_VSI_QUEUES.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    | 98 +++++++++++++++++++
 include/linux/net/intel/ice_migration.h       |  3 +
 2 files changed, 101 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index 68f9ff843d12..2579bc0bd193 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -2,6 +2,7 @@
 /* Copyright (C) 2018-2023 Intel Corporation */
 
 #include "ice.h"
+#include "ice_base.h"
 
 struct ice_migration_virtchnl_msg_slot {
 	u32 opcode;
@@ -16,6 +17,8 @@ struct ice_migration_virtchnl_msg_listnode {
 
 struct ice_migration_dev_state {
 	u16 vsi_id;
+	/* next RX desc index to be processed by the device */
+	u16 rx_head[IAVF_QRX_TAIL_MAX];
 } __aligned(8);
 
 /**
@@ -166,6 +169,44 @@ void ice_migration_save_vf_msg(struct ice_vf *vf,
 	}
 }
 
+static int
+ice_migration_save_rx_head(struct ice_vf *vf,
+			   struct ice_migration_dev_state *devstate)
+{
+	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
+	struct ice_pf *pf = vf->pf;
+	struct device *dev;
+	struct ice_hw *hw;
+	int i;
+
+	dev = ice_pf_to_dev(pf);
+	hw = &pf->hw;
+
+	if (!vsi) {
+		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
+		return -EINVAL;
+	}
+	ice_for_each_rxq(vsi, i) {
+		struct ice_rx_ring *rx_ring = vsi->rx_rings[i];
+		struct ice_rlan_ctx rlan_ctx = {0};
+		int status;
+		u16 pf_q;
+
+		if (!test_bit(i, vf->rxq_ena))
+			continue;
+
+		pf_q = rx_ring->reg_idx;
+		status = ice_read_rxq_ctx(hw, &rlan_ctx, pf_q);
+		if (status) {
+			dev_err(dev, "Failed to read RXQ[%d] context, err=%d\n",
+				rx_ring->q_index, status);
+			return -EIO;
+		}
+		devstate->rx_head[i] = rlan_ctx.head;
+	}
+	return 0;
+}
+
 /**
  * ice_migration_save_devstate - save VF msg to migration buffer
  * @opaque: pointer to VF handler in ice vdev
@@ -182,6 +223,7 @@ int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 	struct device *dev = ice_pf_to_dev(vf->pf);
 	struct ice_migration_dev_state *devstate;
 	u64 total_sz = 0;
+	int ret;
 
 	if (vf == NULL)
 		return -EINVAL;
@@ -196,6 +238,11 @@ int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 
 	devstate = (struct ice_migration_dev_state *)buf;
 	devstate->vsi_id = vf->vm_vsi_num;
+	ret = ice_migration_save_rx_head(vf, devstate);
+	if (ret) {
+		dev_err(dev, "VF %d failed to save rxq head\n", vf->vf_id);
+		return ret;
+	}
 	buf += sizeof(*devstate);
 
 	list_for_each_entry(msg_listnode, &vf->virtchnl_msg_list, node) {
@@ -231,6 +278,48 @@ int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 }
 EXPORT_SYMBOL(ice_migration_save_devstate);
 
+static int
+ice_migration_restore_rx_head(struct ice_vf *vf,
+			      struct ice_migration_dev_state *devstate)
+{
+	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
+	struct ice_pf *pf = vf->pf;
+	struct device *dev;
+	int i;
+
+	dev = ice_pf_to_dev(pf);
+
+	if (!vsi) {
+		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
+		return -EINVAL;
+	}
+	ice_for_each_rxq(vsi, i) {
+		struct ice_rx_ring *rx_ring = vsi->rx_rings[i];
+		struct ice_rlan_ctx rlan_ctx = {0};
+		int status;
+		u16 pf_q;
+
+		if (!rx_ring)
+			return -EINVAL;
+		pf_q = rx_ring->reg_idx;
+		status = ice_read_rxq_ctx(&pf->hw, &rlan_ctx, pf_q);
+		if (status) {
+			dev_err(dev, "Failed to read RXQ[%d] context, err=%d\n",
+				rx_ring->q_index, status);
+			return -EIO;
+		}
+
+		rlan_ctx.head = devstate->rx_head[i];
+		status = ice_write_rxq_ctx(&pf->hw, &rlan_ctx, pf_q);
+		if (status) {
+			dev_err(dev, "Failed to set LAN RXQ[%d] context, err=%d\n",
+				rx_ring->q_index, status);
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
 /**
  * ice_migration_restore_devstate - restore device state at dst
  * @opaque: pointer to VF handler in ice vdev
@@ -299,6 +388,15 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 				msg_slot->opcode);
 			break;
 		}
+		if (msg_slot->opcode == VIRTCHNL_OP_CONFIG_VSI_QUEUES) {
+			ret = ice_migration_restore_rx_head(vf, devstate);
+			if (ret) {
+				dev_err(dev, "VF %d failed to restore rx head\n",
+					vf->vf_id);
+				break;
+			}
+		}
+
 		event.msg_buf = NULL;
 		msg_slot = (struct ice_migration_virtchnl_msg_slot *)
 					((char *)msg_slot + slot_sz);
diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/intel/ice_migration.h
index 741a242558a1..68e567791b5c 100644
--- a/include/linux/net/intel/ice_migration.h
+++ b/include/linux/net/intel/ice_migration.h
@@ -5,6 +5,9 @@
 #define _ICE_MIGRATION_H_
 
 #if IS_ENABLED(CONFIG_ICE_VFIO_PCI)
+
+#define IAVF_QRX_TAIL_MAX 256
+
 void *ice_migration_get_vf(struct pci_dev *vf_pdev);
 void ice_migration_put_vf(void *opaque);
 void ice_migration_init_vf(void *opaque);
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (8 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 09/15] ice: save and restore RX queue head Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21 14:37   ` Jason Gunthorpe
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 11/15] ice: stop device before saving device states Lingyu Liu
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Save TX queue head in device migration region and
restore TX queue head at migration dst by inserting NOP
descriptors packages and kicking doorbell.

By default, VF driver's TX DMA ring context is configured by
ice driver. In device resuming stage of live migration, it's
ice driver's reponsibility to recover the TX DMA ring context.
Due to HW limitation, some of TX DMA ring context can't be
recovered by writing mmio registers. Instead, it can be
recovered by injecting NOP descriptors into the DMA ring
descriptors indirectly. The NOP descriptor injection needs
assistance from host kernel driver to read/write TX DMA ring
descriptors. As a result, vfio_dma_rw() is used to recover the
TX DMA ring context.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    | 203 +++++++++++++++++-
 drivers/net/ethernet/intel/ice/ice_virtchnl.c |  10 +
 include/linux/net/intel/ice_migration.h       |   9 +-
 3 files changed, 213 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index 2579bc0bd193..c2a83a97af05 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -3,6 +3,7 @@
 
 #include "ice.h"
 #include "ice_base.h"
+#include "ice_txrx_lib.h"
 
 struct ice_migration_virtchnl_msg_slot {
 	u32 opcode;
@@ -19,6 +20,8 @@ struct ice_migration_dev_state {
 	u16 vsi_id;
 	/* next RX desc index to be processed by the device */
 	u16 rx_head[IAVF_QRX_TAIL_MAX];
+	/* next TX desc index to be processed by the device */
+	u16 tx_head[IAVF_QRX_TAIL_MAX];
 } __aligned(8);
 
 /**
@@ -207,10 +210,51 @@ ice_migration_save_rx_head(struct ice_vf *vf,
 	return 0;
 }
 
+static int
+ice_migration_save_tx_head(struct ice_vf *vf,
+			   struct ice_migration_dev_state *devstate)
+{
+	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
+	struct ice_pf *pf = vf->pf;
+	struct device *dev;
+	int i = 0;
+
+	dev = ice_pf_to_dev(pf);
+
+	if (!vsi) {
+		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
+		return -EINVAL;
+	}
+
+	ice_for_each_txq(vsi, i) {
+		u16 tx_head;
+		u32 reg;
+
+		if (!test_bit(i, vf->txq_ena))
+			continue;
+
+		reg = rd32(&pf->hw, QTX_COMM_HEAD(vsi->txq_map[i]));
+		tx_head = (reg & QTX_COMM_HEAD_HEAD_M)
+					>> QTX_COMM_HEAD_HEAD_S;
+
+		if (tx_head == QTX_COMM_HEAD_HEAD_M ||
+		    tx_head == (vsi->tx_rings[i]->count - 1))
+			/* when transmitted packet number is 0 or tx_ring
+			 * length, the next packet to be sent is 0.
+			 */
+			tx_head = 0;
+		else
+			tx_head++;
+
+		devstate->tx_head[i] = tx_head;
+	}
+	return 0;
+}
+
 /**
  * ice_migration_save_devstate - save VF msg to migration buffer
- * @opaque: pointer to VF handler in ice vdev
- * @buf: pointer to VF msg in migration buffer
+ * @opaque: pointer to vf handler in ice vdev
+ * @buf: pointer to vf msg in migration buffer
  * @buf_sz: size of migration buffer
  *
  * Return 0 for success, negative for error
@@ -243,6 +287,11 @@ int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 		dev_err(dev, "VF %d failed to save rxq head\n", vf->vf_id);
 		return ret;
 	}
+	ret = ice_migration_save_tx_head(vf, devstate);
+	if (ret) {
+		dev_err(dev, "VF %d failed to save txq head\n", vf->vf_id);
+		return ret;
+	}
 	buf += sizeof(*devstate);
 
 	list_for_each_entry(msg_listnode, &vf->virtchnl_msg_list, node) {
@@ -320,18 +369,145 @@ ice_migration_restore_rx_head(struct ice_vf *vf,
 	return 0;
 }
 
+static int
+ice_migration_restore_tx_head(struct ice_vf *vf,
+			      struct ice_migration_dev_state *devstate,
+			      struct vfio_device *vdev)
+{
+	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
+	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
+	struct ice_pf *pf = vf->pf;
+	u16 max_ring_len = 0;
+	struct device *dev;
+	int ret = 0;
+	int i = 0;
+
+	dev = ice_pf_to_dev(vf->pf);
+
+	if (!vsi) {
+		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
+		return -EINVAL;
+	}
+
+	ice_for_each_txq(vsi, i) {
+		if (!test_bit(i, vf->txq_ena))
+			continue;
+
+		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
+	}
+
+	if (max_ring_len == 0)
+		return 0;
+
+	tx_desc = (struct ice_tx_desc *)kcalloc
+		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
+	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
+			(max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
+	if (!tx_desc || !tx_desc_dummy) {
+		dev_err(dev, "VF %d failed to allocate memory for tx descriptors to restore tx head\n",
+			vf->vf_id);
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	for (i = 0; i < max_ring_len; i++) {
+		u32 td_cmd;
+
+		td_cmd = ICE_TXD_LAST_DESC_CMD | ICE_TX_DESC_CMD_DUMMY;
+		tx_desc_dummy[i].cmd_type_offset_bsz =
+					ice_build_ctob(td_cmd, 0, SZ_256, 0);
+	}
+
+	/* For each tx queue, we restore the tx head following below steps:
+	 * 1. backup original tx ring descriptor memory
+	 * 2. overwrite the tx ring descriptor with dummy packets
+	 * 3. kick doorbell register to trigger descriptor writeback,
+	 *    then tx head will move from 0 to tail - 1 and tx head is restored
+	 *    to the place we expect.
+	 * 4. restore the tx ring with original tx ring descriptor memory in
+	 *    order not to corrupt the ring context.
+	 */
+	ice_for_each_txq(vsi, i) {
+		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
+		u16 *tx_heads = devstate->tx_head;
+		u32 tx_head;
+		int j;
+
+		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
+			continue;
+
+		if (tx_heads[i] >= tx_ring->count) {
+			dev_err(dev, "saved tx ring head exceeds tx ring count\n");
+			ret = -EINVAL;
+			goto err;
+		}
+		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
+				  tx_ring->count * sizeof(tx_desc[0]), false);
+		if (ret) {
+			dev_err(dev, "kvm read guest tx ring error: %d\n",
+				ret);
+			goto err;
+		}
+		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc_dummy,
+				  tx_heads[i] * sizeof(tx_desc_dummy[0]), true);
+		if (ret) {
+			dev_err(dev, "kvm write guest return error: %d\n",
+				ret);
+			goto err;
+		}
+
+		/* Force memory writes to complete before letting h/w know there
+		 * are new descriptors to fetch.
+		 */
+		wmb();
+		writel(tx_heads[i], tx_ring->tail);
+		/* wait until tx_head equals tx_heads[i] - 1 */
+		tx_head = rd32(&pf->hw, QTX_COMM_HEAD(vsi->txq_map[i]));
+		tx_head = (tx_head & QTX_COMM_HEAD_HEAD_M)
+			   >> QTX_COMM_HEAD_HEAD_S;
+		for (j = 0; j < QTX_HEAD_RESTORE_DELAY_MAX &&
+				tx_head != (u32)(tx_heads[i] - 1); j++) {
+			usleep_range(QTX_HEAD_RESTORE_DELAY_SLEEP_US_MIN, QTX_HEAD_RESTORE_DELAY_SLEEP_US_MAX);
+			tx_head = rd32(&pf->hw, QTX_COMM_HEAD(vsi->txq_map[i]));
+			tx_head = (tx_head & QTX_COMM_HEAD_HEAD_M)
+				   >> QTX_COMM_HEAD_HEAD_S;
+		}
+		if (j == 10) {
+			ret = -EIO;
+			dev_err(dev, "VF %d txq[%d] head restore timeout\n",
+				vf->vf_id, i);
+			goto err;
+		}
+		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
+				  tx_ring->count * sizeof(tx_desc[0]), true);
+		if (ret) {
+			dev_err(dev, "kvm write guest tx ring error: %d\n",
+				ret);
+			goto err;
+		}
+	}
+
+err:
+	kfree(tx_desc_dummy);
+	kfree(tx_desc);
+
+	return ret;
+}
+
 /**
  * ice_migration_restore_devstate - restore device state at dst
  * @opaque: pointer to VF handler in ice vdev
  * @buf: pointer to device state buf in migration buffer
  * @buf_sz: size of migration buffer
+ * @vdev: pointer to vfio device
  *
  * This function uses the device state saved in migration buffer
  * to restore device state at dst VM
  *
  * Return 0 for success, negative for error
  */
-int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
+int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
+				   struct vfio_device *vdev)
 {
 	struct ice_migration_virtchnl_msg_slot *msg_slot;
 	struct ice_vf *vf = (struct ice_vf *)opaque;
@@ -343,7 +519,7 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 	u64 slot_sz;
 	int ret = 0;
 
-	if (!buf)
+	if (!buf || !vdev)
 		return -EINVAL;
 
 	total_sz += sizeof(struct ice_migration_dev_state);
@@ -374,7 +550,7 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 			dev_err(dev, "VF %d msg size exceeds buffer size\n",
 				vf->vf_id);
 			ret = -ENOBUFS;
-			break;
+			goto err;
 		}
 		dev_dbg(dev, "VF %d replay virtchnl message op code: %d, msg len: %d\n",
 			vf->vf_id, msg_slot->opcode, msg_slot->msg_len);
@@ -386,7 +562,7 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 		if (ret) {
 			dev_err(dev, "failed to replay virtchnl message op code: %d\n",
 				msg_slot->opcode);
-			break;
+			goto err;
 		}
 		if (msg_slot->opcode == VIRTCHNL_OP_CONFIG_VSI_QUEUES) {
 			ret = ice_migration_restore_rx_head(vf, devstate);
@@ -405,9 +581,22 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz)
 			dev_err(dev, "VF %d msg size exceeds buffer size\n",
 				vf->vf_id);
 			ret = -ENOBUFS;
-			break;
+			goto err;
 		}
 	}
+
+	/* Since we can't restore tx head directly due to HW limitation, we
+	 * could only restore tx head indirectly by dummy packets injection.
+	 * After virtual channel replay completes, tx rings are enabled.
+	 * Then restore tx head for tx rings by injecting dummy packets.
+	 */
+	ret = ice_migration_restore_tx_head(vf, devstate, vdev);
+	if (ret) {
+		dev_err(dev, "failed to restore tx queue head\n");
+		goto err;
+	}
+
+err:
 	clear_bit(ICE_VF_STATE_REPLAYING_VC, vf->vf_states);
 	return ret;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 7f5868c975d7..e6018bf0b6a8 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -1328,6 +1328,16 @@ static int ice_vc_ena_qs_msg(struct ice_vf *vf, u8 *msg)
 			continue;
 
 		ice_vf_ena_txq_interrupt(vsi, vf_q_id);
+		/* set TXQ head value to 0x1FFF to indicate no packet is sent. According
+		 * to CVL HAS Transmit Queue Context Structure, the size of descriptor
+		 * queue is from 8 descriptores (QLEN=0x8) to 8K-32 descriptors
+		 * (QLEN=0x1FE0). So QTX_COMM_HEAD.HEAD rang value from 0x1fe0 to 0x1fff
+		 * is reserved and will never be used by HW. So, use 0x1FFF as a marker.
+		 * This is used by live migration.
+		 */
+		if (vf->migration_active)
+			wr32(&vsi->back->hw, QTX_COMM_HEAD(vsi->txq_map[vf_q_id]),
+			     QTX_COMM_HEAD_HEAD_M);
 		set_bit(vf_q_id, vf->txq_ena);
 	}
 
diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/intel/ice_migration.h
index 68e567791b5c..b59200a0a059 100644
--- a/include/linux/net/intel/ice_migration.h
+++ b/include/linux/net/intel/ice_migration.h
@@ -3,17 +3,22 @@
 
 #ifndef _ICE_MIGRATION_H_
 #define _ICE_MIGRATION_H_
+#include <linux/vfio.h>
 
 #if IS_ENABLED(CONFIG_ICE_VFIO_PCI)
 
 #define IAVF_QRX_TAIL_MAX 256
+#define QTX_HEAD_RESTORE_DELAY_MAX 100
+#define QTX_HEAD_RESTORE_DELAY_SLEEP_US_MIN 10
+#define QTX_HEAD_RESTORE_DELAY_SLEEP_US_MAX 10
 
 void *ice_migration_get_vf(struct pci_dev *vf_pdev);
 void ice_migration_put_vf(void *opaque);
 void ice_migration_init_vf(void *opaque);
 void ice_migration_uninit_vf(void *opaque);
 int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz);
-int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz);
+int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
+				   struct vfio_device *vdev);
 
 #else
 static inline void *ice_migration_get_vf(struct pci_dev *vf_pdev)
@@ -32,7 +37,7 @@ static inline int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 }
 
 static inline int ice_migration_restore_devstate(void *opaque, const u8 *buf,
-						 u64 buf_sz)
+						 u64 buf_sz, struct vfio_device *vdev)
 {
 	return 0;
 }
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 11/15] ice: stop device before saving device states
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (9 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX " Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 12/15] ice: mask VF advanced capabilities if live migration is activated Lingyu Liu
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Stop device by disabling TX queue and RX queue.

ice_vfio_pci driver introduced in following patches from this
series needs this function when device state transits from
VFIO_DEVICE_STATE_RUNNING to VFIO_DEVICE_STATE_STOP.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    | 68 +++++++++++++++++++
 include/linux/net/intel/ice_migration.h       |  5 ++
 2 files changed, 73 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index c2a83a97af05..c588738828ab 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -2,6 +2,8 @@
 /* Copyright (C) 2018-2023 Intel Corporation */
 
 #include "ice.h"
+#include "ice_lib.h"
+#include "ice_fltr.h"
 #include "ice_base.h"
 #include "ice_txrx_lib.h"
 
@@ -172,6 +174,72 @@ void ice_migration_save_vf_msg(struct ice_vf *vf,
 	}
 }
 
+/**
+ * ice_migration_suspend_vf - suspend device on src
+ * @opaque: pointer to VF handler in ice vdev
+ * @is_dst: false: migration src true: migration dst
+ *
+ * Return 0 for success, negative for error
+ */
+int ice_migration_suspend_vf(void *opaque, bool is_dst)
+{
+	struct ice_vf *vf = (struct ice_vf *)opaque;
+	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
+	struct ice_pf *pf = vf->pf;
+	struct device *dev;
+	int ret;
+
+	if (is_dst)
+		return 0;
+
+	dev = ice_pf_to_dev(pf);
+	if (vf->virtchnl_msg_num >= VIRTCHNL_MSG_MAX) {
+		dev_err(dev, "SR-IOV live migration disabled on VF %d. Migration buffer exceeded\n",
+			vf->vf_id);
+		return -EIO;
+	}
+
+	if (!vsi) {
+		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
+		return -EINVAL;
+	}
+	/* Prevent VSI from incoming packets by removing all filters before
+	 * stop rx ring and draining the traffic. There are possibilities that
+	 * rx ring head value jitters when rx ring is stopped with large amount
+	 * of packets incoming. In this case, HW mismatches SW on rx ring head
+	 * state. As a result, after restoring rx ring head on the destination
+	 * VM, the missing rx descriptors will never be written back, causing
+	 * packets receiving failure and dropped.
+	 */
+	ice_fltr_remove_all(vsi);
+	/* MAC based filter rule is disabled at this point. Set MAC to zero
+	 * to keep consistency when using ip link to display MAC address.
+	 */
+	eth_zero_addr(vf->hw_lan_addr);
+	eth_zero_addr(vf->dev_lan_addr);
+	/* For the tx side, there is possibility that some descriptors are
+	 * still pending to be transmitted by HW. Since VM is stopped now,
+	 * wait a while to make sure all the transmission is completed.
+	 * For the rx side, head value jittering may happen in case of high
+	 * packet rate. Since all forwarding filters are removed now, wait a
+	 * while to make sure all the reception is completed and rx head no
+	 * longer moves.
+	 */
+	usleep_range(1000, 2000);
+	ret = ice_vsi_stop_lan_tx_rings(vsi, ICE_NO_RESET, vf->vf_id);
+	if (ret) {
+		dev_err(dev, "VF %d failed to stop tx rings\n", vf->vf_id);
+		return -EIO;
+	}
+	ret = ice_vsi_stop_all_rx_rings(vsi);
+	if (ret) {
+		dev_err(dev, "VF %d failed to stop rx rings\n", vf->vf_id);
+		return -EIO;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(ice_migration_suspend_vf);
+
 static int
 ice_migration_save_rx_head(struct ice_vf *vf,
 			   struct ice_migration_dev_state *devstate)
diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/intel/ice_migration.h
index b59200a0a059..45c3469df55d 100644
--- a/include/linux/net/intel/ice_migration.h
+++ b/include/linux/net/intel/ice_migration.h
@@ -16,6 +16,7 @@ void *ice_migration_get_vf(struct pci_dev *vf_pdev);
 void ice_migration_put_vf(void *opaque);
 void ice_migration_init_vf(void *opaque);
 void ice_migration_uninit_vf(void *opaque);
+int ice_migration_suspend_vf(void *opaque, bool mig_dst);
 int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz);
 int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
 				   struct vfio_device *vdev);
@@ -31,6 +32,10 @@ static inline void ice_migration_put_vf(void *opaque)
 }
 static inline void ice_migration_init_vf(void *opaque) { }
 static inline void ice_migration_uninit_vf(void *opaque) { }
+static inline int ice_migration_suspend_vf(void *opaque, bool mig_dst)
+{
+	return 0;
+}
 static inline int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz)
 {
 	return 0;
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 12/15] ice: mask VF advanced capabilities if live migration is activated
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (10 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 11/15] ice: stop device before saving device states Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices Lingyu Liu
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Mask the advanced VF capability flags that are
not supported when VF migration is activated.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    | 27 +++++++++++++++++++
 .../intel/ice/ice_migration_private.h         |  6 +++++
 drivers/net/ethernet/intel/ice/ice_virtchnl.c |  2 ++
 3 files changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index c588738828ab..0bc897ab0dc2 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -748,3 +748,30 @@ void ice_migration_fix_msg_vsi(struct ice_vf *vf, u32 v_opcode, u8 *msg)
 		break;
 	}
 }
+
+#define VIRTCHNL_VF_MIGRATION_SUPPORT_FEATURE \
+				(VIRTCHNL_VF_OFFLOAD_L2 | \
+				 VIRTCHNL_VF_OFFLOAD_RSS_PF | \
+				 VIRTCHNL_VF_OFFLOAD_RSS_AQ | \
+				 VIRTCHNL_VF_OFFLOAD_RSS_REG | \
+				 VIRTCHNL_VF_OFFLOAD_RSS_PCTYPE_V2 | \
+				 VIRTCHNL_VF_OFFLOAD_ENCAP | \
+				 VIRTCHNL_VF_OFFLOAD_ENCAP_CSUM | \
+				 VIRTCHNL_VF_OFFLOAD_RX_POLLING | \
+				 VIRTCHNL_VF_OFFLOAD_WB_ON_ITR | \
+				 VIRTCHNL_VF_CAP_ADV_LINK_SPEED | \
+				 VIRTCHNL_VF_OFFLOAD_VLAN | \
+				 VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC | \
+				 VIRTCHNL_VF_OFFLOAD_USO)
+
+/**
+ * ice_migration_supported_caps - get migration supported VF capablities
+ *
+ * When migration is activated, some VF capabilities are not supported.
+ * So unmask those capability flags for VF resources.
+ */
+u32 ice_migration_supported_caps(void)
+{
+
+	return VIRTCHNL_VF_MIGRATION_SUPPORT_FEATURE;
+}
diff --git a/drivers/net/ethernet/intel/ice/ice_migration_private.h b/drivers/net/ethernet/intel/ice/ice_migration_private.h
index 728acfaefbdf..47ad45c2b737 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration_private.h
+++ b/drivers/net/ethernet/intel/ice/ice_migration_private.h
@@ -14,12 +14,18 @@
 void ice_migration_save_vf_msg(struct ice_vf *vf,
 			       struct ice_rq_event_info *event);
 void ice_migration_fix_msg_vsi(struct ice_vf *vf, u32 v_opcode, u8 *msg);
+u32 ice_migration_supported_caps(void);
 #else
 static inline void
 ice_migration_save_vf_msg(struct ice_vf *vf,
 			  struct ice_rq_event_info *event) { }
 static inline void
 ice_migration_fix_msg_vsi(struct ice_vf *vf, u32 v_opcode, u8 *msg) { }
+static inline u32
+ice_migration_supported_caps(void)
+{
+	return 0xFFFFFFFF;
+}
 #endif /* CONFIG_ICE_VFIO_PCI */
 
 #endif /* _ICE_MIGRATION_PRIVATE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index e6018bf0b6a8..87d049d6ec03 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -468,6 +468,8 @@ static int ice_vc_get_vf_res_msg(struct ice_vf *vf, u8 *msg)
 				  VIRTCHNL_VF_OFFLOAD_RSS_REG |
 				  VIRTCHNL_VF_OFFLOAD_VLAN;
 
+	if (vf->migration_active)
+		vf->driver_caps &= ice_migration_supported_caps();
 	vfres->vf_cap_flags = VIRTCHNL_VF_OFFLOAD_L2;
 	vsi = ice_get_vf_vsi(vf);
 	if (!vsi) {
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (11 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 12/15] ice: mask VF advanced capabilities if live migration is activated Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21 14:23   ` Jason Gunthorpe
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 14/15] vfio: Expose vfio_device_has_container() Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode Lingyu Liu
  14 siblings, 1 reply; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

Add a vendor-specific vfio_pci driver for E800 devices.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the E800 specific logic to support VF live migration.

It implements the device state transition flow for live
migration.

Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
---
 MAINTAINERS                         |   7 +
 drivers/vfio/pci/Kconfig            |   2 +
 drivers/vfio/pci/Makefile           |   2 +
 drivers/vfio/pci/ice/Kconfig        |  10 +
 drivers/vfio/pci/ice/Makefile       |   4 +
 drivers/vfio/pci/ice/ice_vfio_pci.c | 841 ++++++++++++++++++++++++++++
 6 files changed, 866 insertions(+)
 create mode 100644 drivers/vfio/pci/ice/Kconfig
 create mode 100644 drivers/vfio/pci/ice/Makefile
 create mode 100644 drivers/vfio/pci/ice/ice_vfio_pci.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7322963b0670..39a2d7c15dc4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22170,6 +22170,13 @@ L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/pci/mlx5/
 
+VFIO ICE PCI DRIVER
+M:	Yahui Cao <yahui.cao@intel.com>
+M:	Lingyu Liu <lingyu.liu@intel.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	drivers/vfio/pci/ice/
+
 VFIO PCI DEVICE SPECIFIC DRIVERS
 R:	Jason Gunthorpe <jgg@nvidia.com>
 R:	Yishai Hadas <yishaih@nvidia.com>
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index f9d0c908e738..834ad57c7455 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
 
+source "drivers/vfio/pci/ice/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5..12d2ee3350c5 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
+
+obj-$(CONFIG_ICE_VFIO_PCI) += ice/
diff --git a/drivers/vfio/pci/ice/Kconfig b/drivers/vfio/pci/ice/Kconfig
new file mode 100644
index 000000000000..4c6f348d3062
--- /dev/null
+++ b/drivers/vfio/pci/ice/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config ICE_VFIO_PCI
+	tristate "VFIO support for Intel(R) Ethernet Connection E800 Series"
+	depends on ICE
+	depends on VFIO_PCI_CORE
+	help
+	  This provides migration support for Intel(R) Ethernet connection E800
+	  series devices using the VFIO framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/ice/Makefile b/drivers/vfio/pci/ice/Makefile
new file mode 100644
index 000000000000..259d4ab89105
--- /dev/null
+++ b/drivers/vfio/pci/ice/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_ICE_VFIO_PCI) += ice-vfio-pci.o
+ice-vfio-pci-y := ice_vfio_pci.o
+
diff --git a/drivers/vfio/pci/ice/ice_vfio_pci.c b/drivers/vfio/pci/ice/ice_vfio_pci.c
new file mode 100644
index 000000000000..389a2be41896
--- /dev/null
+++ b/drivers/vfio/pci/ice/ice_vfio_pci.c
@@ -0,0 +1,841 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2018-2023 Intel Corporation */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/file.h>
+
+#include <linux/net/intel/ice_migration.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/anon_inodes.h>
+
+#define DRIVER_DESC     "ICE VFIO PCI - User Level meta-driver for Intel E800 device family"
+
+/* IAVF registers description */
+#define IAVF_VF_ARQBAH1 0x00006000 /* Reset: EMPR */
+#define IAVF_VF_ATQH1 0x00006400 /* Reset: EMPR */
+#define IAVF_VF_ATQLEN1 0x00006800 /* Reset: EMPR */
+#define IAVF_VF_ARQBAL1 0x00006C00 /* Reset: EMPR */
+#define IAVF_VF_ARQT1 0x00007000   /* Reset: EMPR */
+#define IAVF_VF_ARQH1 0x00007400   /* Reset: EMPR */
+#define IAVF_VF_ATQBAH1 0x00007800 /* Reset: EMPR */
+#define IAVF_VF_ATQBAL1 0x00007C00 /* Reset: EMPR */
+#define IAVF_VF_ARQLEN1 0x00008000 /* Reset: EMPR */
+#define IAVF_VF_ATQT1 0x00008400   /* Reset: EMPR */
+#define IAVF_VFINT_DYN_CTL01 0x00005C00 /* Reset: VFR */
+#define IAVF_VFINT_DYN_CTLN1(_INTVF) \
+	(0x00003800 + ((_INTVF) * 4)) /* _INTVF=0...16 */ /* Reset: VFR */
+#define IAVF_VFINT_DYN_CTLN_NUM 16
+#define IAVF_VFINT_ITRN0(_i) \
+	(0x00004C00 + (_i) * 4) /* _i=0...2 */ /* Reset: VFR */
+#define IAVF_VFINT_ITRN0_NUM 3
+#define IAVF_VFINT_ITRN1(_i, _INTVF) (0x00002800 + ((_i) * 64 + (_INTVF) * 4))
+	/* _i=0...2, _INTVF=0...15 */ /* Reset: VFR */
+#define IAVF_VFINT_ITRN_NUM 3
+#define IAVF_QRX_TAIL1(_Q) \
+	(0x00002000 + ((_Q) * 4)) /* _Q=0...256 */ /* Reset: CORER */
+
+/* Registers for saving and loading during live Migration */
+struct ice_vfio_pci_regs {
+	/* VF interrupts */
+	u32 int_dyn_ctl0;
+	u32 int_dyn_ctln[IAVF_VFINT_DYN_CTLN_NUM];
+	u32 int_intr0[IAVF_VFINT_ITRN0_NUM];
+	u32 int_intrn[IAVF_VFINT_ITRN_NUM][IAVF_VFINT_DYN_CTLN_NUM];
+
+	/* VF Control Queues */
+	u32 asq_bal;
+	u32 asq_bah;
+	u32 asq_len;
+	u32 asq_head;
+	u32 asq_tail;
+	u32 arq_bal;
+	u32 arq_bah;
+	u32 arq_len;
+	u32 arq_head;
+	u32 arq_tail;
+
+	/* VF LAN RX */
+	u32 rx_tail[IAVF_QRX_TAIL_MAX];
+};
+
+struct ice_vfio_pci_migration_data {
+	struct ice_vfio_pci_regs regs;
+
+	u8 __aligned(8) dev_state[SZ_128K];
+};
+
+struct ice_vfio_pci_migration_file {
+	struct file *filp;
+	struct mutex lock;
+	bool disabled;
+
+	struct ice_vfio_pci_migration_data mig_data;
+	size_t total_length;
+};
+
+struct ice_vfio_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 deferred_reset:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
+	struct ice_vfio_pci_migration_file *resuming_migf;
+	struct ice_vfio_pci_migration_file *saving_migf;
+	struct vfio_device_migration_info mig_info;
+	struct ice_vfio_pci_migration_data *mig_data;
+	u8 __iomem *io_base;
+	void *vf_handle;
+	bool is_dst;
+};
+
+/**
+ * ice_vfio_pci_save_regs - Save migration register data
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ * @regs: pointer to ice_vfio_pci_regs structure
+ *
+ */
+static void
+ice_vfio_pci_save_regs(struct ice_vfio_pci_core_device *ice_vdev,
+		       struct ice_vfio_pci_regs *regs)
+{
+	u8 __iomem *io_base = ice_vdev->io_base;
+	int i, j;
+
+	regs->int_dyn_ctl0 = readl(io_base + IAVF_VFINT_DYN_CTL01);
+
+	for (i = 0; i < IAVF_VFINT_DYN_CTLN_NUM; i++)
+		regs->int_dyn_ctln[i] =
+		    readl(io_base + IAVF_VFINT_DYN_CTLN1(i));
+
+	for (i = 0; i < IAVF_VFINT_ITRN0_NUM; i++)
+		regs->int_intr0[i] = readl(io_base + IAVF_VFINT_ITRN0(i));
+
+	for (i = 0; i < IAVF_VFINT_ITRN_NUM; i++)
+		for (j = 0; j < IAVF_VFINT_DYN_CTLN_NUM; j++)
+			regs->int_intrn[i][j] =
+			    readl(io_base + IAVF_VFINT_ITRN1(i, j));
+
+	regs->asq_bal = readl(io_base + IAVF_VF_ATQBAL1);
+	regs->asq_bah = readl(io_base + IAVF_VF_ATQBAH1);
+	regs->asq_len = readl(io_base + IAVF_VF_ATQLEN1);
+	regs->asq_head = readl(io_base + IAVF_VF_ATQH1);
+	regs->asq_tail = readl(io_base + IAVF_VF_ATQT1);
+	regs->arq_bal = readl(io_base + IAVF_VF_ARQBAL1);
+	regs->arq_bah = readl(io_base + IAVF_VF_ARQBAH1);
+	regs->arq_len = readl(io_base +  IAVF_VF_ARQLEN1);
+	regs->arq_head = readl(io_base + IAVF_VF_ARQH1);
+	regs->arq_tail = readl(io_base + IAVF_VF_ARQT1);
+
+	for (i = 0; i < IAVF_QRX_TAIL_MAX; i++)
+		regs->rx_tail[i] = readl(io_base + IAVF_QRX_TAIL1(i));
+}
+
+/**
+ * ice_vfio_pci_load_regs - Load migration register data
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ * @regs: pointer to ice_vfio_pci_regs structure
+ *
+ */
+static void
+ice_vfio_pci_load_regs(struct ice_vfio_pci_core_device *ice_vdev,
+		       struct ice_vfio_pci_regs *regs)
+{
+	u8 __iomem *io_base = ice_vdev->io_base;
+	int i, j;
+
+	writel(regs->int_dyn_ctl0, io_base + IAVF_VFINT_DYN_CTL01);
+
+	for (i = 0; i < IAVF_VFINT_DYN_CTLN_NUM; i++)
+		writel(regs->int_dyn_ctln[i],
+		       io_base + IAVF_VFINT_DYN_CTLN1(i));
+
+	for (i = 0; i < IAVF_VFINT_ITRN0_NUM; i++)
+		writel(regs->int_intr0[i], io_base + IAVF_VFINT_ITRN0(i));
+
+	for (i = 0; i < IAVF_VFINT_ITRN_NUM; i++)
+		for (j = 0; j < IAVF_VFINT_DYN_CTLN_NUM; j++)
+			writel(regs->int_intrn[i][j],
+			       io_base + IAVF_VFINT_ITRN1(i, j));
+
+	writel(regs->asq_bal, io_base + IAVF_VF_ATQBAL1);
+	writel(regs->asq_bah, io_base + IAVF_VF_ATQBAH1);
+	writel(regs->asq_len, io_base + IAVF_VF_ATQLEN1);
+	writel(regs->asq_head, io_base + IAVF_VF_ATQH1);
+	writel(regs->asq_tail, io_base + IAVF_VF_ATQT1);
+	writel(regs->arq_bal, io_base + IAVF_VF_ARQBAL1);
+	writel(regs->arq_bah, io_base + IAVF_VF_ARQBAH1);
+	writel(regs->arq_len, io_base +  IAVF_VF_ARQLEN1);
+	writel(regs->arq_head, io_base + IAVF_VF_ARQH1);
+	writel(regs->arq_tail, io_base + IAVF_VF_ARQT1);
+
+	for (i = 0; i < IAVF_QRX_TAIL_MAX; i++)
+		writel(regs->rx_tail[i], io_base + IAVF_QRX_TAIL1(i));
+}
+
+/**
+ * ice_vfio_pci_load_state - VFIO device state reloading
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ *
+ * Load device state and restore it. This function is called when the VFIO uAPI
+ * consumer wants to load the device state info from VFIO migration region and
+ * restore them into the device. This function should make sure all the device
+ * state info is loaded and restored successfully. As a result, return value is
+ * mandatory to be checked.
+ *
+ * Return 0 for success, negative value for failure.
+ */
+static int __must_check
+ice_vfio_pci_load_state(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	struct ice_vfio_pci_migration_file *migf = ice_vdev->resuming_migf;
+
+	ice_vfio_pci_load_regs(ice_vdev, &migf->mig_data.regs);
+	return ice_migration_restore_devstate(ice_vdev->vf_handle,
+					      migf->mig_data.dev_state,
+					      SZ_128K,
+					      &ice_vdev->core_device.vdev);
+}
+
+/**
+ * ice_vfio_pci_save_state - VFIO device state saving
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ * @migf: pointer to migration file
+ *
+ * Snapshot the device state and save it. This function is called when the
+ * VFIO uAPI consumer wants to snapshot the current device state and saves
+ * it into the VFIO migration region. This function should make sure all
+ * of the device state info is collectted and saved successfully. As a
+ * result, return value is mandatory to be checked.
+ *
+ * Return 0 for success, negative value for failure.
+ */
+static int __must_check
+ice_vfio_pci_save_state(struct ice_vfio_pci_core_device *ice_vdev,
+			struct ice_vfio_pci_migration_file *migf)
+{
+	ice_vfio_pci_save_regs(ice_vdev, &migf->mig_data.regs);
+	migf->total_length = SZ_128K;
+
+	return ice_migration_save_devstate(ice_vdev->vf_handle,
+					   migf->mig_data.dev_state,
+					   SZ_128K);
+}
+
+/**
+ * ice_vfio_migration_init - Initialization for live migration function
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ *
+ * Returns 0 on success, negative value on error
+ */
+static int ice_vfio_migration_init(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	struct pci_dev *pdev = ice_vdev->core_device.pdev;
+
+	ice_vdev->vf_handle = ice_migration_get_vf(pdev);
+	if (!ice_vdev->vf_handle)
+		return -EFAULT;
+
+	ice_migration_init_vf(ice_vdev->vf_handle);
+	ice_vdev->io_base = (u8 __iomem *)pci_iomap(pdev, 0, 0);
+	if (!ice_vdev->io_base)
+		return -EFAULT;
+
+	return 0;
+}
+
+/**
+ * ice_vfio_migration_uninit - Cleanup for live migration function
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ */
+static void ice_vfio_migration_uninit(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	pci_iounmap(ice_vdev->core_device.pdev, ice_vdev->io_base);
+	ice_migration_uninit_vf(ice_vdev->vf_handle);
+	ice_migration_put_vf(ice_vdev->vf_handle);
+}
+
+/**
+ * ice_vfio_pci_disable_fd - Close migration file
+ * @migf: pointer to ice vfio pci migration file
+ */
+static void ice_vfio_pci_disable_fd(struct ice_vfio_pci_migration_file *migf)
+{
+	mutex_lock(&migf->lock);
+	migf->disabled = true;
+	migf->total_length = 0;
+	migf->filp->f_pos = 0;
+	mutex_unlock(&migf->lock);
+}
+
+/**
+ * ice_vfio_pci_disable_fds - Close migration files of ice vfio pci device
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ */
+static void ice_vfio_pci_disable_fds(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	if (ice_vdev->resuming_migf) {
+		ice_vfio_pci_disable_fd(ice_vdev->resuming_migf);
+		fput(ice_vdev->resuming_migf->filp);
+		ice_vdev->resuming_migf = NULL;
+	}
+	if (ice_vdev->saving_migf) {
+		ice_vfio_pci_disable_fd(ice_vdev->saving_migf);
+		fput(ice_vdev->saving_migf->filp);
+		ice_vdev->saving_migf = NULL;
+	}
+}
+
+/*
+ * This function is called in all state_mutex unlock cases to
+ * handle a 'deferred_reset' if exists.
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ */
+static void
+ice_vfio_pci_state_mutex_unlock(struct ice_vfio_pci_core_device *ice_vdev)
+{
+again:
+	spin_lock(&ice_vdev->reset_lock);
+	if (ice_vdev->deferred_reset) {
+		ice_vdev->deferred_reset = false;
+		spin_unlock(&ice_vdev->reset_lock);
+		ice_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+		ice_vfio_pci_disable_fds(ice_vdev);
+		goto again;
+	}
+	mutex_unlock(&ice_vdev->state_mutex);
+	spin_unlock(&ice_vdev->reset_lock);
+}
+
+static void ice_vfio_pci_reset_done(struct pci_dev *pdev)
+{
+	struct ice_vfio_pci_core_device *ice_vdev =
+		(struct ice_vfio_pci_core_device *)dev_get_drvdata(&pdev->dev);
+
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&ice_vdev->reset_lock);
+	ice_vdev->deferred_reset = true;
+	if (!mutex_trylock(&ice_vdev->state_mutex)) {
+		spin_unlock(&ice_vdev->reset_lock);
+		return;
+	}
+	spin_unlock(&ice_vdev->reset_lock);
+	ice_vfio_pci_state_mutex_unlock(ice_vdev);
+}
+
+/**
+ * ice_vfio_pci_open_device - Called when a vfio device is probed by VFIO UAPI
+ * @core_vdev: the vfio device to open
+ *
+ * Initialization of the vfio device
+ *
+ * Returns 0 on success, negative value on error
+ */
+static int ice_vfio_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(core_vdev,
+			struct ice_vfio_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &ice_vdev->core_device;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	ret = ice_vfio_migration_init(ice_vdev);
+	if (ret) {
+		vfio_pci_core_disable(vdev);
+		return ret;
+	}
+	ice_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+	vfio_pci_core_finish_enable(vdev);
+
+	return 0;
+}
+
+/**
+ * ice_vfio_pci_close_device - Called when a vfio device fd is closed
+ * @core_vdev: the vfio device to close
+ */
+static void ice_vfio_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(core_vdev,
+			struct ice_vfio_pci_core_device, core_device.vdev);
+
+	ice_vfio_pci_disable_fds(ice_vdev);
+	vfio_pci_core_close_device(core_vdev);
+	ice_vfio_migration_uninit(ice_vdev);
+}
+
+/**
+ * ice_vfio_pci_release_file - release ice vfio pci migration file
+ * @inode: pointer to inode
+ * @filp: pointer to the file to release
+ *
+ * Return 0 for success, negative for error
+ */
+static int ice_vfio_pci_release_file(struct inode *inode, struct file *filp)
+{
+	struct ice_vfio_pci_migration_file *migf = filp->private_data;
+
+	ice_vfio_pci_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+	return 0;
+}
+
+/**
+ * ice_vfio_pci_save_read - save migration file data to user space
+ * @filp: pointer to migration file
+ * @buf: pointer to user space buffer
+ * @len: data length to be saved
+ * @pos: should be 0
+ *
+ * Return len of saved data, negative for error
+ */
+static ssize_t ice_vfio_pci_save_read(struct file *filp, char __user *buf,
+				      size_t len, loff_t *pos)
+{
+	struct ice_vfio_pci_migration_file *migf = filp->private_data;
+	loff_t *off = &filp->f_pos;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+
+	mutex_lock(&migf->lock);
+	if (*off > migf->total_length) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, migf->total_length - *off, len);
+	if (len) {
+		ret = copy_to_user(buf, (u8 *)(&migf->mig_data) + *off, len);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*off += len;
+		done = len;
+	}
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations ice_vfio_pci_save_fops = {
+	.owner = THIS_MODULE,
+	.read = ice_vfio_pci_save_read,
+	.release = ice_vfio_pci_release_file,
+	.llseek = no_llseek,
+};
+
+/**
+ * ice_vfio_pci_stop_copy - create migration file and save migration state to it
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ *
+ * Return migration file handler
+ */
+static struct ice_vfio_pci_migration_file *
+ice_vfio_pci_stop_copy(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	struct ice_vfio_pci_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("ice_vfio_pci_mig",
+					&ice_vfio_pci_save_fops, migf,
+					O_RDONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	ret = ice_vfio_pci_save_state(ice_vdev, migf);
+	if (ret) {
+		fput(migf->filp);
+		kfree(migf);
+		return ERR_PTR(ret);
+	}
+
+	return migf;
+}
+
+/**
+ * ice_vfio_pci_resume_write- copy migration file data from user space
+ * @filp: pointer to migration file
+ * @buf: pointer to user space buffer
+ * @len: data length to be copied
+ * @pos: should be 0
+ *
+ * Return len of saved data, negative for error
+ */
+static ssize_t
+ice_vfio_pci_resume_write(struct file *filp, const char __user *buf,
+			  size_t len, loff_t *pos)
+{
+	struct ice_vfio_pci_migration_file *migf = filp->private_data;
+	loff_t *off = &filp->f_pos;
+	loff_t requested_length;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+
+	if (*off < 0 ||
+	    check_add_overflow((loff_t)len, *off, &requested_length))
+		return -EINVAL;
+
+	if (requested_length > sizeof(struct ice_vfio_pci_migration_data))
+		return -ENOMEM;
+
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	ret = copy_from_user((u8 *)(&migf->mig_data) + *off, buf, len);
+	if (ret) {
+		done = -EFAULT;
+		goto out_unlock;
+	}
+	*off += len;
+	done = len;
+	migf->total_length += len;
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations ice_vfio_pci_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = ice_vfio_pci_resume_write,
+	.release = ice_vfio_pci_release_file,
+	.llseek = no_llseek,
+};
+
+/**
+ * ice_vfio_pci_resume - create resuming migration file
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ *
+ * Return migration file handler, negative value for failure
+ */
+static struct ice_vfio_pci_migration_file *
+ice_vfio_pci_resume(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	struct ice_vfio_pci_migration_file *migf;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("ice_vfio_pci_mig",
+					&ice_vfio_pci_resume_fops, migf,
+					O_WRONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+	return migf;
+}
+
+/**
+ * ice_vfio_pci_step_device_state_locked - process device state change
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ * @new: new device state
+ * @final: final device state
+ *
+ * Return migration file handler or NULL for success, negative value for failure
+ */
+static struct file *
+ice_vfio_pci_step_device_state_locked(struct ice_vfio_pci_core_device *ice_vdev,
+				      u32 new, u32 final)
+{
+	struct device *dev = &ice_vdev->core_device.pdev->dev;
+	u32 cur = ice_vdev->mig_state;
+	int ret;
+
+	if (final == VFIO_DEVICE_STATE_RESUMING)
+		ice_vdev->is_dst = true;
+	else if (final == VFIO_DEVICE_STATE_STOP)
+		ice_vdev->is_dst = false;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) {
+		if (!ice_vdev->is_dst)
+			dev_info(dev, "Live migration begins\n");
+		ice_migration_suspend_vf(ice_vdev->vf_handle, ice_vdev->is_dst);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct ice_vfio_pci_migration_file *migf;
+
+		migf = ice_vfio_pci_stop_copy(ice_vdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		ice_vdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
+		ice_vfio_pci_disable_fds(ice_vdev);
+		dev_info(dev, "Live migration ends\n");
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
+		struct ice_vfio_pci_migration_file *migf;
+
+		migf = ice_vfio_pci_resume(ice_vdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		ice_vdev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = ice_vfio_pci_load_state(ice_vdev);
+		if (ret)
+			return ERR_PTR(ret);
+		ice_vfio_pci_disable_fds(ice_vdev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING)
+		return NULL;
+
+	/*
+	 * vfio_mig_get_next_state() does not use arcs other than the above
+	 */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+/**
+ * ice_vfio_pci_set_device_state - Config device state
+ * @vdev: pointer to vfio pci device
+ * @new_state: device state
+ *
+ * Return 0 for success, negative value for failure.
+ */
+static struct file *
+ice_vfio_pci_set_device_state(struct vfio_device *vdev,
+			    enum vfio_device_mig_state new_state)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(
+		vdev, struct ice_vfio_pci_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+	int ret;
+
+	mutex_lock(&ice_vdev->state_mutex);
+	while (new_state != ice_vdev->mig_state) {
+		ret = vfio_mig_get_next_state(vdev, ice_vdev->mig_state,
+					      new_state, &next_state);
+		if (ret) {
+			res = ERR_PTR(ret);
+			break;
+		}
+		res = ice_vfio_pci_step_device_state_locked(ice_vdev, next_state,
+							    new_state);
+		if (IS_ERR(res))
+			break;
+		ice_vdev->mig_state = next_state;
+		if (WARN_ON(res && new_state != ice_vdev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	ice_vfio_pci_state_mutex_unlock(ice_vdev);
+	return res;
+}
+
+/**
+ * ice_vfio_pci_get_device_state - get device state
+ * @vdev: pointer to vfio pci device
+ * @curr_state: device state
+ *
+ * Return 0 for success
+ */
+static int ice_vfio_pci_get_device_state(struct vfio_device *vdev,
+					 enum vfio_device_mig_state *curr_state)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(
+		vdev, struct ice_vfio_pci_core_device, core_device.vdev);
+
+	mutex_lock(&ice_vdev->state_mutex);
+	*curr_state = ice_vdev->mig_state;
+	ice_vfio_pci_state_mutex_unlock(ice_vdev);
+	return 0;
+}
+
+/**
+ * ice_vfio_pci_get_data_size - get migration data size
+ * @vdev: pointer to vfio pci device
+ * @stop_copy_length: migration data size
+ *
+ * Return 0 for success
+ */
+static int
+ice_vfio_pci_get_data_size(struct vfio_device *vdev,
+			   unsigned long *stop_copy_length)
+{
+	*stop_copy_length = SZ_128K;
+	return 0;
+}
+
+static const struct vfio_migration_ops ice_vfio_pci_migrn_state_ops = {
+	.migration_set_state = ice_vfio_pci_set_device_state,
+	.migration_get_state = ice_vfio_pci_get_device_state,
+	.migration_get_data_size = ice_vfio_pci_get_data_size,
+};
+
+/**
+ * ice_vfio_pci_core_init_dev - initialize vfio device
+ * @core_vdev: pointer to vfio device
+ *
+ * Return 0 for success
+ */
+static int ice_vfio_pci_core_init_dev(struct vfio_device *core_vdev)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(core_vdev,
+			struct ice_vfio_pci_core_device, core_device.vdev);
+
+	mutex_init(&ice_vdev->state_mutex);
+	spin_lock_init(&ice_vdev->reset_lock);
+
+	core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY;
+	core_vdev->mig_ops = &ice_vfio_pci_migrn_state_ops;
+
+	return vfio_pci_core_init_dev(core_vdev);
+}
+
+static const struct vfio_device_ops ice_vfio_pci_ops = {
+	.name		= "ice-vfio-pci",
+	.init		= ice_vfio_pci_core_init_dev,
+	.release	= vfio_pci_core_release_dev,
+	.open_device	= ice_vfio_pci_open_device,
+	.close_device	= ice_vfio_pci_close_device,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.ioctl		= vfio_pci_core_ioctl,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+	.bind_iommufd	= vfio_iommufd_physical_bind,
+	.unbind_iommufd	= vfio_iommufd_physical_unbind,
+	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+};
+
+/**
+ * ice_vfio_pci_probe - Device initialization routine
+ * @pdev: PCI device information struct
+ * @id: entry in ice_vfio_pci_table
+ *
+ * Returns 0 on success, negative on failure
+ */
+static int
+ice_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct ice_vfio_pci_core_device *ice_vdev;
+	int ret;
+
+	ice_vdev = vfio_alloc_device(ice_vfio_pci_core_device, core_device.vdev,
+					&pdev->dev, &ice_vfio_pci_ops);
+	if (!ice_vdev)
+		return -ENOMEM;
+
+	dev_set_drvdata(&pdev->dev, &ice_vdev->core_device);
+
+	ret = vfio_pci_core_register_device(&ice_vdev->core_device);
+	if (ret)
+		goto out_free;
+
+	return 0;
+
+out_free:
+	vfio_put_device(&ice_vdev->core_device.vdev);
+	return ret;
+}
+
+/**
+ * ice_vfio_pci_remove - Device removal routine
+ * @pdev: PCI device information struct
+ */
+static void ice_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct ice_vfio_pci_core_device *ice_vdev =
+		(struct ice_vfio_pci_core_device *)dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(&ice_vdev->core_device);
+	vfio_put_device(&ice_vdev->core_device.vdev);
+}
+
+/* ice_pci_tbl - PCI Device ID Table
+ *
+ * Wildcard entries (PCI_ANY_ID) should come last
+ * Last entry must be all 0s
+ *
+ * { Vendor ID, Device ID, SubVendor ID, SubDevice ID,
+ *   Class, Class Mask, private data (not used) }
+ */
+static const struct pci_device_id ice_vfio_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x1889) },
+	{}
+};
+MODULE_DEVICE_TABLE(pci, ice_vfio_pci_table);
+
+static const struct pci_error_handlers ice_vfio_pci_core_err_handlers = {
+	.reset_done = ice_vfio_pci_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
+static struct pci_driver ice_vfio_pci_driver = {
+	.name			= "ice-vfio-pci",
+	.id_table		= ice_vfio_pci_table,
+	.probe			= ice_vfio_pci_probe,
+	.remove			= ice_vfio_pci_remove,
+	.err_handler            = &ice_vfio_pci_core_err_handlers,
+	.driver_managed_dma	= true,
+};
+
+module_pci_driver(ice_vfio_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Intel Corporation, <linux.nics@intel.com>");
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 14/15] vfio: Expose vfio_device_has_container()
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (12 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode Lingyu Liu
  14 siblings, 0 replies; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

From: Yahui Cao <yahui.cao@intel.com>

Expose vfio_device_has_container() to be used by drivers
to know what is the vfio uAPI probed by user application.

Next patch for ice vfio pci driver will use it.

Signed-off-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
---
 drivers/vfio/group.c | 1 +
 include/linux/vfio.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index fc75c1000d74..ae8f9ed4708a 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -744,6 +744,7 @@ bool vfio_device_has_container(struct vfio_device *device)
 {
 	return device->group->container;
 }
+EXPORT_SYMBOL_GPL(vfio_device_has_container);
 
 /**
  * vfio_file_iommu_group - Return the struct iommu_group for the vfio group file
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 2c137ea94a3e..ad8ed8fd8d56 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -253,6 +253,7 @@ bool vfio_file_is_group(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
+bool vfio_device_has_container(struct vfio_device *device);
 
 #define VFIO_PIN_PAGES_MAX_ENTRIES	(PAGE_SIZE/sizeof(unsigned long))
 
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode
  2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
                   ` (13 preceding siblings ...)
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 14/15] vfio: Expose vfio_device_has_container() Lingyu Liu
@ 2023-06-21  9:11 ` Lingyu Liu
  2023-06-21 14:40   ` Jason Gunthorpe
  14 siblings, 1 reply; 40+ messages in thread
From: Lingyu Liu @ 2023-06-21  9:11 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: kevin.tian, yi.l.liu, phani.r.burra

From: Yahui Cao <yahui.cao@intel.com>

In iommufd vfio compat mode, vfio_dma_rw() will return failure, since
vfio_device_has_container() returns false and device->iommufd_access is
NULL.

Currently device->iommufd_access will not be created if vfio device is
backed by pci device. To support IOVA access, manually create
iommufd_access context by iommufd_access_create/attach() and access IOVA
by iommufd_access_rw(). And in order to minimize the iommufd_access's
impact, store the iommufd_access context in driver data, create it only
before loading the device state and destroy it once finishing loading
the device state.

To be compatible with legacy vfio, use vfio_device_has_container() to
check the vfio uAPI. If in legacy vfio mode, call vfio_dma_rw()
directly, otherwise call iommufd_access_rw().

Signed-off-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
---
 .../net/ethernet/intel/ice/ice_migration.c    |  23 +--
 drivers/vfio/pci/ice/ice_vfio_pci.c           | 171 +++++++++++++++++-
 include/linux/net/intel/ice_migration.h       |   4 +-
 3 files changed, 179 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
index 0bc897ab0dc2..c5bdfee1e3b0 100644
--- a/drivers/net/ethernet/intel/ice/ice_migration.c
+++ b/drivers/net/ethernet/intel/ice/ice_migration.c
@@ -440,7 +440,7 @@ ice_migration_restore_rx_head(struct ice_vf *vf,
 static int
 ice_migration_restore_tx_head(struct ice_vf *vf,
 			      struct ice_migration_dev_state *devstate,
-			      struct vfio_device *vdev)
+			      dma_rw_handler_t handler, void *data)
 {
 	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
 	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
@@ -509,15 +509,15 @@ ice_migration_restore_tx_head(struct ice_vf *vf,
 			ret = -EINVAL;
 			goto err;
 		}
-		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
-				  tx_ring->count * sizeof(tx_desc[0]), false);
+		ret = handler(data, tx_ring->dma, (void *)tx_desc,
+			      tx_ring->count * sizeof(tx_desc[0]), false);
 		if (ret) {
 			dev_err(dev, "kvm read guest tx ring error: %d\n",
 				ret);
 			goto err;
 		}
-		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc_dummy,
-				  tx_heads[i] * sizeof(tx_desc_dummy[0]), true);
+		ret = handler(data, tx_ring->dma, (void *)tx_desc_dummy,
+			      tx_heads[i] * sizeof(tx_desc_dummy[0]), true);
 		if (ret) {
 			dev_err(dev, "kvm write guest return error: %d\n",
 				ret);
@@ -546,8 +546,8 @@ ice_migration_restore_tx_head(struct ice_vf *vf,
 				vf->vf_id, i);
 			goto err;
 		}
-		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
-				  tx_ring->count * sizeof(tx_desc[0]), true);
+		ret = handler(data, tx_ring->dma, (void *)tx_desc,
+			      tx_ring->count * sizeof(tx_desc[0]), true);
 		if (ret) {
 			dev_err(dev, "kvm write guest tx ring error: %d\n",
 				ret);
@@ -567,7 +567,8 @@ ice_migration_restore_tx_head(struct ice_vf *vf,
  * @opaque: pointer to VF handler in ice vdev
  * @buf: pointer to device state buf in migration buffer
  * @buf_sz: size of migration buffer
- * @vdev: pointer to vfio device
+ * @handler: dma_rw_handler
+ * @data: dma_rw_handler data
  *
  * This function uses the device state saved in migration buffer
  * to restore device state at dst VM
@@ -575,7 +576,7 @@ ice_migration_restore_tx_head(struct ice_vf *vf,
  * Return 0 for success, negative for error
  */
 int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
-				   struct vfio_device *vdev)
+				   dma_rw_handler_t handler, void *data)
 {
 	struct ice_migration_virtchnl_msg_slot *msg_slot;
 	struct ice_vf *vf = (struct ice_vf *)opaque;
@@ -587,7 +588,7 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
 	u64 slot_sz;
 	int ret = 0;
 
-	if (!buf || !vdev)
+	if (!buf)
 		return -EINVAL;
 
 	total_sz += sizeof(struct ice_migration_dev_state);
@@ -658,7 +659,7 @@ int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
 	 * After virtual channel replay completes, tx rings are enabled.
 	 * Then restore tx head for tx rings by injecting dummy packets.
 	 */
-	ret = ice_migration_restore_tx_head(vf, devstate, vdev);
+	ret = ice_migration_restore_tx_head(vf, devstate, handler, data);
 	if (ret) {
 		dev_err(dev, "failed to restore tx queue head\n");
 		goto err;
diff --git a/drivers/vfio/pci/ice/ice_vfio_pci.c b/drivers/vfio/pci/ice/ice_vfio_pci.c
index 389a2be41896..45b95d8eef5c 100644
--- a/drivers/vfio/pci/ice/ice_vfio_pci.c
+++ b/drivers/vfio/pci/ice/ice_vfio_pci.c
@@ -9,6 +9,9 @@
 #include <linux/net/intel/ice_migration.h>
 #include <linux/vfio_pci_core.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommufd.h>
+
+MODULE_IMPORT_NS(IOMMUFD);
 
 #define DRIVER_DESC     "ICE VFIO PCI - User Level meta-driver for Intel E800 device family"
 
@@ -90,6 +93,10 @@ struct ice_vfio_pci_core_device {
 	u8 __iomem *io_base;
 	void *vf_handle;
 	bool is_dst;
+
+	u32 pt_id;
+	struct iommufd_ctx *ictx;
+	struct iommufd_access *user;
 };
 
 /**
@@ -176,6 +183,112 @@ ice_vfio_pci_load_regs(struct ice_vfio_pci_core_device *ice_vdev,
 		writel(regs->rx_tail[i], io_base + IAVF_QRX_TAIL1(i));
 }
 
+/**
+ * ice_vfio_pci_emulated_unmap - callback to unmap IOVA
+ * @data: function handler data
+ * @iova: I/O virtuall address
+ * @len: IOVA length
+ *
+ * This function is called when application are doing DMA unmap and in some
+ * cases driver needs to explicitly do some unmap ops if this device does not
+ * have backed iommu. Nothing is required here since this is pci baseed vfio
+ * device, which has backed iommu.
+ */
+static void
+ice_vfio_pci_emulated_unmap(void *data, unsigned long iova, unsigned long len)
+{
+}
+
+static const struct iommufd_access_ops ice_vfio_user_ops = {
+	.needs_pin_pages = 1,
+	.unmap = ice_vfio_pci_emulated_unmap,
+};
+
+/**
+ * ice_vfio_dma_rw - read/write function for device IOVA address space
+ * @data: function handler data
+ * @iova: I/O virtuall address
+ * @buf: buffer for read/write access
+ * @len: buffer length
+ * @write: true for write, false for read
+ *
+ * Read/write function for device IOVA access. Since vfio_dma_rw() may fail
+ * at iommufd vfio compatiable mode, we need runtime check what uAPI it is
+ * using and use corresponding access method for IOVA access.
+ *
+ * Return 0 for success, negative value for failure.
+ */
+static int ice_vfio_dma_rw(void *data, dma_addr_t iova,
+			   void *buf, size_t len, bool write)
+{
+	struct ice_vfio_pci_core_device *ice_vdev =
+			(struct ice_vfio_pci_core_device *)data;
+	struct vfio_device *vdev = &ice_vdev->core_device.vdev;
+	unsigned int flags = 0;
+
+	if (vfio_device_has_container(vdev))
+		return vfio_dma_rw(vdev, iova, buf, len, write);
+
+	if (!current->mm)
+		flags |= IOMMUFD_ACCESS_RW_KTHREAD;
+	if (write)
+		flags |= IOMMUFD_ACCESS_RW_WRITE;
+	return iommufd_access_rw(ice_vdev->user, iova, buf, len, flags);
+}
+
+/**
+ * ice_vfio_pci_load_state_init - VFIO device state reloading initialization
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ *
+ * Initialization procedure before loading device state.
+ *
+ * Return 0 for success, negative value for failure.
+ */
+static int
+ice_vfio_pci_load_state_init(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	struct device *dev = &ice_vdev->core_device.pdev->dev;
+	struct iommufd_access *user;
+	int pt_id = 0;
+	int ret;
+
+	if (vfio_device_has_container(&ice_vdev->core_device.vdev))
+		return 0;
+
+	user = iommufd_access_create(ice_vdev->ictx, &ice_vfio_user_ops,
+				     ice_vdev, &pt_id);
+	if (IS_ERR(user)) {
+		ret = PTR_ERR(user);
+		dev_err(dev, "iommufd_access_create() return %d", ret);
+		return ret;
+	}
+
+	ret = iommufd_access_attach(user, ice_vdev->pt_id);
+	if (ret) {
+		dev_err(dev, "iommufd_access_attach() return %d", ret);
+		iommufd_access_destroy(user);
+		return ret;
+	}
+
+	ice_vdev->user = user;
+	return 0;
+}
+
+/**
+ * ice_vfio_pci_load_state_exit - VFIO device state reloading exit
+ * @ice_vdev: pointer to ice vfio pci core device structure
+ *
+ * Exit procedure after loading device state.
+ */
+static void
+ice_vfio_pci_load_state_exit(struct ice_vfio_pci_core_device *ice_vdev)
+{
+	if (vfio_device_has_container(&ice_vdev->core_device.vdev))
+		return;
+
+	iommufd_access_destroy(ice_vdev->user);
+}
+
 /**
  * ice_vfio_pci_load_state - VFIO device state reloading
  * @ice_vdev: pointer to ice vfio pci core device structure
@@ -192,12 +305,19 @@ static int __must_check
 ice_vfio_pci_load_state(struct ice_vfio_pci_core_device *ice_vdev)
 {
 	struct ice_vfio_pci_migration_file *migf = ice_vdev->resuming_migf;
+	int ret;
 
+	ret = ice_vfio_pci_load_state_init(ice_vdev);
+	if (ret)
+		return ret;
 	ice_vfio_pci_load_regs(ice_vdev, &migf->mig_data.regs);
-	return ice_migration_restore_devstate(ice_vdev->vf_handle,
-					      migf->mig_data.dev_state,
-					      SZ_128K,
-					      &ice_vdev->core_device.vdev);
+	ret = ice_migration_restore_devstate(ice_vdev->vf_handle,
+					     migf->mig_data.dev_state,
+					     SZ_128K,
+					     ice_vfio_dma_rw, ice_vdev);
+	ice_vfio_pci_load_state_exit(ice_vdev);
+
+	return ret;
 }
 
 /**
@@ -744,6 +864,43 @@ static int ice_vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	return vfio_pci_core_init_dev(core_vdev);
 }
 
+static int ice_vfio_pci_attach_ioas(struct vfio_device *core_vdev, u32 *pt_id)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(core_vdev,
+			struct ice_vfio_pci_core_device, core_device.vdev);
+
+	ice_vdev->pt_id = *pt_id;
+	return vfio_iommufd_physical_attach_ioas(core_vdev, pt_id);
+}
+
+static int ice_vfio_pci_bind(struct vfio_device *core_vdev,
+			     struct iommufd_ctx *ictx, u32 *out_device_id)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(core_vdev,
+			struct ice_vfio_pci_core_device, core_device.vdev);
+	int ret;
+
+	ice_vdev->ictx = ictx;
+	iommufd_ctx_get(ictx);
+
+	ret = vfio_iommufd_physical_bind(core_vdev, ictx, out_device_id);
+	if (ret)
+		iommufd_ctx_put(ictx);
+
+	return ret;
+}
+
+static void ice_vfio_pci_unbind(struct vfio_device *core_vdev)
+{
+	struct ice_vfio_pci_core_device *ice_vdev = container_of(core_vdev,
+			struct ice_vfio_pci_core_device, core_device.vdev);
+
+	vfio_iommufd_physical_unbind(core_vdev);
+
+	iommufd_ctx_put(ice_vdev->ictx);
+	ice_vdev->ictx = NULL;
+}
+
 static const struct vfio_device_ops ice_vfio_pci_ops = {
 	.name		= "ice-vfio-pci",
 	.init		= ice_vfio_pci_core_init_dev,
@@ -757,9 +914,9 @@ static const struct vfio_device_ops ice_vfio_pci_ops = {
 	.mmap		= vfio_pci_core_mmap,
 	.request	= vfio_pci_core_request,
 	.match		= vfio_pci_core_match,
-	.bind_iommufd	= vfio_iommufd_physical_bind,
-	.unbind_iommufd	= vfio_iommufd_physical_unbind,
-	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.bind_iommufd	= ice_vfio_pci_bind,
+	.unbind_iommufd	= ice_vfio_pci_unbind,
+	.attach_ioas	= ice_vfio_pci_attach_ioas,
 };
 
 /**
diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/intel/ice_migration.h
index 45c3469df55d..f97ed6940afd 100644
--- a/include/linux/net/intel/ice_migration.h
+++ b/include/linux/net/intel/ice_migration.h
@@ -7,6 +7,8 @@
 
 #if IS_ENABLED(CONFIG_ICE_VFIO_PCI)
 
+typedef int (*dma_rw_handler_t)(void *data, dma_addr_t iova, void *buf,
+				size_t len, bool write);
 #define IAVF_QRX_TAIL_MAX 256
 #define QTX_HEAD_RESTORE_DELAY_MAX 100
 #define QTX_HEAD_RESTORE_DELAY_SLEEP_US_MIN 10
@@ -19,7 +21,7 @@ void ice_migration_uninit_vf(void *opaque);
 int ice_migration_suspend_vf(void *opaque, bool mig_dst);
 int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz);
 int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
-				   struct vfio_device *vdev);
+				   dma_rw_handler_t handler, void *data);
 
 #else
 static inline void *ice_migration_get_vf(struct pci_dev *vf_pdev)
-- 
2.25.1

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions Lingyu Liu
@ 2023-06-21 13:35   ` Jason Gunthorpe
  2023-06-27  7:50     ` Cao, Yahui
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-21 13:35 UTC (permalink / raw)
  To: Lingyu Liu; +Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra

On Wed, Jun 21, 2023 at 09:11:01AM +0000, Lingyu Liu wrote:
> Adds a function to get ice VF device from pci device.
> Adds a field in VF structure to indicate migration init state,
> and functions to init and uninit migration.
> 
> This will be used by ice_vfio_pci driver introduced in coming patches
> from this series.
> 
> Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
> Signed-off-by: Yahui Cao <yahui.cao@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/Makefile       |  1 +
>  drivers/net/ethernet/intel/ice/ice.h          |  1 +
>  .../net/ethernet/intel/ice/ice_migration.c    | 68 +++++++++++++++++++
>  drivers/net/ethernet/intel/ice/ice_vf_lib.c   |  7 ++
>  drivers/net/ethernet/intel/ice/ice_vf_lib.h   |  1 +
>  include/linux/net/intel/ice_migration.h       | 26 +++++++
>  6 files changed, 104 insertions(+)
>  create mode 100644 drivers/net/ethernet/intel/ice/ice_migration.c
>  create mode 100644 include/linux/net/intel/ice_migration.h
> 
> diff --git a/drivers/net/ethernet/intel/ice/Makefile b/drivers/net/ethernet/intel/ice/Makefile
> index 960277d78e09..915b70588f79 100644
> --- a/drivers/net/ethernet/intel/ice/Makefile
> +++ b/drivers/net/ethernet/intel/ice/Makefile
> @@ -49,3 +49,4 @@ ice-$(CONFIG_RFS_ACCEL) += ice_arfs.o
>  ice-$(CONFIG_XDP_SOCKETS) += ice_xsk.o
>  ice-$(CONFIG_ICE_SWITCHDEV) += ice_eswitch.o ice_eswitch_br.o
>  ice-$(CONFIG_GNSS) += ice_gnss.o
> +ice-$(CONFIG_ICE_VFIO_PCI) += ice_migration.o
> diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> index 9109006336f0..ec7f27d93924 100644
> --- a/drivers/net/ethernet/intel/ice/ice.h
> +++ b/drivers/net/ethernet/intel/ice/ice.h
> @@ -55,6 +55,7 @@
>  #include <net/vxlan.h>
>  #include <net/gtp.h>
>  #include <linux/ppp_defs.h>
> +#include <linux/net/intel/ice_migration.h>
>  #include "ice_devids.h"
>  #include "ice_type.h"
>  #include "ice_txrx.h"
> diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
> new file mode 100644
> index 000000000000..1aadb8577a41
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
> @@ -0,0 +1,68 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2018-2023 Intel Corporation */
> +
> +#include "ice.h"
> +
> +/**
> + * ice_migration_get_vf - Get ice VF structure pointer by pdev
> + * @vf_pdev: pointer to ice vfio pci VF pdev structure
> + *
> + * Return nonzero for success, NULL for failure.
> + *
> + * ice_put_vf() should be called after finishing accessing VF
> + */
> +void *ice_migration_get_vf(struct pci_dev *vf_pdev)
> +{
> +	struct pci_dev *pf_pdev = vf_pdev->physfn;
> +	int vf_id = pci_iov_vf_id(vf_pdev);
> +	struct ice_pf *pf;
> +
> +	if (!pf_pdev || vf_id < 0)
> +		return NULL;
> +
> +	pf = pci_get_drvdata(pf_pdev);
> +	return ice_get_vf_by_id(pf, vf_id);
> +}
> +EXPORT_SYMBOL(ice_migration_get_vf);

This doesn't look right, you shouldn't need functions like this.

The VF knows itself, and it goes back to the PF safely from the VFIO
code. You should not be doing things like 'vf_pdev->physfn'

Loook at how the other drivers are structured.

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices Lingyu Liu
@ 2023-06-21 14:23   ` Jason Gunthorpe
  2023-06-27  9:00     ` Liu, Lingyu
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-21 14:23 UTC (permalink / raw)
  To: Lingyu Liu; +Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra

On Wed, Jun 21, 2023 at 09:11:10AM +0000, Lingyu Liu wrote:

> +static struct file *
> +ice_vfio_pci_step_device_state_locked(struct ice_vfio_pci_core_device *ice_vdev,
> +				      u32 new, u32 final)
> +{
> +	struct device *dev = &ice_vdev->core_device.pdev->dev;
> +	u32 cur = ice_vdev->mig_state;
> +	int ret;
> +
> +	if (final == VFIO_DEVICE_STATE_RESUMING)
> +		ice_vdev->is_dst = true;
> +	else if (final == VFIO_DEVICE_STATE_STOP)
> +		ice_vdev->is_dst = false;

Definately not. The kernel should not be guessing which end is which,
the protocol already makes it clear.

> +
> +	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) {
> +		if (!ice_vdev->is_dst)
> +			dev_info(dev, "Live migration begins\n");
> +		ice_migration_suspend_vf(ice_vdev->vf_handle, ice_vdev->is_dst);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
> +		struct ice_vfio_pci_migration_file *migf;
> +
> +		migf = ice_vfio_pci_stop_copy(ice_vdev);
> +		if (IS_ERR(migf))
> +			return ERR_CAST(migf);
> +		get_file(migf->filp);
> +		ice_vdev->saving_migf = migf;
> +		return migf->filp;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
> +		ice_vfio_pci_disable_fds(ice_vdev);
> +		dev_info(dev, "Live migration ends\n");
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
> +		struct ice_vfio_pci_migration_file *migf;
> +
> +		migf = ice_vfio_pci_resume(ice_vdev);
> +		if (IS_ERR(migf))
> +			return ERR_CAST(migf);
> +		get_file(migf->filp);
> +		ice_vdev->resuming_migf = migf;
> +		return migf->filp;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
> +		ret = ice_vfio_pci_load_state(ice_vdev);
> +		if (ret)
> +			return ERR_PTR(ret);
> +		ice_vfio_pci_disable_fds(ice_vdev);
> +		return NULL;
> +	}
> +
> +	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING)
> +		return NULL;

Lack of P2P is going to be a problem too

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX " Lingyu Liu
@ 2023-06-21 14:37   ` Jason Gunthorpe
  2023-06-27  6:55       ` Tian, Kevin
  2023-06-28  8:11     ` Liu, Yi L
  0 siblings, 2 replies; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-21 14:37 UTC (permalink / raw)
  To: Lingyu Liu; +Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra

On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
> diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
> index 2579bc0bd193..c2a83a97af05 100644
> --- a/drivers/net/ethernet/intel/ice/ice_migration.c
> +++ b/drivers/net/ethernet/intel/ice/ice_migration.c

> +static int
> +ice_migration_restore_tx_head(struct ice_vf *vf,
> +			      struct ice_migration_dev_state *devstate,
> +			      struct vfio_device *vdev)
> +{
> +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
> +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
> +	struct ice_pf *pf = vf->pf;
> +	u16 max_ring_len = 0;
> +	struct device *dev;
> +	int ret = 0;
> +	int i = 0;
> +
> +	dev = ice_pf_to_dev(vf->pf);
> +
> +	if (!vsi) {
> +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
> +		return -EINVAL;
> +	}
> +
> +	ice_for_each_txq(vsi, i) {
> +		if (!test_bit(i, vf->txq_ena))
> +			continue;
> +
> +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
> +	}
> +
> +	if (max_ring_len == 0)
> +		return 0;
> +
> +	tx_desc = (struct ice_tx_desc *)kcalloc
> +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
> +			(max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> +	if (!tx_desc || !tx_desc_dummy) {
> +		dev_err(dev, "VF %d failed to allocate memory for tx descriptors to restore tx head\n",
> +			vf->vf_id);
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	for (i = 0; i < max_ring_len; i++) {
> +		u32 td_cmd;
> +
> +		td_cmd = ICE_TXD_LAST_DESC_CMD | ICE_TX_DESC_CMD_DUMMY;
> +		tx_desc_dummy[i].cmd_type_offset_bsz =
> +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
> +	}
> +
> +	/* For each tx queue, we restore the tx head following below steps:
> +	 * 1. backup original tx ring descriptor memory
> +	 * 2. overwrite the tx ring descriptor with dummy packets
> +	 * 3. kick doorbell register to trigger descriptor writeback,
> +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
> +	 *    to the place we expect.
> +	 * 4. restore the tx ring with original tx ring descriptor memory in
> +	 *    order not to corrupt the ring context.
> +	 */
> +	ice_for_each_txq(vsi, i) {
> +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
> +		u16 *tx_heads = devstate->tx_head;
> +		u32 tx_head;
> +		int j;
> +
> +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
> +			continue;
> +
> +		if (tx_heads[i] >= tx_ring->count) {
> +			dev_err(dev, "saved tx ring head exceeds tx ring count\n");
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
> +				  tx_ring->count * sizeof(tx_desc[0]), false);
> +		if (ret) {
> +			dev_err(dev, "kvm read guest tx ring error: %d\n",
> +				ret);
> +			goto err;

You can't call VFIO functions from a netdev driver. All this code
needs to be moved into the varient driver.

This design seems pretty wild to me, it doesn't seem too robust
against a hostile VM - eg these DMAs can all fail under guest control,
and then what?

We also don't have any guarentees defined for the VFIO protocol about
what state the vIOMMU will be in prior to reaching RUNNING.

IDK, all of this looks like it is trying really hard to hackily force
HW that was never ment to support live migration to somehow do
something that looks like it.

You really need to present an explanation in the VFIO driver comments
about how this whole scheme actually works and is secure and
functional against a hostile guest.

Jason

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode
  2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode Lingyu Liu
@ 2023-06-21 14:40   ` Jason Gunthorpe
  2023-06-27  8:09     ` Cao, Yahui
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-21 14:40 UTC (permalink / raw)
  To: Lingyu Liu; +Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra

On Wed, Jun 21, 2023 at 09:11:12AM +0000, Lingyu Liu wrote:
> From: Yahui Cao <yahui.cao@intel.com>
> 
> In iommufd vfio compat mode, vfio_dma_rw() will return failure, since
> vfio_device_has_container() returns false and device->iommufd_access is
> NULL.
> 
> Currently device->iommufd_access will not be created if vfio device is
> backed by pci device. To support IOVA access, manually create
> iommufd_access context by iommufd_access_create/attach() and access IOVA
> by iommufd_access_rw(). And in order to minimize the iommufd_access's
> impact, store the iommufd_access context in driver data, create it only
> before loading the device state and destroy it once finishing loading
> the device state.
> 
> To be compatible with legacy vfio, use vfio_device_has_container() to
> check the vfio uAPI. If in legacy vfio mode, call vfio_dma_rw()
> directly, otherwise call iommufd_access_rw().

This is not the right approach, you should create your access by
overloading the iommufd ops. Nak on exposing vfio_device_has_container

> +/**
> + * ice_vfio_pci_emulated_unmap - callback to unmap IOVA
> + * @data: function handler data
> + * @iova: I/O virtuall address
> + * @len: IOVA length
> + *
> + * This function is called when application are doing DMA unmap and in some
> + * cases driver needs to explicitly do some unmap ops if this device does not
> + * have backed iommu. Nothing is required here since this is pci baseed vfio
> + * device, which has backed iommu.
> + */
> +static void
> +ice_vfio_pci_emulated_unmap(void *data, unsigned long iova, unsigned long len)
> +{
> +}
> +
> +static const struct iommufd_access_ops ice_vfio_user_ops = {
> +	.needs_pin_pages = 1,
> +	.unmap = ice_vfio_pci_emulated_unmap,
> +};

If you don't call pin pages then you shouldn't set needs_pin_pages?

An empty unmap op is unconditionally wrong.

> + * ice_vfio_dma_rw - read/write function for device IOVA address space
> + * @data: function handler data
> + * @iova: I/O virtuall address
> + * @buf: buffer for read/write access
> + * @len: buffer length
> + * @write: true for write, false for read
> + *
> + * Read/write function for device IOVA access. Since vfio_dma_rw() may fail
> + * at iommufd vfio compatiable mode, we need runtime check what uAPI it is
> + * using and use corresponding access method for IOVA access.
> + *
> + * Return 0 for success, negative value for failure.
> + */
> +static int ice_vfio_dma_rw(void *data, dma_addr_t iova,
> +			   void *buf, size_t len, bool write)
> +{
> +	struct ice_vfio_pci_core_device *ice_vdev =
> +			(struct ice_vfio_pci_core_device *)data;
> +	struct vfio_device *vdev = &ice_vdev->core_device.vdev;
> +	unsigned int flags = 0;
> +
> +	if (vfio_device_has_container(vdev))
> +		return vfio_dma_rw(vdev, iova, buf, len, write);
> +
> +	if (!current->mm)
> +		flags |= IOMMUFD_ACCESS_RW_KTHREAD;

No, you need to know your own calling context, you can't guess like this.

I suppose this is always called from an ioctl?

> @@ -19,7 +21,7 @@ void ice_migration_uninit_vf(void *opaque);
>  int ice_migration_suspend_vf(void *opaque, bool mig_dst);
>  int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz);
>  int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
> -				   struct vfio_device *vdev);
> +				   dma_rw_handler_t handler, void *data);

Please remove all the wild function pointers and void * opaques I see
in this driver. Use proper types and get your layering right so you
dont't have to fake up improper cross-layer calls like this.

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-21 14:37   ` Jason Gunthorpe
@ 2023-06-27  6:55       ` Tian, Kevin
  2023-06-28  8:11     ` Liu, Yi L
  1 sibling, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-06-27  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Lingyu
  Cc: intel-wired-lan, Liu, Yi L, Burra, Phani R, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 21, 2023 10:37 PM
> 
> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
> > diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
> b/drivers/net/ethernet/intel/ice/ice_migration.c
> > index 2579bc0bd193..c2a83a97af05 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_migration.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
> 
> > +static int
> > +ice_migration_restore_tx_head(struct ice_vf *vf,
> > +			      struct ice_migration_dev_state *devstate,
> > +			      struct vfio_device *vdev)
> > +{
> > +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
> > +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
> > +	struct ice_pf *pf = vf->pf;
> > +	u16 max_ring_len = 0;
> > +	struct device *dev;
> > +	int ret = 0;
> > +	int i = 0;
> > +
> > +	dev = ice_pf_to_dev(vf->pf);
> > +
> > +	if (!vsi) {
> > +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	ice_for_each_txq(vsi, i) {
> > +		if (!test_bit(i, vf->txq_ena))
> > +			continue;
> > +
> > +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
> > +	}
> > +
> > +	if (max_ring_len == 0)
> > +		return 0;
> > +
> > +	tx_desc = (struct ice_tx_desc *)kcalloc
> > +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> > +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
> > +			(max_ring_len, sizeof(struct ice_tx_desc),
> GFP_KERNEL);
> > +	if (!tx_desc || !tx_desc_dummy) {
> > +		dev_err(dev, "VF %d failed to allocate memory for tx
> descriptors to restore tx head\n",
> > +			vf->vf_id);
> > +		ret = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	for (i = 0; i < max_ring_len; i++) {
> > +		u32 td_cmd;
> > +
> > +		td_cmd = ICE_TXD_LAST_DESC_CMD |
> ICE_TX_DESC_CMD_DUMMY;
> > +		tx_desc_dummy[i].cmd_type_offset_bsz =
> > +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
> > +	}
> > +
> > +	/* For each tx queue, we restore the tx head following below steps:
> > +	 * 1. backup original tx ring descriptor memory
> > +	 * 2. overwrite the tx ring descriptor with dummy packets
> > +	 * 3. kick doorbell register to trigger descriptor writeback,
> > +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
> > +	 *    to the place we expect.
> > +	 * 4. restore the tx ring with original tx ring descriptor memory in
> > +	 *    order not to corrupt the ring context.
> > +	 */
> > +	ice_for_each_txq(vsi, i) {
> > +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
> > +		u16 *tx_heads = devstate->tx_head;
> > +		u32 tx_head;
> > +		int j;
> > +
> > +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
> > +			continue;
> > +
> > +		if (tx_heads[i] >= tx_ring->count) {
> > +			dev_err(dev, "saved tx ring head exceeds tx ring
> count\n");
> > +			ret = -EINVAL;
> > +			goto err;
> > +		}
> > +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
> > +				  tx_ring->count * sizeof(tx_desc[0]), false);
> > +		if (ret) {
> > +			dev_err(dev, "kvm read guest tx ring error: %d\n",
> > +				ret);
> > +			goto err;
> 
> You can't call VFIO functions from a netdev driver. All this code
> needs to be moved into the varient driver.
> 
> This design seems pretty wild to me, it doesn't seem too robust
> against a hostile VM - eg these DMAs can all fail under guest control,
> and then what?

Yeah that sounds fragile.

at least the range which will be overwritten in the resuming path should
be verified in the src side. If inaccessible then the driver should fail the
state transition immediately instead of letting it identified in the resuming
path which is unrecoverable.

btw I don't know how its spec describes the hw behavior in such situation.
If the behavior is undefined when a hostile software deliberately causes
DMA failures to TX queue then not restoring the queue head could also be
an option to continue the migration in such scenario.

> 
> We also don't have any guarentees defined for the VFIO protocol about
> what state the vIOMMU will be in prior to reaching RUNNING.

This is a good point. Actually it's not just a gap on vIOMMU. it's kind
of a dependency on IOMMUFD no matter the IOAS which the migrated
device is currently attached to is GPA or GIOVA. The device state can
be restored only after IOMMUFD is fully recovered and the device is
re-attached to the IOAS.

Need a way for migration driver to advocate such dependency to the user.

> 
> IDK, all of this looks like it is trying really hard to hackily force
> HW that was never ment to support live migration to somehow do
> something that looks like it.
> 
> You really need to present an explanation in the VFIO driver comments
> about how this whole scheme actually works and is secure and
> functional against a hostile guest.
> 

Agree. And please post the next version to the VFIO community to gain
more attention.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
@ 2023-06-27  6:55       ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-06-27  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Lingyu
  Cc: Burra, Phani R, Liu, Yi L, intel-wired-lan, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 21, 2023 10:37 PM
> 
> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
> > diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
> b/drivers/net/ethernet/intel/ice/ice_migration.c
> > index 2579bc0bd193..c2a83a97af05 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_migration.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
> 
> > +static int
> > +ice_migration_restore_tx_head(struct ice_vf *vf,
> > +			      struct ice_migration_dev_state *devstate,
> > +			      struct vfio_device *vdev)
> > +{
> > +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
> > +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
> > +	struct ice_pf *pf = vf->pf;
> > +	u16 max_ring_len = 0;
> > +	struct device *dev;
> > +	int ret = 0;
> > +	int i = 0;
> > +
> > +	dev = ice_pf_to_dev(vf->pf);
> > +
> > +	if (!vsi) {
> > +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	ice_for_each_txq(vsi, i) {
> > +		if (!test_bit(i, vf->txq_ena))
> > +			continue;
> > +
> > +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
> > +	}
> > +
> > +	if (max_ring_len == 0)
> > +		return 0;
> > +
> > +	tx_desc = (struct ice_tx_desc *)kcalloc
> > +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> > +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
> > +			(max_ring_len, sizeof(struct ice_tx_desc),
> GFP_KERNEL);
> > +	if (!tx_desc || !tx_desc_dummy) {
> > +		dev_err(dev, "VF %d failed to allocate memory for tx
> descriptors to restore tx head\n",
> > +			vf->vf_id);
> > +		ret = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	for (i = 0; i < max_ring_len; i++) {
> > +		u32 td_cmd;
> > +
> > +		td_cmd = ICE_TXD_LAST_DESC_CMD |
> ICE_TX_DESC_CMD_DUMMY;
> > +		tx_desc_dummy[i].cmd_type_offset_bsz =
> > +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
> > +	}
> > +
> > +	/* For each tx queue, we restore the tx head following below steps:
> > +	 * 1. backup original tx ring descriptor memory
> > +	 * 2. overwrite the tx ring descriptor with dummy packets
> > +	 * 3. kick doorbell register to trigger descriptor writeback,
> > +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
> > +	 *    to the place we expect.
> > +	 * 4. restore the tx ring with original tx ring descriptor memory in
> > +	 *    order not to corrupt the ring context.
> > +	 */
> > +	ice_for_each_txq(vsi, i) {
> > +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
> > +		u16 *tx_heads = devstate->tx_head;
> > +		u32 tx_head;
> > +		int j;
> > +
> > +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
> > +			continue;
> > +
> > +		if (tx_heads[i] >= tx_ring->count) {
> > +			dev_err(dev, "saved tx ring head exceeds tx ring
> count\n");
> > +			ret = -EINVAL;
> > +			goto err;
> > +		}
> > +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
> > +				  tx_ring->count * sizeof(tx_desc[0]), false);
> > +		if (ret) {
> > +			dev_err(dev, "kvm read guest tx ring error: %d\n",
> > +				ret);
> > +			goto err;
> 
> You can't call VFIO functions from a netdev driver. All this code
> needs to be moved into the varient driver.
> 
> This design seems pretty wild to me, it doesn't seem too robust
> against a hostile VM - eg these DMAs can all fail under guest control,
> and then what?

Yeah that sounds fragile.

at least the range which will be overwritten in the resuming path should
be verified in the src side. If inaccessible then the driver should fail the
state transition immediately instead of letting it identified in the resuming
path which is unrecoverable.

btw I don't know how its spec describes the hw behavior in such situation.
If the behavior is undefined when a hostile software deliberately causes
DMA failures to TX queue then not restoring the queue head could also be
an option to continue the migration in such scenario.

> 
> We also don't have any guarentees defined for the VFIO protocol about
> what state the vIOMMU will be in prior to reaching RUNNING.

This is a good point. Actually it's not just a gap on vIOMMU. it's kind
of a dependency on IOMMUFD no matter the IOAS which the migrated
device is currently attached to is GPA or GIOVA. The device state can
be restored only after IOMMUFD is fully recovered and the device is
re-attached to the IOAS.

Need a way for migration driver to advocate such dependency to the user.

> 
> IDK, all of this looks like it is trying really hard to hackily force
> HW that was never ment to support live migration to somehow do
> something that looks like it.
> 
> You really need to present an explanation in the VFIO driver comments
> about how this whole scheme actually works and is secure and
> functional against a hostile guest.
> 

Agree. And please post the next version to the VFIO community to gain
more attention.
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions
  2023-06-21 13:35   ` Jason Gunthorpe
@ 2023-06-27  7:50     ` Cao, Yahui
  0 siblings, 0 replies; 40+ messages in thread
From: Cao, Yahui @ 2023-06-27  7:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Lingyu Liu
  Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra


On 6/21/2023 9:35 PM, Jason Gunthorpe wrote:
> On Wed, Jun 21, 2023 at 09:11:01AM +0000, Lingyu Liu wrote:
>> Adds a function to get ice VF device from pci device.
>> Adds a field in VF structure to indicate migration init state,
>> and functions to init and uninit migration.
>>
>> This will be used by ice_vfio_pci driver introduced in coming patches
>> from this series.
>>
>> Signed-off-by: Lingyu Liu <lingyu.liu@intel.com>
>> Signed-off-by: Yahui Cao <yahui.cao@intel.com>
>> ---
>>   drivers/net/ethernet/intel/ice/Makefile       |  1 +
>>   drivers/net/ethernet/intel/ice/ice.h          |  1 +
>>   .../net/ethernet/intel/ice/ice_migration.c    | 68 +++++++++++++++++++
>>   drivers/net/ethernet/intel/ice/ice_vf_lib.c   |  7 ++
>>   drivers/net/ethernet/intel/ice/ice_vf_lib.h   |  1 +
>>   include/linux/net/intel/ice_migration.h       | 26 +++++++
>>   6 files changed, 104 insertions(+)
>>   create mode 100644 drivers/net/ethernet/intel/ice/ice_migration.c
>>   create mode 100644 include/linux/net/intel/ice_migration.h
>>
>> diff --git a/drivers/net/ethernet/intel/ice/Makefile b/drivers/net/ethernet/intel/ice/Makefile
>> index 960277d78e09..915b70588f79 100644
>> --- a/drivers/net/ethernet/intel/ice/Makefile
>> +++ b/drivers/net/ethernet/intel/ice/Makefile
>> @@ -49,3 +49,4 @@ ice-$(CONFIG_RFS_ACCEL) += ice_arfs.o
>>   ice-$(CONFIG_XDP_SOCKETS) += ice_xsk.o
>>   ice-$(CONFIG_ICE_SWITCHDEV) += ice_eswitch.o ice_eswitch_br.o
>>   ice-$(CONFIG_GNSS) += ice_gnss.o
>> +ice-$(CONFIG_ICE_VFIO_PCI) += ice_migration.o
>> diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
>> index 9109006336f0..ec7f27d93924 100644
>> --- a/drivers/net/ethernet/intel/ice/ice.h
>> +++ b/drivers/net/ethernet/intel/ice/ice.h
>> @@ -55,6 +55,7 @@
>>   #include <net/vxlan.h>
>>   #include <net/gtp.h>
>>   #include <linux/ppp_defs.h>
>> +#include <linux/net/intel/ice_migration.h>
>>   #include "ice_devids.h"
>>   #include "ice_type.h"
>>   #include "ice_txrx.h"
>> diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c b/drivers/net/ethernet/intel/ice/ice_migration.c
>> new file mode 100644
>> index 000000000000..1aadb8577a41
>> --- /dev/null
>> +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
>> @@ -0,0 +1,68 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2018-2023 Intel Corporation */
>> +
>> +#include "ice.h"
>> +
>> +/**
>> + * ice_migration_get_vf - Get ice VF structure pointer by pdev
>> + * @vf_pdev: pointer to ice vfio pci VF pdev structure
>> + *
>> + * Return nonzero for success, NULL for failure.
>> + *
>> + * ice_put_vf() should be called after finishing accessing VF
>> + */
>> +void *ice_migration_get_vf(struct pci_dev *vf_pdev)
>> +{
>> +	struct pci_dev *pf_pdev = vf_pdev->physfn;
>> +	int vf_id = pci_iov_vf_id(vf_pdev);
>> +	struct ice_pf *pf;
>> +
>> +	if (!pf_pdev || vf_id < 0)
>> +		return NULL;
>> +
>> +	pf = pci_get_drvdata(pf_pdev);
>> +	return ice_get_vf_by_id(pf, vf_id);
>> +}
>> +EXPORT_SYMBOL(ice_migration_get_vf);
> This doesn't look right, you shouldn't need functions like this.
>
> The VF knows itself, and it goes back to the PF safely from the VFIO
> code. You should not be doing things like 'vf_pdev->physfn'
>
> Loook at how the other drivers are structured.
>
> Jason

I'll change the way to go back to the PF, following how other drivers do.

Thanks.
Yahui.

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode
  2023-06-21 14:40   ` Jason Gunthorpe
@ 2023-06-27  8:09     ` Cao, Yahui
  0 siblings, 0 replies; 40+ messages in thread
From: Cao, Yahui @ 2023-06-27  8:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Lingyu Liu
  Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra

Hi Jason,

On 6/21/2023 10:40 PM, Jason Gunthorpe wrote:
> On Wed, Jun 21, 2023 at 09:11:12AM +0000, Lingyu Liu wrote:
>> From: Yahui Cao <yahui.cao@intel.com>
>>
>> In iommufd vfio compat mode, vfio_dma_rw() will return failure, since
>> vfio_device_has_container() returns false and device->iommufd_access is
>> NULL.
>>
>> Currently device->iommufd_access will not be created if vfio device is
>> backed by pci device. To support IOVA access, manually create
>> iommufd_access context by iommufd_access_create/attach() and access IOVA
>> by iommufd_access_rw(). And in order to minimize the iommufd_access's
>> impact, store the iommufd_access context in driver data, create it only
>> before loading the device state and destroy it once finishing loading
>> the device state.
>>
>> To be compatible with legacy vfio, use vfio_device_has_container() to
>> check the vfio uAPI. If in legacy vfio mode, call vfio_dma_rw()
>> directly, otherwise call iommufd_access_rw().
> This is not the right approach, you should create your access by
> overloading the iommufd ops. Nak on exposing vfio_device_has_container
Could you explain a little bit more about "create your access by 
overloading the iommufd ops." ?

>> +/**
>> + * ice_vfio_pci_emulated_unmap - callback to unmap IOVA
>> + * @data: function handler data
>> + * @iova: I/O virtuall address
>> + * @len: IOVA length
>> + *
>> + * This function is called when application are doing DMA unmap and in some
>> + * cases driver needs to explicitly do some unmap ops if this device does not
>> + * have backed iommu. Nothing is required here since this is pci baseed vfio
>> + * device, which has backed iommu.
>> + */
>> +static void
>> +ice_vfio_pci_emulated_unmap(void *data, unsigned long iova, unsigned long len)
>> +{
>> +}
>> +
>> +static const struct iommufd_access_ops ice_vfio_user_ops = {
>> +	.needs_pin_pages = 1,
>> +	.unmap = ice_vfio_pci_emulated_unmap,
>> +};
> If you don't call pin pages then you shouldn't set needs_pin_pages?
>
> An empty unmap op is unconditionally wrong.
Will change in next version.
>
>> + * ice_vfio_dma_rw - read/write function for device IOVA address space
>> + * @data: function handler data
>> + * @iova: I/O virtuall address
>> + * @buf: buffer for read/write access
>> + * @len: buffer length
>> + * @write: true for write, false for read
>> + *
>> + * Read/write function for device IOVA access. Since vfio_dma_rw() may fail
>> + * at iommufd vfio compatiable mode, we need runtime check what uAPI it is
>> + * using and use corresponding access method for IOVA access.
>> + *
>> + * Return 0 for success, negative value for failure.
>> + */
>> +static int ice_vfio_dma_rw(void *data, dma_addr_t iova,
>> +			   void *buf, size_t len, bool write)
>> +{
>> +	struct ice_vfio_pci_core_device *ice_vdev =
>> +			(struct ice_vfio_pci_core_device *)data;
>> +	struct vfio_device *vdev = &ice_vdev->core_device.vdev;
>> +	unsigned int flags = 0;
>> +
>> +	if (vfio_device_has_container(vdev))
>> +		return vfio_dma_rw(vdev, iova, buf, len, write);
>> +
>> +	if (!current->mm)
>> +		flags |= IOMMUFD_ACCESS_RW_KTHREAD;
> No, you need to know your own calling context, you can't guess like this.
>
> I suppose this is always called from an ioctl?
Yes. This is always called from ioctl. Will remove this calling context 
check.
>> @@ -19,7 +21,7 @@ void ice_migration_uninit_vf(void *opaque);
>>   int ice_migration_suspend_vf(void *opaque, bool mig_dst);
>>   int ice_migration_save_devstate(void *opaque, u8 *buf, u64 buf_sz);
>>   int ice_migration_restore_devstate(void *opaque, const u8 *buf, u64 buf_sz,
>> -				   struct vfio_device *vdev);
>> +				   dma_rw_handler_t handler, void *data);
> Please remove all the wild function pointers and void * opaques I see
> in this driver. Use proper types and get your layering right so you
> dont't have to fake up improper cross-layer calls like this.
>
> Jason

Will refactor the API with proper types.

Thanks.
Yahui.

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices
  2023-06-21 14:23   ` Jason Gunthorpe
@ 2023-06-27  9:00     ` Liu, Lingyu
  0 siblings, 0 replies; 40+ messages in thread
From: Liu, Lingyu @ 2023-06-27  9:00 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kevin.tian, yi.l.liu, intel-wired-lan, phani.r.burra


On 6/21/2023 10:23 PM, Jason Gunthorpe wrote:
> On Wed, Jun 21, 2023 at 09:11:10AM +0000, Lingyu Liu wrote:
>
>> +static struct file *
>> +ice_vfio_pci_step_device_state_locked(struct ice_vfio_pci_core_device *ice_vdev,
>> +				      u32 new, u32 final)
>> +{
>> +	struct device *dev = &ice_vdev->core_device.pdev->dev;
>> +	u32 cur = ice_vdev->mig_state;
>> +	int ret;
>> +
>> +	if (final == VFIO_DEVICE_STATE_RESUMING)
>> +		ice_vdev->is_dst = true;
>> +	else if (final == VFIO_DEVICE_STATE_STOP)
>> +		ice_vdev->is_dst = false;
> Definately not. The kernel should not be guessing which end is which,
> the protocol already makes it clear.

Hi Jason, thanks for your review. This will be changed in next revision.

>> +
>> +	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) {
>> +		if (!ice_vdev->is_dst)
>> +			dev_info(dev, "Live migration begins\n");
>> +		ice_migration_suspend_vf(ice_vdev->vf_handle, ice_vdev->is_dst);
>> +		return NULL;
>> +	}
>> +
>> +	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
>> +		struct ice_vfio_pci_migration_file *migf;
>> +
>> +		migf = ice_vfio_pci_stop_copy(ice_vdev);
>> +		if (IS_ERR(migf))
>> +			return ERR_CAST(migf);
>> +		get_file(migf->filp);
>> +		ice_vdev->saving_migf = migf;
>> +		return migf->filp;
>> +	}
>> +
>> +	if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
>> +		ice_vfio_pci_disable_fds(ice_vdev);
>> +		dev_info(dev, "Live migration ends\n");
>> +		return NULL;
>> +	}
>> +
>> +	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
>> +		struct ice_vfio_pci_migration_file *migf;
>> +
>> +		migf = ice_vfio_pci_resume(ice_vdev);
>> +		if (IS_ERR(migf))
>> +			return ERR_CAST(migf);
>> +		get_file(migf->filp);
>> +		ice_vdev->resuming_migf = migf;
>> +		return migf->filp;
>> +	}
>> +
>> +	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
>> +		ret = ice_vfio_pci_load_state(ice_vdev);
>> +		if (ret)
>> +			return ERR_PTR(ret);
>> +		ice_vfio_pci_disable_fds(ice_vdev);
>> +		return NULL;
>> +	}
>> +
>> +	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING)
>> +		return NULL;
> Lack of P2P is going to be a problem too

Will add P2P support in next revision. Thanks.

>
> Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-21 14:37   ` Jason Gunthorpe
  2023-06-27  6:55       ` Tian, Kevin
@ 2023-06-28  8:11     ` Liu, Yi L
  2023-06-28 12:39       ` Jason Gunthorpe
  1 sibling, 1 reply; 40+ messages in thread
From: Liu, Yi L @ 2023-06-28  8:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Lingyu; +Cc: Tian, Kevin, intel-wired-lan, Burra, Phani R

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
> > diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
> b/drivers/net/ethernet/intel/ice/ice_migration.c
> > index 2579bc0bd193..c2a83a97af05 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_migration.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
> 
> > +static int
> > +ice_migration_restore_tx_head(struct ice_vf *vf,
> > +			      struct ice_migration_dev_state *devstate,
> > +			      struct vfio_device *vdev)
> > +{
> > +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
> > +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
> > +	struct ice_pf *pf = vf->pf;
> > +	u16 max_ring_len = 0;
> > +	struct device *dev;
> > +	int ret = 0;
> > +	int i = 0;
> > +
> > +	dev = ice_pf_to_dev(vf->pf);
> > +
> > +	if (!vsi) {
> > +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	ice_for_each_txq(vsi, i) {
> > +		if (!test_bit(i, vf->txq_ena))
> > +			continue;
> > +
> > +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
> > +	}
> > +
> > +	if (max_ring_len == 0)
> > +		return 0;
> > +
> > +	tx_desc = (struct ice_tx_desc *)kcalloc
> > +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> > +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
> > +			(max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> > +	if (!tx_desc || !tx_desc_dummy) {
> > +		dev_err(dev, "VF %d failed to allocate memory for tx descriptors to
> restore tx head\n",
> > +			vf->vf_id);
> > +		ret = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	for (i = 0; i < max_ring_len; i++) {
> > +		u32 td_cmd;
> > +
> > +		td_cmd = ICE_TXD_LAST_DESC_CMD | ICE_TX_DESC_CMD_DUMMY;
> > +		tx_desc_dummy[i].cmd_type_offset_bsz =
> > +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
> > +	}
> > +
> > +	/* For each tx queue, we restore the tx head following below steps:
> > +	 * 1. backup original tx ring descriptor memory
> > +	 * 2. overwrite the tx ring descriptor with dummy packets
> > +	 * 3. kick doorbell register to trigger descriptor writeback,
> > +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
> > +	 *    to the place we expect.
> > +	 * 4. restore the tx ring with original tx ring descriptor memory in
> > +	 *    order not to corrupt the ring context.
> > +	 */
> > +	ice_for_each_txq(vsi, i) {
> > +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
> > +		u16 *tx_heads = devstate->tx_head;
> > +		u32 tx_head;
> > +		int j;
> > +
> > +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
> > +			continue;
> > +
> > +		if (tx_heads[i] >= tx_ring->count) {
> > +			dev_err(dev, "saved tx ring head exceeds tx ring count\n");
> > +			ret = -EINVAL;
> > +			goto err;
> > +		}
> > +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
> > +				  tx_ring->count * sizeof(tx_desc[0]), false);
> > +		if (ret) {
> > +			dev_err(dev, "kvm read guest tx ring error: %d\n",
> > +				ret);
> > +			goto err;
> 
> You can't call VFIO functions from a netdev driver. All this code
> needs to be moved into the varient driver.
> 
> This design seems pretty wild to me, it doesn't seem too robust
> against a hostile VM - eg these DMAs can all fail under guest control,
> and then what?
> 
> We also don't have any guarentees defined for the VFIO protocol about
> what state the vIOMMU will be in prior to reaching RUNNING.

For QEMU, vIOMMU is supposed to be restored prior to devices per
the below patch. But indeed, there is no guarantee that all VMMs
will do the same as QEMU does.

https://lore.kernel.org/qemu-devel/1483675573-12636-3-git-send-email-peterx@redhat.com/

Regards,
Yi Liu

> IDK, all of this looks like it is trying really hard to hackily force
> HW that was never ment to support live migration to somehow do
> something that looks like it.
> 
> You really need to present an explanation in the VFIO driver comments
> about how this whole scheme actually works and is secure and
> functional against a hostile guest.
> 
> Jason

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-28  8:11     ` Liu, Yi L
@ 2023-06-28 12:39       ` Jason Gunthorpe
  2023-07-03 12:54         ` Liu, Yi L
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-06-28 12:39 UTC (permalink / raw)
  To: Liu, Yi L; +Cc: Tian, Kevin, intel-wired-lan, Burra, Phani R

On Wed, Jun 28, 2023 at 08:11:07AM +0000, Liu, Yi L wrote:

> > You can't call VFIO functions from a netdev driver. All this code
> > needs to be moved into the varient driver.
> > 
> > This design seems pretty wild to me, it doesn't seem too robust
> > against a hostile VM - eg these DMAs can all fail under guest control,
> > and then what?
> > 
> > We also don't have any guarentees defined for the VFIO protocol about
> > what state the vIOMMU will be in prior to reaching RUNNING.
> 
> For QEMU, vIOMMU is supposed to be restored prior to devices per
> the below patch. But indeed, there is no guarantee that all VMMs
> will do the same as QEMU does.

That doesn't seem consistent with how the kernel interface is defined
to work, I wonder why it was done?

Since it is 2017, I suppose it has to do with internal qemu devices?

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-27  6:55       ` Tian, Kevin
@ 2023-07-03  5:27         ` Cao, Yahui
  -1 siblings, 0 replies; 40+ messages in thread
From: Cao, Yahui @ 2023-07-03  5:27 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Liu, Lingyu; +Cc: intel-wired-lan, Liu, Yi L, kvm

Hi Jason & Kevin,

On 6/27/2023 2:55 PM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, June 21, 2023 10:37 PM
>>
>> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
>> b/drivers/net/ethernet/intel/ice/ice_migration.c
>>> index 2579bc0bd193..c2a83a97af05 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_migration.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
>>> +static int
>>> +ice_migration_restore_tx_head(struct ice_vf *vf,
>>> +			      struct ice_migration_dev_state *devstate,
>>> +			      struct vfio_device *vdev)
>>> +{
>>> +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
>>> +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
>>> +	struct ice_pf *pf = vf->pf;
>>> +	u16 max_ring_len = 0;
>>> +	struct device *dev;
>>> +	int ret = 0;
>>> +	int i = 0;
>>> +
>>> +	dev = ice_pf_to_dev(vf->pf);
>>> +
>>> +	if (!vsi) {
>>> +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ice_for_each_txq(vsi, i) {
>>> +		if (!test_bit(i, vf->txq_ena))
>>> +			continue;
>>> +
>>> +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
>>> +	}
>>> +
>>> +	if (max_ring_len == 0)
>>> +		return 0;
>>> +
>>> +	tx_desc = (struct ice_tx_desc *)kcalloc
>>> +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
>>> +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
>>> +			(max_ring_len, sizeof(struct ice_tx_desc),
>> GFP_KERNEL);
>>> +	if (!tx_desc || !tx_desc_dummy) {
>>> +		dev_err(dev, "VF %d failed to allocate memory for tx
>> descriptors to restore tx head\n",
>>> +			vf->vf_id);
>>> +		ret = -ENOMEM;
>>> +		goto err;
>>> +	}
>>> +
>>> +	for (i = 0; i < max_ring_len; i++) {
>>> +		u32 td_cmd;
>>> +
>>> +		td_cmd = ICE_TXD_LAST_DESC_CMD |
>> ICE_TX_DESC_CMD_DUMMY;
>>> +		tx_desc_dummy[i].cmd_type_offset_bsz =
>>> +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
>>> +	}
>>> +
>>> +	/* For each tx queue, we restore the tx head following below steps:
>>> +	 * 1. backup original tx ring descriptor memory
>>> +	 * 2. overwrite the tx ring descriptor with dummy packets
>>> +	 * 3. kick doorbell register to trigger descriptor writeback,
>>> +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
>>> +	 *    to the place we expect.
>>> +	 * 4. restore the tx ring with original tx ring descriptor memory in
>>> +	 *    order not to corrupt the ring context.
>>> +	 */
>>> +	ice_for_each_txq(vsi, i) {
>>> +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
>>> +		u16 *tx_heads = devstate->tx_head;
>>> +		u32 tx_head;
>>> +		int j;
>>> +
>>> +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
>>> +			continue;
>>> +
>>> +		if (tx_heads[i] >= tx_ring->count) {
>>> +			dev_err(dev, "saved tx ring head exceeds tx ring
>> count\n");
>>> +			ret = -EINVAL;
>>> +			goto err;
>>> +		}
>>> +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
>>> +				  tx_ring->count * sizeof(tx_desc[0]), false);
>>> +		if (ret) {
>>> +			dev_err(dev, "kvm read guest tx ring error: %d\n",
>>> +				ret);
>>> +			goto err;
>> You can't call VFIO functions from a netdev driver. All this code
>> needs to be moved into the varient driver.


Will move vfio_dma_rw() into vfio driver and passing callback function 
into netdev driver


>>
>> This design seems pretty wild to me, it doesn't seem too robust
>> against a hostile VM - eg these DMAs can all fail under guest control,
>> and then what?
> Yeah that sounds fragile.
>
> at least the range which will be overwritten in the resuming path should
> be verified in the src side. If inaccessible then the driver should fail the
> state transition immediately instead of letting it identified in the resuming
> path which is unrecoverable.
>
> btw I don't know how its spec describes the hw behavior in such situation.
> If the behavior is undefined when a hostile software deliberately causes
> DMA failures to TX queue then not restoring the queue head could also be
> an option to continue the migration in such scenario.


Thanks for the advice. Will check the vfio_dma_rw() correctness on the 
source side and
fail the state transition once function return failure.

When a hostile software deliberately causes DMA failure to TX queue, TX 
queue head will
remain to be the original value, which is 0 on destination side cases. 
In this case, I'll let
VM resumes by letting TX HEAD to stay with original value.


>
>> We also don't have any guarentees defined for the VFIO protocol about
>> what state the vIOMMU will be in prior to reaching RUNNING.
> This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> of a dependency on IOMMUFD no matter the IOAS which the migrated
> device is currently attached to is GPA or GIOVA. The device state can
> be restored only after IOMMUFD is fully recovered and the device is
> re-attached to the IOAS.
>
> Need a way for migration driver to advocate such dependency to the user.


Since this part is new to me, may need further guidance on how to 
resolve the dependency from you and other community experts.

Thanks.


>
>> IDK, all of this looks like it is trying really hard to hackily force
>> HW that was never ment to support live migration to somehow do
>> something that looks like it.
>>
>> You really need to present an explanation in the VFIO driver comments
>> about how this whole scheme actually works and is secure and
>> functional against a hostile guest.
>>
> Agree. And please post the next version to the VFIO community to gain
> more attention.


I'll add more comments about the whole scheme and post next version to 
VFIO community.

Thank you Jason and Kevin for the valuable feedback.

Thanks.
Yahui.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
@ 2023-07-03  5:27         ` Cao, Yahui
  0 siblings, 0 replies; 40+ messages in thread
From: Cao, Yahui @ 2023-07-03  5:27 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Liu, Lingyu; +Cc: Liu, Yi L, intel-wired-lan, kvm

Hi Jason & Kevin,

On 6/27/2023 2:55 PM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, June 21, 2023 10:37 PM
>>
>> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
>> b/drivers/net/ethernet/intel/ice/ice_migration.c
>>> index 2579bc0bd193..c2a83a97af05 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_migration.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
>>> +static int
>>> +ice_migration_restore_tx_head(struct ice_vf *vf,
>>> +			      struct ice_migration_dev_state *devstate,
>>> +			      struct vfio_device *vdev)
>>> +{
>>> +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
>>> +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
>>> +	struct ice_pf *pf = vf->pf;
>>> +	u16 max_ring_len = 0;
>>> +	struct device *dev;
>>> +	int ret = 0;
>>> +	int i = 0;
>>> +
>>> +	dev = ice_pf_to_dev(vf->pf);
>>> +
>>> +	if (!vsi) {
>>> +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ice_for_each_txq(vsi, i) {
>>> +		if (!test_bit(i, vf->txq_ena))
>>> +			continue;
>>> +
>>> +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
>>> +	}
>>> +
>>> +	if (max_ring_len == 0)
>>> +		return 0;
>>> +
>>> +	tx_desc = (struct ice_tx_desc *)kcalloc
>>> +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
>>> +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
>>> +			(max_ring_len, sizeof(struct ice_tx_desc),
>> GFP_KERNEL);
>>> +	if (!tx_desc || !tx_desc_dummy) {
>>> +		dev_err(dev, "VF %d failed to allocate memory for tx
>> descriptors to restore tx head\n",
>>> +			vf->vf_id);
>>> +		ret = -ENOMEM;
>>> +		goto err;
>>> +	}
>>> +
>>> +	for (i = 0; i < max_ring_len; i++) {
>>> +		u32 td_cmd;
>>> +
>>> +		td_cmd = ICE_TXD_LAST_DESC_CMD |
>> ICE_TX_DESC_CMD_DUMMY;
>>> +		tx_desc_dummy[i].cmd_type_offset_bsz =
>>> +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
>>> +	}
>>> +
>>> +	/* For each tx queue, we restore the tx head following below steps:
>>> +	 * 1. backup original tx ring descriptor memory
>>> +	 * 2. overwrite the tx ring descriptor with dummy packets
>>> +	 * 3. kick doorbell register to trigger descriptor writeback,
>>> +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
>>> +	 *    to the place we expect.
>>> +	 * 4. restore the tx ring with original tx ring descriptor memory in
>>> +	 *    order not to corrupt the ring context.
>>> +	 */
>>> +	ice_for_each_txq(vsi, i) {
>>> +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
>>> +		u16 *tx_heads = devstate->tx_head;
>>> +		u32 tx_head;
>>> +		int j;
>>> +
>>> +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
>>> +			continue;
>>> +
>>> +		if (tx_heads[i] >= tx_ring->count) {
>>> +			dev_err(dev, "saved tx ring head exceeds tx ring
>> count\n");
>>> +			ret = -EINVAL;
>>> +			goto err;
>>> +		}
>>> +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
>>> +				  tx_ring->count * sizeof(tx_desc[0]), false);
>>> +		if (ret) {
>>> +			dev_err(dev, "kvm read guest tx ring error: %d\n",
>>> +				ret);
>>> +			goto err;
>> You can't call VFIO functions from a netdev driver. All this code
>> needs to be moved into the varient driver.


Will move vfio_dma_rw() into vfio driver and passing callback function 
into netdev driver


>>
>> This design seems pretty wild to me, it doesn't seem too robust
>> against a hostile VM - eg these DMAs can all fail under guest control,
>> and then what?
> Yeah that sounds fragile.
>
> at least the range which will be overwritten in the resuming path should
> be verified in the src side. If inaccessible then the driver should fail the
> state transition immediately instead of letting it identified in the resuming
> path which is unrecoverable.
>
> btw I don't know how its spec describes the hw behavior in such situation.
> If the behavior is undefined when a hostile software deliberately causes
> DMA failures to TX queue then not restoring the queue head could also be
> an option to continue the migration in such scenario.


Thanks for the advice. Will check the vfio_dma_rw() correctness on the 
source side and
fail the state transition once function return failure.

When a hostile software deliberately causes DMA failure to TX queue, TX 
queue head will
remain to be the original value, which is 0 on destination side cases. 
In this case, I'll let
VM resumes by letting TX HEAD to stay with original value.


>
>> We also don't have any guarentees defined for the VFIO protocol about
>> what state the vIOMMU will be in prior to reaching RUNNING.
> This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> of a dependency on IOMMUFD no matter the IOAS which the migrated
> device is currently attached to is GPA or GIOVA. The device state can
> be restored only after IOMMUFD is fully recovered and the device is
> re-attached to the IOAS.
>
> Need a way for migration driver to advocate such dependency to the user.


Since this part is new to me, may need further guidance on how to 
resolve the dependency from you and other community experts.

Thanks.


>
>> IDK, all of this looks like it is trying really hard to hackily force
>> HW that was never ment to support live migration to somehow do
>> something that looks like it.
>>
>> You really need to present an explanation in the VFIO driver comments
>> about how this whole scheme actually works and is secure and
>> functional against a hostile guest.
>>
> Agree. And please post the next version to the VFIO community to gain
> more attention.


I'll add more comments about the whole scheme and post next version to 
VFIO community.

Thank you Jason and Kevin for the valuable feedback.

Thanks.
Yahui.

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-28 12:39       ` Jason Gunthorpe
@ 2023-07-03 12:54         ` Liu, Yi L
  2023-07-04  7:38           ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Liu, Yi L @ 2023-07-03 12:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Tian, Kevin, intel-wired-lan, Burra, Phani R, Peter Xu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 28, 2023 8:39 PM
> 
> On Wed, Jun 28, 2023 at 08:11:07AM +0000, Liu, Yi L wrote:
> 
> > > You can't call VFIO functions from a netdev driver. All this code
> > > needs to be moved into the varient driver.
> > >
> > > This design seems pretty wild to me, it doesn't seem too robust
> > > against a hostile VM - eg these DMAs can all fail under guest control,
> > > and then what?
> > >
> > > We also don't have any guarentees defined for the VFIO protocol about
> > > what state the vIOMMU will be in prior to reaching RUNNING.
> >
> > For QEMU, vIOMMU is supposed to be restored prior to devices per
> > the below patch. But indeed, there is no guarantee that all VMMs
> > will do the same as QEMU does.
> 
> That doesn't seem consistent with how the kernel interface is defined
> to work, I wonder why it was done?

Add Peter.

> 
> Since it is 2017, I suppose it has to do with internal qemu devices?

Do you mean the devices emulated by QEMU?

Regards,
Yi Liu
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-03  5:27         ` Cao, Yahui
@ 2023-07-03 21:03           ` Jason Gunthorpe
  -1 siblings, 0 replies; 40+ messages in thread
From: Jason Gunthorpe @ 2023-07-03 21:03 UTC (permalink / raw)
  To: Cao, Yahui; +Cc: Tian, Kevin, Liu, Lingyu, intel-wired-lan, Liu, Yi L, kvm

On Mon, Jul 03, 2023 at 01:27:51PM +0800, Cao, Yahui wrote:

> > > You can't call VFIO functions from a netdev driver. All this code
> > > needs to be moved into the varient driver.
> 
> Will move vfio_dma_rw() into vfio driver and passing callback function into
> netdev driver

Please make proper layers, you should not need to stitch your driver
together with weird function pointers. 
 
> > > We also don't have any guarentees defined for the VFIO protocol about
> > > what state the vIOMMU will be in prior to reaching RUNNING.
> > This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> > of a dependency on IOMMUFD no matter the IOAS which the migrated
> > device is currently attached to is GPA or GIOVA. The device state can
> > be restored only after IOMMUFD is fully recovered and the device is
> > re-attached to the IOAS.
> > 
> > Need a way for migration driver to advocate such dependency to the user. 
> 
> Since this part is new to me, may need further guidance on how to resolve
> the dependency from you and other community experts.

Personally I'm quite uncomfortable with a driver that tries to work
this way, I'm not sure we should encourage this. Can Intel really be
convincing that this is safe and correct?

Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
@ 2023-07-03 21:03           ` Jason Gunthorpe
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Gunthorpe @ 2023-07-03 21:03 UTC (permalink / raw)
  To: Cao, Yahui; +Cc: Tian, Kevin, Liu, Yi L, intel-wired-lan, kvm

On Mon, Jul 03, 2023 at 01:27:51PM +0800, Cao, Yahui wrote:

> > > You can't call VFIO functions from a netdev driver. All this code
> > > needs to be moved into the varient driver.
> 
> Will move vfio_dma_rw() into vfio driver and passing callback function into
> netdev driver

Please make proper layers, you should not need to stitch your driver
together with weird function pointers. 
 
> > > We also don't have any guarentees defined for the VFIO protocol about
> > > what state the vIOMMU will be in prior to reaching RUNNING.
> > This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> > of a dependency on IOMMUFD no matter the IOAS which the migrated
> > device is currently attached to is GPA or GIOVA. The device state can
> > be restored only after IOMMUFD is fully recovered and the device is
> > re-attached to the IOAS.
> > 
> > Need a way for migration driver to advocate such dependency to the user. 
> 
> Since this part is new to me, may need further guidance on how to resolve
> the dependency from you and other community experts.

Personally I'm quite uncomfortable with a driver that tries to work
this way, I'm not sure we should encourage this. Can Intel really be
convincing that this is safe and correct?

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-03 21:03           ` Jason Gunthorpe
@ 2023-07-04  7:35             ` Tian, Kevin
  -1 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-07-04  7:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Cao, Yahui; +Cc: Liu, Lingyu, intel-wired-lan, Liu, Yi L, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, July 4, 2023 5:04 AM
> 
> On Mon, Jul 03, 2023 at 01:27:51PM +0800, Cao, Yahui wrote:
> 
> > > > We also don't have any guarentees defined for the VFIO protocol about
> > > > what state the vIOMMU will be in prior to reaching RUNNING.
> > > This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> > > of a dependency on IOMMUFD no matter the IOAS which the migrated
> > > device is currently attached to is GPA or GIOVA. The device state can
> > > be restored only after IOMMUFD is fully recovered and the device is
> > > re-attached to the IOAS.
> > >
> > > Need a way for migration driver to advocate such dependency to the user.
> >
> > Since this part is new to me, may need further guidance on how to resolve
> > the dependency from you and other community experts.
> 
> Personally I'm quite uncomfortable with a driver that tries to work
> this way, I'm not sure we should encourage this. Can Intel really be
> convincing that this is safe and correct?
> 

I dislike it too. Will discuss with Yahui on the correctness of this approach
and any cleaner alternative.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
@ 2023-07-04  7:35             ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2023-07-04  7:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Cao, Yahui; +Cc: Liu, Yi L, intel-wired-lan, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, July 4, 2023 5:04 AM
> 
> On Mon, Jul 03, 2023 at 01:27:51PM +0800, Cao, Yahui wrote:
> 
> > > > We also don't have any guarentees defined for the VFIO protocol about
> > > > what state the vIOMMU will be in prior to reaching RUNNING.
> > > This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> > > of a dependency on IOMMUFD no matter the IOAS which the migrated
> > > device is currently attached to is GPA or GIOVA. The device state can
> > > be restored only after IOMMUFD is fully recovered and the device is
> > > re-attached to the IOAS.
> > >
> > > Need a way for migration driver to advocate such dependency to the user.
> >
> > Since this part is new to me, may need further guidance on how to resolve
> > the dependency from you and other community experts.
> 
> Personally I'm quite uncomfortable with a driver that tries to work
> this way, I'm not sure we should encourage this. Can Intel really be
> convincing that this is safe and correct?
> 

I dislike it too. Will discuss with Yahui on the correctness of this approach
and any cleaner alternative.
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-03 12:54         ` Liu, Yi L
@ 2023-07-04  7:38           ` Tian, Kevin
  2023-07-04 17:59             ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2023-07-04  7:38 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe; +Cc: Burra, Phani R, intel-wired-lan, Peter Xu

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 3, 2023 8:54 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, June 28, 2023 8:39 PM
> >
> > On Wed, Jun 28, 2023 at 08:11:07AM +0000, Liu, Yi L wrote:
> >
> > > > You can't call VFIO functions from a netdev driver. All this code
> > > > needs to be moved into the varient driver.
> > > >
> > > > This design seems pretty wild to me, it doesn't seem too robust
> > > > against a hostile VM - eg these DMAs can all fail under guest control,
> > > > and then what?
> > > >
> > > > We also don't have any guarentees defined for the VFIO protocol about
> > > > what state the vIOMMU will be in prior to reaching RUNNING.
> > >
> > > For QEMU, vIOMMU is supposed to be restored prior to devices per
> > > the below patch. But indeed, there is no guarantee that all VMMs
> > > will do the same as QEMU does.
> >
> > That doesn't seem consistent with how the kernel interface is defined
> > to work, I wonder why it was done?
> 
> Add Peter.
> 
> >
> > Since it is 2017, I suppose it has to do with internal qemu devices?
> 
> Do you mean the devices emulated by QEMU?
> 

This sounds counter-intuitive even for emulated devices. vIOMMU is
involved only when the device wants to access guest memory. But
by definition the RESUMING path should just restore the device to
the point where it is stopped. Why would such restore create a
dependency on vIOMMU state?
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-04  7:38           ` Tian, Kevin
@ 2023-07-04 17:59             ` Peter Xu
  2023-07-10 15:54               ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2023-07-04 17:59 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Burra, Phani R, Liu, Yi L, intel-wired-lan, Jason Gunthorpe

On Tue, Jul 04, 2023 at 07:38:22AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, July 3, 2023 8:54 PM
> > 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, June 28, 2023 8:39 PM
> > >
> > > On Wed, Jun 28, 2023 at 08:11:07AM +0000, Liu, Yi L wrote:
> > >
> > > > > You can't call VFIO functions from a netdev driver. All this code
> > > > > needs to be moved into the varient driver.
> > > > >
> > > > > This design seems pretty wild to me, it doesn't seem too robust
> > > > > against a hostile VM - eg these DMAs can all fail under guest control,
> > > > > and then what?
> > > > >
> > > > > We also don't have any guarentees defined for the VFIO protocol about
> > > > > what state the vIOMMU will be in prior to reaching RUNNING.
> > > >
> > > > For QEMU, vIOMMU is supposed to be restored prior to devices per
> > > > the below patch. But indeed, there is no guarantee that all VMMs
> > > > will do the same as QEMU does.
> > >
> > > That doesn't seem consistent with how the kernel interface is defined
> > > to work, I wonder why it was done?
> > 
> > Add Peter.

Yi - I think I replied to your private email, I'm not 100% sure this is the
same thing asked?  I hope you received the email.

> > 
> > >
> > > Since it is 2017, I suppose it has to do with internal qemu devices?
> > 
> > Do you mean the devices emulated by QEMU?
> > 
> 
> This sounds counter-intuitive even for emulated devices. vIOMMU is
> involved only when the device wants to access guest memory. But
> by definition the RESUMING path should just restore the device to
> the point where it is stopped. Why would such restore create a
> dependency on vIOMMU state?

For why QEMU migrates vIOMMU earlier than PCI generic devices, I can try to
repeat here: IIRC it's because some device needs dma translations during
post_load(), I forgot which device and why, but it can make some sense to
me if recovering states of the pci device may need help from a translation.

If someone wants to know the details, one can just make vIOMMU prio lower
than pci devices and give it a shot and see what explodes first.

Thanks,

-- 
Peter Xu

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-04 17:59             ` Peter Xu
@ 2023-07-10 15:54               ` Jason Gunthorpe
  2023-07-17 21:43                 ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-07-10 15:54 UTC (permalink / raw)
  To: Peter Xu; +Cc: Tian, Kevin, Liu, Yi L, intel-wired-lan, Burra, Phani R

On Tue, Jul 04, 2023 at 01:59:49PM -0400, Peter Xu wrote:

> > This sounds counter-intuitive even for emulated devices. vIOMMU is
> > involved only when the device wants to access guest memory. But
> > by definition the RESUMING path should just restore the device to
> > the point where it is stopped. Why would such restore create a
> > dependency on vIOMMU state?
> 
> For why QEMU migrates vIOMMU earlier than PCI generic devices, I can try to
> repeat here: IIRC it's because some device needs dma translations during
> post_load(), I forgot which device and why, but it can make some sense to
> me if recovering states of the pci device may need help from a translation.

Well, it has nothing to do with VFIO given the timeframe, and if
internal qemu devices are doing weird things like this it also
suggests they have problems with P2P as well.

The VFIO definition we have is that a device may not do any DMA while in
resuming/resuming_p2p, it is only allowed to initiate DMA once it
fully reaches running.

Obviously we have to setup the vIOMMU before going to running.

The internal qemu devices must follow this definition as well or they
are not P2P compatible.

About the only exception I could think is if some internal devices was
caching translations for some purpose, but it really should setup that
caching at entry to running..

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-10 15:54               ` Jason Gunthorpe
@ 2023-07-17 21:43                 ` Peter Xu
  2023-07-18 15:38                   ` Jason Gunthorpe
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2023-07-17 21:43 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Tian, Kevin, Liu, Yi L, intel-wired-lan, Burra, Phani R

On Mon, Jul 10, 2023 at 12:54:03PM -0300, Jason Gunthorpe wrote:
> On Tue, Jul 04, 2023 at 01:59:49PM -0400, Peter Xu wrote:
> 
> > > This sounds counter-intuitive even for emulated devices. vIOMMU is
> > > involved only when the device wants to access guest memory. But
> > > by definition the RESUMING path should just restore the device to
> > > the point where it is stopped. Why would such restore create a
> > > dependency on vIOMMU state?
> > 
> > For why QEMU migrates vIOMMU earlier than PCI generic devices, I can try to
> > repeat here: IIRC it's because some device needs dma translations during
> > post_load(), I forgot which device and why, but it can make some sense to
> > me if recovering states of the pci device may need help from a translation.
> 
> Well, it has nothing to do with VFIO given the timeframe, and if
> internal qemu devices are doing weird things like this it also
> suggests they have problems with P2P as well.
> 
> The VFIO definition we have is that a device may not do any DMA while in
> resuming/resuming_p2p, it is only allowed to initiate DMA once it
> fully reaches running.
> 
> Obviously we have to setup the vIOMMU before going to running.
> 
> The internal qemu devices must follow this definition as well or they
> are not P2P compatible.
> 
> About the only exception I could think is if some internal devices was
> caching translations for some purpose, but it really should setup that
> caching at entry to running..

Yes it's something like that which is a pure translation request, rather
than doing a real DMA, as far as I remember.  It does sound fine to
postpone that into a vm state change handler to me, but worth checking when
someone really works on it, and also whether it's feasible - e.g. hopefully
only a few devices need to work that around.

It may boil down to why we need to avoid migrating vIOMMU before other
devices (which I am still unsure about..), and which way is easier.

Thanks,

-- 
Peter Xu

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-17 21:43                 ` Peter Xu
@ 2023-07-18 15:38                   ` Jason Gunthorpe
  2023-07-18 17:36                     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Gunthorpe @ 2023-07-18 15:38 UTC (permalink / raw)
  To: Peter Xu; +Cc: Tian, Kevin, Liu, Yi L, intel-wired-lan, Burra, Phani R

On Mon, Jul 17, 2023 at 05:43:34PM -0400, Peter Xu wrote:

> It may boil down to why we need to avoid migrating vIOMMU before other
> devices (which I am still unsure about..), and which way is easier.

The statement is just that the kernel cannot assume anything about
vIOMMU ordering and people can't make VFIO migration drivers that
assume the vIOMMU is setup before the entry to running.

That qemu enforces this ordering today for pointless reasons is
interesting, but it should not leak into becoming the ABI contract of
the kernel.

Jason
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-18 15:38                   ` Jason Gunthorpe
@ 2023-07-18 17:36                     ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2023-07-18 17:36 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Tian, Kevin, Liu, Yi L, intel-wired-lan, Burra, Phani R

On Tue, Jul 18, 2023 at 12:38:58PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 17, 2023 at 05:43:34PM -0400, Peter Xu wrote:
> 
> > It may boil down to why we need to avoid migrating vIOMMU before other
> > devices (which I am still unsure about..), and which way is easier.
> 
> The statement is just that the kernel cannot assume anything about
> vIOMMU ordering and people can't make VFIO migration drivers that
> assume the vIOMMU is setup before the entry to running.
> 
> That qemu enforces this ordering today for pointless reasons is
> interesting, but it should not leak into becoming the ABI contract of
> the kernel.

I wouldn't say it's pointless because we did that obviously with a reason.
But I agree it should be qemu internal and shouldn't even be exposed unless
extremely necessary.

-- 
Peter Xu

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-07-18 17:42 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-21  9:10 [Intel-wired-lan] [PATCH iwl-next V2 00/15] Add E800 live migration driver Lingyu Liu
2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 01/15] ice: Fix missing legacy 32byte RXDID in the supported bitmap Lingyu Liu
2023-06-21  9:10 ` [Intel-wired-lan] [PATCH iwl-next V2 02/15] ice: add function to get rxq context Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 03/15] ice: check VF migration status before sending messages to VF Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 04/15] ice: add migration init field and helper functions Lingyu Liu
2023-06-21 13:35   ` Jason Gunthorpe
2023-06-27  7:50     ` Cao, Yahui
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 05/15] ice: save VF messages as device state Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 06/15] ice: save and restore " Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 07/15] ice: do not notify VF link state during migration Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 08/15] ice: change VSI id in virtual channel message after migration Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 09/15] ice: save and restore RX queue head Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX " Lingyu Liu
2023-06-21 14:37   ` Jason Gunthorpe
2023-06-27  6:55     ` Tian, Kevin
2023-06-27  6:55       ` Tian, Kevin
2023-07-03  5:27       ` Cao, Yahui
2023-07-03  5:27         ` Cao, Yahui
2023-07-03 21:03         ` Jason Gunthorpe
2023-07-03 21:03           ` Jason Gunthorpe
2023-07-04  7:35           ` Tian, Kevin
2023-07-04  7:35             ` Tian, Kevin
2023-06-28  8:11     ` Liu, Yi L
2023-06-28 12:39       ` Jason Gunthorpe
2023-07-03 12:54         ` Liu, Yi L
2023-07-04  7:38           ` Tian, Kevin
2023-07-04 17:59             ` Peter Xu
2023-07-10 15:54               ` Jason Gunthorpe
2023-07-17 21:43                 ` Peter Xu
2023-07-18 15:38                   ` Jason Gunthorpe
2023-07-18 17:36                     ` Peter Xu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 11/15] ice: stop device before saving device states Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 12/15] ice: mask VF advanced capabilities if live migration is activated Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 13/15] vfio/ice: implement vfio_pci driver for E800 devices Lingyu Liu
2023-06-21 14:23   ` Jason Gunthorpe
2023-06-27  9:00     ` Liu, Lingyu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 14/15] vfio: Expose vfio_device_has_container() Lingyu Liu
2023-06-21  9:11 ` [Intel-wired-lan] [PATCH iwl-next V2 15/15] vfio/ice: support iommufd vfio compat mode Lingyu Liu
2023-06-21 14:40   ` Jason Gunthorpe
2023-06-27  8:09     ` Cao, Yahui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.