linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
       [not found] <cover.1632305919.git.leonro@nvidia.com>
@ 2021-09-22 10:38 ` Leon Romanovsky
  2021-09-22 21:59   ` Bjorn Helgaas
  2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Alex Williamson, Bjorn Helgaas, David S. Miller, Jakub Kicinski,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Yishai Hadas

From: Jason Gunthorpe <jgg@nvidia.com>

The PCI core uses the VF index internally, often called the vf_id,
during the setup of the VF, eg pci_iov_add_virtfn().

This index is needed for device drivers that implement live migration
for their internal operations that configure/control their VFs.

Specifically, mlx5_vfio_pci driver that is introduced in coming patches
from this series needs it and not the bus/device/function which is
exposed today.

Add pci_iov_vf_id() which computes the vf_id by reversing the math that
was used to create the bus/device/function.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/iov.c   | 14 ++++++++++++++
 include/linux/pci.h |  7 ++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index dafdc652fcd0..e7751fa3fe0b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 }
 EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
 
+int pci_iov_vf_id(struct pci_dev *dev)
+{
+	struct pci_dev *pf;
+
+	if (!dev->is_virtfn)
+		return -EINVAL;
+
+	pf = pci_physfn(dev);
+	return (((dev->bus->number << 8) + dev->devfn) -
+		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
+	       pf->sriov->stride;
+}
+EXPORT_SYMBOL_GPL(pci_iov_vf_id);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..4d6c73506e18 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2153,7 +2153,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 #ifdef CONFIG_PCI_IOV
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
-
+int pci_iov_vf_id(struct pci_dev *dev);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2181,6 +2181,11 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
 	return -ENOSYS;
 }
+static inline int pci_iov_vf_id(struct pci_dev *dev)
+{
+	return -ENOSYS;
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
       [not found] <cover.1632305919.git.leonro@nvidia.com>
  2021-09-22 10:38 ` [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index Leon Romanovsky
@ 2021-09-22 10:38 ` Leon Romanovsky
  2021-09-23 10:33   ` Shameerali Kolothum Thodi
  2021-09-27 22:46   ` Alex Williamson
  2021-09-22 10:38 ` [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional Leon Romanovsky
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, David S. Miller,
	Jakub Kicinski, Kirti Wankhede, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed

From: Yishai Hadas <yishaih@nvidia.com>

Add an API in the core layer to check migration state transition validity
as part of a migration flow.

The valid transitions follow the expected usage as described in
uapi/vfio.h and triggered by QEMU.

This ensures that all migration implementations follow a consistent
migration state machine.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/vfio.c  | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h |  1 +
 2 files changed, 42 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 3c034fe14ccb..c3ca33e513c8 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+/**
+ * vfio_change_migration_state_allowed - Checks whether a migration state
+ *   transition is valid.
+ * @new_state: The new state to move to.
+ * @old_state: The old state.
+ * Return: true if the transition is valid.
+ */
+bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state)
+{
+	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
+	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
+		[VFIO_DEVICE_STATE_STOP] = {
+			[VFIO_DEVICE_STATE_RUNNING] = 1,
+			[VFIO_DEVICE_STATE_RESUMING] = 1,
+		},
+		[VFIO_DEVICE_STATE_RUNNING] = {
+			[VFIO_DEVICE_STATE_STOP] = 1,
+			[VFIO_DEVICE_STATE_SAVING] = 1,
+			[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = 1,
+		},
+		[VFIO_DEVICE_STATE_SAVING] = {
+			[VFIO_DEVICE_STATE_STOP] = 1,
+			[VFIO_DEVICE_STATE_RUNNING] = 1,
+		},
+		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = {
+			[VFIO_DEVICE_STATE_RUNNING] = 1,
+			[VFIO_DEVICE_STATE_SAVING] = 1,
+		},
+		[VFIO_DEVICE_STATE_RESUMING] = {
+			[VFIO_DEVICE_STATE_RUNNING] = 1,
+			[VFIO_DEVICE_STATE_STOP] = 1,
+		},
+	};
+
+	if (new_state > MAX_STATE || old_state > MAX_STATE)
+		return false;
+
+	return vfio_from_state_table[old_state][new_state];
+}
+EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
+
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b53a9557884a..e65137a708f1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -83,6 +83,7 @@ extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
 extern void vfio_device_put(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
+bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state);
 
 /* events for the backend driver notify callback */
 enum vfio_iommu_notify_type {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional
       [not found] <cover.1632305919.git.leonro@nvidia.com>
  2021-09-22 10:38 ` [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index Leon Romanovsky
  2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
@ 2021-09-22 10:38 ` Leon Romanovsky
  2021-09-23 13:57   ` Max Gurtovoy
  2021-09-22 10:38 ` [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures Leon Romanovsky
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, David S. Miller,
	Jakub Kicinski, Kirti Wankhede, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed

From: Yishai Hadas <yishaih@nvidia.com>

Make the region->release() function optional as in some cases there is
nothing to do by driver as part of it.

This is needed for coming patch from this series once we add
mlx5_vfio_cpi driver to support live migration but we don't need a
migration release function.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 68198e0f2a63..3ddc3adb24de 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -341,7 +341,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	vdev->virq_disabled = false;
 
 	for (i = 0; i < vdev->num_regions; i++)
-		vdev->region[i].ops->release(vdev, &vdev->region[i]);
+		if (vdev->region[i].ops->release)
+			vdev->region[i].ops->release(vdev, &vdev->region[i]);
 
 	vdev->num_regions = 0;
 	kfree(vdev->region);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures
       [not found] <cover.1632305919.git.leonro@nvidia.com>
                   ` (2 preceding siblings ...)
  2021-09-22 10:38 ` [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional Leon Romanovsky
@ 2021-09-22 10:38 ` Leon Romanovsky
  2021-09-24  5:48   ` Mark Zhang
  2021-09-22 10:38 ` [PATCH mlx5-next 5/7] net/mlx5: Expose APIs to get/put the mlx5 core device Leon Romanovsky
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, Jakub Kicinski,
	Kirti Wankhede, kvm, linux-kernel, linux-pci, linux-rdma, netdev,
	Saeed Mahameed

From: Yishai Hadas <yishaih@nvidia.com>

Introduce migration IFC related stuff to enable migration commands.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 145 +++++++++++++++++++++++++++++++++-
 1 file changed, 144 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index d90a65b6824f..366c7b030eb7 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -126,6 +126,11 @@ enum {
 	MLX5_CMD_OP_QUERY_SF_PARTITION            = 0x111,
 	MLX5_CMD_OP_ALLOC_SF                      = 0x113,
 	MLX5_CMD_OP_DEALLOC_SF                    = 0x114,
+	MLX5_CMD_OP_SUSPEND_VHCA                  = 0x115,
+	MLX5_CMD_OP_RESUME_VHCA                   = 0x116,
+	MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE    = 0x117,
+	MLX5_CMD_OP_SAVE_VHCA_STATE               = 0x118,
+	MLX5_CMD_OP_LOAD_VHCA_STATE               = 0x119,
 	MLX5_CMD_OP_CREATE_MKEY                   = 0x200,
 	MLX5_CMD_OP_QUERY_MKEY                    = 0x201,
 	MLX5_CMD_OP_DESTROY_MKEY                  = 0x202,
@@ -1719,7 +1724,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         reserved_at_682[0x1];
 	u8         log_max_sf[0x5];
 	u8         apu[0x1];
-	u8         reserved_at_689[0x7];
+	u8         reserved_at_689[0x4];
+	u8         migration[0x1];
+	u8         reserved_at_68d[0x2];
 	u8         log_min_sf_size[0x8];
 	u8         max_num_sf_partitions[0x8];
 
@@ -11146,4 +11153,140 @@ enum {
 	MLX5_MTT_PERM_RW	= MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE,
 };
 
+enum {
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER  = 0x0,
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE   = 0x1,
+};
+
+struct mlx5_ifc_suspend_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_suspend_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+enum {
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE   = 0x0,
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER  = 0x1,
+};
+
+struct mlx5_ifc_resume_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_resume_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	u8         required_umem_size[0x20];
+
+	u8         reserved_at_a0[0x160];
+};
+
+struct mlx5_ifc_save_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_save_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_load_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_load_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
 #endif /* MLX5_IFC_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH mlx5-next 5/7] net/mlx5: Expose APIs to get/put the mlx5 core device
       [not found] <cover.1632305919.git.leonro@nvidia.com>
                   ` (3 preceding siblings ...)
  2021-09-22 10:38 ` [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures Leon Romanovsky
@ 2021-09-22 10:38 ` Leon Romanovsky
  2021-09-22 10:38 ` [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device Leon Romanovsky
  2021-09-22 10:38 ` [PATCH mlx5-next 7/7] mlx5_vfio_pci: Implement vfio_pci driver for mlx5 devices Leon Romanovsky
  6 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, Jakub Kicinski,
	Kirti Wankhede, kvm, linux-kernel, linux-pci, linux-rdma, netdev,
	Saeed Mahameed

From: Yishai Hadas <yishaih@nvidia.com>

Expose an API to get the mlx5 core device from a given PCI device if
mlx5_core is its driver.

Upon the get API we stay with the intf_state_mutex locked to make sure
that the device can't be gone/unloaded till the caller will complete
its job over the device, this expects to be for a short period of time
for any flow that the lock is taken.

Upon the put API we unlock the intf_state_mutex.

The use case for those APIs is the migration flow of a VF over VFIO PCI.
In that case the VF doesn't ride on mlx5_core, because the device is
driving *two* different PCI devices, the PF owned by mlx5_core and the
VF owned by the vfio driver.

The mlx5_core of the PF is accessed only during the narrow window of the
VF's ioctl that requires its services.

This allows the PF driver to be more independent of the VF driver, so
long as it doesn't reset the FW.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 43 +++++++++++++++++++
 include/linux/mlx5/driver.h                   |  3 ++
 2 files changed, 46 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 79482824c64f..fcc8b7830421 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1795,6 +1795,49 @@ static struct pci_driver mlx5_core_driver = {
 	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
+/**
+ * mlx5_get_core_dev - Get the mlx5 core device from a given PCI device if
+ *                     mlx5_core is its driver.
+ * @pdev: The associated PCI device.
+ *
+ * Upon return the interface state lock stay held to let caller uses it safely.
+ * Caller must ensure to use the returned mlx5 device for a narrow window
+ * and put it back with mlx5_put_core_dev() immediately once usage was over.
+ *
+ * Return: Pointer to the associated mlx5_core_dev or NULL.
+ */
+struct mlx5_core_dev *mlx5_get_core_dev(struct pci_dev *pdev)
+			__acquires(&mdev->intf_state_mutex)
+{
+	struct mlx5_core_dev *mdev;
+
+	device_lock(&pdev->dev);
+	if (pdev->driver != &mlx5_core_driver) {
+		device_unlock(&pdev->dev);
+		return NULL;
+	}
+
+	mdev = pci_get_drvdata(pdev);
+	mutex_lock(&mdev->intf_state_mutex);
+	device_unlock(&pdev->dev);
+
+	return mdev;
+}
+EXPORT_SYMBOL(mlx5_get_core_dev);
+
+/**
+ * mlx5_put_core_dev - Put the mlx5 core device back.
+ * @mdev: The mlx5 core device.
+ *
+ * Upon return the interface state lock is unlocked and caller should not
+ * access the mdev any more.
+ */
+void mlx5_put_core_dev(struct mlx5_core_dev *mdev)
+{
+	mutex_unlock(&mdev->intf_state_mutex);
+}
+EXPORT_SYMBOL(mlx5_put_core_dev);
+
 static void mlx5_core_verify_params(void)
 {
 	if (prof_sel >= ARRAY_SIZE(profile)) {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 1b8bae246b28..e9a96904d6f1 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1156,6 +1156,9 @@ int mlx5_dm_sw_icm_alloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 			   u64 length, u16 uid, phys_addr_t addr, u32 obj_id);
 
+struct mlx5_core_dev *mlx5_get_core_dev(struct pci_dev *pdev);
+void mlx5_put_core_dev(struct mlx5_core_dev *mdev);
+
 #ifdef CONFIG_MLX5_CORE_IPOIB
 struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
 					  struct ib_device *ibdev,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device
       [not found] <cover.1632305919.git.leonro@nvidia.com>
                   ` (4 preceding siblings ...)
  2021-09-22 10:38 ` [PATCH mlx5-next 5/7] net/mlx5: Expose APIs to get/put the mlx5 core device Leon Romanovsky
@ 2021-09-22 10:38 ` Leon Romanovsky
  2021-09-28 20:22   ` Alex Williamson
  2021-09-22 10:38 ` [PATCH mlx5-next 7/7] mlx5_vfio_pci: Implement vfio_pci driver for mlx5 devices Leon Romanovsky
  6 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, David S. Miller,
	Jakub Kicinski, Kirti Wankhede, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed

From: Yishai Hadas <yishaih@nvidia.com>

Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.

As part of this adds the APIs and data structure that are needed to
manage the migration data.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5_vfio_pci_cmd.c | 358 +++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5_vfio_pci_cmd.h |  43 ++++
 2 files changed, 401 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci_cmd.c
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci_cmd.h

diff --git a/drivers/vfio/pci/mlx5_vfio_pci_cmd.c b/drivers/vfio/pci/mlx5_vfio_pci_cmd.c
new file mode 100644
index 000000000000..7e4f83d196a8
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci_cmd.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "mlx5_vfio_pci_cmd.h"
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_get_core_dev(pci_physfn(pdev));
+	u32 out[MLX5_ST_SZ_DW(suspend_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(suspend_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(suspend_vhca_in, in, opcode, MLX5_CMD_OP_SUSPEND_VHCA);
+	MLX5_SET(suspend_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(suspend_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, suspend_vhca, in, out);
+	mlx5_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_get_core_dev(pci_physfn(pdev));
+	u32 out[MLX5_ST_SZ_DW(resume_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(resume_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(resume_vhca_in, in, opcode, MLX5_CMD_OP_RESUME_VHCA);
+	MLX5_SET(resume_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(resume_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, resume_vhca, in, out);
+	mlx5_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  u32 *state_size)
+{
+	struct mlx5_core_dev *mdev = mlx5_get_core_dev(pci_physfn(pdev));
+	u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(query_vhca_migration_state_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE);
+	MLX5_SET(query_vhca_migration_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(query_vhca_migration_state_in, in, op_mod, 0);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_vhca_migration_state, in, out);
+	if (ret)
+		goto end;
+
+	*state_size = MLX5_GET(query_vhca_migration_state_out, out,
+			       required_umem_size);
+
+end:
+	mlx5_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
+{
+	struct mlx5_core_dev *mdev = mlx5_get_core_dev(pci_physfn(pdev));
+	u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+	int out_size;
+	void *out;
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	out = kzalloc(out_size, GFP_KERNEL);
+	if (!out) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	MLX5_SET(query_hca_cap_in, in, other_function, 1);
+	MLX5_SET(query_hca_cap_in, in, function_id, function_id);
+	MLX5_SET(query_hca_cap_in, in, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1 |
+		 HCA_CAP_OPMOD_GET_CUR);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_hca_cap, in, out);
+	if (ret)
+		goto err_exec;
+
+	*vhca_id = MLX5_GET(query_hca_cap_out, out, capability.cmd_hca_cap.vhca_id);
+
+err_exec:
+	kfree(out);
+end:
+	mlx5_put_core_dev(mdev);
+	return ret;
+}
+
+static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
+			      struct mlx5_vhca_state_data *state,
+			      struct mlx5_core_mkey *mkey)
+{
+	struct sg_dma_page_iter dma_iter;
+	int err = 0, inlen;
+	__be64 *mtt;
+	void *mkc;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+			sizeof(*mtt) * round_up(state->num_pages, 2);
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
+		 DIV_ROUND_UP(state->num_pages, 2));
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+
+	for_each_sgtable_dma_page(&state->mig_data.table.sgt, &dma_iter, 0)
+		*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, pd, pdn);
+	MLX5_SET(mkc, mkc, bsf_octword_size, 0);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size,
+		 DIV_ROUND_UP(state->num_pages, 2));
+	MLX5_SET64(mkc, mkc, start_addr, mkey->iova);
+	MLX5_SET64(mkc, mkc, len, state->num_pages * PAGE_SIZE);
+	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
+
+	kvfree(in);
+
+	return err;
+}
+
+struct page *mlx5vf_get_migration_page(struct migration_data *data,
+				       unsigned long offset)
+{
+	unsigned long cur_offset = 0;
+	struct scatterlist *sg;
+	unsigned int i;
+
+	if (offset < data->last_offset || !data->last_offset_sg) {
+		data->last_offset = 0;
+		data->last_offset_sg = data->table.sgt.sgl;
+		data->sg_last_entry = 0;
+	}
+
+	cur_offset = data->last_offset;
+
+	for_each_sg(data->last_offset_sg, sg,
+			data->table.sgt.orig_nents - data->sg_last_entry, i) {
+		if (offset < sg->length + cur_offset) {
+			data->last_offset_sg = sg;
+			data->sg_last_entry += i;
+			data->last_offset = cur_offset;
+			return nth_page(sg_page(sg),
+					(offset - cur_offset) / PAGE_SIZE);
+		}
+		cur_offset += sg->length;
+	}
+	return NULL;
+}
+
+void mlx5vf_reset_vhca_state(struct mlx5_vhca_state_data *state)
+{
+	struct migration_data *data = &state->mig_data;
+	struct sg_page_iter sg_iter;
+
+	if (!data->table.prv)
+		goto end;
+
+	/* Undo alloc_pages_bulk_array() */
+	for_each_sgtable_page(&data->table.sgt, &sg_iter, 0)
+		__free_page(sg_page_iter_page(&sg_iter));
+	sg_free_append_table(&data->table);
+end:
+	memset(state, 0, sizeof(*state));
+}
+
+int mlx5vf_add_migration_pages(struct mlx5_vhca_state_data *state,
+			       unsigned int npages)
+{
+	unsigned int to_alloc = npages;
+	struct page **page_list;
+	unsigned long filled;
+	unsigned int to_fill;
+	int ret = 0;
+
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
+	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	do {
+		filled = alloc_pages_bulk_array(GFP_KERNEL, to_fill,
+						page_list);
+		if (!filled) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		to_alloc -= filled;
+		ret = sg_alloc_append_table_from_pages(
+			&state->mig_data.table, page_list, filled, 0,
+			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
+			GFP_KERNEL);
+
+		if (ret)
+			goto err;
+		/* clean input for another bulk allocation */
+		memset(page_list, 0, filled * sizeof(*page_list));
+		to_fill = min_t(unsigned int, to_alloc,
+				PAGE_SIZE / sizeof(*page_list));
+	} while (to_alloc > 0);
+
+	kvfree(page_list);
+	state->num_pages += npages;
+
+	return 0;
+
+err:
+	kvfree(page_list);
+	return ret;
+}
+
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       u64 state_size,
+			       struct mlx5_vhca_state_data *state)
+{
+	struct mlx5_core_dev *mdev = mlx5_get_core_dev(pci_physfn(pdev));
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	struct mlx5_core_mkey mkey = {};
+	u32 pdn;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = mlx5vf_add_migration_pages(state,
+				DIV_ROUND_UP_ULL(state_size, PAGE_SIZE));
+	if (err < 0)
+		goto err_alloc_pages;
+
+	err = dma_map_sgtable(mdev->device, &state->mig_data.table.sgt,
+			      DMA_FROM_DEVICE, 0);
+	if (err)
+		goto err_reg_dma;
+
+	err = _create_state_mkey(mdev, pdn, state, &mkey);
+	if (err)
+		goto err_create_mkey;
+
+	MLX5_SET(save_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_SAVE_VHCA_STATE);
+	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(save_vhca_state_in, in, vhca_id, vhca_id);
+
+	MLX5_SET64(save_vhca_state_in, in, va, mkey.iova);
+	MLX5_SET(save_vhca_state_in, in, mkey, mkey.key);
+	MLX5_SET(save_vhca_state_in, in, size, mkey.size);
+
+	err = mlx5_cmd_exec_inout(mdev, save_vhca_state, in, out);
+	if (err)
+		goto err_exec;
+
+	state->state_size = state_size;
+
+	mlx5_core_destroy_mkey(mdev, &mkey);
+	mlx5_core_dealloc_pd(mdev, pdn);
+	dma_unmap_sgtable(mdev->device, &state->mig_data.table.sgt,
+			  DMA_FROM_DEVICE, 0);
+	mlx5_put_core_dev(mdev);
+
+	return 0;
+
+err_exec:
+	mlx5_core_destroy_mkey(mdev, &mkey);
+err_create_mkey:
+	dma_unmap_sgtable(mdev->device, &state->mig_data.table.sgt,
+			  DMA_FROM_DEVICE, 0);
+err_reg_dma:
+	mlx5vf_reset_vhca_state(state);
+err_alloc_pages:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_put_core_dev(mdev);
+	return err;
+}
+
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vhca_state_data *state)
+{
+	struct mlx5_core_dev *mdev = mlx5_get_core_dev(pci_physfn(pdev));
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	struct mlx5_core_mkey mkey = {};
+	u32 pdn;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = dma_map_sgtable(mdev->device, &state->mig_data.table.sgt,
+			      DMA_TO_DEVICE, 0);
+	if (err)
+		goto err_reg;
+
+	err = _create_state_mkey(mdev, pdn, state, &mkey);
+	if (err)
+		goto err_mkey;
+
+	MLX5_SET(load_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_LOAD_VHCA_STATE);
+	MLX5_SET(load_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(load_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET64(load_vhca_state_in, in, va, mkey.iova);
+	MLX5_SET(load_vhca_state_in, in, mkey, mkey.key);
+	MLX5_SET(load_vhca_state_in, in, size, state->state_size);
+
+	err = mlx5_cmd_exec_inout(mdev, load_vhca_state, in, out);
+	mlx5_core_destroy_mkey(mdev, &mkey);
+err_mkey:
+	dma_unmap_sgtable(mdev->device, &state->mig_data.table.sgt,
+			  DMA_TO_DEVICE, 0);
+err_reg:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_put_core_dev(mdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5_vfio_pci_cmd.h b/drivers/vfio/pci/mlx5_vfio_pci_cmd.h
new file mode 100644
index 000000000000..66221df24b19
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci_cmd.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#ifndef MLX5_VFIO_CMD_H
+#define MLX5_VFIO_CMD_H
+
+#include <linux/kernel.h>
+#include <linux/mlx5/driver.h>
+
+struct migration_data {
+	struct sg_append_table table;
+
+	struct scatterlist *last_offset_sg;
+	unsigned int sg_last_entry;
+	unsigned long last_offset;
+};
+
+/* state data of vhca to be used as part of migration flow */
+struct mlx5_vhca_state_data {
+	u64 state_size;
+	u64 num_pages;
+	u32 win_start_offset;
+	struct migration_data mig_data;
+};
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  uint32_t *state_size);
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       u64 state_size,
+			       struct mlx5_vhca_state_data *state);
+void mlx5vf_reset_vhca_state(struct mlx5_vhca_state_data *state);
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vhca_state_data *state);
+int mlx5vf_add_migration_pages(struct mlx5_vhca_state_data *state,
+			       unsigned int npages);
+struct page *mlx5vf_get_migration_page(struct migration_data *data,
+				       unsigned long offset);
+#endif /* MLX5_VFIO_CMD_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH mlx5-next 7/7] mlx5_vfio_pci: Implement vfio_pci driver for mlx5 devices
       [not found] <cover.1632305919.git.leonro@nvidia.com>
                   ` (5 preceding siblings ...)
  2021-09-22 10:38 ` [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device Leon Romanovsky
@ 2021-09-22 10:38 ` Leon Romanovsky
  6 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-22 10:38 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, David S. Miller,
	Jakub Kicinski, Kirti Wankhede, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed

From: Yishai Hadas <yishaih@nvidia.com>

This patch adds support for vfio_pci driver for mlx5 devices.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.

The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.

This patch implements the suspend/resume flows.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/Kconfig         |  11 +
 drivers/vfio/pci/Makefile        |   3 +
 drivers/vfio/pci/mlx5_vfio_pci.c | 736 +++++++++++++++++++++++++++++++
 3 files changed, 750 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 860424ccda1b..c10b53028309 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -43,4 +43,15 @@ config VFIO_PCI_IGD
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
 endif
+
+config MLX5_VFIO_PCI
+	tristate "VFIO support for MLX5 PCI devices"
+	depends on MLX5_CORE
+	select VFIO_PCI_CORE
+	help
+	  This provides a PCI support for MLX5 devices using the VFIO
+	  framework. The device specific driver supports suspend/resume
+	  of the MLX5 device.
+
+	  If you don't know what to do here, say N.
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 349d68d242b4..b9448bba0c83 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -7,3 +7,6 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+mlx5-vfio-pci-y := mlx5_vfio_pci.o mlx5_vfio_pci_cmd.o
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c
new file mode 100644
index 000000000000..710a3ff9cbcc
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci.c
@@ -0,0 +1,736 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/vfio_pci_core.h>
+
+#include "mlx5_vfio_pci_cmd.h"
+
+enum {
+	MLX5VF_PCI_QUIESCED = 1 << 0,
+	MLX5VF_PCI_FREEZED = 1 << 1,
+};
+
+enum {
+	MLX5VF_REGION_PENDING_BYTES = 1 << 0,
+	MLX5VF_REGION_DATA_SIZE = 1 << 1,
+};
+
+#define MLX5VF_MIG_REGION_DATA_SIZE SZ_128K
+/* Data section offset from migration region */
+#define MLX5VF_MIG_REGION_DATA_OFFSET                                          \
+	(sizeof(struct vfio_device_migration_info))
+
+#define VFIO_DEVICE_MIGRATION_OFFSET(x)                                        \
+	(offsetof(struct vfio_device_migration_info, x))
+
+struct mlx5vf_pci_migration_info {
+	u32 vfio_dev_state; /* VFIO_DEVICE_STATE_XXX */
+	u32 dev_state; /* device migration state */
+	u32 region_state; /* Use MLX5VF_REGION_XXX */
+	u16 vhca_id;
+	struct mlx5_vhca_state_data vhca_state_data;
+};
+
+struct mlx5vf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 migrate_cap:1;
+	struct mlx5vf_pci_migration_info vmig;
+};
+
+static int mlx5vf_pci_unquiesce_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	int ret;
+
+	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_QUIESCED))
+		return 0;
+
+	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
+				     mvdev->vmig.vhca_id,
+				     MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
+	if (ret)
+		return ret;
+
+	mvdev->vmig.dev_state &= ~MLX5VF_PCI_QUIESCED;
+	return 0;
+}
+
+static int mlx5vf_pci_quiesce_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	int ret;
+
+	if (mvdev->vmig.dev_state & MLX5VF_PCI_QUIESCED)
+		return 0;
+
+	ret = mlx5vf_cmd_suspend_vhca(
+		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
+		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
+	if (ret)
+		return ret;
+
+	mvdev->vmig.dev_state |= MLX5VF_PCI_QUIESCED;
+	return 0;
+}
+
+static int mlx5vf_pci_unfreeze_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	int ret;
+
+	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
+		return 0;
+
+	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
+				     mvdev->vmig.vhca_id,
+				     MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
+	if (ret)
+		return ret;
+
+	mvdev->vmig.dev_state &= ~MLX5VF_PCI_FREEZED;
+	return 0;
+}
+
+static int mlx5vf_pci_freeze_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	int ret;
+
+	if (mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED)
+		return 0;
+
+	ret = mlx5vf_cmd_suspend_vhca(
+		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
+		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
+	if (ret)
+		return ret;
+
+	mvdev->vmig.dev_state |= MLX5VF_PCI_FREEZED;
+	return 0;
+}
+
+static int mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
+{
+	u32 state_size = 0;
+	int ret;
+
+	if (!(mvdev->vmig.vfio_dev_state & VFIO_DEVICE_STATE_SAVING))
+		return -EFAULT;
+
+	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
+		return -EFAULT;
+
+	/* If we already read state no reason to re-read */
+	if (mvdev->vmig.vhca_state_data.state_size)
+		return 0;
+
+	ret = mlx5vf_cmd_query_vhca_migration_state(
+		mvdev->core_device.pdev, mvdev->vmig.vhca_id, &state_size);
+	if (ret)
+		return ret;
+
+	return mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
+					  mvdev->vmig.vhca_id, state_size,
+					  &mvdev->vmig.vhca_state_data);
+}
+
+static int mlx5vf_pci_new_write_window(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
+	u32 num_pages_needed;
+	u64 allocated_ready;
+	u32 bytes_needed;
+
+	/* Check how many bytes are available from previous flows */
+	WARN_ON(state_data->num_pages * PAGE_SIZE <
+		state_data->win_start_offset);
+	allocated_ready = (state_data->num_pages * PAGE_SIZE) -
+			  state_data->win_start_offset;
+	WARN_ON(allocated_ready > MLX5VF_MIG_REGION_DATA_SIZE);
+
+	bytes_needed = MLX5VF_MIG_REGION_DATA_SIZE - allocated_ready;
+	if (!bytes_needed)
+		return 0;
+
+	num_pages_needed = DIV_ROUND_UP_ULL(bytes_needed, PAGE_SIZE);
+	return mlx5vf_add_migration_pages(state_data, num_pages_needed);
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_data_size(struct mlx5vf_pci_core_device *mvdev,
+				      char __user *buf, bool iswrite)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+	u64 data_size;
+	int ret;
+
+	if (iswrite) {
+		/* data_size is writable only during resuming state */
+		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_RESUMING)
+			return -EINVAL;
+
+		ret = copy_from_user(&data_size, buf, sizeof(data_size));
+		if (ret)
+			return -EFAULT;
+
+		vmig->vhca_state_data.state_size += data_size;
+		vmig->vhca_state_data.win_start_offset += data_size;
+		ret = mlx5vf_pci_new_write_window(mvdev);
+		if (ret)
+			return ret;
+
+	} else {
+		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_SAVING)
+			return -EINVAL;
+
+		data_size = min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
+				  vmig->vhca_state_data.state_size -
+				  vmig->vhca_state_data.win_start_offset);
+		ret = copy_to_user(buf, &data_size, sizeof(data_size));
+		if (ret)
+			return -EFAULT;
+	}
+
+	vmig->region_state |= MLX5VF_REGION_DATA_SIZE;
+	return sizeof(data_size);
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_data_offset(struct mlx5vf_pci_core_device *mvdev,
+					char __user *buf, bool iswrite)
+{
+	static const u64 data_offset = MLX5VF_MIG_REGION_DATA_OFFSET;
+	int ret;
+
+	/* RO field */
+	if (iswrite)
+		return -EFAULT;
+
+	ret = copy_to_user(buf, &data_offset, sizeof(data_offset));
+	if (ret)
+		return -EFAULT;
+
+	return sizeof(data_offset);
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_pending_bytes(struct mlx5vf_pci_core_device *mvdev,
+					  char __user *buf, bool iswrite)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+	u64 pending_bytes;
+	int ret;
+
+	/* RO field */
+	if (iswrite)
+		return -EFAULT;
+
+	if (vmig->vfio_dev_state == (VFIO_DEVICE_STATE_SAVING |
+				     VFIO_DEVICE_STATE_RUNNING)) {
+		/* In pre-copy state we have no data to return for now,
+		 * return 0 pending bytes
+		 */
+		pending_bytes = 0;
+	} else {
+		/*
+		 * In case that the device is quiesced, we can freeze the device
+		 * since it's guaranteed that all other DMA masters are quiesced
+		 * as well.
+		 */
+		if (vmig->dev_state & MLX5VF_PCI_QUIESCED) {
+			ret = mlx5vf_pci_freeze_device(mvdev);
+			if (ret)
+				return ret;
+		}
+
+		ret = mlx5vf_pci_save_device_data(mvdev);
+		if (ret)
+			return ret;
+
+		pending_bytes = vmig->vhca_state_data.state_size -
+				vmig->vhca_state_data.win_start_offset;
+	}
+
+	ret = copy_to_user(buf, &pending_bytes, sizeof(pending_bytes));
+	if (ret)
+		return -EFAULT;
+
+	/* Window moves forward once data from previous iteration was read */
+	if (vmig->region_state & MLX5VF_REGION_DATA_SIZE)
+		vmig->vhca_state_data.win_start_offset +=
+			min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE, pending_bytes);
+
+	WARN_ON(vmig->vhca_state_data.win_start_offset >
+		vmig->vhca_state_data.state_size);
+
+	/* New iteration started */
+	vmig->region_state = MLX5VF_REGION_PENDING_BYTES;
+	return sizeof(pending_bytes);
+}
+
+static int mlx5vf_load_state(struct mlx5vf_pci_core_device *mvdev)
+{
+	if (!mvdev->vmig.vhca_state_data.state_size)
+		return 0;
+
+	return mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
+					  mvdev->vmig.vhca_id,
+					  &mvdev->vmig.vhca_state_data);
+}
+
+static void mlx5vf_reset_mig_state(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+
+	vmig->region_state = 0;
+	mlx5vf_reset_vhca_state(&vmig->vhca_state_data);
+}
+
+static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
+				       u32 state)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+	int ret;
+
+	if (state == vmig->vfio_dev_state)
+		return 0;
+
+	if (!vfio_change_migration_state_allowed(state, vmig->vfio_dev_state))
+		return -EINVAL;
+
+	switch (state) {
+	case VFIO_DEVICE_STATE_RUNNING:
+		/*
+		 * (running) - When we move to _RUNNING state we must:
+		 *             1. stop dirty track (in case we got here
+		 *                after recovering from error).
+		 *             2. reset device migration info fields
+		 *             3. make sure device is unfreezed
+		 *             4. make sure device is unquiesced
+		 */
+
+		/* When moving from resuming to running we may load state
+		 * to the device if was previously set.
+		 */
+		if (vmig->vfio_dev_state == VFIO_DEVICE_STATE_RESUMING) {
+			ret = mlx5vf_load_state(mvdev);
+			if (ret)
+				return ret;
+		}
+
+		/* Any previous migration state if exists should be reset */
+		mlx5vf_reset_mig_state(mvdev);
+
+		ret = mlx5vf_pci_unfreeze_device(mvdev);
+		if (ret)
+			return ret;
+		ret = mlx5vf_pci_unquiesce_device(mvdev);
+		break;
+	case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+		/*
+		 * (pre-copy) - device should start logging data.
+		 */
+		ret = 0;
+		break;
+	case VFIO_DEVICE_STATE_SAVING:
+		/*
+		 * (stop-and-copy) - Stop the device as DMA master.
+		 *                   At this stage the device can't dirty more
+		 *                   pages so we can stop logging for it.
+		 */
+		ret = mlx5vf_pci_quiesce_device(mvdev);
+		break;
+	case VFIO_DEVICE_STATE_STOP:
+		/*
+		 * (stop) - device stopped, not saving or resuming data.
+		 */
+		ret = 0;
+		break;
+	case VFIO_DEVICE_STATE_RESUMING:
+		/*
+		 * (resuming) - device stopped, should soon start resuming
+		 * data. Device must be quiesced (not a DMA master) and
+		 * freezed (not a DMA slave). Also migration info should
+		 * reset.
+		 */
+		ret = mlx5vf_pci_quiesce_device(mvdev);
+		if (ret)
+			break;
+		ret = mlx5vf_pci_freeze_device(mvdev);
+		if (ret)
+			break;
+		mlx5vf_reset_mig_state(mvdev);
+		ret = mlx5vf_pci_new_write_window(mvdev);
+		break;
+	default:
+		return -EFAULT;
+	}
+	if (ret)
+		return ret;
+
+	vmig->vfio_dev_state = state;
+	return 0;
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_device_state(struct mlx5vf_pci_core_device *mvdev,
+					 char __user *buf, bool iswrite)
+{
+	size_t count = sizeof(mvdev->vmig.vfio_dev_state);
+	int ret;
+
+	if (iswrite) {
+		u32 device_state;
+
+		ret = copy_from_user(&device_state, buf, count);
+		if (ret)
+			return -EFAULT;
+
+		ret = mlx5vf_pci_set_device_state(mvdev, device_state);
+		if (ret)
+			return ret;
+	} else {
+		ret = copy_to_user(buf, &mvdev->vmig.vfio_dev_state, count);
+		if (ret)
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+static ssize_t
+mlx5vf_pci_copy_user_data_to_device_state(struct mlx5vf_pci_core_device *mvdev,
+					  char __user *buf, size_t count,
+					  u64 offset)
+{
+	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
+	u32 curr_offset;
+	char *from_buff = buf;
+	u32 win_page_offset;
+	u32 copy_count;
+	struct page *page;
+	char *to_buff;
+	int ret;
+
+	curr_offset = state_data->win_start_offset + offset;
+
+	do {
+		page = mlx5vf_get_migration_page(&state_data->mig_data,
+						 curr_offset);
+		if (!page)
+			return -EINVAL;
+
+		win_page_offset = curr_offset % PAGE_SIZE;
+		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
+
+		to_buff = kmap_local_page(page);
+		ret = copy_from_user(to_buff + win_page_offset, from_buff,
+				     copy_count);
+		kunmap_local(to_buff);
+		if (ret)
+			return -EFAULT;
+
+		from_buff += copy_count;
+		curr_offset += copy_count;
+		count -= copy_count;
+	} while (count > 0);
+
+	return 0;
+}
+
+static ssize_t
+mlx5vf_pci_copy_device_state_to_user(struct mlx5vf_pci_core_device *mvdev,
+				     char __user *buf, u64 offset, size_t count)
+{
+	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
+	u32 win_available_bytes;
+	u32 win_page_offset;
+	char *to_buff = buf;
+	u32 copy_count;
+	u32 curr_offset;
+	char *from_buff;
+	struct page *page;
+	int ret;
+
+	win_available_bytes =
+		min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
+		      mvdev->vmig.vhca_state_data.state_size -
+			      mvdev->vmig.vhca_state_data.win_start_offset);
+
+	if (count + offset > win_available_bytes)
+		return -EINVAL;
+
+	curr_offset = state_data->win_start_offset + offset;
+
+	do {
+		page = mlx5vf_get_migration_page(&state_data->mig_data,
+						 curr_offset);
+		if (!page)
+			return -EINVAL;
+
+		win_page_offset = curr_offset % PAGE_SIZE;
+		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
+
+		from_buff = kmap_local_page(page);
+		ret = copy_to_user(buf, from_buff + win_page_offset,
+				   copy_count);
+		kunmap_local(from_buff);
+		if (ret)
+			return -EFAULT;
+
+		curr_offset += copy_count;
+		count -= copy_count;
+		to_buff += copy_count;
+	} while (count);
+
+	return 0;
+}
+
+static ssize_t
+mlx5vf_pci_migration_data_rw(struct mlx5vf_pci_core_device *mvdev,
+			     char __user *buf, size_t count, u64 offset,
+			     bool iswrite)
+{
+	int ret;
+
+	if (offset + count > MLX5VF_MIG_REGION_DATA_SIZE)
+		return -EINVAL;
+
+	if (iswrite)
+		ret = mlx5vf_pci_copy_user_data_to_device_state(mvdev, buf,
+								count, offset);
+	else
+		ret = mlx5vf_pci_copy_device_state_to_user(mvdev, buf, offset,
+							   count);
+	if (ret)
+		return ret;
+	return count;
+}
+
+static ssize_t mlx5vf_pci_mig_rw(struct vfio_pci_core_device *vdev,
+				 char __user *buf, size_t count, loff_t *ppos,
+				 bool iswrite)
+{
+	struct mlx5vf_pci_core_device *mvdev =
+		container_of(vdev, struct mlx5vf_pci_core_device, core_device);
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int ret;
+
+	/* Copy to/from the migration region data section */
+	if (pos >= MLX5VF_MIG_REGION_DATA_OFFSET)
+		return mlx5vf_pci_migration_data_rw(
+			mvdev, buf, count, pos - MLX5VF_MIG_REGION_DATA_OFFSET,
+			iswrite);
+
+	switch (pos) {
+	case VFIO_DEVICE_MIGRATION_OFFSET(device_state):
+		/* This is RW field. */
+		if (count != sizeof(mvdev->vmig.vfio_dev_state)) {
+			ret = -EINVAL;
+			break;
+		}
+		ret = mlx5vf_pci_handle_migration_device_state(mvdev, buf,
+							       iswrite);
+		break;
+	case VFIO_DEVICE_MIGRATION_OFFSET(pending_bytes):
+		/*
+		 * The number of pending bytes still to be migrated from the
+		 * vendor driver. This is RO field.
+		 * Reading this field indicates on the start of a new iteration
+		 * to get device data.
+		 *
+		 */
+		ret = mlx5vf_pci_handle_migration_pending_bytes(mvdev, buf,
+								iswrite);
+		break;
+	case VFIO_DEVICE_MIGRATION_OFFSET(data_offset):
+		/*
+		 * The user application should read data_offset field from the
+		 * migration region. The user application should read the
+		 * device data from this offset within the migration region
+		 * during the _SAVING mode or write the device data during the
+		 * _RESUMING mode. This is RO field.
+		 */
+		ret = mlx5vf_pci_handle_migration_data_offset(mvdev, buf,
+							      iswrite);
+		break;
+	case VFIO_DEVICE_MIGRATION_OFFSET(data_size):
+		/*
+		 * The user application should read data_size to get the size
+		 * in bytes of the data copied to the migration region during
+		 * the _SAVING state by the device. The user application should
+		 * write the size in bytes of the data that was copied to
+		 * the migration region during the _RESUMING state by the user.
+		 * This is RW field.
+		 */
+		ret = mlx5vf_pci_handle_migration_data_size(mvdev, buf,
+							    iswrite);
+		break;
+	default:
+		ret = -EFAULT;
+		break;
+	}
+
+	return ret;
+}
+
+static struct vfio_pci_regops migration_ops = {
+	.rw = mlx5vf_pci_mig_rw,
+};
+
+static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &mvdev->core_device;
+	int vf_id;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	if (!mvdev->migrate_cap) {
+		vfio_pci_core_finish_enable(vdev);
+		return 0;
+	}
+
+	vf_id = pci_iov_vf_id(vdev->pdev);
+	if (vf_id < 0) {
+		ret = vf_id;
+		goto out_disable;
+	}
+
+	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1,
+				     &mvdev->vmig.vhca_id);
+	if (ret)
+		goto out_disable;
+
+	ret = vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
+					   VFIO_REGION_SUBTYPE_MIGRATION,
+					   &migration_ops,
+					   MLX5VF_MIG_REGION_DATA_OFFSET +
+					   MLX5VF_MIG_REGION_DATA_SIZE,
+					   VFIO_REGION_INFO_FLAG_READ |
+					   VFIO_REGION_INFO_FLAG_WRITE,
+					   NULL);
+	if (ret)
+		goto out_disable;
+
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+out_disable:
+	vfio_pci_core_disable(vdev);
+	return ret;
+}
+
+static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	vfio_pci_core_close_device(core_vdev);
+	mlx5vf_reset_mig_state(mvdev);
+}
+
+static const struct vfio_device_ops mlx5vf_pci_ops = {
+	.name = "mlx5-vfio-pci",
+	.open_device = mlx5vf_pci_open_device,
+	.close_device = mlx5vf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+};
+
+static int mlx5vf_pci_probe(struct pci_dev *pdev,
+			    const struct pci_device_id *id)
+{
+	struct mlx5vf_pci_core_device *mvdev;
+	int ret;
+
+	mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL);
+	if (!mvdev)
+		return -ENOMEM;
+	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
+
+	if (pdev->is_virtfn) {
+		struct mlx5_core_dev *mdev =
+			mlx5_get_core_dev(pci_physfn(pdev));
+
+		if (mdev) {
+			if (MLX5_CAP_GEN(mdev, migration))
+				mvdev->migrate_cap = 1;
+			mlx5_put_core_dev(mdev);
+		}
+	}
+
+	ret = vfio_pci_core_register_device(&mvdev->core_device);
+	if (ret)
+		goto out_free;
+
+	dev_set_drvdata(&pdev->dev, mvdev);
+	return 0;
+
+out_free:
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+	return ret;
+}
+
+static void mlx5vf_pci_remove(struct pci_dev *pdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(&mvdev->core_device);
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+}
+
+static const struct pci_device_id mlx5vf_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX, 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
+
+static struct pci_driver mlx5vf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = mlx5vf_pci_table,
+	.probe = mlx5vf_pci_probe,
+	.remove = mlx5vf_pci_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+};
+
+static void __exit mlx5vf_pci_cleanup(void)
+{
+	pci_unregister_driver(&mlx5vf_pci_driver);
+}
+
+static int __init mlx5vf_pci_init(void)
+{
+	return pci_register_driver(&mlx5vf_pci_driver);
+}
+
+module_init(mlx5vf_pci_init);
+module_exit(mlx5vf_pci_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
+MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
+MODULE_DESCRIPTION(
+	"MLX5 VFIO PCI - User Level meta-driver for MLX5 device family");
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-22 10:38 ` [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index Leon Romanovsky
@ 2021-09-22 21:59   ` Bjorn Helgaas
  2021-09-23  6:35     ` Leon Romanovsky
  0 siblings, 1 reply; 57+ messages in thread
From: Bjorn Helgaas @ 2021-09-22 21:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The PCI core uses the VF index internally, often called the vf_id,
> during the setup of the VF, eg pci_iov_add_virtfn().
> 
> This index is needed for device drivers that implement live migration
> for their internal operations that configure/control their VFs.
>
> Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> from this series needs it and not the bus/device/function which is
> exposed today.
> 
> Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> was used to create the bus/device/function.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
same thing as pci_iov_vf_id() by iterating through VFs until it finds
one with a matching devfn (although it *doesn't* check for a matching
bus number, which seems like a bug).

Maybe that should use pci_iov_vf_id()?

> ---
>  drivers/pci/iov.c   | 14 ++++++++++++++
>  include/linux/pci.h |  7 ++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index dafdc652fcd0..e7751fa3fe0b 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  }
>  EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
>  
> +int pci_iov_vf_id(struct pci_dev *dev)
> +{
> +	struct pci_dev *pf;
> +
> +	if (!dev->is_virtfn)
> +		return -EINVAL;
> +
> +	pf = pci_physfn(dev);
> +	return (((dev->bus->number << 8) + dev->devfn) -
> +		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
> +	       pf->sriov->stride;
> +}
> +EXPORT_SYMBOL_GPL(pci_iov_vf_id);
> +
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>   * change when NumVFs changes.
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index cd8aa6fce204..4d6c73506e18 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2153,7 +2153,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  #ifdef CONFIG_PCI_IOV
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> -
> +int pci_iov_vf_id(struct pci_dev *dev);
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
>  
> @@ -2181,6 +2181,11 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>  {
>  	return -ENOSYS;
>  }
> +static inline int pci_iov_vf_id(struct pci_dev *dev)
> +{
> +	return -ENOSYS;
> +}
> +

Drop the blank line to match the surrounding stubs.

>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>  { return -ENODEV; }


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-22 21:59   ` Bjorn Helgaas
@ 2021-09-23  6:35     ` Leon Romanovsky
  2021-09-24 13:08       ` Bjorn Helgaas
  0 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-23  6:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > The PCI core uses the VF index internally, often called the vf_id,
> > during the setup of the VF, eg pci_iov_add_virtfn().
> > 
> > This index is needed for device drivers that implement live migration
> > for their internal operations that configure/control their VFs.
> >
> > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > from this series needs it and not the bus/device/function which is
> > exposed today.
> > 
> > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > was used to create the bus/device/function.
> > 
> > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> same thing as pci_iov_vf_id() by iterating through VFs until it finds
> one with a matching devfn (although it *doesn't* check for a matching
> bus number, which seems like a bug).
> 
> Maybe that should use pci_iov_vf_id()?

Yes, I gave same comment internally and we decided to simply reduce the
amount of changes in mlx5_core to have less distractions and submit as a
followup. Most likely will add this hunk in v1.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index e8185b69ac6c..b66be0b4244a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -209,15 +209,8 @@ int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
        /* Reversed translation of PCI VF function number to the internal
         * function_id, which exists in the name of virtfn symlink.
         */
-       for (id = 0; id < pci_num_vf(pf); id++) {
-               if (!sriov->vfs_ctx[id].enabled)
-                       continue;
-
-               if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
-                       break;
-       }
-
-       if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+       id = pci_iov_vf_id(vf);
+       if (id < 0 || !sriov->vfs_ctx[id].enabled)
                return -EINVAL;

        return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);

Thanks

> 
> > ---
> >  drivers/pci/iov.c   | 14 ++++++++++++++
> >  include/linux/pci.h |  7 ++++++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index dafdc652fcd0..e7751fa3fe0b 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> >  }
> >  EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
> >  
> > +int pci_iov_vf_id(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *pf;
> > +
> > +	if (!dev->is_virtfn)
> > +		return -EINVAL;
> > +
> > +	pf = pci_physfn(dev);
> > +	return (((dev->bus->number << 8) + dev->devfn) -
> > +		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
> > +	       pf->sriov->stride;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_iov_vf_id);
> > +
> >  /*
> >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> >   * change when NumVFs changes.
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index cd8aa6fce204..4d6c73506e18 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -2153,7 +2153,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
> >  #ifdef CONFIG_PCI_IOV
> >  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> >  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> > -
> > +int pci_iov_vf_id(struct pci_dev *dev);
> >  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> >  void pci_disable_sriov(struct pci_dev *dev);
> >  
> > @@ -2181,6 +2181,11 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
> >  {
> >  	return -ENOSYS;
> >  }
> > +static inline int pci_iov_vf_id(struct pci_dev *dev)
> > +{
> > +	return -ENOSYS;
> > +}
> > +
> 
> Drop the blank line to match the surrounding stubs.

Sure, thanks

> 
> >  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
> >  { return -ENODEV; }
> 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* RE: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
@ 2021-09-23 10:33   ` Shameerali Kolothum Thodi
  2021-09-23 11:17     ` Leon Romanovsky
  2021-09-27 22:46   ` Alex Williamson
  1 sibling, 1 reply; 57+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-09-23 10:33 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, David S. Miller,
	Jakub Kicinski, Kirti Wankhede, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: 22 September 2021 11:39
> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@nvidia.com>
> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@nvidia.com>
> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
> transition validity
> 
> From: Yishai Hadas <yishaih@nvidia.com>
> 
> Add an API in the core layer to check migration state transition validity
> as part of a migration flow.
> 
> The valid transitions follow the expected usage as described in
> uapi/vfio.h and triggered by QEMU.
> 
> This ensures that all migration implementations follow a consistent
> migration state machine.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/vfio.c  | 41 +++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h |  1 +
>  2 files changed, 42 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 3c034fe14ccb..c3ca33e513c8 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct inode
> *inode, struct file *filep)
>  	return 0;
>  }
> 
> +/**
> + * vfio_change_migration_state_allowed - Checks whether a migration state
> + *   transition is valid.
> + * @new_state: The new state to move to.
> + * @old_state: The old state.
> + * Return: true if the transition is valid.
> + */
> +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state)
> +{
> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> +		[VFIO_DEVICE_STATE_STOP] = {
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING] = {
> +			[VFIO_DEVICE_STATE_STOP] = 1,
> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> +			[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
> = 1,

Do we need to allow _RESUMING state here or not? As per the "State transitions"
section from uapi/linux/vfio.h, 

" * 4. To start the resuming phase, the device state should be transitioned from
 *    the _RUNNING to the _RESUMING state."

IIRC, I have seen that transition happening on the destination dev while testing the 
HiSilicon ACC dev migration. 

Thanks,
Shameer

> +		},
> +		[VFIO_DEVICE_STATE_SAVING] = {
> +			[VFIO_DEVICE_STATE_STOP] = 1,
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +		},
> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = {
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> +		},
> +		[VFIO_DEVICE_STATE_RESUMING] = {
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +			[VFIO_DEVICE_STATE_STOP] = 1,
> +		},
> +	};
> +
> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
> +		return false;
> +
> +	return vfio_from_state_table[old_state][new_state];
> +}
> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
> +
>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index b53a9557884a..e65137a708f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -83,6 +83,7 @@ extern struct vfio_device
> *vfio_device_get_from_dev(struct device *dev);
>  extern void vfio_device_put(struct vfio_device *device);
> 
>  int vfio_assign_device_set(struct vfio_device *device, void *set_id);
> +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state);
> 
>  /* events for the backend driver notify callback */
>  enum vfio_iommu_notify_type {
> --
> 2.31.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-23 10:33   ` Shameerali Kolothum Thodi
@ 2021-09-23 11:17     ` Leon Romanovsky
  2021-09-23 13:55       ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-23 11:17 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed

On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi wrote:
> 
> 
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: 22 September 2021 11:39
> > To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> > <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
> > S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
> > Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> > linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> > linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> > <saeedm@nvidia.com>
> > Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
> > transition validity
> > 
> > From: Yishai Hadas <yishaih@nvidia.com>
> > 
> > Add an API in the core layer to check migration state transition validity
> > as part of a migration flow.
> > 
> > The valid transitions follow the expected usage as described in
> > uapi/vfio.h and triggered by QEMU.
> > 
> > This ensures that all migration implementations follow a consistent
> > migration state machine.
> > 
> > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/vfio/vfio.c  | 41 +++++++++++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h |  1 +
> >  2 files changed, 42 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 3c034fe14ccb..c3ca33e513c8 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct inode
> > *inode, struct file *filep)
> >  	return 0;
> >  }
> > 
> > +/**
> > + * vfio_change_migration_state_allowed - Checks whether a migration state
> > + *   transition is valid.
> > + * @new_state: The new state to move to.
> > + * @old_state: The old state.
> > + * Return: true if the transition is valid.
> > + */
> > +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state)
> > +{
> > +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> > +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> > +		[VFIO_DEVICE_STATE_STOP] = {
> > +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> > +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> > +		},
> > +		[VFIO_DEVICE_STATE_RUNNING] = {
> > +			[VFIO_DEVICE_STATE_STOP] = 1,
> > +			[VFIO_DEVICE_STATE_SAVING] = 1,
> > +			[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
> > = 1,
> 
> Do we need to allow _RESUMING state here or not? As per the "State transitions"
> section from uapi/linux/vfio.h, 

It looks like we missed this state transition.

Thanks

> 
> " * 4. To start the resuming phase, the device state should be transitioned from
>  *    the _RUNNING to the _RESUMING state."
> 
> IIRC, I have seen that transition happening on the destination dev while testing the 
> HiSilicon ACC dev migration. 
> 
> Thanks,
> Shameer
> 
> > +		},
> > +		[VFIO_DEVICE_STATE_SAVING] = {
> > +			[VFIO_DEVICE_STATE_STOP] = 1,
> > +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> > +		},
> > +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = {
> > +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> > +			[VFIO_DEVICE_STATE_SAVING] = 1,
> > +		},
> > +		[VFIO_DEVICE_STATE_RESUMING] = {
> > +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> > +			[VFIO_DEVICE_STATE_STOP] = 1,
> > +		},
> > +	};
> > +
> > +	if (new_state > MAX_STATE || old_state > MAX_STATE)
> > +		return false;
> > +
> > +	return vfio_from_state_table[old_state][new_state];
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
> > +
> >  static long vfio_device_fops_unl_ioctl(struct file *filep,
> >  				       unsigned int cmd, unsigned long arg)
> >  {
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index b53a9557884a..e65137a708f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -83,6 +83,7 @@ extern struct vfio_device
> > *vfio_device_get_from_dev(struct device *dev);
> >  extern void vfio_device_put(struct vfio_device *device);
> > 
> >  int vfio_assign_device_set(struct vfio_device *device, void *set_id);
> > +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state);
> > 
> >  /* events for the backend driver notify callback */
> >  enum vfio_iommu_notify_type {
> > --
> > 2.31.1
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-23 11:17     ` Leon Romanovsky
@ 2021-09-23 13:55       ` Max Gurtovoy
  2021-09-24  7:44         ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-23 13:55 UTC (permalink / raw)
  To: Leon Romanovsky, Shameerali Kolothum Thodi
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed


On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
> On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi wrote:
>>
>>> -----Original Message-----
>>> From: Leon Romanovsky [mailto:leon@kernel.org]
>>> Sent: 22 September 2021 11:39
>>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@nvidia.com>
>>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
>>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
>>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>>> <saeedm@nvidia.com>
>>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
>>> transition validity
>>>
>>> From: Yishai Hadas <yishaih@nvidia.com>
>>>
>>> Add an API in the core layer to check migration state transition validity
>>> as part of a migration flow.
>>>
>>> The valid transitions follow the expected usage as described in
>>> uapi/vfio.h and triggered by QEMU.
>>>
>>> This ensures that all migration implementations follow a consistent
>>> migration state machine.
>>>
>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>> ---
>>>   drivers/vfio/vfio.c  | 41 +++++++++++++++++++++++++++++++++++++++++
>>>   include/linux/vfio.h |  1 +
>>>   2 files changed, 42 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 3c034fe14ccb..c3ca33e513c8 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct inode
>>> *inode, struct file *filep)
>>>   	return 0;
>>>   }
>>>
>>> +/**
>>> + * vfio_change_migration_state_allowed - Checks whether a migration state
>>> + *   transition is valid.
>>> + * @new_state: The new state to move to.
>>> + * @old_state: The old state.
>>> + * Return: true if the transition is valid.
>>> + */
>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state)
>>> +{
>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>> +		},
>>> +		[VFIO_DEVICE_STATE_RUNNING] = {
>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>> +			[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
>>> = 1,
>> Do we need to allow _RESUMING state here or not? As per the "State transitions"
>> section from uapi/linux/vfio.h,
> It looks like we missed this state transition.
>
> Thanks

I'm not sure this state transition is valid.

Kirti, When we would like to move from RUNNING to RESUMING ?

Sameerali, can you please re-test and update if you see this transition ?


>
>> " * 4. To start the resuming phase, the device state should be transitioned from
>>   *    the _RUNNING to the _RESUMING state."
>>
>> IIRC, I have seen that transition happening on the destination dev while testing the
>> HiSilicon ACC dev migration.
>>
>> Thanks,
>> Shameer
>>
>>> +		},
>>> +		[VFIO_DEVICE_STATE_SAVING] = {
>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>> +		},
>>> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = {
>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>> +		},
>>> +		[VFIO_DEVICE_STATE_RESUMING] = {
>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>> +		},
>>> +	};
>>> +
>>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
>>> +		return false;
>>> +
>>> +	return vfio_from_state_table[old_state][new_state];
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
>>> +
>>>   static long vfio_device_fops_unl_ioctl(struct file *filep,
>>>   				       unsigned int cmd, unsigned long arg)
>>>   {
>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>> index b53a9557884a..e65137a708f1 100644
>>> --- a/include/linux/vfio.h
>>> +++ b/include/linux/vfio.h
>>> @@ -83,6 +83,7 @@ extern struct vfio_device
>>> *vfio_device_get_from_dev(struct device *dev);
>>>   extern void vfio_device_put(struct vfio_device *device);
>>>
>>>   int vfio_assign_device_set(struct vfio_device *device, void *set_id);
>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state);
>>>
>>>   /* events for the backend driver notify callback */
>>>   enum vfio_iommu_notify_type {
>>> --
>>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional
  2021-09-22 10:38 ` [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional Leon Romanovsky
@ 2021-09-23 13:57   ` Max Gurtovoy
  0 siblings, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-23 13:57 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, David S. Miller,
	Jakub Kicinski, Kirti Wankhede, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed


On 9/22/2021 1:38 PM, Leon Romanovsky wrote:
> From: Yishai Hadas <yishaih@nvidia.com>
>
> Make the region->release() function optional as in some cases there is
> nothing to do by driver as part of it.
>
> This is needed for coming patch from this series once we add
> mlx5_vfio_cpi driver to support live migration but we don't need a

mlx5_vfio_pci *typo


> migration release function.
>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   drivers/vfio/pci/vfio_pci_core.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 68198e0f2a63..3ddc3adb24de 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -341,7 +341,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>   	vdev->virq_disabled = false;
>   
>   	for (i = 0; i < vdev->num_regions; i++)
> -		vdev->region[i].ops->release(vdev, &vdev->region[i]);
> +		if (vdev->region[i].ops->release)
> +			vdev->region[i].ops->release(vdev, &vdev->region[i]);
>   
>   	vdev->num_regions = 0;
>   	kfree(vdev->region);

Looks good,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures
  2021-09-22 10:38 ` [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures Leon Romanovsky
@ 2021-09-24  5:48   ` Mark Zhang
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Zhang @ 2021-09-24  5:48 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, Alex Williamson, Bjorn Helgaas, Jakub Kicinski,
	Kirti Wankhede, kvm, linux-kernel, linux-pci, linux-rdma, netdev,
	Saeed Mahameed

On 9/22/2021 6:38 PM, Leon Romanovsky wrote:
> From: Yishai Hadas <yishaih@nvidia.com>
> 
> Introduce migration IFC related stuff to enable migration commands.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   include/linux/mlx5/mlx5_ifc.h | 145 +++++++++++++++++++++++++++++++++-
>   1 file changed, 144 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index d90a65b6824f..366c7b030eb7 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -126,6 +126,11 @@ enum {
>   	MLX5_CMD_OP_QUERY_SF_PARTITION            = 0x111,
>   	MLX5_CMD_OP_ALLOC_SF                      = 0x113,
>   	MLX5_CMD_OP_DEALLOC_SF                    = 0x114,
> +	MLX5_CMD_OP_SUSPEND_VHCA                  = 0x115,
> +	MLX5_CMD_OP_RESUME_VHCA                   = 0x116,
> +	MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE    = 0x117,
> +	MLX5_CMD_OP_SAVE_VHCA_STATE               = 0x118,
> +	MLX5_CMD_OP_LOAD_VHCA_STATE               = 0x119,
>   	MLX5_CMD_OP_CREATE_MKEY                   = 0x200,
>   	MLX5_CMD_OP_QUERY_MKEY                    = 0x201,
>   	MLX5_CMD_OP_DESTROY_MKEY                  = 0x202,
> @@ -1719,7 +1724,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
>   	u8         reserved_at_682[0x1];
>   	u8         log_max_sf[0x5];
>   	u8         apu[0x1];
> -	u8         reserved_at_689[0x7];
> +	u8         reserved_at_689[0x4];
> +	u8         migration[0x1];
> +	u8         reserved_at_68d[0x2];

Should it be "reserved_at_68e[0x2]"?

>   	u8         log_min_sf_size[0x8];
>   	u8         max_num_sf_partitions[0x8];
>   


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-23 13:55       ` Max Gurtovoy
@ 2021-09-24  7:44         ` Shameerali Kolothum Thodi
  2021-09-24  9:37           ` Kirti Wankhede
  2021-09-26  9:09           ` Max Gurtovoy
  0 siblings, 2 replies; 57+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-09-24  7:44 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	liulongfang



> -----Original Message-----
> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> Sent: 23 September 2021 14:56
> To: Leon Romanovsky <leon@kernel.org>; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@nvidia.com>
> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
> transition validity
> 
> 
> On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
> > On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi
> wrote:
> >>
> >>> -----Original Message-----
> >>> From: Leon Romanovsky [mailto:leon@kernel.org]
> >>> Sent: 22 September 2021 11:39
> >>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> <jgg@nvidia.com>
> >>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> >>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
> David
> >>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
> >>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> >>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> >>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> >>> <saeedm@nvidia.com>
> >>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
> >>> transition validity
> >>>
> >>> From: Yishai Hadas <yishaih@nvidia.com>
> >>>
> >>> Add an API in the core layer to check migration state transition validity
> >>> as part of a migration flow.
> >>>
> >>> The valid transitions follow the expected usage as described in
> >>> uapi/vfio.h and triggered by QEMU.
> >>>
> >>> This ensures that all migration implementations follow a consistent
> >>> migration state machine.
> >>>
> >>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>> ---
> >>>   drivers/vfio/vfio.c  | 41
> +++++++++++++++++++++++++++++++++++++++++
> >>>   include/linux/vfio.h |  1 +
> >>>   2 files changed, 42 insertions(+)
> >>>
> >>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>> index 3c034fe14ccb..c3ca33e513c8 100644
> >>> --- a/drivers/vfio/vfio.c
> >>> +++ b/drivers/vfio/vfio.c
> >>> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct
> inode
> >>> *inode, struct file *filep)
> >>>   	return 0;
> >>>   }
> >>>
> >>> +/**
> >>> + * vfio_change_migration_state_allowed - Checks whether a migration
> state
> >>> + *   transition is valid.
> >>> + * @new_state: The new state to move to.
> >>> + * @old_state: The old state.
> >>> + * Return: true if the transition is valid.
> >>> + */
> >>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
> old_state)
> >>> +{
> >>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> >>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE +
> 1] = {
> >>> +		[VFIO_DEVICE_STATE_STOP] = {
> >>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> >>> +		},
> >>> +		[VFIO_DEVICE_STATE_RUNNING] = {
> >>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> >>> +			[VFIO_DEVICE_STATE_SAVING |
> VFIO_DEVICE_STATE_RUNNING]
> >>> = 1,
> >> Do we need to allow _RESUMING state here or not? As per the "State
> transitions"
> >> section from uapi/linux/vfio.h,
> > It looks like we missed this state transition.
> >
> > Thanks
> 
> I'm not sure this state transition is valid.
> 
> Kirti, When we would like to move from RUNNING to RESUMING ?

I guess it depends on what you report as your dev default state. 

For HiSilicon ACC migration driver, we set the default to _RUNNING.

And when the migration starts, the destination side Qemu, set the 
device state to _RESUMING(vfio_load_state()).

From the documentation, it looks like the assumption on default state of
the VFIO dev is _RUNNING.

"
*  001b => Device running, which is the default state
"

> 
> Sameerali, can you please re-test and update if you see this transition ?

Yes. And if I change the default state to _STOP, then the transition
is from _STOP --> _RESUMING.

But the documentation on State transitions doesn't have _STOP --> _RESUMING
transition as valid.

Thanks,
Shameer 

> 
> 
> >
> >> " * 4. To start the resuming phase, the device state should be transitioned
> from
> >>   *    the _RUNNING to the _RESUMING state."
> >>
> >> IIRC, I have seen that transition happening on the destination dev while
> testing the
> >> HiSilicon ACC dev migration.
> >>
> >> Thanks,
> >> Shameer
> >>
> >>> +		},
> >>> +		[VFIO_DEVICE_STATE_SAVING] = {
> >>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>> +		},
> >>> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
> = {
> >>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> >>> +		},
> >>> +		[VFIO_DEVICE_STATE_RESUMING] = {
> >>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>> +		},
> >>> +	};
> >>> +
> >>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
> >>> +		return false;
> >>> +
> >>> +	return vfio_from_state_table[old_state][new_state];
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
> >>> +
> >>>   static long vfio_device_fops_unl_ioctl(struct file *filep,
> >>>   				       unsigned int cmd, unsigned long arg)
> >>>   {
> >>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>> index b53a9557884a..e65137a708f1 100644
> >>> --- a/include/linux/vfio.h
> >>> +++ b/include/linux/vfio.h
> >>> @@ -83,6 +83,7 @@ extern struct vfio_device
> >>> *vfio_device_get_from_dev(struct device *dev);
> >>>   extern void vfio_device_put(struct vfio_device *device);
> >>>
> >>>   int vfio_assign_device_set(struct vfio_device *device, void *set_id);
> >>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
> old_state);
> >>>
> >>>   /* events for the backend driver notify callback */
> >>>   enum vfio_iommu_notify_type {
> >>> --
> >>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-24  7:44         ` Shameerali Kolothum Thodi
@ 2021-09-24  9:37           ` Kirti Wankhede
  2021-09-26  9:09           ` Max Gurtovoy
  1 sibling, 0 replies; 57+ messages in thread
From: Kirti Wankhede @ 2021-09-24  9:37 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Max Gurtovoy, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	liulongfang



On 9/24/2021 1:14 PM, Shameerali Kolothum Thodi wrote:
> 
> 
>> -----Original Message-----
>> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
>> Sent: 23 September 2021 14:56
>> To: Leon Romanovsky <leon@kernel.org>; Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>
>> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>> <saeedm@nvidia.com>
>> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
>> transition validity
>>
>>
>> On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
>>> On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi
>> wrote:
>>>>
>>>>> -----Original Message-----
>>>>> From: Leon Romanovsky [mailto:leon@kernel.org]
>>>>> Sent: 22 September 2021 11:39
>>>>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>> <jgg@nvidia.com>
>>>>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>>>>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
>> David
>>>>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
>>>>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>>>>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>>>>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>>>>> <saeedm@nvidia.com>
>>>>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
>>>>> transition validity
>>>>>
>>>>> From: Yishai Hadas <yishaih@nvidia.com>
>>>>>
>>>>> Add an API in the core layer to check migration state transition validity
>>>>> as part of a migration flow.
>>>>>
>>>>> The valid transitions follow the expected usage as described in
>>>>> uapi/vfio.h and triggered by QEMU.
>>>>>
>>>>> This ensures that all migration implementations follow a consistent
>>>>> migration state machine.
>>>>>
>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>>>> ---
>>>>>    drivers/vfio/vfio.c  | 41
>> +++++++++++++++++++++++++++++++++++++++++
>>>>>    include/linux/vfio.h |  1 +
>>>>>    2 files changed, 42 insertions(+)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>> index 3c034fe14ccb..c3ca33e513c8 100644
>>>>> --- a/drivers/vfio/vfio.c
>>>>> +++ b/drivers/vfio/vfio.c
>>>>> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct
>> inode
>>>>> *inode, struct file *filep)
>>>>>    	return 0;
>>>>>    }
>>>>>
>>>>> +/**
>>>>> + * vfio_change_migration_state_allowed - Checks whether a migration
>> state
>>>>> + *   transition is valid.
>>>>> + * @new_state: The new state to move to.
>>>>> + * @old_state: The old state.
>>>>> + * Return: true if the transition is valid.
>>>>> + */
>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
>> old_state)
>>>>> +{
>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE +
>> 1] = {
>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_RUNNING] = {
>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_SAVING |
>> VFIO_DEVICE_STATE_RUNNING]
>>>>> = 1,
>>>> Do we need to allow _RESUMING state here or not? As per the "State
>> transitions"
>>>> section from uapi/linux/vfio.h,
>>> It looks like we missed this state transition.
>>>
>>> Thanks
>>
>> I'm not sure this state transition is valid.
>>
>> Kirti, When we would like to move from RUNNING to RESUMING ?
> 
> I guess it depends on what you report as your dev default state.
> 
> For HiSilicon ACC migration driver, we set the default to _RUNNING.
> 
> And when the migration starts, the destination side Qemu, set the
> device state to _RESUMING(vfio_load_state()).
> 
>  From the documentation, it looks like the assumption on default state of
> the VFIO dev is _RUNNING.
> 

That's right. in QEMU VFIO device state at init is running to maintain 
backward compatibility since migration support was added later.

RUNNING -> RESUMING state transition is valid.

Thanks,
Kirti

> "
> *  001b => Device running, which is the default state
> "
> 
>>
>> Sameerali, can you please re-test and update if you see this transition ?
> 
> Yes. And if I change the default state to _STOP, then the transition
> is from _STOP --> _RESUMING.
> 
> But the documentation on State transitions doesn't have _STOP --> _RESUMING
> transition as valid.
> 
> Thanks,
> Shameer
> 
>>
>>
>>>
>>>> " * 4. To start the resuming phase, the device state should be transitioned
>> from
>>>>    *    the _RUNNING to the _RESUMING state."
>>>>
>>>> IIRC, I have seen that transition happening on the destination dev while
>> testing the
>>>> HiSilicon ACC dev migration.
>>>>
>>>> Thanks,
>>>> Shameer
>>>>
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_SAVING] = {
>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
>> = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_RESUMING] = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>> +		},
>>>>> +	};
>>>>> +
>>>>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
>>>>> +		return false;
>>>>> +
>>>>> +	return vfio_from_state_table[old_state][new_state];
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
>>>>> +
>>>>>    static long vfio_device_fops_unl_ioctl(struct file *filep,
>>>>>    				       unsigned int cmd, unsigned long arg)
>>>>>    {
>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>> index b53a9557884a..e65137a708f1 100644
>>>>> --- a/include/linux/vfio.h
>>>>> +++ b/include/linux/vfio.h
>>>>> @@ -83,6 +83,7 @@ extern struct vfio_device
>>>>> *vfio_device_get_from_dev(struct device *dev);
>>>>>    extern void vfio_device_put(struct vfio_device *device);
>>>>>
>>>>>    int vfio_assign_device_set(struct vfio_device *device, void *set_id);
>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
>> old_state);
>>>>>
>>>>>    /* events for the backend driver notify callback */
>>>>>    enum vfio_iommu_notify_type {
>>>>> --
>>>>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-23  6:35     ` Leon Romanovsky
@ 2021-09-24 13:08       ` Bjorn Helgaas
  2021-09-25 10:10         ` Leon Romanovsky
  0 siblings, 1 reply; 57+ messages in thread
From: Bjorn Helgaas @ 2021-09-24 13:08 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > 
> > > The PCI core uses the VF index internally, often called the vf_id,
> > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > 
> > > This index is needed for device drivers that implement live migration
> > > for their internal operations that configure/control their VFs.
> > >
> > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > from this series needs it and not the bus/device/function which is
> > > exposed today.
> > > 
> > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > was used to create the bus/device/function.
> > > 
> > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > one with a matching devfn (although it *doesn't* check for a matching
> > bus number, which seems like a bug).
> > 
> > Maybe that should use pci_iov_vf_id()?
> 
> Yes, I gave same comment internally and we decided to simply reduce the
> amount of changes in mlx5_core to have less distractions and submit as a
> followup. Most likely will add this hunk in v1.

I guess it backfired as far as reducing distractions, because now it
just looks like a job half-done.

And it still looks like the existing code is buggy.  This is called
via sysfs, so if the PF is on bus X and the user writes to
sriov_vf_msix_count for a VF on bus X+1, it looks like
mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
VF.

Bjorn

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-24 13:08       ` Bjorn Helgaas
@ 2021-09-25 10:10         ` Leon Romanovsky
  2021-09-25 17:41           ` Bjorn Helgaas
  0 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-25 10:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Fri, Sep 24, 2021 at 08:08:45AM -0500, Bjorn Helgaas wrote:
> On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> > On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > 
> > > > The PCI core uses the VF index internally, often called the vf_id,
> > > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > > 
> > > > This index is needed for device drivers that implement live migration
> > > > for their internal operations that configure/control their VFs.
> > > >
> > > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > > from this series needs it and not the bus/device/function which is
> > > > exposed today.
> > > > 
> > > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > > was used to create the bus/device/function.
> > > > 
> > > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > 
> > > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > > one with a matching devfn (although it *doesn't* check for a matching
> > > bus number, which seems like a bug).
> > > 
> > > Maybe that should use pci_iov_vf_id()?
> > 
> > Yes, I gave same comment internally and we decided to simply reduce the
> > amount of changes in mlx5_core to have less distractions and submit as a
> > followup. Most likely will add this hunk in v1.
> 
> I guess it backfired as far as reducing distractions, because now it
> just looks like a job half-done.

Partially :)
I didn't expect to see acceptance of this series in v0, we wanted to
gather feedback as early as possible.

> 
> And it still looks like the existing code is buggy.  This is called
> via sysfs, so if the PF is on bus X and the user writes to
> sriov_vf_msix_count for a VF on bus X+1, it looks like
> mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
> VF.

In mlx5_core_sriov_set_msix_vec_count(), we receive VF that is connected
to PF which has "struct mlx5_core_dev". My expectation is that they share
same bus as that PF was the one who created VFs. The mlx5 devices supports
upto 256 VFs and it is far below the bus split mentioned in PCI spec.

How can VF and their respective PF have different bus numbers?

Thanks

> 
> Bjorn

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-25 10:10         ` Leon Romanovsky
@ 2021-09-25 17:41           ` Bjorn Helgaas
  2021-09-26  6:36             ` Leon Romanovsky
  0 siblings, 1 reply; 57+ messages in thread
From: Bjorn Helgaas @ 2021-09-25 17:41 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Sat, Sep 25, 2021 at 01:10:39PM +0300, Leon Romanovsky wrote:
> On Fri, Sep 24, 2021 at 08:08:45AM -0500, Bjorn Helgaas wrote:
> > On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> > > On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > 
> > > > > The PCI core uses the VF index internally, often called the vf_id,
> > > > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > > > 
> > > > > This index is needed for device drivers that implement live migration
> > > > > for their internal operations that configure/control their VFs.
> > > > >
> > > > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > > > from this series needs it and not the bus/device/function which is
> > > > > exposed today.
> > > > > 
> > > > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > > > was used to create the bus/device/function.
> > > > > 
> > > > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > 
> > > > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > > > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > > > one with a matching devfn (although it *doesn't* check for a matching
> > > > bus number, which seems like a bug).
> ...

> > And it still looks like the existing code is buggy.  This is called
> > via sysfs, so if the PF is on bus X and the user writes to
> > sriov_vf_msix_count for a VF on bus X+1, it looks like
> > mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
> > VF.
> 
> In mlx5_core_sriov_set_msix_vec_count(), we receive VF that is connected
> to PF which has "struct mlx5_core_dev". My expectation is that they share
> same bus as that PF was the one who created VFs. The mlx5 devices supports
> upto 256 VFs and it is far below the bus split mentioned in PCI spec.
> 
> How can VF and their respective PF have different bus numbers?

See PCIe r5.0, sec 9.2.1.2.  For example,

  PF 0 on bus 20
    First VF Offset   1
    VF Stride         1
    NumVFs          511
  VF 0,1   through VF 0,255 on bus 20
  VF 0,256 through VF 0,511 on bus 21

This is implemented in pci_iov_add_virtfn(), which computes the bus
number and devfn from the VF ID.

pci_iov_virtfn_devfn(VF 0,1) == pci_iov_virtfn_devfn(VF 0,256), so if
the user writes to sriov_vf_msix_count for VF 0,256, it looks like
we'll call mlx5_set_msix_vec_count() for VF 0,1 instead of VF 0,256.

The spec encourages devices that require no more than 256 devices to
locate them all on the same bus number (PCIe r5.0, sec 9.1), so if you
only have 255 VFs, you may avoid the problem.

But in mlx5_core_sriov_set_msix_vec_count(), it's not obvious that it
is safe to assume the bus number is the same.

Bjorn

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-25 17:41           ` Bjorn Helgaas
@ 2021-09-26  6:36             ` Leon Romanovsky
  2021-09-26 20:23               ` Bjorn Helgaas
  0 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-26  6:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Sat, Sep 25, 2021 at 12:41:15PM -0500, Bjorn Helgaas wrote:
> On Sat, Sep 25, 2021 at 01:10:39PM +0300, Leon Romanovsky wrote:
> > On Fri, Sep 24, 2021 at 08:08:45AM -0500, Bjorn Helgaas wrote:
> > > On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> > > > On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > 
> > > > > > The PCI core uses the VF index internally, often called the vf_id,
> > > > > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > > > > 
> > > > > > This index is needed for device drivers that implement live migration
> > > > > > for their internal operations that configure/control their VFs.
> > > > > >
> > > > > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > > > > from this series needs it and not the bus/device/function which is
> > > > > > exposed today.
> > > > > > 
> > > > > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > > > > was used to create the bus/device/function.
> > > > > > 
> > > > > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > 
> > > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > > 
> > > > > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > > > > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > > > > one with a matching devfn (although it *doesn't* check for a matching
> > > > > bus number, which seems like a bug).
> > ...
> 
> > > And it still looks like the existing code is buggy.  This is called
> > > via sysfs, so if the PF is on bus X and the user writes to
> > > sriov_vf_msix_count for a VF on bus X+1, it looks like
> > > mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
> > > VF.
> > 
> > In mlx5_core_sriov_set_msix_vec_count(), we receive VF that is connected
> > to PF which has "struct mlx5_core_dev". My expectation is that they share
> > same bus as that PF was the one who created VFs. The mlx5 devices supports
> > upto 256 VFs and it is far below the bus split mentioned in PCI spec.
> > 
> > How can VF and their respective PF have different bus numbers?
> 
> See PCIe r5.0, sec 9.2.1.2.  For example,
> 
>   PF 0 on bus 20
>     First VF Offset   1
>     VF Stride         1
>     NumVFs          511
>   VF 0,1   through VF 0,255 on bus 20
>   VF 0,256 through VF 0,511 on bus 21
> 
> This is implemented in pci_iov_add_virtfn(), which computes the bus
> number and devfn from the VF ID.
> 
> pci_iov_virtfn_devfn(VF 0,1) == pci_iov_virtfn_devfn(VF 0,256), so if
> the user writes to sriov_vf_msix_count for VF 0,256, it looks like
> we'll call mlx5_set_msix_vec_count() for VF 0,1 instead of VF 0,256.

This is PCI spec split that I mentioned.

> 
> The spec encourages devices that require no more than 256 devices to
> locate them all on the same bus number (PCIe r5.0, sec 9.1), so if you
> only have 255 VFs, you may avoid the problem.
> 
> But in mlx5_core_sriov_set_msix_vec_count(), it's not obvious that it
> is safe to assume the bus number is the same.

No problem, we will make it more clear.

> 
> Bjorn

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-24  7:44         ` Shameerali Kolothum Thodi
  2021-09-24  9:37           ` Kirti Wankhede
@ 2021-09-26  9:09           ` Max Gurtovoy
  2021-09-26 16:17             ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-26  9:09 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	liulongfang


On 9/24/2021 10:44 AM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
>> Sent: 23 September 2021 14:56
>> To: Leon Romanovsky <leon@kernel.org>; Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>
>> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>> <saeedm@nvidia.com>
>> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
>> transition validity
>>
>>
>> On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
>>> On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi
>> wrote:
>>>>> -----Original Message-----
>>>>> From: Leon Romanovsky [mailto:leon@kernel.org]
>>>>> Sent: 22 September 2021 11:39
>>>>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>> <jgg@nvidia.com>
>>>>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>>>>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
>> David
>>>>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
>>>>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>>>>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>>>>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>>>>> <saeedm@nvidia.com>
>>>>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
>>>>> transition validity
>>>>>
>>>>> From: Yishai Hadas <yishaih@nvidia.com>
>>>>>
>>>>> Add an API in the core layer to check migration state transition validity
>>>>> as part of a migration flow.
>>>>>
>>>>> The valid transitions follow the expected usage as described in
>>>>> uapi/vfio.h and triggered by QEMU.
>>>>>
>>>>> This ensures that all migration implementations follow a consistent
>>>>> migration state machine.
>>>>>
>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>>>> ---
>>>>>    drivers/vfio/vfio.c  | 41
>> +++++++++++++++++++++++++++++++++++++++++
>>>>>    include/linux/vfio.h |  1 +
>>>>>    2 files changed, 42 insertions(+)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>> index 3c034fe14ccb..c3ca33e513c8 100644
>>>>> --- a/drivers/vfio/vfio.c
>>>>> +++ b/drivers/vfio/vfio.c
>>>>> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct
>> inode
>>>>> *inode, struct file *filep)
>>>>>    	return 0;
>>>>>    }
>>>>>
>>>>> +/**
>>>>> + * vfio_change_migration_state_allowed - Checks whether a migration
>> state
>>>>> + *   transition is valid.
>>>>> + * @new_state: The new state to move to.
>>>>> + * @old_state: The old state.
>>>>> + * Return: true if the transition is valid.
>>>>> + */
>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
>> old_state)
>>>>> +{
>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE +
>> 1] = {
>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_RUNNING] = {
>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_SAVING |
>> VFIO_DEVICE_STATE_RUNNING]
>>>>> = 1,
>>>> Do we need to allow _RESUMING state here or not? As per the "State
>> transitions"
>>>> section from uapi/linux/vfio.h,
>>> It looks like we missed this state transition.
>>>
>>> Thanks
>> I'm not sure this state transition is valid.
>>
>> Kirti, When we would like to move from RUNNING to RESUMING ?
> I guess it depends on what you report as your dev default state.
>
> For HiSilicon ACC migration driver, we set the default to _RUNNING.

Where do you set it and report it ?


>
> And when the migration starts, the destination side Qemu, set the
> device state to _RESUMING(vfio_load_state()).
>
>  From the documentation, it looks like the assumption on default state of
> the VFIO dev is _RUNNING.
>
> "
> *  001b => Device running, which is the default state
> "
>
>> Sameerali, can you please re-test and update if you see this transition ?
> Yes. And if I change the default state to _STOP, then the transition
> is from _STOP --> _RESUMING.
>
> But the documentation on State transitions doesn't have _STOP --> _RESUMING
> transition as valid.
>
> Thanks,
> Shameer
>
>>
>>>> " * 4. To start the resuming phase, the device state should be transitioned
>> from
>>>>    *    the _RUNNING to the _RESUMING state."
>>>>
>>>> IIRC, I have seen that transition happening on the destination dev while
>> testing the
>>>> HiSilicon ACC dev migration.
>>>>
>>>> Thanks,
>>>> Shameer
>>>>
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_SAVING] = {
>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
>> = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>>>> +		},
>>>>> +		[VFIO_DEVICE_STATE_RESUMING] = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>> +		},
>>>>> +	};
>>>>> +
>>>>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
>>>>> +		return false;
>>>>> +
>>>>> +	return vfio_from_state_table[old_state][new_state];
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
>>>>> +
>>>>>    static long vfio_device_fops_unl_ioctl(struct file *filep,
>>>>>    				       unsigned int cmd, unsigned long arg)
>>>>>    {
>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>> index b53a9557884a..e65137a708f1 100644
>>>>> --- a/include/linux/vfio.h
>>>>> +++ b/include/linux/vfio.h
>>>>> @@ -83,6 +83,7 @@ extern struct vfio_device
>>>>> *vfio_device_get_from_dev(struct device *dev);
>>>>>    extern void vfio_device_put(struct vfio_device *device);
>>>>>
>>>>>    int vfio_assign_device_set(struct vfio_device *device, void *set_id);
>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
>> old_state);
>>>>>    /* events for the backend driver notify callback */
>>>>>    enum vfio_iommu_notify_type {
>>>>> --
>>>>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-26  9:09           ` Max Gurtovoy
@ 2021-09-26 16:17             ` Shameerali Kolothum Thodi
  2021-09-27 18:24               ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-09-26 16:17 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	liulongfang



> -----Original Message-----
> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> Sent: 26 September 2021 10:10
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Leon Romanovsky <leon@kernel.org>
> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@nvidia.com>; liulongfang <liulongfang@huawei.com>
> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
> transition validity
> 
> 
> On 9/24/2021 10:44 AM, Shameerali Kolothum Thodi wrote:
> >
> >> -----Original Message-----
> >> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> >> Sent: 23 September 2021 14:56
> >> To: Leon Romanovsky <leon@kernel.org>; Shameerali Kolothum Thodi
> >> <shameerali.kolothum.thodi@huawei.com>
> >> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> >> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> >> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
> >> David S. Miller <davem@davemloft.net>; Jakub Kicinski
> >> <kuba@kernel.org>; Kirti Wankhede <kwankhede@nvidia.com>;
> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> linux-pci@vger.kernel.org; linux-rdma@vger.kernel.org;
> >> netdev@vger.kernel.org; Saeed Mahameed <saeedm@nvidia.com>
> >> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check
> >> migration state transition validity
> >>
> >>
> >> On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
> >>> On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi
> >> wrote:
> >>>>> -----Original Message-----
> >>>>> From: Leon Romanovsky [mailto:leon@kernel.org]
> >>>>> Sent: 22 September 2021 11:39
> >>>>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> >> <jgg@nvidia.com>
> >>>>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> >>>>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
> >> David
> >>>>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>;
> >>>>> Kirti Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> >>>>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> >>>>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> >>>>> <saeedm@nvidia.com>
> >>>>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration
> >>>>> state transition validity
> >>>>>
> >>>>> From: Yishai Hadas <yishaih@nvidia.com>
> >>>>>
> >>>>> Add an API in the core layer to check migration state transition
> >>>>> validity as part of a migration flow.
> >>>>>
> >>>>> The valid transitions follow the expected usage as described in
> >>>>> uapi/vfio.h and triggered by QEMU.
> >>>>>
> >>>>> This ensures that all migration implementations follow a
> >>>>> consistent migration state machine.
> >>>>>
> >>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >>>>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>>>> ---
> >>>>>    drivers/vfio/vfio.c  | 41
> >> +++++++++++++++++++++++++++++++++++++++++
> >>>>>    include/linux/vfio.h |  1 +
> >>>>>    2 files changed, 42 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> >>>>> 3c034fe14ccb..c3ca33e513c8 100644
> >>>>> --- a/drivers/vfio/vfio.c
> >>>>> +++ b/drivers/vfio/vfio.c
> >>>>> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct
> >> inode
> >>>>> *inode, struct file *filep)
> >>>>>    	return 0;
> >>>>>    }
> >>>>>
> >>>>> +/**
> >>>>> + * vfio_change_migration_state_allowed - Checks whether a
> >>>>> +migration
> >> state
> >>>>> + *   transition is valid.
> >>>>> + * @new_state: The new state to move to.
> >>>>> + * @old_state: The old state.
> >>>>> + * Return: true if the transition is valid.
> >>>>> + */
> >>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
> >> old_state)
> >>>>> +{
> >>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> >>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE +
> >> 1] = {
> >>>>> +		[VFIO_DEVICE_STATE_STOP] = {
> >>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> >>>>> +		},
> >>>>> +		[VFIO_DEVICE_STATE_RUNNING] = {
> >>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_SAVING |
> >> VFIO_DEVICE_STATE_RUNNING]
> >>>>> = 1,
> >>>> Do we need to allow _RESUMING state here or not? As per the "State
> >> transitions"
> >>>> section from uapi/linux/vfio.h,
> >>> It looks like we missed this state transition.
> >>>
> >>> Thanks
> >> I'm not sure this state transition is valid.
> >>
> >> Kirti, When we would like to move from RUNNING to RESUMING ?
> > I guess it depends on what you report as your dev default state.
> >
> > For HiSilicon ACC migration driver, we set the default to _RUNNING.
> 
> Where do you set it and report it ?

Currently, in _open_device() we set the device_state to _RUNNING.

I think in your case the default of vmig->vfio_dev_state == 0 (_STOP).

> 
> >
> > And when the migration starts, the destination side Qemu, set the
> > device state to _RESUMING(vfio_load_state()).
> >
> >  From the documentation, it looks like the assumption on default state
> > of the VFIO dev is _RUNNING.
> >
> > "
> > *  001b => Device running, which is the default state "
> >
> >> Sameerali, can you please re-test and update if you see this transition ?
> > Yes. And if I change the default state to _STOP, then the transition
> > is from _STOP --> _RESUMING.
> >
> > But the documentation on State transitions doesn't have _STOP -->
> > _RESUMING transition as valid.
> >
> > Thanks,
> > Shameer
> >
> >>
> >>>> " * 4. To start the resuming phase, the device state should be
> >>>> transitioned
> >> from
> >>>>    *    the _RUNNING to the _RESUMING state."
> >>>>
> >>>> IIRC, I have seen that transition happening on the destination dev
> >>>> while
> >> testing the
> >>>> HiSilicon ACC dev migration.
> >>>>
> >>>> Thanks,
> >>>> Shameer
> >>>>
> >>>>> +		},
> >>>>> +		[VFIO_DEVICE_STATE_SAVING] = {
> >>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>> +		},
> >>>>> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
> >> = {
> >>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> >>>>> +		},
> >>>>> +		[VFIO_DEVICE_STATE_RESUMING] = {
> >>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>>>> +		},
> >>>>> +	};
> >>>>> +
> >>>>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
> >>>>> +		return false;
> >>>>> +
> >>>>> +	return vfio_from_state_table[old_state][new_state];
> >>>>> +}
> >>>>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
> >>>>> +
> >>>>>    static long vfio_device_fops_unl_ioctl(struct file *filep,
> >>>>>    				       unsigned int cmd, unsigned long arg)
> >>>>>    {
> >>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h index
> >>>>> b53a9557884a..e65137a708f1 100644
> >>>>> --- a/include/linux/vfio.h
> >>>>> +++ b/include/linux/vfio.h
> >>>>> @@ -83,6 +83,7 @@ extern struct vfio_device
> >>>>> *vfio_device_get_from_dev(struct device *dev);
> >>>>>    extern void vfio_device_put(struct vfio_device *device);
> >>>>>
> >>>>>    int vfio_assign_device_set(struct vfio_device *device, void
> >>>>> *set_id);
> >>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
> >> old_state);
> >>>>>    /* events for the backend driver notify callback */
> >>>>>    enum vfio_iommu_notify_type {
> >>>>> --
> >>>>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-26  6:36             ` Leon Romanovsky
@ 2021-09-26 20:23               ` Bjorn Helgaas
  2021-09-27 11:55                 ` Leon Romanovsky
  0 siblings, 1 reply; 57+ messages in thread
From: Bjorn Helgaas @ 2021-09-26 20:23 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Sun, Sep 26, 2021 at 09:36:49AM +0300, Leon Romanovsky wrote:
> On Sat, Sep 25, 2021 at 12:41:15PM -0500, Bjorn Helgaas wrote:
> > On Sat, Sep 25, 2021 at 01:10:39PM +0300, Leon Romanovsky wrote:
> > > On Fri, Sep 24, 2021 at 08:08:45AM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> > > > > On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > > > > > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > 
> > > > > > > The PCI core uses the VF index internally, often called the vf_id,
> > > > > > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > > > > > 
> > > > > > > This index is needed for device drivers that implement live migration
> > > > > > > for their internal operations that configure/control their VFs.
> > > > > > >
> > > > > > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > > > > > from this series needs it and not the bus/device/function which is
> > > > > > > exposed today.
> > > > > > > 
> > > > > > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > > > > > was used to create the bus/device/function.
> > > > > > > 
> > > > > > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > 
> > > > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > > > 
> > > > > > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > > > > > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > > > > > one with a matching devfn (although it *doesn't* check for a matching
> > > > > > bus number, which seems like a bug).
> > > ...
> > 
> > > > And it still looks like the existing code is buggy.  This is called
> > > > via sysfs, so if the PF is on bus X and the user writes to
> > > > sriov_vf_msix_count for a VF on bus X+1, it looks like
> > > > mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
> > > > VF.
> > > 
> > > In mlx5_core_sriov_set_msix_vec_count(), we receive VF that is connected
> > > to PF which has "struct mlx5_core_dev". My expectation is that they share
> > > same bus as that PF was the one who created VFs. The mlx5 devices supports
> > > upto 256 VFs and it is far below the bus split mentioned in PCI spec.
> > > 
> > > How can VF and their respective PF have different bus numbers?
> > 
> > See PCIe r5.0, sec 9.2.1.2.  For example,
> > 
> >   PF 0 on bus 20
> >     First VF Offset   1
> >     VF Stride         1
> >     NumVFs          511
> >   VF 0,1   through VF 0,255 on bus 20
> >   VF 0,256 through VF 0,511 on bus 21
> > 
> > This is implemented in pci_iov_add_virtfn(), which computes the bus
> > number and devfn from the VF ID.
> > 
> > pci_iov_virtfn_devfn(VF 0,1) == pci_iov_virtfn_devfn(VF 0,256), so if
> > the user writes to sriov_vf_msix_count for VF 0,256, it looks like
> > we'll call mlx5_set_msix_vec_count() for VF 0,1 instead of VF 0,256.
> 
> This is PCI spec split that I mentioned.
> 
> > 
> > The spec encourages devices that require no more than 256 devices to
> > locate them all on the same bus number (PCIe r5.0, sec 9.1), so if you
> > only have 255 VFs, you may avoid the problem.
> > 
> > But in mlx5_core_sriov_set_msix_vec_count(), it's not obvious that it
> > is safe to assume the bus number is the same.
> 
> No problem, we will make it more clear.

IMHO you should resolve it by using the new interface.  Better
performing, unambiguous regardless of how many VFs the device
supports.  What's the down side?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-26 20:23               ` Bjorn Helgaas
@ 2021-09-27 11:55                 ` Leon Romanovsky
  2021-09-27 14:47                   ` Bjorn Helgaas
  0 siblings, 1 reply; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-27 11:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Sun, Sep 26, 2021 at 03:23:41PM -0500, Bjorn Helgaas wrote:
> On Sun, Sep 26, 2021 at 09:36:49AM +0300, Leon Romanovsky wrote:
> > On Sat, Sep 25, 2021 at 12:41:15PM -0500, Bjorn Helgaas wrote:
> > > On Sat, Sep 25, 2021 at 01:10:39PM +0300, Leon Romanovsky wrote:
> > > > On Fri, Sep 24, 2021 at 08:08:45AM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> > > > > > On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > > > > > > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > 
> > > > > > > > The PCI core uses the VF index internally, often called the vf_id,
> > > > > > > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > > > > > > 
> > > > > > > > This index is needed for device drivers that implement live migration
> > > > > > > > for their internal operations that configure/control their VFs.
> > > > > > > >
> > > > > > > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > > > > > > from this series needs it and not the bus/device/function which is
> > > > > > > > exposed today.
> > > > > > > > 
> > > > > > > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > > > > > > was used to create the bus/device/function.
> > > > > > > > 
> > > > > > > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > > > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > 
> > > > > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > > > > 
> > > > > > > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > > > > > > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > > > > > > one with a matching devfn (although it *doesn't* check for a matching
> > > > > > > bus number, which seems like a bug).
> > > > ...
> > > 
> > > > > And it still looks like the existing code is buggy.  This is called
> > > > > via sysfs, so if the PF is on bus X and the user writes to
> > > > > sriov_vf_msix_count for a VF on bus X+1, it looks like
> > > > > mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
> > > > > VF.
> > > > 
> > > > In mlx5_core_sriov_set_msix_vec_count(), we receive VF that is connected
> > > > to PF which has "struct mlx5_core_dev". My expectation is that they share
> > > > same bus as that PF was the one who created VFs. The mlx5 devices supports
> > > > upto 256 VFs and it is far below the bus split mentioned in PCI spec.
> > > > 
> > > > How can VF and their respective PF have different bus numbers?
> > > 
> > > See PCIe r5.0, sec 9.2.1.2.  For example,
> > > 
> > >   PF 0 on bus 20
> > >     First VF Offset   1
> > >     VF Stride         1
> > >     NumVFs          511
> > >   VF 0,1   through VF 0,255 on bus 20
> > >   VF 0,256 through VF 0,511 on bus 21
> > > 
> > > This is implemented in pci_iov_add_virtfn(), which computes the bus
> > > number and devfn from the VF ID.
> > > 
> > > pci_iov_virtfn_devfn(VF 0,1) == pci_iov_virtfn_devfn(VF 0,256), so if
> > > the user writes to sriov_vf_msix_count for VF 0,256, it looks like
> > > we'll call mlx5_set_msix_vec_count() for VF 0,1 instead of VF 0,256.
> > 
> > This is PCI spec split that I mentioned.
> > 
> > > 
> > > The spec encourages devices that require no more than 256 devices to
> > > locate them all on the same bus number (PCIe r5.0, sec 9.1), so if you
> > > only have 255 VFs, you may avoid the problem.
> > > 
> > > But in mlx5_core_sriov_set_msix_vec_count(), it's not obvious that it
> > > is safe to assume the bus number is the same.
> > 
> > No problem, we will make it more clear.
> 
> IMHO you should resolve it by using the new interface.  Better
> performing, unambiguous regardless of how many VFs the device
> supports.  What's the down side?

I don't see any. My previous answer worth to be written.
"No problem, we will make it more clear with this new function".

Thanks

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index
  2021-09-27 11:55                 ` Leon Romanovsky
@ 2021-09-27 14:47                   ` Bjorn Helgaas
  0 siblings, 0 replies; 57+ messages in thread
From: Bjorn Helgaas @ 2021-09-27 14:47 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Alex Williamson, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, kvm, linux-kernel, linux-pci,
	linux-rdma, netdev, Saeed Mahameed, Yishai Hadas

On Mon, Sep 27, 2021 at 02:55:24PM +0300, Leon Romanovsky wrote:
> On Sun, Sep 26, 2021 at 03:23:41PM -0500, Bjorn Helgaas wrote:
> > On Sun, Sep 26, 2021 at 09:36:49AM +0300, Leon Romanovsky wrote:
> > > On Sat, Sep 25, 2021 at 12:41:15PM -0500, Bjorn Helgaas wrote:
> > > > On Sat, Sep 25, 2021 at 01:10:39PM +0300, Leon Romanovsky wrote:
> > > > > On Fri, Sep 24, 2021 at 08:08:45AM -0500, Bjorn Helgaas wrote:
> > > > > > On Thu, Sep 23, 2021 at 09:35:32AM +0300, Leon Romanovsky wrote:
> > > > > > > On Wed, Sep 22, 2021 at 04:59:30PM -0500, Bjorn Helgaas wrote:
> > > > > > > > On Wed, Sep 22, 2021 at 01:38:50PM +0300, Leon Romanovsky wrote:
> > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > > 
> > > > > > > > > The PCI core uses the VF index internally, often called the vf_id,
> > > > > > > > > during the setup of the VF, eg pci_iov_add_virtfn().
> > > > > > > > > 
> > > > > > > > > This index is needed for device drivers that implement live migration
> > > > > > > > > for their internal operations that configure/control their VFs.
> > > > > > > > >
> > > > > > > > > Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> > > > > > > > > from this series needs it and not the bus/device/function which is
> > > > > > > > > exposed today.
> > > > > > > > > 
> > > > > > > > > Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> > > > > > > > > was used to create the bus/device/function.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > > > > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > 
> > > > > > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > > > > > 
> > > > > > > > mlx5_core_sriov_set_msix_vec_count() looks like it does basically the
> > > > > > > > same thing as pci_iov_vf_id() by iterating through VFs until it finds
> > > > > > > > one with a matching devfn (although it *doesn't* check for a matching
> > > > > > > > bus number, which seems like a bug).
> > > > > ...
> > > > 
> > > > > > And it still looks like the existing code is buggy.  This is called
> > > > > > via sysfs, so if the PF is on bus X and the user writes to
> > > > > > sriov_vf_msix_count for a VF on bus X+1, it looks like
> > > > > > mlx5_core_sriov_set_msix_vec_count() will set the count for the wrong
> > > > > > VF.
> > > > > 
> > > > > In mlx5_core_sriov_set_msix_vec_count(), we receive VF that is connected
> > > > > to PF which has "struct mlx5_core_dev". My expectation is that they share
> > > > > same bus as that PF was the one who created VFs. The mlx5 devices supports
> > > > > upto 256 VFs and it is far below the bus split mentioned in PCI spec.
> > > > > 
> > > > > How can VF and their respective PF have different bus numbers?
> > > > 
> > > > See PCIe r5.0, sec 9.2.1.2.  For example,
> > > > 
> > > >   PF 0 on bus 20
> > > >     First VF Offset   1
> > > >     VF Stride         1
> > > >     NumVFs          511
> > > >   VF 0,1   through VF 0,255 on bus 20
> > > >   VF 0,256 through VF 0,511 on bus 21
> > > > 
> > > > This is implemented in pci_iov_add_virtfn(), which computes the bus
> > > > number and devfn from the VF ID.
> > > > 
> > > > pci_iov_virtfn_devfn(VF 0,1) == pci_iov_virtfn_devfn(VF 0,256), so if
> > > > the user writes to sriov_vf_msix_count for VF 0,256, it looks like
> > > > we'll call mlx5_set_msix_vec_count() for VF 0,1 instead of VF 0,256.
> > > 
> > > This is PCI spec split that I mentioned.
> > > 
> > > > 
> > > > The spec encourages devices that require no more than 256 devices to
> > > > locate them all on the same bus number (PCIe r5.0, sec 9.1), so if you
> > > > only have 255 VFs, you may avoid the problem.
> > > > 
> > > > But in mlx5_core_sriov_set_msix_vec_count(), it's not obvious that it
> > > > is safe to assume the bus number is the same.
> > > 
> > > No problem, we will make it more clear.
> > 
> > IMHO you should resolve it by using the new interface.  Better
> > performing, unambiguous regardless of how many VFs the device
> > supports.  What's the down side?
> 
> I don't see any. My previous answer worth to be written.
> "No problem, we will make it more clear with this new function".

Great, sorry I missed that nuance :)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-26 16:17             ` Shameerali Kolothum Thodi
@ 2021-09-27 18:24               ` Max Gurtovoy
  2021-09-27 18:29                 ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-27 18:24 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	liulongfang


On 9/26/2021 7:17 PM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
>> Sent: 26 September 2021 10:10
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> Leon Romanovsky <leon@kernel.org>
>> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
>> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>> <saeedm@nvidia.com>; liulongfang <liulongfang@huawei.com>
>> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
>> transition validity
>>
>>
>> On 9/24/2021 10:44 AM, Shameerali Kolothum Thodi wrote:
>>>> -----Original Message-----
>>>> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
>>>> Sent: 23 September 2021 14:56
>>>> To: Leon Romanovsky <leon@kernel.org>; Shameerali Kolothum Thodi
>>>> <shameerali.kolothum.thodi@huawei.com>
>>>> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>>>> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>>>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
>>>> David S. Miller <davem@davemloft.net>; Jakub Kicinski
>>>> <kuba@kernel.org>; Kirti Wankhede <kwankhede@nvidia.com>;
>>>> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>>>> linux-pci@vger.kernel.org; linux-rdma@vger.kernel.org;
>>>> netdev@vger.kernel.org; Saeed Mahameed <saeedm@nvidia.com>
>>>> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check
>>>> migration state transition validity
>>>>
>>>>
>>>> On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
>>>>> On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum Thodi
>>>> wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Leon Romanovsky [mailto:leon@kernel.org]
>>>>>>> Sent: 22 September 2021 11:39
>>>>>>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
>>>> <jgg@nvidia.com>
>>>>>>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
>>>>>>> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
>>>> David
>>>>>>> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>;
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
>>>>>>> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
>>>>>>> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
>>>>>>> <saeedm@nvidia.com>
>>>>>>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check migration
>>>>>>> state transition validity
>>>>>>>
>>>>>>> From: Yishai Hadas <yishaih@nvidia.com>
>>>>>>>
>>>>>>> Add an API in the core layer to check migration state transition
>>>>>>> validity as part of a migration flow.
>>>>>>>
>>>>>>> The valid transitions follow the expected usage as described in
>>>>>>> uapi/vfio.h and triggered by QEMU.
>>>>>>>
>>>>>>> This ensures that all migration implementations follow a
>>>>>>> consistent migration state machine.
>>>>>>>
>>>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>>>>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>>>>>> ---
>>>>>>>     drivers/vfio/vfio.c  | 41
>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>>>     include/linux/vfio.h |  1 +
>>>>>>>     2 files changed, 42 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
>>>>>>> 3c034fe14ccb..c3ca33e513c8 100644
>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct
>>>> inode
>>>>>>> *inode, struct file *filep)
>>>>>>>     	return 0;
>>>>>>>     }
>>>>>>>
>>>>>>> +/**
>>>>>>> + * vfio_change_migration_state_allowed - Checks whether a
>>>>>>> +migration
>>>> state
>>>>>>> + *   transition is valid.
>>>>>>> + * @new_state: The new state to move to.
>>>>>>> + * @old_state: The old state.
>>>>>>> + * Return: true if the transition is valid.
>>>>>>> + */
>>>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
>>>> old_state)
>>>>>>> +{
>>>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE +
>>>> 1] = {
>>>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>>>>> +		},
>>>>>>> +		[VFIO_DEVICE_STATE_RUNNING] = {
>>>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_SAVING |
>>>> VFIO_DEVICE_STATE_RUNNING]
>>>>>>> = 1,
>>>>>> Do we need to allow _RESUMING state here or not? As per the "State
>>>> transitions"
>>>>>> section from uapi/linux/vfio.h,
>>>>> It looks like we missed this state transition.
>>>>>
>>>>> Thanks
>>>> I'm not sure this state transition is valid.
>>>>
>>>> Kirti, When we would like to move from RUNNING to RESUMING ?
>>> I guess it depends on what you report as your dev default state.
>>>
>>> For HiSilicon ACC migration driver, we set the default to _RUNNING.
>> Where do you set it and report it ?
> Currently, in _open_device() we set the device_state to _RUNNING.

Why do you do it ?

>
> I think in your case the default of vmig->vfio_dev_state == 0 (_STOP).
>
>>> And when the migration starts, the destination side Qemu, set the
>>> device state to _RESUMING(vfio_load_state()).
>>>
>>>   From the documentation, it looks like the assumption on default state
>>> of the VFIO dev is _RUNNING.
>>>
>>> "
>>> *  001b => Device running, which is the default state "
>>>
>>>> Sameerali, can you please re-test and update if you see this transition ?
>>> Yes. And if I change the default state to _STOP, then the transition
>>> is from _STOP --> _RESUMING.
>>>
>>> But the documentation on State transitions doesn't have _STOP -->
>>> _RESUMING transition as valid.
>>>
>>> Thanks,
>>> Shameer
>>>
>>>>>> " * 4. To start the resuming phase, the device state should be
>>>>>> transitioned
>>>> from
>>>>>>     *    the _RUNNING to the _RESUMING state."
>>>>>>
>>>>>> IIRC, I have seen that transition happening on the destination dev
>>>>>> while
>>>> testing the
>>>>>> HiSilicon ACC dev migration.
>>>>>>
>>>>>> Thanks,
>>>>>> Shameer
>>>>>>
>>>>>>> +		},
>>>>>>> +		[VFIO_DEVICE_STATE_SAVING] = {
>>>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>>>> +		},
>>>>>>> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING]
>>>> = {
>>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
>>>>>>> +		},
>>>>>>> +		[VFIO_DEVICE_STATE_RESUMING] = {
>>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
>>>>>>> +		},
>>>>>>> +	};
>>>>>>> +
>>>>>>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	return vfio_from_state_table[old_state][new_state];
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
>>>>>>> +
>>>>>>>     static long vfio_device_fops_unl_ioctl(struct file *filep,
>>>>>>>     				       unsigned int cmd, unsigned long arg)
>>>>>>>     {
>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h index
>>>>>>> b53a9557884a..e65137a708f1 100644
>>>>>>> --- a/include/linux/vfio.h
>>>>>>> +++ b/include/linux/vfio.h
>>>>>>> @@ -83,6 +83,7 @@ extern struct vfio_device
>>>>>>> *vfio_device_get_from_dev(struct device *dev);
>>>>>>>     extern void vfio_device_put(struct vfio_device *device);
>>>>>>>
>>>>>>>     int vfio_assign_device_set(struct vfio_device *device, void
>>>>>>> *set_id);
>>>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
>>>> old_state);
>>>>>>>     /* events for the backend driver notify callback */
>>>>>>>     enum vfio_iommu_notify_type {
>>>>>>> --
>>>>>>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-27 18:24               ` Max Gurtovoy
@ 2021-09-27 18:29                 ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 57+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-09-27 18:29 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Alex Williamson,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	liulongfang



> -----Original Message-----
> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> Sent: 27 September 2021 19:24
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Leon Romanovsky <leon@kernel.org>
> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>; David
> S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Kirti
> Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@nvidia.com>; liulongfang <liulongfang@huawei.com>
> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
> transition validity
> 
> 
> On 9/26/2021 7:17 PM, Shameerali Kolothum Thodi wrote:
> >
> >> -----Original Message-----
> >> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> >> Sent: 26 September 2021 10:10
> >> To: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>;
> >> Leon Romanovsky <leon@kernel.org>
> >> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> >> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> >> <alex.williamson@redhat.com>; Bjorn Helgaas <bhelgaas@google.com>;
> >> David S. Miller <davem@davemloft.net>; Jakub Kicinski
> >> <kuba@kernel.org>; Kirti Wankhede <kwankhede@nvidia.com>;
> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> linux-pci@vger.kernel.org; linux-rdma@vger.kernel.org;
> >> netdev@vger.kernel.org; Saeed Mahameed <saeedm@nvidia.com>;
> >> liulongfang <liulongfang@huawei.com>
> >> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check
> >> migration state transition validity
> >>
> >>
> >> On 9/24/2021 10:44 AM, Shameerali Kolothum Thodi wrote:
> >>>> -----Original Message-----
> >>>> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> >>>> Sent: 23 September 2021 14:56
> >>>> To: Leon Romanovsky <leon@kernel.org>; Shameerali Kolothum Thodi
> >>>> <shameerali.kolothum.thodi@huawei.com>
> >>>> Cc: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> >>>> <jgg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Alex
> >>>> Williamson <alex.williamson@redhat.com>; Bjorn Helgaas
> >>>> <bhelgaas@google.com>; David S. Miller <davem@davemloft.net>;
> Jakub
> >>>> Kicinski <kuba@kernel.org>; Kirti Wankhede <kwankhede@nvidia.com>;
> >>>> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >>>> linux-pci@vger.kernel.org; linux-rdma@vger.kernel.org;
> >>>> netdev@vger.kernel.org; Saeed Mahameed <saeedm@nvidia.com>
> >>>> Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check
> >>>> migration state transition validity
> >>>>
> >>>>
> >>>> On 9/23/2021 2:17 PM, Leon Romanovsky wrote:
> >>>>> On Thu, Sep 23, 2021 at 10:33:10AM +0000, Shameerali Kolothum
> >>>>> Thodi
> >>>> wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: Leon Romanovsky [mailto:leon@kernel.org]
> >>>>>>> Sent: 22 September 2021 11:39
> >>>>>>> To: Doug Ledford <dledford@redhat.com>; Jason Gunthorpe
> >>>> <jgg@nvidia.com>
> >>>>>>> Cc: Yishai Hadas <yishaih@nvidia.com>; Alex Williamson
> >>>>>>> <alex.williamson@redhat.com>; Bjorn Helgaas
> >>>>>>> <bhelgaas@google.com>;
> >>>> David
> >>>>>>> S. Miller <davem@davemloft.net>; Jakub Kicinski
> >>>>>>> <kuba@kernel.org>; Kirti Wankhede <kwankhede@nvidia.com>;
> >>>>>>> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >>>>>>> linux-pci@vger.kernel.org; linux-rdma@vger.kernel.org;
> >>>>>>> netdev@vger.kernel.org; Saeed Mahameed <saeedm@nvidia.com>
> >>>>>>> Subject: [PATCH mlx5-next 2/7] vfio: Add an API to check
> >>>>>>> migration state transition validity
> >>>>>>>
> >>>>>>> From: Yishai Hadas <yishaih@nvidia.com>
> >>>>>>>
> >>>>>>> Add an API in the core layer to check migration state transition
> >>>>>>> validity as part of a migration flow.
> >>>>>>>
> >>>>>>> The valid transitions follow the expected usage as described in
> >>>>>>> uapi/vfio.h and triggered by QEMU.
> >>>>>>>
> >>>>>>> This ensures that all migration implementations follow a
> >>>>>>> consistent migration state machine.
> >>>>>>>
> >>>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >>>>>>> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >>>>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>>>>>> ---
> >>>>>>>     drivers/vfio/vfio.c  | 41
> >>>> +++++++++++++++++++++++++++++++++++++++++
> >>>>>>>     include/linux/vfio.h |  1 +
> >>>>>>>     2 files changed, 42 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> >>>>>>> 3c034fe14ccb..c3ca33e513c8 100644
> >>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>> @@ -1664,6 +1664,47 @@ static int
> >>>>>>> vfio_device_fops_release(struct
> >>>> inode
> >>>>>>> *inode, struct file *filep)
> >>>>>>>     	return 0;
> >>>>>>>     }
> >>>>>>>
> >>>>>>> +/**
> >>>>>>> + * vfio_change_migration_state_allowed - Checks whether a
> >>>>>>> +migration
> >>>> state
> >>>>>>> + *   transition is valid.
> >>>>>>> + * @new_state: The new state to move to.
> >>>>>>> + * @old_state: The old state.
> >>>>>>> + * Return: true if the transition is valid.
> >>>>>>> + */
> >>>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
> >>>> old_state)
> >>>>>>> +{
> >>>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> >>>>>>> +	static const u8 vfio_from_state_table[MAX_STATE +
> 1][MAX_STATE
> >>>>>>> ++
> >>>> 1] = {
> >>>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
> >>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> >>>>>>> +		},
> >>>>>>> +		[VFIO_DEVICE_STATE_RUNNING] = {
> >>>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_SAVING |
> >>>> VFIO_DEVICE_STATE_RUNNING]
> >>>>>>> = 1,
> >>>>>> Do we need to allow _RESUMING state here or not? As per the
> >>>>>> "State
> >>>> transitions"
> >>>>>> section from uapi/linux/vfio.h,
> >>>>> It looks like we missed this state transition.
> >>>>>
> >>>>> Thanks
> >>>> I'm not sure this state transition is valid.
> >>>>
> >>>> Kirti, When we would like to move from RUNNING to RESUMING ?
> >>> I guess it depends on what you report as your dev default state.
> >>>
> >>> For HiSilicon ACC migration driver, we set the default to _RUNNING.
> >> Where do you set it and report it ?
> > Currently, in _open_device() we set the device_state to _RUNNING.
> 
> Why do you do it ?

It is by the assumption that the default state to be _RUNNING and then
we take it from there for migration state changes.

Any particular reason why it needs to be in _STOP state? We need to update
the documentation in that case to allow _STOP --> _RESUMING.

> 
> >
> > I think in your case the default of vmig->vfio_dev_state == 0 (_STOP).
> >
> >>> And when the migration starts, the destination side Qemu, set the
> >>> device state to _RESUMING(vfio_load_state()).
> >>>
> >>>   From the documentation, it looks like the assumption on default
> >>> state of the VFIO dev is _RUNNING.
> >>>
> >>> "
> >>> *  001b => Device running, which is the default state "
> >>>
> >>>> Sameerali, can you please re-test and update if you see this transition ?
> >>> Yes. And if I change the default state to _STOP, then the transition
> >>> is from _STOP --> _RESUMING.
> >>>
> >>> But the documentation on State transitions doesn't have _STOP -->
> >>> _RESUMING transition as valid.
> >>>
> >>> Thanks,
> >>> Shameer
> >>>
> >>>>>> " * 4. To start the resuming phase, the device state should be
> >>>>>> transitioned
> >>>> from
> >>>>>>     *    the _RUNNING to the _RESUMING state."
> >>>>>>
> >>>>>> IIRC, I have seen that transition happening on the destination
> >>>>>> dev while
> >>>> testing the
> >>>>>> HiSilicon ACC dev migration.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Shameer
> >>>>>>
> >>>>>>> +		},
> >>>>>>> +		[VFIO_DEVICE_STATE_SAVING] = {
> >>>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>>>> +		},
> >>>>>>> +		[VFIO_DEVICE_STATE_SAVING |
> VFIO_DEVICE_STATE_RUNNING]
> >>>> = {
> >>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> >>>>>>> +		},
> >>>>>>> +		[VFIO_DEVICE_STATE_RESUMING] = {
> >>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_STOP] = 1,
> >>>>>>> +		},
> >>>>>>> +	};
> >>>>>>> +
> >>>>>>> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
> >>>>>>> +		return false;
> >>>>>>> +
> >>>>>>> +	return vfio_from_state_table[old_state][new_state];
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
> >>>>>>> +
> >>>>>>>     static long vfio_device_fops_unl_ioctl(struct file *filep,
> >>>>>>>     				       unsigned int cmd, unsigned long arg)
> >>>>>>>     {
> >>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h index
> >>>>>>> b53a9557884a..e65137a708f1 100644
> >>>>>>> --- a/include/linux/vfio.h
> >>>>>>> +++ b/include/linux/vfio.h
> >>>>>>> @@ -83,6 +83,7 @@ extern struct vfio_device
> >>>>>>> *vfio_device_get_from_dev(struct device *dev);
> >>>>>>>     extern void vfio_device_put(struct vfio_device *device);
> >>>>>>>
> >>>>>>>     int vfio_assign_device_set(struct vfio_device *device, void
> >>>>>>> *set_id);
> >>>>>>> +bool vfio_change_migration_state_allowed(u32 new_state, u32
> >>>> old_state);
> >>>>>>>     /* events for the backend driver notify callback */
> >>>>>>>     enum vfio_iommu_notify_type {
> >>>>>>> --
> >>>>>>> 2.31.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
  2021-09-23 10:33   ` Shameerali Kolothum Thodi
@ 2021-09-27 22:46   ` Alex Williamson
  2021-09-27 23:12     ` Jason Gunthorpe
  1 sibling, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-27 22:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, 22 Sep 2021 13:38:51 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Yishai Hadas <yishaih@nvidia.com>
> 
> Add an API in the core layer to check migration state transition validity
> as part of a migration flow.
> 
> The valid transitions follow the expected usage as described in
> uapi/vfio.h and triggered by QEMU.
> 
> This ensures that all migration implementations follow a consistent
> migration state machine.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/vfio.c  | 41 +++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h |  1 +
>  2 files changed, 42 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 3c034fe14ccb..c3ca33e513c8 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1664,6 +1664,47 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +/**
> + * vfio_change_migration_state_allowed - Checks whether a migration state
> + *   transition is valid.
> + * @new_state: The new state to move to.
> + * @old_state: The old state.
> + * Return: true if the transition is valid.
> + */
> +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state)
> +{
> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> +		[VFIO_DEVICE_STATE_STOP] = {
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> +		},

Our state transition diagram is pretty weak on reachable transitions
out of the _STOP state, why do we select only these two as valid?

Consistent behavior to userspace is of course nice, but I wonder if we
were expecting a device reset to get us back to _RUNNING, or if the
drivers would make use of the protocol through which a driver can nak
(write error, no state change) or fault (_ERROR device state) a state
change.

There does need to be a way to get back to _RUNNING to support a
migration failure without a reset, but would that be from _SAVING
or from _STOP and what's our rationale for the excluded states?

I'll see if I can dig through emails to find what was intended to be
reachable from _STOP.  Kirti or Connie, do you recall?

Also, I think the _ERROR state is implicitly handled correctly here,
its value is >MAX_STATE so we can't transition into or out of it, but a
comment to indicate that it's been considered for this would be nice.

> +		[VFIO_DEVICE_STATE_RUNNING] = {
> +			[VFIO_DEVICE_STATE_STOP] = 1,
> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> +			[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = 1,
> +		},

Shameer's comment is correct here, _RESUMING is a valid next state
since the default state is _RUNNING.

> +		[VFIO_DEVICE_STATE_SAVING] = {
> +			[VFIO_DEVICE_STATE_STOP] = 1,
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +		},

What's the rationale that we can't return to _SAVING|_RUNNING here?

> +		[VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING] = {
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +			[VFIO_DEVICE_STATE_SAVING] = 1,
> +		},

Can't we always _STOP the device at any point?

> +		[VFIO_DEVICE_STATE_RESUMING] = {
> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> +			[VFIO_DEVICE_STATE_STOP] = 1,
> +		},

Couldn't it be possible to switch immediately to _RUNNING|_SAVING for
tracing purposes?  Or _SAVING, perhaps to validate the restored state
without starting the device?  Thanks,

Alex

> +	};
> +
> +	if (new_state > MAX_STATE || old_state > MAX_STATE)
> +		return false;
> +
> +	return vfio_from_state_table[old_state][new_state];
> +}
> +EXPORT_SYMBOL_GPL(vfio_change_migration_state_allowed);
> +
>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index b53a9557884a..e65137a708f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -83,6 +83,7 @@ extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
>  extern void vfio_device_put(struct vfio_device *device);
>  
>  int vfio_assign_device_set(struct vfio_device *device, void *set_id);
> +bool vfio_change_migration_state_allowed(u32 new_state, u32 old_state);
>  
>  /* events for the backend driver notify callback */
>  enum vfio_iommu_notify_type {


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-27 22:46   ` Alex Williamson
@ 2021-09-27 23:12     ` Jason Gunthorpe
  2021-09-28 19:19       ` Alex Williamson
  2021-09-29 10:44       ` Max Gurtovoy
  0 siblings, 2 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-27 23:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
> > +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> > +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> > +		[VFIO_DEVICE_STATE_STOP] = {
> > +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> > +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> > +		},
> 
> Our state transition diagram is pretty weak on reachable transitions
> out of the _STOP state, why do we select only these two as valid?

I have no particular opinion on specific states here, however adding
more states means more stuff for drivers to implement and more risk
driver writers will mess up this uAPI.

So only on those grounds I'd suggest to keep this to the minimum
needed instead of the maximum logically possible..

Also, probably the FSM comment from the uapi header file should be
moved into a function comment above this function?

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-27 23:12     ` Jason Gunthorpe
@ 2021-09-28 19:19       ` Alex Williamson
  2021-09-28 19:35         ` Jason Gunthorpe
  2021-09-29 10:57         ` Max Gurtovoy
  2021-09-29 10:44       ` Max Gurtovoy
  1 sibling, 2 replies; 57+ messages in thread
From: Alex Williamson @ 2021-09-28 19:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Mon, 27 Sep 2021 20:12:39 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
> > > +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> > > +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> > > +		[VFIO_DEVICE_STATE_STOP] = {
> > > +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> > > +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> > > +		},  
> > 
> > Our state transition diagram is pretty weak on reachable transitions
> > out of the _STOP state, why do we select only these two as valid?  
> 
> I have no particular opinion on specific states here, however adding
> more states means more stuff for drivers to implement and more risk
> driver writers will mess up this uAPI.

It looks like state transitions were largely discussed in v9 and v10 of
the migration proposals:

https://lore.kernel.org/all/1573578220-7530-2-git-send-email-kwankhede@nvidia.com/
https://lore.kernel.org/all/1576527700-21805-2-git-send-email-kwankhede@nvidia.com/

I'm not seeing that we really excluded many transitions there.

> So only on those grounds I'd suggest to keep this to the minimum
> needed instead of the maximum logically possible..
> 
> Also, probably the FSM comment from the uapi header file should be
> moved into a function comment above this function?

It's not clear this function shouldn't be anything more than:

	if (new_state > MAX_STATE || old_state > MAX_STATE)
		return false;	/* exited via device reset, */
				/* entered via transition fault */

	return true;

That's still only 5 fully interconnected states to work between, and
potentially a 6th if we decide _RESUMING|_RUNNING is valid for a device
supporting post-copy.

In defining the device state, we tried to steer away from defining it
in terms of the QEMU migration API, but rather as a set of controls
that could be used to support that API to leave us some degree of
independence that QEMU implementation might evolve.

To that extent, it actually seems easier for a device implementation to
focus on bit definition rather than the state machine node.

I'd also vote that any clarification of state validity and transitions
belongs in the uAPI header and a transition test function should
reference that header as the source of truth, rather than the other way
around.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-28 19:19       ` Alex Williamson
@ 2021-09-28 19:35         ` Jason Gunthorpe
  2021-09-28 20:18           ` Alex Williamson
  2021-09-29 10:57         ` Max Gurtovoy
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:

> In defining the device state, we tried to steer away from defining it
> in terms of the QEMU migration API, but rather as a set of controls
> that could be used to support that API to leave us some degree of
> independence that QEMU implementation might evolve.

That is certainly a different perspective, it would have been
better to not express this idea as a FSM in that case...

So each state in mlx5vf_pci_set_device_state() should call the correct
combination of (un)freeze, (un)quiesce and so on so each state
reflects a defined operation of the device?

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-28 19:35         ` Jason Gunthorpe
@ 2021-09-28 20:18           ` Alex Williamson
  2021-09-29 16:16             ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-28 20:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Tue, 28 Sep 2021 16:35:50 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> 
> > In defining the device state, we tried to steer away from defining it
> > in terms of the QEMU migration API, but rather as a set of controls
> > that could be used to support that API to leave us some degree of
> > independence that QEMU implementation might evolve.  
> 
> That is certainly a different perspective, it would have been
> better to not express this idea as a FSM in that case...
> 
> So each state in mlx5vf_pci_set_device_state() should call the correct
> combination of (un)freeze, (un)quiesce and so on so each state
> reflects a defined operation of the device?

I'd expect so, for instance the implementation of entering the _STOP
state presumes a previous state that where the device is apparently
already quiesced.  That doesn't support a direct _RUNNING -> _STOP
transition where I argued in the linked threads that those states
should be reachable from any other state.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device
  2021-09-22 10:38 ` [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device Leon Romanovsky
@ 2021-09-28 20:22   ` Alex Williamson
  2021-09-29  5:36     ` Leon Romanovsky
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-28 20:22 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed

On Wed, 22 Sep 2021 13:38:55 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Yishai Hadas <yishaih@nvidia.com>
> 
> Expose migration commands over the device, it includes: suspend, resume,
> get vhca id, query/save/load state.
> 
> As part of this adds the APIs and data structure that are needed to
> manage the migration data.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/mlx5_vfio_pci_cmd.c | 358 +++++++++++++++++++++++++++
>  drivers/vfio/pci/mlx5_vfio_pci_cmd.h |  43 ++++
>  2 files changed, 401 insertions(+)
>  create mode 100644 drivers/vfio/pci/mlx5_vfio_pci_cmd.c
>  create mode 100644 drivers/vfio/pci/mlx5_vfio_pci_cmd.h

Should we set the precedent of a vendor sub-directory like we have
elsewhere?  Either way I'd like to see a MAINTAINERS file update for the
new driver.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device
  2021-09-28 20:22   ` Alex Williamson
@ 2021-09-29  5:36     ` Leon Romanovsky
  0 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2021-09-29  5:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Doug Ledford, Jason Gunthorpe, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed

On Tue, Sep 28, 2021 at 02:22:30PM -0600, Alex Williamson wrote:
> On Wed, 22 Sep 2021 13:38:55 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
> 
> > From: Yishai Hadas <yishaih@nvidia.com>
> > 
> > Expose migration commands over the device, it includes: suspend, resume,
> > get vhca id, query/save/load state.
> > 
> > As part of this adds the APIs and data structure that are needed to
> > manage the migration data.
> > 
> > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/vfio/pci/mlx5_vfio_pci_cmd.c | 358 +++++++++++++++++++++++++++
> >  drivers/vfio/pci/mlx5_vfio_pci_cmd.h |  43 ++++
> >  2 files changed, 401 insertions(+)
> >  create mode 100644 drivers/vfio/pci/mlx5_vfio_pci_cmd.c
> >  create mode 100644 drivers/vfio/pci/mlx5_vfio_pci_cmd.h
> 
> Should we set the precedent of a vendor sub-directory like we have
> elsewhere?  Either way I'd like to see a MAINTAINERS file update for the
> new driver.  Thanks,

I would like to see subfolders, because all these vendor_xxxx.c filenames
look awful to me.

Thanks

> 
> Alex
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-27 23:12     ` Jason Gunthorpe
  2021-09-28 19:19       ` Alex Williamson
@ 2021-09-29 10:44       ` Max Gurtovoy
  2021-09-29 12:35         ` Alex Williamson
  1 sibling, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-29 10:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:
> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>> +		},
>> Our state transition diagram is pretty weak on reachable transitions
>> out of the _STOP state, why do we select only these two as valid?
> I have no particular opinion on specific states here, however adding
> more states means more stuff for drivers to implement and more risk
> driver writers will mess up this uAPI.

_STOP == 000b => Device Stopped, not saving or resuming (from UAPI).

This is the default initial state and not RUNNING.

The user application should move device from STOP => RUNNING or STOP => 
RESUMING.

Maybe we need to extend the comment in the UAPI file.

>
> So only on those grounds I'd suggest to keep this to the minimum
> needed instead of the maximum logically possible..
>
> Also, probably the FSM comment from the uapi header file should be
> moved into a function comment above this function?
>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-28 19:19       ` Alex Williamson
  2021-09-28 19:35         ` Jason Gunthorpe
@ 2021-09-29 10:57         ` Max Gurtovoy
  1 sibling, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-29 10:57 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/28/2021 10:19 PM, Alex Williamson wrote:
> On Mon, 27 Sep 2021 20:12:39 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
>> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>> +		},
>>> Our state transition diagram is pretty weak on reachable transitions
>>> out of the _STOP state, why do we select only these two as valid?
>> I have no particular opinion on specific states here, however adding
>> more states means more stuff for drivers to implement and more risk
>> driver writers will mess up this uAPI.
> It looks like state transitions were largely discussed in v9 and v10 of
> the migration proposals:
>
> https://lore.kernel.org/all/1573578220-7530-2-git-send-email-kwankhede@nvidia.com/
> https://lore.kernel.org/all/1576527700-21805-2-git-send-email-kwankhede@nvidia.com/
>
> I'm not seeing that we really excluded many transitions there.
>
>> So only on those grounds I'd suggest to keep this to the minimum
>> needed instead of the maximum logically possible..
>>
>> Also, probably the FSM comment from the uapi header file should be
>> moved into a function comment above this function?
> It's not clear this function shouldn't be anything more than:
>
> 	if (new_state > MAX_STATE || old_state > MAX_STATE)
> 		return false;	/* exited via device reset, */
> 				/* entered via transition fault */
>
> 	return true;
>
> That's still only 5 fully interconnected states to work between, and
> potentially a 6th if we decide _RESUMING|_RUNNING is valid for a device
> supporting post-copy.
>
> In defining the device state, we tried to steer away from defining it
> in terms of the QEMU migration API, but rather as a set of controls
> that could be used to support that API to leave us some degree of
> independence that QEMU implementation might evolve.

The state machine is not related to QEMU specifically.

The state machine defines an agreement between user application (let's 
say QEMU) and VFIO.

If a user application would like to move, for example, from RESUMING to 
SAVING state, then the kernel should fail. I don't that there is a 
device that can support it.

If you prefer we check this inside our mlx5 vfio driver, we can do it. 
But we think that this is a common logic according to the defined FSM.

Do you prefer code duplication in vendor vfio-pci drivers ?

> To that extent, it actually seems easier for a device implementation to
> focus on bit definition rather than the state machine node.
>
> I'd also vote that any clarification of state validity and transitions
> belongs in the uAPI header and a transition test function should
> reference that header as the source of truth, rather than the other way
> around.  Thanks,

Yes, I guess this is possible.

>
> Alex
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 10:44       ` Max Gurtovoy
@ 2021-09-29 12:35         ` Alex Williamson
  2021-09-29 13:26           ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-29 12:35 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, 29 Sep 2021 13:44:10 +0300
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:
> > On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:  
> >>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> >>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> >>> +		[VFIO_DEVICE_STATE_STOP] = {
> >>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> >>> +		},  
> >> Our state transition diagram is pretty weak on reachable transitions
> >> out of the _STOP state, why do we select only these two as valid?  
> > I have no particular opinion on specific states here, however adding
> > more states means more stuff for drivers to implement and more risk
> > driver writers will mess up this uAPI.  
> 
> _STOP == 000b => Device Stopped, not saving or resuming (from UAPI).
> 
> This is the default initial state and not RUNNING.
> 
> The user application should move device from STOP => RUNNING or STOP => 
> RESUMING.
> 
> Maybe we need to extend the comment in the UAPI file.


include/uapi/linux/vfio.h:
...
 *  +------- _RESUMING
 *  |+------ _SAVING
 *  ||+----- _RUNNING
 *  |||
 *  000b => Device Stopped, not saving or resuming
 *  001b => Device running, which is the default state
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
...
 * State transitions:
 *
 *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
 *                (100b)     (001b)     (011b)        (010b)       (000b)
 * 0. Running or default state
 *                             |
                 ^^^^^^^^^^^^^
...
 * 0. Default state of VFIO device is _RUNNING when the user application starts.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The uAPI is pretty clear here.  A default state of _STOP is not
compatible with existing devices and userspace that does not support
migration.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 12:35         ` Alex Williamson
@ 2021-09-29 13:26           ` Max Gurtovoy
  2021-09-29 13:50             ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-29 13:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/29/2021 3:35 PM, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:44:10 +0300
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>> On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:
>>> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>>> +		},
>>>> Our state transition diagram is pretty weak on reachable transitions
>>>> out of the _STOP state, why do we select only these two as valid?
>>> I have no particular opinion on specific states here, however adding
>>> more states means more stuff for drivers to implement and more risk
>>> driver writers will mess up this uAPI.
>> _STOP == 000b => Device Stopped, not saving or resuming (from UAPI).
>>
>> This is the default initial state and not RUNNING.
>>
>> The user application should move device from STOP => RUNNING or STOP =>
>> RESUMING.
>>
>> Maybe we need to extend the comment in the UAPI file.
>
> include/uapi/linux/vfio.h:
> ...
>   *  +------- _RESUMING
>   *  |+------ _SAVING
>   *  ||+----- _RUNNING
>   *  |||
>   *  000b => Device Stopped, not saving or resuming
>   *  001b => Device running, which is the default state
>                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
> ...
>   * State transitions:
>   *
>   *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>   *                (100b)     (001b)     (011b)        (010b)       (000b)
>   * 0. Running or default state
>   *                             |
>                   ^^^^^^^^^^^^^
> ...
>   * 0. Default state of VFIO device is _RUNNING when the user application starts.
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The uAPI is pretty clear here.  A default state of _STOP is not
> compatible with existing devices and userspace that does not support
> migration.  Thanks,

Why do you need this state machine for userspace that doesn't support 
migration ?

What is the definition of RUNNING state for a paused VM that is waiting 
for incoming migration blob ?

>
> Alex
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 13:26           ` Max Gurtovoy
@ 2021-09-29 13:50             ` Alex Williamson
  2021-09-29 14:36               ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-29 13:50 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, 29 Sep 2021 16:26:55 +0300
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 9/29/2021 3:35 PM, Alex Williamson wrote:
> > On Wed, 29 Sep 2021 13:44:10 +0300
> > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >  
> >> On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:  
> >>> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:  
> >>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> >>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> >>>>> +		[VFIO_DEVICE_STATE_STOP] = {
> >>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> >>>>> +		},  
> >>>> Our state transition diagram is pretty weak on reachable transitions
> >>>> out of the _STOP state, why do we select only these two as valid?  
> >>> I have no particular opinion on specific states here, however adding
> >>> more states means more stuff for drivers to implement and more risk
> >>> driver writers will mess up this uAPI.  
> >> _STOP == 000b => Device Stopped, not saving or resuming (from UAPI).
> >>
> >> This is the default initial state and not RUNNING.
> >>
> >> The user application should move device from STOP => RUNNING or STOP =>
> >> RESUMING.
> >>
> >> Maybe we need to extend the comment in the UAPI file.  
> >
> > include/uapi/linux/vfio.h:
> > ...
> >   *  +------- _RESUMING
> >   *  |+------ _SAVING
> >   *  ||+----- _RUNNING
> >   *  |||
> >   *  000b => Device Stopped, not saving or resuming
> >   *  001b => Device running, which is the default state
> >                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > ...
> >   * State transitions:
> >   *
> >   *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> >   *                (100b)     (001b)     (011b)        (010b)       (000b)
> >   * 0. Running or default state
> >   *                             |
> >                   ^^^^^^^^^^^^^
> > ...
> >   * 0. Default state of VFIO device is _RUNNING when the user application starts.
> >        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > The uAPI is pretty clear here.  A default state of _STOP is not
> > compatible with existing devices and userspace that does not support
> > migration.  Thanks,  
> 
> Why do you need this state machine for userspace that doesn't support 
> migration ?

For userspace that doesn't support migration, there's one state,
_RUNNING.  That's what we're trying to be compatible and consistent
with.  Migration is an extension, not a base requirement.

> What is the definition of RUNNING state for a paused VM that is waiting 
> for incoming migration blob ?

A VM supporting migration of the device would move the device to
_RESUMING to load the incoming data.  If the VM leaves the device in
_RUNNING, then it doesn't support migration of the device and it's out
of scope how it handles that device state.  Existing devices continue
running regardless of whether the VM state is paused, it's only devices
supporting migration where userspace could optionally have the device
run state follow the VM run state.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 13:50             ` Alex Williamson
@ 2021-09-29 14:36               ` Max Gurtovoy
  2021-09-29 15:17                 ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-29 14:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/29/2021 4:50 PM, Alex Williamson wrote:
> On Wed, 29 Sep 2021 16:26:55 +0300
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>> On 9/29/2021 3:35 PM, Alex Williamson wrote:
>>> On Wed, 29 Sep 2021 13:44:10 +0300
>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>   
>>>> On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:
>>>>> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
>>>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
>>>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>>>>> +		},
>>>>>> Our state transition diagram is pretty weak on reachable transitions
>>>>>> out of the _STOP state, why do we select only these two as valid?
>>>>> I have no particular opinion on specific states here, however adding
>>>>> more states means more stuff for drivers to implement and more risk
>>>>> driver writers will mess up this uAPI.
>>>> _STOP == 000b => Device Stopped, not saving or resuming (from UAPI).
>>>>
>>>> This is the default initial state and not RUNNING.
>>>>
>>>> The user application should move device from STOP => RUNNING or STOP =>
>>>> RESUMING.
>>>>
>>>> Maybe we need to extend the comment in the UAPI file.
>>> include/uapi/linux/vfio.h:
>>> ...
>>>    *  +------- _RESUMING
>>>    *  |+------ _SAVING
>>>    *  ||+----- _RUNNING
>>>    *  |||
>>>    *  000b => Device Stopped, not saving or resuming
>>>    *  001b => Device running, which is the default state
>>>                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> ...
>>>    * State transitions:
>>>    *
>>>    *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>>>    *                (100b)     (001b)     (011b)        (010b)       (000b)
>>>    * 0. Running or default state
>>>    *                             |
>>>                    ^^^^^^^^^^^^^
>>> ...
>>>    * 0. Default state of VFIO device is _RUNNING when the user application starts.
>>>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> The uAPI is pretty clear here.  A default state of _STOP is not
>>> compatible with existing devices and userspace that does not support
>>> migration.  Thanks,
>> Why do you need this state machine for userspace that doesn't support
>> migration ?
> For userspace that doesn't support migration, there's one state,
> _RUNNING.  That's what we're trying to be compatible and consistent
> with.  Migration is an extension, not a base requirement.

Userspace without migration doesn't care about this state.

We left with kernel now. vfio-pci today doesn't support migration, right 
? state is in theory is 0 (STOP).

This state machine is controlled by the migration SW. The drivers don't 
move state implicitly.

mlx5-vfio-pci support migration and will work fine with non-migration SW 
(it will stay with state = 0 unless someone will move it. but nobody 
will) exactly like vfio-pci does today.

So where is the problem ?

>> What is the definition of RUNNING state for a paused VM that is waiting
>> for incoming migration blob ?
> A VM supporting migration of the device would move the device to
> _RESUMING to load the incoming data.  If the VM leaves the device in
> _RUNNING, then it doesn't support migration of the device and it's out
> of scope how it handles that device state.  Existing devices continue
> running regardless of whether the VM state is paused, it's only devices
> supporting migration where userspace could optionally have the device
> run state follow the VM run state.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 14:36               ` Max Gurtovoy
@ 2021-09-29 15:17                 ` Alex Williamson
  2021-09-29 15:28                   ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-29 15:17 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, 29 Sep 2021 17:36:59 +0300
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 9/29/2021 4:50 PM, Alex Williamson wrote:
> > On Wed, 29 Sep 2021 16:26:55 +0300
> > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >  
> >> On 9/29/2021 3:35 PM, Alex Williamson wrote:  
> >>> On Wed, 29 Sep 2021 13:44:10 +0300
> >>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>>     
> >>>> On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:  
> >>>>> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:  
> >>>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
> >>>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
> >>>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
> >>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
> >>>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
> >>>>>>> +		},  
> >>>>>> Our state transition diagram is pretty weak on reachable transitions
> >>>>>> out of the _STOP state, why do we select only these two as valid?  
> >>>>> I have no particular opinion on specific states here, however adding
> >>>>> more states means more stuff for drivers to implement and more risk
> >>>>> driver writers will mess up this uAPI.  
> >>>> _STOP == 000b => Device Stopped, not saving or resuming (from UAPI).
> >>>>
> >>>> This is the default initial state and not RUNNING.
> >>>>
> >>>> The user application should move device from STOP => RUNNING or STOP =>
> >>>> RESUMING.
> >>>>
> >>>> Maybe we need to extend the comment in the UAPI file.  
> >>> include/uapi/linux/vfio.h:
> >>> ...
> >>>    *  +------- _RESUMING
> >>>    *  |+------ _SAVING
> >>>    *  ||+----- _RUNNING
> >>>    *  |||
> >>>    *  000b => Device Stopped, not saving or resuming
> >>>    *  001b => Device running, which is the default state
> >>>                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
> >>> ...
> >>>    * State transitions:
> >>>    *
> >>>    *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> >>>    *                (100b)     (001b)     (011b)        (010b)       (000b)
> >>>    * 0. Running or default state
> >>>    *                             |
> >>>                    ^^^^^^^^^^^^^
> >>> ...
> >>>    * 0. Default state of VFIO device is _RUNNING when the user application starts.
> >>>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >>>
> >>> The uAPI is pretty clear here.  A default state of _STOP is not
> >>> compatible with existing devices and userspace that does not support
> >>> migration.  Thanks,  
> >> Why do you need this state machine for userspace that doesn't support
> >> migration ?  
> > For userspace that doesn't support migration, there's one state,
> > _RUNNING.  That's what we're trying to be compatible and consistent
> > with.  Migration is an extension, not a base requirement.  
> 
> Userspace without migration doesn't care about this state.
> 
> We left with kernel now. vfio-pci today doesn't support migration, right 
> ? state is in theory is 0 (STOP).
> 
> This state machine is controlled by the migration SW. The drivers don't 
> move state implicitly.
> 
> mlx5-vfio-pci support migration and will work fine with non-migration SW 
> (it will stay with state = 0 unless someone will move it. but nobody 
> will) exactly like vfio-pci does today.
> 
> So where is the problem ?

So you have a device that's actively modifying its internal state,
performing I/O, including DMA (thereby dirtying VM memory), all while
in the _STOP state?  And you don't see this as a problem?

There's a major inconsistency if the migration interface is telling us
something different than we can actually observe through the behavior of
the device.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 15:17                 ` Alex Williamson
@ 2021-09-29 15:28                   ` Max Gurtovoy
  2021-09-29 16:14                     ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-29 15:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/29/2021 6:17 PM, Alex Williamson wrote:
> On Wed, 29 Sep 2021 17:36:59 +0300
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>> On 9/29/2021 4:50 PM, Alex Williamson wrote:
>>> On Wed, 29 Sep 2021 16:26:55 +0300
>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>   
>>>> On 9/29/2021 3:35 PM, Alex Williamson wrote:
>>>>> On Wed, 29 Sep 2021 13:44:10 +0300
>>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>      
>>>>>> On 9/28/2021 2:12 AM, Jason Gunthorpe wrote:
>>>>>>> On Mon, Sep 27, 2021 at 04:46:48PM -0600, Alex Williamson wrote:
>>>>>>>>> +	enum { MAX_STATE = VFIO_DEVICE_STATE_RESUMING };
>>>>>>>>> +	static const u8 vfio_from_state_table[MAX_STATE + 1][MAX_STATE + 1] = {
>>>>>>>>> +		[VFIO_DEVICE_STATE_STOP] = {
>>>>>>>>> +			[VFIO_DEVICE_STATE_RUNNING] = 1,
>>>>>>>>> +			[VFIO_DEVICE_STATE_RESUMING] = 1,
>>>>>>>>> +		},
>>>>>>>> Our state transition diagram is pretty weak on reachable transitions
>>>>>>>> out of the _STOP state, why do we select only these two as valid?
>>>>>>> I have no particular opinion on specific states here, however adding
>>>>>>> more states means more stuff for drivers to implement and more risk
>>>>>>> driver writers will mess up this uAPI.
>>>>>> _STOP == 000b => Device Stopped, not saving or resuming (from UAPI).
>>>>>>
>>>>>> This is the default initial state and not RUNNING.
>>>>>>
>>>>>> The user application should move device from STOP => RUNNING or STOP =>
>>>>>> RESUMING.
>>>>>>
>>>>>> Maybe we need to extend the comment in the UAPI file.
>>>>> include/uapi/linux/vfio.h:
>>>>> ...
>>>>>     *  +------- _RESUMING
>>>>>     *  |+------ _SAVING
>>>>>     *  ||+----- _RUNNING
>>>>>     *  |||
>>>>>     *  000b => Device Stopped, not saving or resuming
>>>>>     *  001b => Device running, which is the default state
>>>>>                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> ...
>>>>>     * State transitions:
>>>>>     *
>>>>>     *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
>>>>>     *                (100b)     (001b)     (011b)        (010b)       (000b)
>>>>>     * 0. Running or default state
>>>>>     *                             |
>>>>>                     ^^^^^^^^^^^^^
>>>>> ...
>>>>>     * 0. Default state of VFIO device is _RUNNING when the user application starts.
>>>>>          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>
>>>>> The uAPI is pretty clear here.  A default state of _STOP is not
>>>>> compatible with existing devices and userspace that does not support
>>>>> migration.  Thanks,
>>>> Why do you need this state machine for userspace that doesn't support
>>>> migration ?
>>> For userspace that doesn't support migration, there's one state,
>>> _RUNNING.  That's what we're trying to be compatible and consistent
>>> with.  Migration is an extension, not a base requirement.
>> Userspace without migration doesn't care about this state.
>>
>> We left with kernel now. vfio-pci today doesn't support migration, right
>> ? state is in theory is 0 (STOP).
>>
>> This state machine is controlled by the migration SW. The drivers don't
>> move state implicitly.
>>
>> mlx5-vfio-pci support migration and will work fine with non-migration SW
>> (it will stay with state = 0 unless someone will move it. but nobody
>> will) exactly like vfio-pci does today.
>>
>> So where is the problem ?
> So you have a device that's actively modifying its internal state,
> performing I/O, including DMA (thereby dirtying VM memory), all while
> in the _STOP state?  And you don't see this as a problem?

I don't see how is it different from vfio-pci situation.

And you said you're worried from compatibility. I can't see a 
compatibility issue here.

Maybe we need to rename STOP state. We can call it READY or LIVE or 
NON_MIGRATION_STATE.

>
> There's a major inconsistency if the migration interface is telling us
> something different than we can actually observe through the behavior of
> the device.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 15:28                   ` Max Gurtovoy
@ 2021-09-29 16:14                     ` Jason Gunthorpe
  2021-09-29 21:48                       ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 16:14 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:

> > So you have a device that's actively modifying its internal state,
> > performing I/O, including DMA (thereby dirtying VM memory), all while
> > in the _STOP state?  And you don't see this as a problem?
> 
> I don't see how is it different from vfio-pci situation.

vfio-pci provides no way to observe the migration state. It isn't
"000b"

> Maybe we need to rename STOP state. We can call it READY or LIVE or
> NON_MIGRATION_STATE.

It was a poor choice to use 000b as stop, but it doesn't really
matter. The mlx5 driver should just pre-init this readable to running.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-28 20:18           ` Alex Williamson
@ 2021-09-29 16:16             ` Jason Gunthorpe
  2021-09-29 18:06               ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 16:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> On Tue, 28 Sep 2021 16:35:50 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > 
> > > In defining the device state, we tried to steer away from defining it
> > > in terms of the QEMU migration API, but rather as a set of controls
> > > that could be used to support that API to leave us some degree of
> > > independence that QEMU implementation might evolve.  
> > 
> > That is certainly a different perspective, it would have been
> > better to not express this idea as a FSM in that case...
> > 
> > So each state in mlx5vf_pci_set_device_state() should call the correct
> > combination of (un)freeze, (un)quiesce and so on so each state
> > reflects a defined operation of the device?
> 
> I'd expect so, for instance the implementation of entering the _STOP
> state presumes a previous state that where the device is apparently
> already quiesced.  That doesn't support a direct _RUNNING -> _STOP
> transition where I argued in the linked threads that those states
> should be reachable from any other state.  Thanks,

If we focus on mlx5 there are two device 'flags' to manage:
 - Device cannot issue DMAs
 - Device internal state cannot change (ie cannot receive DMAs)

This is necessary to co-ordinate across multiple devices that might be
doing peer to peer DMA between them. The whole multi-device complex
should be moved to "cannot issue DMA's" then the whole complex would
go to "state cannot change" and be serialized.

The expected sequence at the device is thus

Resuming
 full stop -> does not issue DMAs -> full operation
Suspend
 full operation -> does not issue DMAs -> full stop

Further the device has two actions
 - Trigger serializating the device state
 - Trigger de-serializing the device state

So, what is the behavior upon each state:

 *  000b => Device Stopped, not saving or resuming
     Does not issue DMAs
     Internal state cannot change

 *  001b => Device running, which is the default state
     Neither flags

 *  010b => Stop the device & save the device state, stop-and-copy state
     Does not issue DMAs
     Internal state cannot change

 *  011b => Device running and save the device state, pre-copy state
     Neither flags
     (future, DMA tracking turned on)

 *  100b => Device stopped and the device state is resuming
     Does not issue DMAs
     Internal state cannot change
     
 *  110b => Error state
    ???

 *  101b => Invalid state
 *  111b => Invalid state

    ???

What should the ??'s be? It looks like mlx5 doesn't use these, so it
should just refuse to enter these states in the first place..

The two actions:
 trigger serializing the device state
   Done when asked to go to 010b ?

 trigger de-serializing the device state
   Done when transition from 100b -> 000b ?

There is a missing state "Stop Active Transactions" which would be
only "does not issue DMAs". I've seen a proposal to add that.

I'm happy enough with this and it seems clean and easy enough to
implement.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 16:16             ` Jason Gunthorpe
@ 2021-09-29 18:06               ` Alex Williamson
  2021-09-29 18:26                 ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-29 18:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, 29 Sep 2021 13:16:02 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > On Tue, 28 Sep 2021 16:35:50 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > >   
> > > > In defining the device state, we tried to steer away from defining it
> > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > that could be used to support that API to leave us some degree of
> > > > independence that QEMU implementation might evolve.    
> > > 
> > > That is certainly a different perspective, it would have been
> > > better to not express this idea as a FSM in that case...
> > > 
> > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > combination of (un)freeze, (un)quiesce and so on so each state
> > > reflects a defined operation of the device?  
> > 
> > I'd expect so, for instance the implementation of entering the _STOP
> > state presumes a previous state that where the device is apparently
> > already quiesced.  That doesn't support a direct _RUNNING -> _STOP
> > transition where I argued in the linked threads that those states
> > should be reachable from any other state.  Thanks,  
> 
> If we focus on mlx5 there are two device 'flags' to manage:
>  - Device cannot issue DMAs
>  - Device internal state cannot change (ie cannot receive DMAs)
> 
> This is necessary to co-ordinate across multiple devices that might be
> doing peer to peer DMA between them. The whole multi-device complex
> should be moved to "cannot issue DMA's" then the whole complex would
> go to "state cannot change" and be serialized.

Are you anticipating p2p from outside the VM?  The typical scenario
here would be that p2p occurs only intra-VM, so all the devices would
stop issuing DMA (modulo trying to quiesce devices simultaneously).

> The expected sequence at the device is thus
> 
> Resuming
>  full stop -> does not issue DMAs -> full operation
> Suspend
>  full operation -> does not issue DMAs -> full stop
> 
> Further the device has two actions
>  - Trigger serializating the device state
>  - Trigger de-serializing the device state
> 
> So, what is the behavior upon each state:
> 
>  *  000b => Device Stopped, not saving or resuming
>      Does not issue DMAs
>      Internal state cannot change
> 
>  *  001b => Device running, which is the default state
>      Neither flags
> 
>  *  010b => Stop the device & save the device state, stop-and-copy state
>      Does not issue DMAs
>      Internal state cannot change
> 
>  *  011b => Device running and save the device state, pre-copy state
>      Neither flags
>      (future, DMA tracking turned on)
> 
>  *  100b => Device stopped and the device state is resuming
>      Does not issue DMAs
>      Internal state cannot change

cannot change... other that as loaded via migration region.

>      
>  *  110b => Error state
>     ???
> 
>  *  101b => Invalid state
>  *  111b => Invalid state
> 
>     ???
> 
> What should the ??'s be? It looks like mlx5 doesn't use these, so it
> should just refuse to enter these states in the first place..

_SAVING and _RESUMING are considered mutually exclusive, therefore any
combination of both of them is invalid.  We've chosen to use the
combination of 110b as an error state to indicate the device state is
undefined, but not _RUNNING.  This state is only reachable by an
internal error of the driver during a state transition.

The expected protocol is that if the user write to the device_state
register returns an errno, the user reevaluates the device_state to
determine if the desired transition is unavailable (previous state
value is returned) or generated a fault (error state value returned).
Due to the undefined state of the device, the only exit from the error
state is to re-initialize the device state via a reset.  Therefore a
successful device reset should always return the device to the 001b
state.

The 111b state is also considered unreachable through normal means due
to the _SAVING | _RESUMING conflict, but suggests the device is also
_RUNNING in this undefined state.  This combination has no currently
defined use case and should not be reachable.

The 101b state indicates _RUNNING while _RESUMING, which is simply not
a mode that has been spec'd at this time as it would require some
mechanism for the device to fault in state on demand.
 
> The two actions:
>  trigger serializing the device state
>    Done when asked to go to 010b ?

When the _SAVING bit is set.  The exact mechanics depends on the size
and volatility of the device state.  A GPU might begin in pre-copy
(011b) to transmit chunks of framebuffer data, recording hashes of
blocks read by the user to avoid re-sending them during the
stop-and-copy (010b) phase.  A device with a small internal state
representation may choose to forgo providing data in the pre-copy phase
and entirely serialize internal state at stop-and-copy.

>  trigger de-serializing the device state
>    Done when transition from 100b -> 000b ?

100b -> 000b is not a required transition, generally this would be 100b
-> 001b, ie. end state of _RUNNING vs _STOP.

I think the requirement is that de-serialization is complete when the
_RESUMING bit is cleared.  Whether the driver chooses to de-serialize
piece-wise as each block of data is written to the device or in bulk
from a buffer is left to the implementation.  In either case, the
driver can fail the transition to !_RESUMING if the state is incomplete
or otherwise corrupt.  It would again be the driver's discretion if
the device enters the error state or remains in _RESUMING.  If the user
has no valid state with which to exit the _RESUMING phase, a device
reset should return the device to _RUNNING with a default initial state.

> There is a missing state "Stop Active Transactions" which would be
> only "does not issue DMAs". I've seen a proposal to add that.

This would be to get all devices to stop issuing DMA while internal
state can be modified to avoid the synchronization issue of trying to
stop devices concurrently?  For PCI devices we obviously have the bus
master bit to manage that, but I could see how a migration extension
for such support (perhaps even just wired through to BM for PCI) could
be useful.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 18:06               ` Alex Williamson
@ 2021-09-29 18:26                 ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 18:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Doug Ledford, Yishai Hadas, Bjorn Helgaas,
	David S. Miller, Jakub Kicinski, Kirti Wankhede, kvm,
	linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:16:02 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > > On Tue, 28 Sep 2021 16:35:50 -0300
> > > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >   
> > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > > >   
> > > > > In defining the device state, we tried to steer away from defining it
> > > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > > that could be used to support that API to leave us some degree of
> > > > > independence that QEMU implementation might evolve.    
> > > > 
> > > > That is certainly a different perspective, it would have been
> > > > better to not express this idea as a FSM in that case...
> > > > 
> > > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > > combination of (un)freeze, (un)quiesce and so on so each state
> > > > reflects a defined operation of the device?  
> > > 
> > > I'd expect so, for instance the implementation of entering the _STOP
> > > state presumes a previous state that where the device is apparently
> > > already quiesced.  That doesn't support a direct _RUNNING -> _STOP
> > > transition where I argued in the linked threads that those states
> > > should be reachable from any other state.  Thanks,  
> > 
> > If we focus on mlx5 there are two device 'flags' to manage:
> >  - Device cannot issue DMAs
> >  - Device internal state cannot change (ie cannot receive DMAs)
> > 
> > This is necessary to co-ordinate across multiple devices that might be
> > doing peer to peer DMA between them. The whole multi-device complex
> > should be moved to "cannot issue DMA's" then the whole complex would
> > go to "state cannot change" and be serialized.
> 
> Are you anticipating p2p from outside the VM?  The typical scenario
> here would be that p2p occurs only intra-VM, so all the devices would
> stop issuing DMA (modulo trying to quiesce devices simultaneously).

Inside the VM.

Your 'modulo trying to quiesce devices simultaneously' is correct -
this is a real issue that needs to be solved.

If we put one device in a state where it's internal state is immutable
it can no longer accept DMA messages from the other devices. So there
are two states in the HW model - do not generate DMAs and finally the
immutable internal state where even external DMAs are refused.

> > The expected sequence at the device is thus
> > 
> > Resuming
> >  full stop -> does not issue DMAs -> full operation
> > Suspend
> >  full operation -> does not issue DMAs -> full stop
> > 
> > Further the device has two actions
> >  - Trigger serializating the device state
> >  - Trigger de-serializing the device state
> > 
> > So, what is the behavior upon each state:
> > 
> >  *  000b => Device Stopped, not saving or resuming
> >      Does not issue DMAs
> >      Internal state cannot change
> > 
> >  *  001b => Device running, which is the default state
> >      Neither flags
> > 
> >  *  010b => Stop the device & save the device state, stop-and-copy state
> >      Does not issue DMAs
> >      Internal state cannot change
> > 
> >  *  011b => Device running and save the device state, pre-copy state
> >      Neither flags
> >      (future, DMA tracking turned on)
> > 
> >  *  100b => Device stopped and the device state is resuming
> >      Does not issue DMAs
> >      Internal state cannot change
> 
> cannot change... other that as loaded via migration region.

Yes

> The expected protocol is that if the user write to the device_state
> register returns an errno, the user reevaluates the device_state to
> determine if the desired transition is unavailable (previous state
> value is returned) or generated a fault (error state value
> returned).

Hmm, interesting, mlx5 should be doing this as well. Eg resuming with
corrupt state should fail and cannot be recovered except via reset.

> The 101b state indicates _RUNNING while _RESUMING, which is simply not
> a mode that has been spec'd at this time as it would require some
> mechanism for the device to fault in state on demand.

So lets error on these requests since we don't know what state to put
the device into.

> > The two actions:
> >  trigger serializing the device state
> >    Done when asked to go to 010b ?
> 
> When the _SAVING bit is set.  The exact mechanics depends on the size
> and volatility of the device state.  A GPU might begin in pre-copy
> (011b) to transmit chunks of framebuffer data, recording hashes of
> blocks read by the user to avoid re-sending them during the
> stop-and-copy (010b) phase.  

Here I am talking specifically about mlx5 which does not have a state
capture in pre-copy. So mlx5 should capture state on 010b only, and
the 011b is a NOP.

> >  trigger de-serializing the device state
> >    Done when transition from 100b -> 000b ?
> 
> 100b -> 000b is not a required transition, generally this would be 100b
> -> 001b, ie. end state of _RUNNING vs _STOP.

Sorry, I typo'd it, yes to _RUNNING

> I think the requirement is that de-serialization is complete when the
> _RESUMING bit is cleared.  Whether the driver chooses to de-serialize
> piece-wise as each block of data is written to the device or in bulk
> from a buffer is left to the implementation.  In either case, the
> driver can fail the transition to !_RESUMING if the state is incomplete
> or otherwise corrupt.  It would again be the driver's discretion if
> the device enters the error state or remains in _RESUMING.  If the user
> has no valid state with which to exit the _RESUMING phase, a device
> reset should return the device to _RUNNING with a default initial state.

That makes sense enough.

> > There is a missing state "Stop Active Transactions" which would be
> > only "does not issue DMAs". I've seen a proposal to add that.
> 
> This would be to get all devices to stop issuing DMA while internal
> state can be modified to avoid the synchronization issue of trying to
> stop devices concurrently?  

Yes, as above

> For PCI devices we obviously have the bus master bit to manage that,
> but I could see how a migration extension for such support (perhaps
> even just wired through to BM for PCI) could be useful.  Thanks,

I'm nervous to override the BM bit for something like this, the BM bit
isn't a gentle "please coherently stop what you are doing" it is a
hanbrake the OS pulls to ensure any PCI device becomes quiet.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 16:14                     ` Jason Gunthorpe
@ 2021-09-29 21:48                       ` Max Gurtovoy
  2021-09-29 22:44                         ` Alex Williamson
  2021-09-29 23:21                         ` Jason Gunthorpe
  0 siblings, 2 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-29 21:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/29/2021 7:14 PM, Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:
>
>>> So you have a device that's actively modifying its internal state,
>>> performing I/O, including DMA (thereby dirtying VM memory), all while
>>> in the _STOP state?  And you don't see this as a problem?
>> I don't see how is it different from vfio-pci situation.
> vfio-pci provides no way to observe the migration state. It isn't
> "000b"

Alex said that there is a problem of compatibility.

I migration SW is not involved, nobody will read this migration state.

>> Maybe we need to rename STOP state. We can call it READY or LIVE or
>> NON_MIGRATION_STATE.
> It was a poor choice to use 000b as stop, but it doesn't really
> matter. The mlx5 driver should just pre-init this readable to running.

I guess we can do it for this reason. There is no functional problem nor 
compatibility issue here as was mentioned.

But still we need the kernel to track transitions. We don't want to 
allow moving from RESUMING to SAVING state for example. How this 
transition can be allowed ?

In this case we need to fail the request from the migration SW...


>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 21:48                       ` Max Gurtovoy
@ 2021-09-29 22:44                         ` Alex Williamson
  2021-09-30  9:25                           ` Max Gurtovoy
  2021-09-29 23:21                         ` Jason Gunthorpe
  1 sibling, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-09-29 22:44 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Thu, 30 Sep 2021 00:48:55 +0300
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 9/29/2021 7:14 PM, Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:
> >  
> >>> So you have a device that's actively modifying its internal state,
> >>> performing I/O, including DMA (thereby dirtying VM memory), all while
> >>> in the _STOP state?  And you don't see this as a problem?  
> >> I don't see how is it different from vfio-pci situation.  
> > vfio-pci provides no way to observe the migration state. It isn't
> > "000b"  
> 
> Alex said that there is a problem of compatibility.
> 
> I migration SW is not involved, nobody will read this migration state.

The _STOP state has a specific meaning regardless of whether userspace
reads the device state value.  I think what you're suggesting is that
the device reports itself as _STOP'd but it's actually _RUNNING.  Is
that the compatibility workaround, create a self inconsistency?

We cannot impose on userspace to move a device from _STOP to _RUNNING
simply because the device supports the migration region, nor should we
report a device state that is inconsistent with the actual device state.

> >> Maybe we need to rename STOP state. We can call it READY or LIVE or
> >> NON_MIGRATION_STATE.  
> > It was a poor choice to use 000b as stop, but it doesn't really
> > matter. The mlx5 driver should just pre-init this readable to running.  
> 
> I guess we can do it for this reason. There is no functional problem nor 
> compatibility issue here as was mentioned.
> 
> But still we need the kernel to track transitions. We don't want to 
> allow moving from RESUMING to SAVING state for example. How this 
> transition can be allowed ?
> 
> In this case we need to fail the request from the migration SW...

_RESUMING to _SAVING seems like a good way to test round trip migration
without running the device to modify the state.  Potentially it's a
means to update a saved device migration data stream to a newer format
using an intermediate driver version.

If a driver is written such that it simply sees clearing the _RESUME
bit as an indicator to de-serialize the data stream to the device, and
setting the _SAVING flag as an indicator to re-serialize that data
stream from the device, then this is just a means to make use of
existing data paths.

The uAPI specifies a means for drivers to reject a state change, but
that risks failing to support a transition which might find mainstream
use cases.  I don't think common code should be responsible for
filtering out viable transitions.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 21:48                       ` Max Gurtovoy
  2021-09-29 22:44                         ` Alex Williamson
@ 2021-09-29 23:21                         ` Jason Gunthorpe
  2021-09-30  9:34                           ` Max Gurtovoy
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 23:21 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Thu, Sep 30, 2021 at 12:48:55AM +0300, Max Gurtovoy wrote:
> 
> On 9/29/2021 7:14 PM, Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:
> > 
> > > > So you have a device that's actively modifying its internal state,
> > > > performing I/O, including DMA (thereby dirtying VM memory), all while
> > > > in the _STOP state?  And you don't see this as a problem?
> > > I don't see how is it different from vfio-pci situation.
> > vfio-pci provides no way to observe the migration state. It isn't
> > "000b"
> 
> Alex said that there is a problem of compatibility.

Yes, when a vfio_device first opens it must be running - ie able to do
DMA and otherwise operational.

When we add the migration extension this cannot change, so after
open_device() the device should be operational.

The reported state in the migration region should accurately reflect
what the device is currently doing. If the device is operational then
it must report running, not stopped.

Thus a driver cannot just zero initalize the migration "registers",
they have to be accurate.

> > > Maybe we need to rename STOP state. We can call it READY or LIVE or
> > > NON_MIGRATION_STATE.
> > It was a poor choice to use 000b as stop, but it doesn't really
> > matter. The mlx5 driver should just pre-init this readable to running.
> 
> I guess we can do it for this reason. There is no functional problem nor
> compatibility issue here as was mentioned.
> 
> But still we need the kernel to track transitions. We don't want to allow
> moving from RESUMING to SAVING state for example. How this transition can be
> allowed ?

It seems semantically fine to me, as per Alex's note what will happen
is defined:

driver will see RESUMING toggle off so it will trigger a
de-serialization

driver will see SAVING toggled on so it will serialize the new state
(either the pre-copy state or the post-copy state dpending on the
running bit)

Depending on the running bit the device may or may not be woken up.

If de-serialization fails then the state goes to error and SAVING is
ignored.

The driver logic probably looks something like this:

// Running toggles off
if (oldstate & RUNNING != newstate & RUNNING && oldstate & RUNNING)
    queice
    freeze

// Resuming toggles off
if (oldstate & RESUMING != newstate & RESUMING && oldstate & RESUMING)
   deserialize

// Saving toggles on
if (oldstate & SAVING != newstate & SAVING && newstate & SAVING)
   if (!(newstate & RUNNING))
     serialize post copy

// Running toggles on
if (oldstate & RUNNING != newstate & RUNNING && newstate & RUNNING)
   unfreeze
   unqueice

I'd have to check that carefully against the state chart from my last
email though..

And need to check how the "Stop Active Transactions" bit fits in there

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 22:44                         ` Alex Williamson
@ 2021-09-30  9:25                           ` Max Gurtovoy
  2021-09-30 12:41                             ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-30  9:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/30/2021 1:44 AM, Alex Williamson wrote:
> On Thu, 30 Sep 2021 00:48:55 +0300
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>> On 9/29/2021 7:14 PM, Jason Gunthorpe wrote:
>>> On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:
>>>   
>>>>> So you have a device that's actively modifying its internal state,
>>>>> performing I/O, including DMA (thereby dirtying VM memory), all while
>>>>> in the _STOP state?  And you don't see this as a problem?
>>>> I don't see how is it different from vfio-pci situation.
>>> vfio-pci provides no way to observe the migration state. It isn't
>>> "000b"
>> Alex said that there is a problem of compatibility.
>>
>> I migration SW is not involved, nobody will read this migration state.
> The _STOP state has a specific meaning regardless of whether userspace
> reads the device state value.  I think what you're suggesting is that
> the device reports itself as _STOP'd but it's actually _RUNNING.  Is
> that the compatibility workaround, create a self inconsistency?

 From migration point of view the device is stopped.

>
> We cannot impose on userspace to move a device from _STOP to _RUNNING
> simply because the device supports the migration region, nor should we
> report a device state that is inconsistent with the actual device state.

In this case we can think maybe moving to running during enabling the 
bus master..


>
>>>> Maybe we need to rename STOP state. We can call it READY or LIVE or
>>>> NON_MIGRATION_STATE.
>>> It was a poor choice to use 000b as stop, but it doesn't really
>>> matter. The mlx5 driver should just pre-init this readable to running.
>> I guess we can do it for this reason. There is no functional problem nor
>> compatibility issue here as was mentioned.
>>
>> But still we need the kernel to track transitions. We don't want to
>> allow moving from RESUMING to SAVING state for example. How this
>> transition can be allowed ?
>>
>> In this case we need to fail the request from the migration SW...
> _RESUMING to _SAVING seems like a good way to test round trip migration
> without running the device to modify the state.  Potentially it's a
> means to update a saved device migration data stream to a newer format
> using an intermediate driver version.

what do you mean by "without running the device to modify the state." ?

did you describe a case where you migrate from source to dst and then 
back to source with a new migration data format ?

>
> If a driver is written such that it simply sees clearing the _RESUME
> bit as an indicator to de-serialize the data stream to the device, and
> setting the _SAVING flag as an indicator to re-serialize that data
> stream from the device, then this is just a means to make use of
> existing data paths.
>
> The uAPI specifies a means for drivers to reject a state change, but
> that risks failing to support a transition which might find mainstream
> use cases.  I don't think common code should be responsible for
> filtering out viable transitions.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-29 23:21                         ` Jason Gunthorpe
@ 2021-09-30  9:34                           ` Max Gurtovoy
  2021-09-30 14:47                             ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-30  9:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/30/2021 2:21 AM, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 12:48:55AM +0300, Max Gurtovoy wrote:
>> On 9/29/2021 7:14 PM, Jason Gunthorpe wrote:
>>> On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:
>>>
>>>>> So you have a device that's actively modifying its internal state,
>>>>> performing I/O, including DMA (thereby dirtying VM memory), all while
>>>>> in the _STOP state?  And you don't see this as a problem?
>>>> I don't see how is it different from vfio-pci situation.
>>> vfio-pci provides no way to observe the migration state. It isn't
>>> "000b"
>> Alex said that there is a problem of compatibility.
> Yes, when a vfio_device first opens it must be running - ie able to do
> DMA and otherwise operational.

how can non resumed device do DMA ?

Also the bus master is not set.

>
> When we add the migration extension this cannot change, so after
> open_device() the device should be operational.

if it's waiting for incoming migration blob, it is not running.

>
> The reported state in the migration region should accurately reflect
> what the device is currently doing. If the device is operational then
> it must report running, not stopped.

STOP in migration meaning.

>
> Thus a driver cannot just zero initalize the migration "registers",
> they have to be accurate.
>
>>>> Maybe we need to rename STOP state. We can call it READY or LIVE or
>>>> NON_MIGRATION_STATE.
>>> It was a poor choice to use 000b as stop, but it doesn't really
>>> matter. The mlx5 driver should just pre-init this readable to running.
>> I guess we can do it for this reason. There is no functional problem nor
>> compatibility issue here as was mentioned.
>>
>> But still we need the kernel to track transitions. We don't want to allow
>> moving from RESUMING to SAVING state for example. How this transition can be
>> allowed ?
> It seems semantically fine to me, as per Alex's note what will happen
> is defined:
>
> driver will see RESUMING toggle off so it will trigger a
> de-serialization

You mean stop serialization ?

>
> driver will see SAVING toggled on so it will serialize the new state
> (either the pre-copy state or the post-copy state dpending on the
> running bit)

lets leave the bits and how you implement the state numbering aside.

If you finish resuming you can move to a new state (that we should add) 
=> RESUMED.

Now you suggested moving from RESUMED to SAVING to get the state again 
from the dst device ? and send it back to src ? before staring the VM 
and moving to RUNNING ?

where this is coming from ?

>
> Depending on the running bit the device may or may not be woken up.

lets take about logic here and not bits.

>
> If de-serialization fails then the state goes to error and SAVING is
> ignored.
>
> The driver logic probably looks something like this:
>
> // Running toggles off
> if (oldstate & RUNNING != newstate & RUNNING && oldstate & RUNNING)
>      queice
>      freeze
>
> // Resuming toggles off
> if (oldstate & RESUMING != newstate & RESUMING && oldstate & RESUMING)
>     deserialize
>
> // Saving toggles on
> if (oldstate & SAVING != newstate & SAVING && newstate & SAVING)
>     if (!(newstate & RUNNING))
>       serialize post copy
>
> // Running toggles on
> if (oldstate & RUNNING != newstate & RUNNING && newstate & RUNNING)
>     unfreeze
>     unqueice
>
> I'd have to check that carefully against the state chart from my last
> email though..
>
> And need to check how the "Stop Active Transactions" bit fits in there
>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-30  9:25                           ` Max Gurtovoy
@ 2021-09-30 12:41                             ` Alex Williamson
  0 siblings, 0 replies; 57+ messages in thread
From: Alex Williamson @ 2021-09-30 12:41 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Thu, 30 Sep 2021 12:25:23 +0300
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 9/30/2021 1:44 AM, Alex Williamson wrote:
> > On Thu, 30 Sep 2021 00:48:55 +0300
> > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >  
> >> On 9/29/2021 7:14 PM, Jason Gunthorpe wrote:  
> >>> On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote:
> >>>     
> >>>>> So you have a device that's actively modifying its internal state,
> >>>>> performing I/O, including DMA (thereby dirtying VM memory), all while
> >>>>> in the _STOP state?  And you don't see this as a problem?  
> >>>> I don't see how is it different from vfio-pci situation.  
> >>> vfio-pci provides no way to observe the migration state. It isn't
> >>> "000b"  
> >> Alex said that there is a problem of compatibility.
> >>
> >> I migration SW is not involved, nobody will read this migration state.  
> > The _STOP state has a specific meaning regardless of whether userspace
> > reads the device state value.  I think what you're suggesting is that
> > the device reports itself as _STOP'd but it's actually _RUNNING.  Is
> > that the compatibility workaround, create a self inconsistency?  
> 
>  From migration point of view the device is stopped.

The _RESUMING and _SAVING bits control the migration activity, the
_RUNNING bit controls the ability of the device to modify its internal
state and affect external state.  The initial state of the device is
absolutely not stopped.

> > We cannot impose on userspace to move a device from _STOP to _RUNNING
> > simply because the device supports the migration region, nor should we
> > report a device state that is inconsistent with the actual device state.  
> 
> In this case we can think maybe moving to running during enabling the 
> bus master..

There are no spontaneous state transitions, device_state changes only
via user manipulation of the register.

> >>>> Maybe we need to rename STOP state. We can call it READY or LIVE or
> >>>> NON_MIGRATION_STATE.  
> >>> It was a poor choice to use 000b as stop, but it doesn't really
> >>> matter. The mlx5 driver should just pre-init this readable to running.  
> >> I guess we can do it for this reason. There is no functional problem nor
> >> compatibility issue here as was mentioned.
> >>
> >> But still we need the kernel to track transitions. We don't want to
> >> allow moving from RESUMING to SAVING state for example. How this
> >> transition can be allowed ?
> >>
> >> In this case we need to fail the request from the migration SW...  
> > _RESUMING to _SAVING seems like a good way to test round trip migration
> > without running the device to modify the state.  Potentially it's a
> > means to update a saved device migration data stream to a newer format
> > using an intermediate driver version.  
> 
> what do you mean by "without running the device to modify the state." ?

If a device is !_RUNNING it should not be advancing its internal state,
therefore state-in == state-out.
 
> did you describe a case where you migrate from source to dst and then 
> back to source with a new migration data format ?

I'm speculating that as the driver evolves, the migration data stream
generated from the device's migration region can change.  Hopefully in
compatible ways.  The above sequence of restoring and extracting state
without the complication of the device running could help to validate
compatibility.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-30  9:34                           ` Max Gurtovoy
@ 2021-09-30 14:47                             ` Jason Gunthorpe
  2021-09-30 15:32                               ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 14:47 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Thu, Sep 30, 2021 at 12:34:19PM +0300, Max Gurtovoy wrote:

> > When we add the migration extension this cannot change, so after
> > open_device() the device should be operational.
> 
> if it's waiting for incoming migration blob, it is not running.

It cannot be waiting for a migration blob after open_device, that is
not backwards compatible.

Just prior to open device the vfio pci layer will generate a FLR to
the function so we expect that post open_device has a fresh from reset
fully running device state.

> > The reported state in the migration region should accurately reflect
> > what the device is currently doing. If the device is operational then
> > it must report running, not stopped.
> 
> STOP in migration meaning.

As Alex and I have said several times STOP means the internal state is
not allowed to change.

> > driver will see RESUMING toggle off so it will trigger a
> > de-serialization
> 
> You mean stop serialization ?

No, I mean it will take all the migration data that has been uploaded
through the migration region and de-serialize it into active device
state.

> > driver will see SAVING toggled on so it will serialize the new state
> > (either the pre-copy state or the post-copy state dpending on the
> > running bit)
> 
> lets leave the bits and how you implement the state numbering aside.

You've missed the point. This isn't a FSM. It is a series of three
control bits that we have assigned logical meaning their combinatoins.

The algorithm I gave is a control centric algorithm not a state
centric algorithm and matches the direction Alex thought this was
being designed for.
 
> If you finish resuming you can move to a new state (that we should add) =>
> RESUMED.

It is not a state machine. Once you stop prentending this is
implementing a FSM Alex's position makes perfect sense.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-30 14:47                             ` Jason Gunthorpe
@ 2021-09-30 15:32                               ` Max Gurtovoy
  2021-09-30 16:24                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-30 15:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/30/2021 5:47 PM, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 12:34:19PM +0300, Max Gurtovoy wrote:
>
>>> When we add the migration extension this cannot change, so after
>>> open_device() the device should be operational.
>> if it's waiting for incoming migration blob, it is not running.
> It cannot be waiting for a migration blob after open_device, that is
> not backwards compatible.
>
> Just prior to open device the vfio pci layer will generate a FLR to
> the function so we expect that post open_device has a fresh from reset
> fully running device state.

running also mean that the device doesn't have a clue on its internal 
state ? or running means unfreezed and unquiesced ?

>
>>> The reported state in the migration region should accurately reflect
>>> what the device is currently doing. If the device is operational then
>>> it must report running, not stopped.
>> STOP in migration meaning.
> As Alex and I have said several times STOP means the internal state is
> not allowed to change.
>
>>> driver will see RESUMING toggle off so it will trigger a
>>> de-serialization
>> You mean stop serialization ?
> No, I mean it will take all the migration data that has been uploaded
> through the migration region and de-serialize it into active device
> state.

you should feed the device way before that.

>
>>> driver will see SAVING toggled on so it will serialize the new state
>>> (either the pre-copy state or the post-copy state dpending on the
>>> running bit)
>> lets leave the bits and how you implement the state numbering aside.
> You've missed the point. This isn't a FSM. It is a series of three
> control bits that we have assigned logical meaning their combinatoins.
>
> The algorithm I gave is a control centric algorithm not a state
> centric algorithm and matches the direction Alex thought this was
> being designed for.
>   
>> If you finish resuming you can move to a new state (that we should add) =>
>> RESUMED.
> It is not a state machine. Once you stop prentending this is
> implementing a FSM Alex's position makes perfect sense.

You can look on it anyway you want. Three control bits or FSM. And I can 
look on it anyway I want.

The point is what bits/state you set during the resume phase:

1. you initialize at  _RUNNING bit == 001b. No problem.

2. state stream arrives, migration SW raise _RESUMING bit. should it be 
101b or 100b ? for now it's 100b. But according to your statement is 
should be 101b (invalid today) since device state can change. right ?

3. Then you should indicate that all the state was serialized to the 
device (actually to all the pci devices). 100b mean RESUMING and not 
RUNNING so maybe this can say RESUMED and state can't change now ?

4. all devices move to running 001b only after all devices moved to 100b.

Otherwise, devices will start changing each other internal states.

-Max.

>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-30 15:32                               ` Max Gurtovoy
@ 2021-09-30 16:24                                 ` Jason Gunthorpe
  2021-09-30 16:51                                   ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 16:24 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Thu, Sep 30, 2021 at 06:32:07PM +0300, Max Gurtovoy wrote:
> > Just prior to open device the vfio pci layer will generate a FLR to
> > the function so we expect that post open_device has a fresh from reset
> > fully running device state.
> 
> running also mean that the device doesn't have a clue on its internal state
> ? or running means unfreezed and unquiesced ?

The device just got FLR'd and it should be in a clean state and
operating. Think the VM is booting for the first time.

> > > > driver will see RESUMING toggle off so it will trigger a
> > > > de-serialization
> > > You mean stop serialization ?
> > No, I mean it will take all the migration data that has been uploaded
> > through the migration region and de-serialize it into active device
> > state.
> 
> you should feed the device way before that.

I don't know what this means, when the resuming bit is set the
migration data buffer is wiped and userspace should beging loading
it. When the resuming bit is cleared whatever is in the migration
buffer is deserialized into the current device internal state.

It is the opposite of saving. When the saving bit is set the current
device state is serialized into the migration buffer and userspace and
reads it out.

> 1. you initialize at  _RUNNING bit == 001b. No problem.
> 
> 2. state stream arrives, migration SW raise _RESUMING bit. should it be 101b
> or 100b ? for now it's 100b. But according to your statement is should be
> 101b (invalid today) since device state can change. right ?

Running means the device state chanages independently, the controlled
change of the device state via deserializing the migration buffer is
different. Both running and saving commands need running to be zero.

ie commands that are marked invalid in the uapi comment are rejected
at the start - and that is probably the core helper we should provide.

> 3. Then you should indicate that all the state was serialized to the device
> (actually to all the pci devices). 100b mean RESUMING and not RUNNING so
> maybe this can say RESUMED and state can't change now ?

State is not loaded into the device until the resuming bit is
cleared. There is no RESUMED state until we incorporate Artem's
proposal for an additional bit eg 1001b - running with DMA master
disabled.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-30 16:24                                 ` Jason Gunthorpe
@ 2021-09-30 16:51                                   ` Max Gurtovoy
  2021-09-30 17:01                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-09-30 16:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck


On 9/30/2021 7:24 PM, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 06:32:07PM +0300, Max Gurtovoy wrote:
>>> Just prior to open device the vfio pci layer will generate a FLR to
>>> the function so we expect that post open_device has a fresh from reset
>>> fully running device state.
>> running also mean that the device doesn't have a clue on its internal state
>> ? or running means unfreezed and unquiesced ?
> The device just got FLR'd and it should be in a clean state and
> operating. Think the VM is booting for the first time.

During the resume phase in the dst, the VM is paused and not booting. 
Migration SW is waiting to get memory and state from SRC. The device 
will start from the exact point that was in the src.

it's exactly "000b => Device Stopped, not saving or resuming"

>
>>>>> driver will see RESUMING toggle off so it will trigger a
>>>>> de-serialization
>>>> You mean stop serialization ?
>>> No, I mean it will take all the migration data that has been uploaded
>>> through the migration region and de-serialize it into active device
>>> state.
>> you should feed the device way before that.
> I don't know what this means, when the resuming bit is set the
> migration data buffer is wiped and userspace should beging loading
> it. When the resuming bit is cleared whatever is in the migration
> buffer is deserialized into the current device internal state.

Well, this is your design for the driver implementation. Nobody is 
preventing other drivers to start deserializing device state into the 
device during RESUMING bit on.

Or is this a must ?

>
> It is the opposite of saving. When the saving bit is set the current
> device state is serialized into the migration buffer and userspace and
> reads it out.

This is not new.

>> 1. you initialize at  _RUNNING bit == 001b. No problem.
>>
>> 2. state stream arrives, migration SW raise _RESUMING bit. should it be 101b
>> or 100b ? for now it's 100b. But according to your statement is should be
>> 101b (invalid today) since device state can change. right ?
> Running means the device state chanages independently, the controlled
> change of the device state via deserializing the migration buffer is
> different. Both running and saving commands need running to be zero.
>
> ie commands that are marked invalid in the uapi comment are rejected
> at the start - and that is probably the core helper we should provide.
>
>> 3. Then you should indicate that all the state was serialized to the device
>> (actually to all the pci devices). 100b mean RESUMING and not RUNNING so
>> maybe this can say RESUMED and state can't change now ?
> State is not loaded into the device until the resuming bit is
> cleared. There is no RESUMED state until we incorporate Artem's
> proposal for an additional bit eg 1001b - running with DMA master
> disabled.

So if we moved from 100b to 010b somehow, one should deserialized its 
buffer to the device, and then serialize it to migration region again ?

I guess its doable since the device is freeze and quiesced. But moving 
from 100b to 011b is not possible, right ?

>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
  2021-09-30 16:51                                   ` Max Gurtovoy
@ 2021-09-30 17:01                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 17:01 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alex Williamson, Leon Romanovsky, Doug Ledford, Yishai Hadas,
	Bjorn Helgaas, David S. Miller, Jakub Kicinski, Kirti Wankhede,
	kvm, linux-kernel, linux-pci, linux-rdma, netdev, Saeed Mahameed,
	Cornelia Huck

On Thu, Sep 30, 2021 at 07:51:22PM +0300, Max Gurtovoy wrote:
> 
> On 9/30/2021 7:24 PM, Jason Gunthorpe wrote:
> > On Thu, Sep 30, 2021 at 06:32:07PM +0300, Max Gurtovoy wrote:
> > > > Just prior to open device the vfio pci layer will generate a FLR to
> > > > the function so we expect that post open_device has a fresh from reset
> > > > fully running device state.
> > > running also mean that the device doesn't have a clue on its internal state
> > > ? or running means unfreezed and unquiesced ?
> > The device just got FLR'd and it should be in a clean state and
> > operating. Think the VM is booting for the first time.
> 
> During the resume phase in the dst, the VM is paused and not booting.
> Migration SW is waiting to get memory and state from SRC. The device will
> start from the exact point that was in the src.
> 
> it's exactly "000b => Device Stopped, not saving or resuming"

For this case qmeu should open the VFIO device and immediately issue a
command to go to resuming. The kernel cannot know at open_device time
which case userspace is trying to do. Due to backwards compat we
assume userspace is going to boot a fresh VM.

> Well, this is your design for the driver implementation. Nobody is
> preventing other drivers to start deserializing device state into the device
> during RESUMING bit on.

It is a logical model. Devices can stream the migration data directly
into the internal state if they like. It just creates more conditions
where they have report an error state.

> So if we moved from 100b to 010b somehow, one should deserialized its buffer
> to the device, and then serialize it to migration region again ?

Yes.
 
> I guess its doable since the device is freeze and quiesced. But moving from
> 100b to 011b is not possible, right ?

Why not?

100b to 011b is no different than going indirectly 100b -> 001b -> 011b

The time spent in 001b is just negligable.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2021-09-30 17:01 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1632305919.git.leonro@nvidia.com>
2021-09-22 10:38 ` [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index Leon Romanovsky
2021-09-22 21:59   ` Bjorn Helgaas
2021-09-23  6:35     ` Leon Romanovsky
2021-09-24 13:08       ` Bjorn Helgaas
2021-09-25 10:10         ` Leon Romanovsky
2021-09-25 17:41           ` Bjorn Helgaas
2021-09-26  6:36             ` Leon Romanovsky
2021-09-26 20:23               ` Bjorn Helgaas
2021-09-27 11:55                 ` Leon Romanovsky
2021-09-27 14:47                   ` Bjorn Helgaas
2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
2021-09-23 10:33   ` Shameerali Kolothum Thodi
2021-09-23 11:17     ` Leon Romanovsky
2021-09-23 13:55       ` Max Gurtovoy
2021-09-24  7:44         ` Shameerali Kolothum Thodi
2021-09-24  9:37           ` Kirti Wankhede
2021-09-26  9:09           ` Max Gurtovoy
2021-09-26 16:17             ` Shameerali Kolothum Thodi
2021-09-27 18:24               ` Max Gurtovoy
2021-09-27 18:29                 ` Shameerali Kolothum Thodi
2021-09-27 22:46   ` Alex Williamson
2021-09-27 23:12     ` Jason Gunthorpe
2021-09-28 19:19       ` Alex Williamson
2021-09-28 19:35         ` Jason Gunthorpe
2021-09-28 20:18           ` Alex Williamson
2021-09-29 16:16             ` Jason Gunthorpe
2021-09-29 18:06               ` Alex Williamson
2021-09-29 18:26                 ` Jason Gunthorpe
2021-09-29 10:57         ` Max Gurtovoy
2021-09-29 10:44       ` Max Gurtovoy
2021-09-29 12:35         ` Alex Williamson
2021-09-29 13:26           ` Max Gurtovoy
2021-09-29 13:50             ` Alex Williamson
2021-09-29 14:36               ` Max Gurtovoy
2021-09-29 15:17                 ` Alex Williamson
2021-09-29 15:28                   ` Max Gurtovoy
2021-09-29 16:14                     ` Jason Gunthorpe
2021-09-29 21:48                       ` Max Gurtovoy
2021-09-29 22:44                         ` Alex Williamson
2021-09-30  9:25                           ` Max Gurtovoy
2021-09-30 12:41                             ` Alex Williamson
2021-09-29 23:21                         ` Jason Gunthorpe
2021-09-30  9:34                           ` Max Gurtovoy
2021-09-30 14:47                             ` Jason Gunthorpe
2021-09-30 15:32                               ` Max Gurtovoy
2021-09-30 16:24                                 ` Jason Gunthorpe
2021-09-30 16:51                                   ` Max Gurtovoy
2021-09-30 17:01                                     ` Jason Gunthorpe
2021-09-22 10:38 ` [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional Leon Romanovsky
2021-09-23 13:57   ` Max Gurtovoy
2021-09-22 10:38 ` [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures Leon Romanovsky
2021-09-24  5:48   ` Mark Zhang
2021-09-22 10:38 ` [PATCH mlx5-next 5/7] net/mlx5: Expose APIs to get/put the mlx5 core device Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device Leon Romanovsky
2021-09-28 20:22   ` Alex Williamson
2021-09-29  5:36     ` Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 7/7] mlx5_vfio_pci: Implement vfio_pci driver for mlx5 devices Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).