All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol
@ 2022-02-07 17:22 Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
                   ` (15 more replies)
  0 siblings, 16 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

This series adds mlx5 live migration driver for VFs that are migration
capable and includes the v2 migration protocol definition and mlx5
implementation.

The mlx5 driver uses the vfio_pci_core split to create a specific VFIO
PCI driver that matches the mlx5 virtual functions. The driver provides
the same experience as normal vfio-pci with the addition of migration
support.

In HW the migration is controlled by the PF function, using its
mlx5_core driver, and the VFIO PCI VF driver co-ordinates with the PF to
execute the migration actions.

The bulk of the v2 migration protocol is semantically the same v1,
however it has been recast into a FSM for the device_state and the
actual syscall interface uses normal ioctl(), read() and write() instead
of building a syscall interface using the region.

Several bits of infrastructure work are included here:
 - pci_iov_vf_id() to help drivers like mlx5 figure out the VF index from
   a BDF
 - pci_iov_get_pf_drvdata() to clarify the tricky locking protocol when a
   VF reaches into its PF's driver
 - mlx5_core uses the normal SRIOV lifecycle and disables SRIOV before
   driver remove, to be compatible with pci_iov_get_pf_drvdata()
 - Lifting VFIO_DEVICE_FEATURE into core VFIO code

This series comes after alot of discussion. Some major points:
- v1 ABI compatible migration defined using the same FSM approach:
   https://lore.kernel.org/all/0-v1-a4f7cab64938+3f-vfio_mig_states_jgg@nvidia.com/
- Attempts to clarify how the v1 API works:
   Alex's:
     https://lore.kernel.org/kvm/163909282574.728533.7460416142511440919.stgit@omen/
   Jason's:
     https://lore.kernel.org/all/0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com/
- Etherpad exploring the scope and questions of general VFIO migration:
     https://lore.kernel.org/kvm/87mtm2loml.fsf@redhat.com/

NOTE: As this series touched mlx5_core parts we need to send this in a
pull request format to VFIO to avoid conflicts.

Matching qemu changes can be previewed here:
 https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2

Changes from V6: https://lore.kernel.org/netdev/20220130160826.32449-1-yishaih@nvidia.com/
vfio:
- Move to use the FEATURE ioctl for setting/getting the device state.
- Use state_flags_table as part of vfio_mig_get_next_state() and use
  WARN_ON as Alex suggested.
- Leave the V1 definitions in the uAPI header and drop only its
  documentation till V2 will be part of Linus's tree.
- Fix errno's usage in few places.
- Improve and adapt the uAPI documentation to match the latest code.
- Put the VFIO_DEVICE_FEATURE_PCI_VF_TOKEN functionality into a separate
  function.
- Fix some rebase note.
vfio/mlx5:
- Adapt to use the vfio core changes.
- Fix some bad flow upon load state.

Changes from V5: https://lore.kernel.org/kvm/20211027095658.144468-1-yishaih@nvidia.com/
vfio:
- Migration protocol v2:
  + enum for device state, not bitmap
  + ioctl to manipulate device_state, not a region
  + Only STOP_COPY is mandatory, P2P and PRE_COPY are optional, discovered
    via VFIO_DEVICE_FEATURE
  + Migration data transfer is done via dedicated FD
- VFIO core code to implement the migration related ioctls and help
  drivers implement it correctly
- VFIO_DEVICE_FEATURE refactor
- Delete migration protocol, drop patches fixing it
- Drop "vfio/pci_core: Make the region->release() function optional"
vfio/mlx5:
- Switch to use migration v2 protocol, with core helpers
- Eliminate the region implementation

Changes from V4: https://lore.kernel.org/kvm/20211026090605.91646-1-yishaih@nvidia.com/
vfio:
- Add some Reviewed-by.
- Rename to vfio_pci_core_aer_err_detected() as Alex asked.
vfio/mlx5:
- Improve to enter the error state only if unquiesce also fails.
- Fix some typos.
- Use the multi-line comment style as in drivers/vfio.

Changes from V3: https://lore.kernel.org/kvm/20211024083019.232813-1-yishaih@nvidia.com/
vfio/mlx5:
- Align with mlx5 latest specification to create the MKEY with full read
  write permissions.
- Fix unlock ordering in mlx5vf_state_mutex_unlock() to prevent some
  race.

Changes from V2: https://lore.kernel.org/kvm/20211019105838.227569-1-yishaih@nvidia.com/
vfio:
- Put and use the new macro VFIO_DEVICE_STATE_SET_ERROR as Alex asked.
vfio/mlx5:
- Improve/fix state checking as was asked by Alex & Jason.
- Let things be done in a deterministic way upon 'reset_done' following
  the suggested algorithm by Jason.
- Align with mlx5 latest specification when calling the SAVE command.
- Fix some typos.
vdpa/mlx5:
- Drop the patch from the series based on the discussion in the mailing
  list.

Changes from V1: https://lore.kernel.org/kvm/20211013094707.163054-1-yishaih@nvidia.com/
PCI/IOV:
- Add actual interface in the subject as was asked by Bjorn and add
  his Acked-by.
- Move to check explicitly for !dev->is_virtfn as was asked by Alex.
vfio:
- Come with a separate patch for fixing the non-compiled
  VFIO_DEVICE_STATE_SET_ERROR macro.
- Expose vfio_pci_aer_err_detected() to be set by drivers on their own
  pci error handles.
- Add a macro for VFIO_DEVICE_STATE_ERROR in the uapi header file as was
  suggested by Alex.
vfio/mlx5:
- Improve to use xor as part of checking the 'state' change command as
  was suggested by Alex.
- Set state to VFIO_DEVICE_STATE_ERROR when an error occurred instead of
  VFIO_DEVICE_STATE_INVALID.
- Improve state checking as was suggested by Jason.
- Use its own PCI reset_done error handler as was suggested by Jason and
  fix the locking scheme around the state mutex to work properly.

Changes from V0: https://lore.kernel.org/kvm/cover.1632305919.git.leonro@nvidia.com/
PCI/IOV:
- Add an API (i.e. pci_iov_get_pf_drvdata()) that allows SRVIO VF drivers
  to reach the drvdata of a PF.
mlx5_core:
- Add an extra patch to disable SRIOV before PF removal.
- Adapt to use the above PCI/IOV API as part of mlx5_vf_get_core_dev().
- Reuse the exported PCI/IOV virtfn index function call (i.e. pci_iov_vf_id().
vfio:
- Add support in the pci_core to let a driver be notified when
 'reset_done' to let it sets its internal state accordingly.
- Add some helper stuff for 'invalid' state handling.
mlx5_vfio_pci:
- Move to use the 'command mode' instead of the 'state machine'
 scheme as was discussed in the mailing list.
- Handle the RESET scenario when called by vfio_pci_core to sets
 its internal state accordingly.
- Set initial state as RUNNING.
- Put the driver files as sub-folder under drivers/vfio/pci named mlx5
  and update MAINTAINER file as was asked.
vdpa_mlx5:
Add a new patch to use mlx5_vf_get_core_dev() to get PF device.
Jason Gunthorpe (7):
  PCI/IOV: Add pci_iov_vf_id() to get VF index
  PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata
    of a PF
  vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  vfio: Define device migration protocol v2
  vfio: Extend the device migration protocol with RUNNING_P2P
  vfio: Remove migration protocol v1 documentation
  vfio: Extend the device migration protocol with PRE_COPY

Leon Romanovsky (1):
  net/mlx5: Reuse exported virtfn index function call

Yishai Hadas (7):
  net/mlx5: Disable SRIOV before PF removal
  net/mlx5: Expose APIs to get/put the mlx5 core device
  net/mlx5: Introduce migration bits and structures
  vfio/mlx5: Expose migration commands over mlx5 device
  vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  vfio/pci: Expose vfio_pci_core_aer_err_detected()
  vfio/mlx5: Use its own PCI reset_done error handler

 MAINTAINERS                                   |   6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  45 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   1 +
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  17 +-
 drivers/pci/iov.c                             |  43 ++
 drivers/vfio/pci/Kconfig                      |   3 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/mlx5/Kconfig                 |  10 +
 drivers/vfio/pci/mlx5/Makefile                |   4 +
 drivers/vfio/pci/mlx5/cmd.c                   | 259 +++++++
 drivers/vfio/pci/mlx5/cmd.h                   |  36 +
 drivers/vfio/pci/mlx5/main.c                  | 676 ++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c                   |   1 +
 drivers/vfio/pci/vfio_pci_core.c              | 101 ++-
 drivers/vfio/vfio.c                           | 358 +++++++++-
 include/linux/mlx5/driver.h                   |   3 +
 include/linux/mlx5/mlx5_ifc.h                 | 147 +++-
 include/linux/pci.h                           |  15 +-
 include/linux/vfio.h                          |  50 ++
 include/linux/vfio_pci_core.h                 |   4 +
 include/uapi/linux/vfio.h                     | 504 +++++++------
 21 files changed, 1994 insertions(+), 291 deletions(-)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h
 create mode 100644 drivers/vfio/pci/mlx5/main.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

The PCI core uses the VF index internally, often called the vf_id,
during the setup of the VF, eg pci_iov_add_virtfn().

This index is needed for device drivers that implement live migration
for their internal operations that configure/control their VFs.

Specifically, mlx5_vfio_pci driver that is introduced in coming patches
from this series needs it and not the bus/device/function which is
exposed today.

Add pci_iov_vf_id() which computes the vf_id by reversing the math that
was used to create the bus/device/function.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/pci/iov.c   | 14 ++++++++++++++
 include/linux/pci.h |  8 +++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0267977c9f17..2e9f3d70803a 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 }
 EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
 
+int pci_iov_vf_id(struct pci_dev *dev)
+{
+	struct pci_dev *pf;
+
+	if (!dev->is_virtfn)
+		return -EINVAL;
+
+	pf = pci_physfn(dev);
+	return (((dev->bus->number << 8) + dev->devfn) -
+		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
+	       pf->sriov->stride;
+}
+EXPORT_SYMBOL_GPL(pci_iov_vf_id);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8253a5413d7c..3d4ff7b35ad1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2166,7 +2166,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 #ifdef CONFIG_PCI_IOV
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
-
+int pci_iov_vf_id(struct pci_dev *dev);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2194,6 +2194,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
 	return -ENOSYS;
 }
+
+static inline int pci_iov_vf_id(struct pci_dev *dev)
+{
+	return -ENOSYS;
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Leon Romanovsky <leonro@nvidia.com>

Instead open-code iteration to compare virtfn internal index, use newly
introduced pci_iov_vf_id() call.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index e8185b69ac6c..24c4b4f05214 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -205,19 +205,8 @@ int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
 			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
 
 	sriov = &dev->priv.sriov;
-
-	/* Reversed translation of PCI VF function number to the internal
-	 * function_id, which exists in the name of virtfn symlink.
-	 */
-	for (id = 0; id < pci_num_vf(pf); id++) {
-		if (!sriov->vfs_ctx[id].enabled)
-			continue;
-
-		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
-			break;
-	}
-
-	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+	id = pci_iov_vf_id(vf);
+	if (id < 0 || !sriov->vfs_ctx[id].enabled)
 		return -EINVAL;
 
 	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

Virtual functions depend on physical function for device access (for example
firmware host PAGE management), so make sure to disable SRIOV once PF is gone.

This will prevent also the below warning if PF has gone before disabling SRIOV.
"driver left SR-IOV enabled after remove"

Next patch from this series will rely on that when the VF may need to
access safely the PF 'driver data'.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c      | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c     | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 2c774f367199..5b8958186157 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1620,6 +1620,7 @@ static void remove_one(struct pci_dev *pdev)
 	struct devlink *devlink = priv_to_devlink(dev);
 
 	devlink_unregister(devlink);
+	mlx5_sriov_disable(pdev);
 	mlx5_crdump_disable(dev);
 	mlx5_drain_health_wq(dev);
 	mlx5_uninit_one(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 6f8baa0f2a73..37b2805b3bf3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -164,6 +164,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+void mlx5_sriov_disable(struct pci_dev *pdev);
 int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 24c4b4f05214..887ee0f729d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -161,7 +161,7 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	return err;
 }
 
-static void mlx5_sriov_disable(struct pci_dev *pdev)
+void mlx5_sriov_disable(struct pci_dev *pdev)
 {
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	int num_vfs = pci_num_vf(dev->pdev);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (2 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

There are some cases where a SR-IOV VF driver will need to reach into and
interact with the PF driver. This requires accessing the drvdata of the PF.

Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
safe way. Normally accessing a drvdata of a foreign struct device would be
done using the device_lock() to protect against device driver
probe()/remove() races.

However, due to the design of pci_enable_sriov() this will result in a
ABBA deadlock on the device_lock as the PF's device_lock is held during PF
sriov_configure() while calling pci_enable_sriov() which in turn holds the
VF's device_lock while calling VF probe(), and similarly for remove.

This means the VF driver can never obtain the PF's device_lock.

Instead use the implicit locking created by pci_enable/disable_sriov(). A
VF driver can access its PF drvdata only while its own driver is attached,
and the PF driver can control access to its own drvdata based on when it
calls pci_enable/disable_sriov().

To use this API the PF driver will setup the PF drvdata in the probe()
function. pci_enable_sriov() is only called from sriov_configure() which
cannot happen until probe() completes, ensuring no VF races with drvdata
setup.

For removal, the PF driver must call pci_disable_sriov() in its remove
function before destroying any of the drvdata. This ensures that all VF
drivers are unbound before returning, fencing concurrent access to the
drvdata.

The introduction of a new function to do this access makes clear the
special locking scheme and the documents the requirements on the PF/VF
drivers using this.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/pci/iov.c   | 29 +++++++++++++++++++++++++++++
 include/linux/pci.h |  7 +++++++
 2 files changed, 36 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 2e9f3d70803a..28ec952e1221 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -47,6 +47,35 @@ int pci_iov_vf_id(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_iov_vf_id);
 
+/**
+ * pci_iov_get_pf_drvdata - Return the drvdata of a PF
+ * @dev - VF pci_dev
+ * @pf_driver - Device driver required to own the PF
+ *
+ * This must be called from a context that ensures that a VF driver is attached.
+ * The value returned is invalid once the VF driver completes its remove()
+ * callback.
+ *
+ * Locking is achieved by the driver core. A VF driver cannot be probed until
+ * pci_enable_sriov() is called and pci_disable_sriov() does not return until
+ * all VF drivers have completed their remove().
+ *
+ * The PF driver must call pci_disable_sriov() before it begins to destroy the
+ * drvdata.
+ */
+void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver)
+{
+	struct pci_dev *pf_dev;
+
+	if (!dev->is_virtfn)
+		return ERR_PTR(-EINVAL);
+	pf_dev = dev->physfn;
+	if (pf_dev->driver != pf_driver)
+		return ERR_PTR(-EINVAL);
+	return pci_get_drvdata(pf_dev);
+}
+EXPORT_SYMBOL_GPL(pci_iov_get_pf_drvdata);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3d4ff7b35ad1..60d423d8f0c4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2167,6 +2167,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
 int pci_iov_vf_id(struct pci_dev *dev);
+void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2200,6 +2201,12 @@ static inline int pci_iov_vf_id(struct pci_dev *dev)
 	return -ENOSYS;
 }
 
+static inline void *pci_iov_get_pf_drvdata(struct pci_dev *dev,
+					   struct pci_driver *pf_driver)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (3 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

Expose an API to get the mlx5 core device from a given VF PCI device if
mlx5_core is its driver.

Upon the get API we stay with the intf_state_mutex locked to make sure
that the device can't be gone/unloaded till the caller will complete
its job over the device, this expects to be for a short period of time
for any flow that the lock is taken.

Upon the put API we unlock the intf_state_mutex.

The use case for those APIs is the migration flow of a VF over VFIO PCI.
In that case the VF doesn't ride on mlx5_core, because the device is
driving *two* different PCI devices, the PF owned by mlx5_core and the
VF owned by the vfio driver.

The mlx5_core of the PF is accessed only during the narrow window of the
VF's ioctl that requires its services.

This allows the PF driver to be more independent of the VF driver, so
long as it doesn't reset the FW.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 44 +++++++++++++++++++
 include/linux/mlx5/driver.h                   |  3 ++
 2 files changed, 47 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 5b8958186157..e9aeba4267ff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1881,6 +1881,50 @@ static struct pci_driver mlx5_core_driver = {
 	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
+/**
+ * mlx5_vf_get_core_dev - Get the mlx5 core device from a given VF PCI device if
+ *                     mlx5_core is its driver.
+ * @pdev: The associated PCI device.
+ *
+ * Upon return the interface state lock stay held to let caller uses it safely.
+ * Caller must ensure to use the returned mlx5 device for a narrow window
+ * and put it back with mlx5_vf_put_core_dev() immediately once usage was over.
+ *
+ * Return: Pointer to the associated mlx5_core_dev or NULL.
+ */
+struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev)
+			__acquires(&mdev->intf_state_mutex)
+{
+	struct mlx5_core_dev *mdev;
+
+	mdev = pci_iov_get_pf_drvdata(pdev, &mlx5_core_driver);
+	if (IS_ERR(mdev))
+		return NULL;
+
+	mutex_lock(&mdev->intf_state_mutex);
+	if (!test_bit(MLX5_INTERFACE_STATE_UP, &mdev->intf_state)) {
+		mutex_unlock(&mdev->intf_state_mutex);
+		return NULL;
+	}
+
+	return mdev;
+}
+EXPORT_SYMBOL(mlx5_vf_get_core_dev);
+
+/**
+ * mlx5_vf_put_core_dev - Put the mlx5 core device back.
+ * @mdev: The mlx5 core device.
+ *
+ * Upon return the interface state lock is unlocked and caller should not
+ * access the mdev any more.
+ */
+void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev)
+			__releases(&mdev->intf_state_mutex)
+{
+	mutex_unlock(&mdev->intf_state_mutex);
+}
+EXPORT_SYMBOL(mlx5_vf_put_core_dev);
+
 static void mlx5_core_verify_params(void)
 {
 	if (prof_sel >= ARRAY_SIZE(profile)) {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 78655d8d13a7..319322a8ff94 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1143,6 +1143,9 @@ int mlx5_dm_sw_icm_alloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 			   u64 length, u16 uid, phys_addr_t addr, u32 obj_id);
 
+struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev);
+void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev);
+
 #ifdef CONFIG_MLX5_CORE_IPOIB
 struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
 					  struct ib_device *ibdev,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (4 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

Introduce migration IFC related stuff to enable migration commands.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 147 +++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 598ac3bcc901..45891a75c5ca 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -127,6 +127,11 @@ enum {
 	MLX5_CMD_OP_QUERY_SF_PARTITION            = 0x111,
 	MLX5_CMD_OP_ALLOC_SF                      = 0x113,
 	MLX5_CMD_OP_DEALLOC_SF                    = 0x114,
+	MLX5_CMD_OP_SUSPEND_VHCA                  = 0x115,
+	MLX5_CMD_OP_RESUME_VHCA                   = 0x116,
+	MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE    = 0x117,
+	MLX5_CMD_OP_SAVE_VHCA_STATE               = 0x118,
+	MLX5_CMD_OP_LOAD_VHCA_STATE               = 0x119,
 	MLX5_CMD_OP_CREATE_MKEY                   = 0x200,
 	MLX5_CMD_OP_QUERY_MKEY                    = 0x201,
 	MLX5_CMD_OP_DESTROY_MKEY                  = 0x202,
@@ -1757,7 +1762,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         reserved_at_682[0x1];
 	u8         log_max_sf[0x5];
 	u8         apu[0x1];
-	u8         reserved_at_689[0x7];
+	u8         reserved_at_689[0x4];
+	u8         migration[0x1];
+	u8         reserved_at_68e[0x2];
 	u8         log_min_sf_size[0x8];
 	u8         max_num_sf_partitions[0x8];
 
@@ -11519,4 +11526,142 @@ enum {
 	MLX5_MTT_PERM_RW	= MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE,
 };
 
+enum {
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER  = 0x0,
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE   = 0x1,
+};
+
+struct mlx5_ifc_suspend_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_suspend_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+enum {
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE   = 0x0,
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER  = 0x1,
+};
+
+struct mlx5_ifc_resume_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_resume_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	u8         required_umem_size[0x20];
+
+	u8         reserved_at_a0[0x160];
+};
+
+struct mlx5_ifc_save_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_save_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         actual_image_size[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_load_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_load_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
 #endif /* MLX5_IFC_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (5 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

Invoke a new device op 'device_feature' to handle just the data array
portion of the command. This lifts the ioctl validation to the core code
and makes it simpler for either the core code, or layered drivers, to
implement their own feature values.

Provide vfio_check_feature() to consolidate checking the flags/etc against
what the driver supports.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/vfio_pci.c      |  1 +
 drivers/vfio/pci/vfio_pci_core.c | 94 +++++++++++++-------------------
 drivers/vfio/vfio.c              | 46 ++++++++++++++--
 include/linux/vfio.h             | 32 +++++++++++
 include/linux/vfio_pci_core.h    |  2 +
 5 files changed, 114 insertions(+), 61 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a5ce92beb655..2b047469e02f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -130,6 +130,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.open_device	= vfio_pci_open_device,
 	.close_device	= vfio_pci_core_close_device,
 	.ioctl		= vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
 	.read		= vfio_pci_core_read,
 	.write		= vfio_pci_core_write,
 	.mmap		= vfio_pci_core_mmap,
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f948e6cd2993..106e1970d653 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1114,70 +1114,50 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 
 		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
 					  ioeventfd.data, count, ioeventfd.fd);
-	} else if (cmd == VFIO_DEVICE_FEATURE) {
-		struct vfio_device_feature feature;
-		uuid_t uuid;
-
-		minsz = offsetofend(struct vfio_device_feature, flags);
-
-		if (copy_from_user(&feature, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (feature.argsz < minsz)
-			return -EINVAL;
-
-		/* Check unknown flags */
-		if (feature.flags & ~(VFIO_DEVICE_FEATURE_MASK |
-				      VFIO_DEVICE_FEATURE_SET |
-				      VFIO_DEVICE_FEATURE_GET |
-				      VFIO_DEVICE_FEATURE_PROBE))
-			return -EINVAL;
-
-		/* GET & SET are mutually exclusive except with PROBE */
-		if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) &&
-		    (feature.flags & VFIO_DEVICE_FEATURE_SET) &&
-		    (feature.flags & VFIO_DEVICE_FEATURE_GET))
-			return -EINVAL;
-
-		switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
-		case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-			if (!vdev->vf_token)
-				return -ENOTTY;
-
-			/*
-			 * We do not support GET of the VF Token UUID as this
-			 * could expose the token of the previous device user.
-			 */
-			if (feature.flags & VFIO_DEVICE_FEATURE_GET)
-				return -EINVAL;
-
-			if (feature.flags & VFIO_DEVICE_FEATURE_PROBE)
-				return 0;
+	}
+	return -ENOTTY;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-			/* Don't SET unless told to do so */
-			if (!(feature.flags & VFIO_DEVICE_FEATURE_SET))
-				return -EINVAL;
+static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
+				       void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	uuid_t uuid;
+	int ret;
 
-			if (feature.argsz < minsz + sizeof(uuid))
-				return -EINVAL;
+	if (!vdev->vf_token)
+		return -ENOTTY;
+	/*
+	 * We do not support GET of the VF Token UUID as this could
+	 * expose the token of the previous device user.
+	 */
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+				 sizeof(uuid));
+	if (ret != 1)
+		return ret;
 
-			if (copy_from_user(&uuid, (void __user *)(arg + minsz),
-					   sizeof(uuid)))
-				return -EFAULT;
+	if (copy_from_user(&uuid, arg, sizeof(uuid)))
+		return -EFAULT;
 
-			mutex_lock(&vdev->vf_token->lock);
-			uuid_copy(&vdev->vf_token->uuid, &uuid);
-			mutex_unlock(&vdev->vf_token->lock);
+	mutex_lock(&vdev->vf_token->lock);
+	uuid_copy(&vdev->vf_token->uuid, &uuid);
+	mutex_unlock(&vdev->vf_token->lock);
+	return 0;
+}
 
-			return 0;
-		default:
-			return -ENOTTY;
-		}
+int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
+				void __user *arg, size_t argsz)
+{
+	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
+	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
+		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	default:
+		return -ENOTTY;
 	}
-
-	return -ENOTTY;
 }
-EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl_feature);
 
 static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 735d1d344af9..71763e2ac561 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1557,15 +1557,53 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+static int vfio_ioctl_device_feature(struct vfio_device *device,
+				     struct vfio_device_feature __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_device_feature, flags);
+	struct vfio_device_feature feature;
+
+	if (copy_from_user(&feature, arg, minsz))
+		return -EFAULT;
+
+	if (feature.argsz < minsz)
+		return -EINVAL;
+
+	/* Check unknown flags */
+	if (feature.flags &
+	    ~(VFIO_DEVICE_FEATURE_MASK | VFIO_DEVICE_FEATURE_SET |
+	      VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_PROBE))
+		return -EINVAL;
+
+	/* GET & SET are mutually exclusive except with PROBE */
+	if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) &&
+	    (feature.flags & VFIO_DEVICE_FEATURE_SET) &&
+	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
+		return -EINVAL;
+
+	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	default:
+		if (unlikely(!device->ops->device_feature))
+			return -EINVAL;
+		return device->ops->device_feature(device, feature.flags,
+						   arg->data,
+						   feature.argsz - minsz);
+	}
+}
+
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
 	struct vfio_device *device = filep->private_data;
 
-	if (unlikely(!device->ops->ioctl))
-		return -EINVAL;
-
-	return device->ops->ioctl(device, cmd, arg);
+	switch (cmd) {
+	case VFIO_DEVICE_FEATURE:
+		return vfio_ioctl_device_feature(device, (void __user *)arg);
+	default:
+		if (unlikely(!device->ops->ioctl))
+			return -EINVAL;
+		return device->ops->ioctl(device, cmd, arg);
+	}
 }
 
 static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 76191d7abed1..ca69516f869d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -55,6 +55,7 @@ struct vfio_device {
  * @match: Optional device name match callback (return: 0 for no-match, >0 for
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
+ * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
  */
 struct vfio_device_ops {
 	char	*name;
@@ -69,8 +70,39 @@ struct vfio_device_ops {
 	int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
 	void	(*request)(struct vfio_device *vdev, unsigned int count);
 	int	(*match)(struct vfio_device *vdev, char *buf);
+	int	(*device_feature)(struct vfio_device *device, u32 flags,
+				  void __user *arg, size_t argsz);
 };
 
+/**
+ * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
+ * @flags: Arg from the device_feature op
+ * @argsz: Arg from the device_feature op
+ * @supported_ops: Combination of VFIO_DEVICE_FEATURE_GET and SET the driver
+ *                 supports
+ * @minsz: Minimum data size the driver accepts
+ *
+ * For use in a driver's device_feature op. Checks that the inputs to the
+ * VFIO_DEVICE_FEATURE ioctl are correct for the driver's feature. Returns 1 if
+ * the driver should execute the get or set, otherwise the relevant
+ * value should be returned.
+ */
+static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
+				    size_t minsz)
+{
+	if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) &
+	    ~supported_ops)
+		return -EINVAL;
+	if (flags & VFIO_DEVICE_FEATURE_PROBE)
+		return 0;
+	/* Without PROBE one of GET or SET must be requested */
+	if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)))
+		return -EINVAL;
+	if (argsz < minsz)
+		return -EINVAL;
+	return 1;
+}
+
 void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
 			 const struct vfio_device_ops *ops);
 void vfio_uninit_group_dev(struct vfio_device *device);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..beba0b2ed87d 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -220,6 +220,8 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
 extern const struct pci_error_handlers vfio_pci_core_err_handlers;
 long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		unsigned long arg);
+int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
+				void __user *arg, size_t argsz);
 ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (6 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-09  0:07   ` Alex Williamson
  2022-02-15  8:04   ` Tian, Kevin
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.

This is the STOP_COPY portion of the new protocol, it defines the 5 states
for basic stop and copy migration and the protocol to move the migration
data in/out of the kernel.

Compared to the clarification of the v1 protocol Alex proposed:

https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen

This has a few deliberate functional differences:

 - ERROR arcs allow the device function to remain unchanged.

 - The protocol is not required to return to the original state on
   transition failure. Instead userspace can execute an unwind back to
   the original state, reset, or do something else without needing kernel
   support. This simplifies the kernel design and should userspace choose
   a policy like always reset, avoids doing useless work in the kernel
   on error handling paths.

 - PRE_COPY is made optional, userspace must discover it before using it.
   This reflects the fact that the majority of drivers we are aware of
   right now will not implement PRE_COPY.

 - segmentation is not part of the data stream protocol, the receiver
   does not have to reproduce the framing boundaries.

The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.

A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is
defined to replace writing to the device_state field in the region. This
allows returning a brand new FD whenever the requested transition opens
a data transfer session.

The VFIO core code implements the new feature and provides a helper
function to the driver. Using the helper the driver only has to
implement 6 of the FSM arcs and the other combination transitions are
elaborated consistently from those arcs.

A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
report the capability for migration and indicate which set of states and
arcs are supported by the device. The FSM provides a lot of flexibility to
make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
allows for future breaking extensions for scenarios that cannot support
even the basic STOP_COPY requirements.

The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e.
VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state
of the VFIO device.

Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.

The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using a read() on the
region.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 198 ++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h      |  17 ++++
 include/uapi/linux/vfio.h | 172 ++++++++++++++++++++++++++++++---
 3 files changed, 374 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 71763e2ac561..e7ab9f2048cd 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1557,6 +1557,196 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+/*
+ * vfio_mig_get_next_state - Compute the next step in the FSM
+ * @cur_fsm - The current state the device is in
+ * @new_fsm - The target state to reach
+ * @next_fsm - Pointer to the next step to get to new_fsm
+ *
+ * Return 0 upon success, otherwise -errno
+ * Upon success the next step in the state progression between cur_fsm and
+ * new_fsm will be set in next_fsm.
+ *
+ * This breaks down requests for combination transitions into smaller steps and
+ * returns the next step to get to new_fsm. The function may need to be called
+ * multiple times before reaching new_fsm.
+ *
+ */
+int vfio_mig_get_next_state(struct vfio_device *device,
+			    enum vfio_device_mig_state cur_fsm,
+			    enum vfio_device_mig_state new_fsm,
+			    enum vfio_device_mig_state *next_fsm)
+{
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	/*
+	 * The coding in this table requires the driver to implement 6
+	 * FSM arcs:
+	 *         RESUMING -> STOP
+	 *         RUNNING -> STOP
+	 *         STOP -> RESUMING
+	 *         STOP -> RUNNING
+	 *         STOP -> STOP_COPY
+	 *         STOP_COPY -> STOP
+	 *
+	 * The coding will step through multiple states for these combination
+	 * transitions:
+	 *         RESUMING -> STOP -> RUNNING
+	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> STOP -> RESUMING
+	 *         RUNNING -> STOP -> STOP_COPY
+	 *         STOP_COPY -> STOP -> RESUMING
+	 *         STOP_COPY -> STOP -> RUNNING
+	 */
+	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
+		[VFIO_DEVICE_STATE_STOP] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_STOP_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RESUMING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_ERROR] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+	};
+
+	if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table)))
+		return -EINVAL;
+
+	if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
+		return -EINVAL;
+
+	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL;
+}
+EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
+
+/*
+ * Convert the drivers's struct file into a FD number and return it to userspace
+ */
+static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg,
+				   struct vfio_device_feature_mig_state *mig)
+{
+	int ret;
+	int fd;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
+		goto out_fput;
+	}
+
+	mig->data_fd = fd;
+	if (copy_to_user(arg, mig, sizeof(*mig))) {
+		ret = -EFAULT;
+		goto out_put_unused;
+	}
+	fd_install(fd, filp);
+	return 0;
+
+out_put_unused:
+	put_unused_fd(fd);
+out_fput:
+	fput(filp);
+	return ret;
+}
+
+static int
+vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device,
+					   u32 flags, void __user *arg,
+					   size_t argsz)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_feature_mig_state, data_fd);
+	struct vfio_device_feature_mig_state mig;
+	struct file *filp = NULL;
+	int ret;
+
+	if (!device->ops->migration_set_state ||
+	    !device->ops->migration_get_state)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET |
+				 VFIO_DEVICE_FEATURE_GET,
+				 sizeof(mig));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&mig, arg, minsz))
+		return -EFAULT;
+
+	if (flags & VFIO_DEVICE_FEATURE_GET) {
+		enum vfio_device_mig_state curr_state;
+
+		ret = device->ops->migration_get_state(device, &curr_state);
+		if (ret)
+			return ret;
+		mig.device_state = curr_state;
+		goto out_copy;
+	}
+
+	/* Handle the VFIO_DEVICE_FEATURE_SET */
+	filp = device->ops->migration_set_state(device, mig.device_state);
+	if (IS_ERR(filp) || !filp)
+		goto out_copy;
+
+	return vfio_ioct_mig_return_fd(filp, arg, &mig);
+out_copy:
+	mig.data_fd = -1;
+	if (copy_to_user(arg, &mig, sizeof(mig)))
+		return -EFAULT;
+	if (IS_ERR(filp))
+		return PTR_ERR(filp);
+	return 0;
+}
+
+static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
+					       u32 flags, void __user *arg,
+					       size_t argsz)
+{
+	struct vfio_device_feature_migration mig = {
+		.flags = VFIO_MIGRATION_STOP_COPY,
+	};
+	int ret;
+
+	if (!device->ops->migration_set_state)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(mig));
+	if (ret != 1)
+		return ret;
+	if (copy_to_user(arg, &mig, sizeof(mig)))
+		return -EFAULT;
+	return 0;
+}
+
 static int vfio_ioctl_device_feature(struct vfio_device *device,
 				     struct vfio_device_feature __user *arg)
 {
@@ -1582,6 +1772,14 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 		return -EINVAL;
 
 	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	case VFIO_DEVICE_FEATURE_MIGRATION:
+		return vfio_ioctl_device_feature_migration(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
+		return vfio_ioctl_device_feature_mig_device_state(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
 	default:
 		if (unlikely(!device->ops->device_feature))
 			return -EINVAL;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ca69516f869d..3f4a1a7c2277 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -56,6 +56,13 @@ struct vfio_device {
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
  * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
+ * @migration_set_state: Optional callback to change the migration state for
+ *         devices that support migration. The returned FD is used for data
+ *         transfer according to the FSM definition. The driver is responsible
+ *         to ensure that FD is isolated whenever the migration FSM leaves a
+ *         data transfer state or before close_device() returns.
+ @migration_get_state: Optional callback to get the migration state for
+ *         devices that support migration.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -72,6 +79,11 @@ struct vfio_device_ops {
 	int	(*match)(struct vfio_device *vdev, char *buf);
 	int	(*device_feature)(struct vfio_device *device, u32 flags,
 				  void __user *arg, size_t argsz);
+	struct file *(*migration_set_state)(
+		struct vfio_device *device,
+		enum vfio_device_mig_state new_state);
+	int (*migration_get_state)(struct vfio_device *device,
+				   enum vfio_device_mig_state *curr_state);
 };
 
 /**
@@ -114,6 +126,11 @@ extern void vfio_device_put(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 
+int vfio_mig_get_next_state(struct vfio_device *device,
+			    enum vfio_device_mig_state cur_fsm,
+			    enum vfio_device_mig_state new_fsm,
+			    enum vfio_device_mig_state *next_fsm);
+
 /*
  * External user API
  */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..89012bc01663 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -605,25 +605,25 @@ struct vfio_region_gfx_edid {
 
 struct vfio_device_migration_info {
 	__u32 device_state;         /* VFIO device state */
-#define VFIO_DEVICE_STATE_STOP      (0)
-#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
-#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
-#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
-#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
-				     VFIO_DEVICE_STATE_SAVING |  \
-				     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_V1_STOP      (0)
+#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING | \
+				     VFIO_DEVICE_STATE_V1_SAVING |  \
+				     VFIO_DEVICE_STATE_V1_RESUMING)
 
 #define VFIO_DEVICE_STATE_VALID(state) \
-	(state & VFIO_DEVICE_STATE_RESUMING ? \
-	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+	(state & VFIO_DEVICE_STATE_V1_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_V1_RESUMING : 1)
 
 #define VFIO_DEVICE_STATE_IS_ERROR(state) \
-	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
-					      VFIO_DEVICE_STATE_RESUMING))
+	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_V1_SAVING | \
+					      VFIO_DEVICE_STATE_V1_RESUMING))
 
 #define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_V1_SAVING | \
+					     VFIO_DEVICE_STATE_V1_RESUMING)
 
 	__u32 reserved;
 	__u64 pending_bytes;
@@ -1002,6 +1002,152 @@ struct vfio_device_feature {
  */
 #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
 
+/*
+ * Indicates the device can support the migration API. See enum
+ * vfio_device_mig_state for details. If present flags must be non-zero and
+ * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
+ * RESUMING are supported.
+ */
+struct vfio_device_feature_migration {
+	__aligned_u64 flags;
+#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+};
+#define VFIO_DEVICE_FEATURE_MIGRATION 1
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET,
+ * Execute a migration state change on the VFIO device.
+ * The new state is supplied in device_state.
+ *
+ * The kernel migration driver must fully transition the device to the new state
+ * value before the write(2) operation returns to the user.
+ *
+ * The kernel migration driver must not generate asynchronous device state
+ * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
+ * ioctl as described above.
+ *
+ * If this function fails then current device_state may be the original
+ * operating state or some other state along the combination transition path.
+ * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt
+ * to return to the original state, or attempt to return to some other state
+ * such as RUNNING or STOP.
+ *
+ * If the new_state starts a new data transfer session then the FD associated
+ * with that session is returned in data_fd. The user is responsible to close
+ * this FD when it is finished. The user must consider the migration data
+ * segments carried over the FD to be opaque and non-fungible. During RESUMING,
+ * the data segments must be written in the same order they came out of the
+ * saving side FD.
+ *
+ * Upon VFIO_DEVICE_FEATURE_GET,
+ * Get the current migration state of the VFIO device, data_fd will be -1.
+ */
+struct vfio_device_feature_mig_state {
+	__u32 device_state; /* From enum vfio_device_mig_state */
+	__s32 data_fd;
+};
+#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2
+
+/*
+ * The device migration Finite State Machine is described by the enum
+ * vfio_device_mig_state. Some of the FSM arcs will create a migration data
+ * transfer session by returning a FD, in this case the migration data will
+ * flow over the FD using read() and write() as discussed below.
+ *
+ * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
+ *  RUNNING - The device is running normally
+ *  STOP - The device does not change the internal or external state
+ *  STOP_COPY - The device internal state can be read out
+ *  RESUMING - The device is stopped and is loading a new internal state
+ *  ERROR - The device has failed and must be reset
+ *
+ * The FSM takes actions on the arcs between FSM states. The driver implements
+ * the following behavior for the FSM arcs:
+ *
+ * RUNNING -> STOP
+ * STOP_COPY -> STOP
+ *   While in STOP the device must stop the operation of the device. The
+ *   device must not generate interrupts, DMA, or advance its internal
+ *   state. When stopped the device and kernel migration driver must accept
+ *   and respond to interaction to support external subsystems in the STOP
+ *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
+ *   restrict device access while in STOP must not result in error conditions
+ *   outside the user context (ex. host system faults).
+ *
+ *   The STOP_COPY arc will terminate a data transfer session.
+ *
+ * RESUMING -> STOP
+ *   Leaving RESUMING terminates a data transfer session and indicates the
+ *   device should complete processing of the data delivered by write(). The
+ *   kernel migration driver should complete the incorporation of data written
+ *   to the data transfer FD into the device internal state and perform
+ *   final validity and consistency checking of the new device state. If the
+ *   user provided data is found to be incomplete, inconsistent, or otherwise
+ *   invalid, the migration driver must fail the SET_STATE ioctl and
+ *   optionally go to the ERROR state as described below.
+ *
+ *   While in STOP the device has the same behavior as other STOP states
+ *   described above.
+ *
+ *   To abort a RESUMING session the device must be reset.
+ *
+ * STOP -> RUNNING
+ *   While in RUNNING the device is fully operational, the device may generate
+ *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
+ *   and the device may advance its internal state.
+ *
+ * STOP -> STOP_COPY
+ *   This arc begin the process of saving the device state and will return a
+ *   new data_fd.
+ *
+ *   While in the STOP_COPY state the device has the same behavior as STOP
+ *   with the addition that the data transfers session continues to stream the
+ *   migration state. End of stream on the FD indicates the entire device
+ *   state has been transferred.
+ *
+ *   The user should take steps to restrict access to vfio device regions while
+ *   the device is in STOP_COPY or risk corruption of the device migration data
+ *   stream.
+ *
+ * STOP -> RESUMING
+ *   Entering the RESUMING state starts a process of restoring the device
+ *   state and will return a new data_fd. The data stream fed into the data_fd
+ *   should be taken from the data transfer output of the saving group states
+ *   from a compatible device. The migration driver may alter/reset the
+ *   internal device state for this arc if required to prepare the device to
+ *   receive the migration data.
+ *
+ * any -> ERROR
+ *   ERROR cannot be specified as a device state, however any transition request
+ *   can be failed with an errno return and may then move the device_state into
+ *   ERROR. In this case the device was unable to execute the requested arc and
+ *   was also unable to restore the device to any valid device_state.
+ *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
+ *   device_state back to RUNNING.
+ *
+ * The remaining possible transitions are interpreted as combinations of the
+ * above FSM arcs. As there are multiple paths through the FSM arcs the path
+ * should be selected based on the following rules:
+ *   - Select the shortest path.
+ * Refer to vfio_mig_get_next_state() for the result of the algorithm.
+ *
+ * The automatic transit through the FSM arcs that make up the combination
+ * transition is invisible to the user. When working with combination arcs the
+ * user may see any step along the path in the device_state if SET_STATE
+ * fails. When handling these types of errors users should anticipate future
+ * revisions of this protocol using new states and those states becoming
+ * visible in this case.
+ */
+enum vfio_device_mig_state {
+	VFIO_DEVICE_STATE_ERROR = 0,
+	VFIO_DEVICE_STATE_STOP = 1,
+	VFIO_DEVICE_STATE_RUNNING = 2,
+	VFIO_DEVICE_STATE_STOP_COPY = 3,
+	VFIO_DEVICE_STATE_RESUMING = 4,
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (7 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-15 10:18   ` Tian, Kevin
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing transactions.

As an optional extension to the mandatory states it is defined as
inbetween STOP and RUNNING:
   STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP

For drivers that are unable to support RUNNING_P2P the core code silently
merges RUNNING_P2P and RUNNING together. Drivers that support this will be
required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
arcs become combination transitions.

Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 79 ++++++++++++++++++++++++++++++---------
 include/linux/vfio.h      |  1 +
 include/uapi/linux/vfio.h | 34 ++++++++++++++++-
 3 files changed, 95 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index e7ab9f2048cd..8c484593dfe0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1577,39 +1577,55 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
 	/*
-	 * The coding in this table requires the driver to implement 6
+	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
 	 *         RESUMING -> STOP
-	 *         RUNNING -> STOP
 	 *         STOP -> RESUMING
-	 *         STOP -> RUNNING
 	 *         STOP -> STOP_COPY
 	 *         STOP_COPY -> STOP
 	 *
-	 * The coding will step through multiple states for these combination
-	 * transitions:
-	 *         RESUMING -> STOP -> RUNNING
+	 * If P2P is supported then the driver must also implement these FSM
+	 * arcs:
+	 *         RUNNING -> RUNNING_P2P
+	 *         RUNNING_P2P -> RUNNING
+	 *         RUNNING_P2P -> STOP
+	 *         STOP -> RUNNING_P2P
+	 * Without P2P the driver must implement:
+	 *         RUNNING -> STOP
+	 *         STOP -> RUNNING
+	 *
+	 * If all optional features are supported then the coding will step
+	 * through multiple states for these combination transitions:
+	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
 	 *         RESUMING -> STOP -> STOP_COPY
-	 *         RUNNING -> STOP -> RESUMING
-	 *         RUNNING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> STOP
+	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> RUNNING
 	 *         STOP_COPY -> STOP -> RESUMING
-	 *         STOP_COPY -> STOP -> RUNNING
+	 *         STOP_COPY -> STOP -> RUNNING_P2P
+	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RUNNING] = {
-			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
-			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
@@ -1617,6 +1633,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RESUMING] = {
@@ -1624,6 +1641,15 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_ERROR] = {
@@ -1631,17 +1657,36 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 	};
 
-	if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table)))
+	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
+		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_RUNNING_P2P] =
+			VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P,
+		[VFIO_DEVICE_STATE_ERROR] = ~0U,
+	};
+
+	if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
+		    (state_flags_table[cur_fsm] & device->migration_flags) !=
+			state_flags_table[cur_fsm]))
 		return -EINVAL;
 
-	if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
+	if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
+	   (state_flags_table[new_fsm] & device->migration_flags) !=
+			state_flags_table[new_fsm])
 		return -EINVAL;
 
 	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
+			state_flags_table[*next_fsm])
+		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
+
 	return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL;
 }
 EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
@@ -1731,7 +1776,7 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 					       size_t argsz)
 {
 	struct vfio_device_feature_migration mig = {
-		.flags = VFIO_MIGRATION_STOP_COPY,
+		.flags = device->migration_flags,
 	};
 	int ret;
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 3f4a1a7c2277..a173718d2a1b 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -33,6 +33,7 @@ struct vfio_device {
 	struct vfio_group *group;
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
+	unsigned int migration_flags;
 
 	/* Members below here are private, not for driver use */
 	refcount_t refcount;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 89012bc01663..773895988cf1 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1009,10 +1009,16 @@ struct vfio_device_feature {
  *
  * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
  * RESUMING are supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
+ * is supported in addition to the STOP_COPY states.
+ *
+ * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+#define VFIO_MIGRATION_P2P		(1 << 1)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -1063,10 +1069,13 @@ struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
+ * And 1 optional state to support VFIO_MIGRATION_P2P:
+ *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
  *
- * RUNNING -> STOP
+ * RUNNING_P2P -> STOP
  * STOP_COPY -> STOP
  *   While in STOP the device must stop the operation of the device. The
  *   device must not generate interrupts, DMA, or advance its internal
@@ -1093,11 +1102,16 @@ struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
- * STOP -> RUNNING
+ * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ * RUNNING -> RUNNING_P2P
+ * STOP -> RUNNING_P2P
+ *   While in RUNNING_P2P the device is partially running in the P2P quiescent
+ *   state defined below.
+ *
  * STOP -> STOP_COPY
  *   This arc begin the process of saving the device state and will return a
  *   new data_fd.
@@ -1127,6 +1141,16 @@ struct vfio_device_feature_mig_state {
  *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
  *   device_state back to RUNNING.
  *
+ * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
+ * state for the device for the purposes of managing multiple devices within a
+ * user context where peer-to-peer DMA between devices may be active. The
+ * RUNNING_P2P states must prevent the device from initiating
+ * any new P2P DMA transactions. If the device can identify P2P transactions
+ * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
+ * driver must complete any such outstanding operations prior to completing the
+ * FSM arc into a P2P state. For the purpose of specification the states
+ * behave as though the device was fully running if not supported.
+ *
  * The remaining possible transitions are interpreted as combinations of the
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
@@ -1139,6 +1163,11 @@ struct vfio_device_feature_mig_state {
  * fails. When handling these types of errors users should anticipate future
  * revisions of this protocol using new states and those states becoming
  * visible in this case.
+ *
+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can disocver if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -1146,6 +1175,7 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING = 2,
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
+	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
 /* -------- API for Type1 VFIO IOMMU -------- */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (8 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-11 11:03   ` Cornelia Huck
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

v1 was never implemented and is replaced by v2.

The old uAPI documentation is removed from the header file.

The old uAPI definitions are still kept in the header file till v2 will
reach Linus's tree.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/uapi/linux/vfio.h | 200 +-------------------------------------
 1 file changed, 2 insertions(+), 198 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 773895988cf1..227f55d57e06 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -323,7 +323,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
-#define VFIO_REGION_TYPE_MIGRATION              (3)
+#define VFIO_REGION_TYPE_MIGRATION_DEPRECATED   (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -405,203 +405,7 @@ struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_CCW_CRW		(3)
 
 /* sub-types for VFIO_REGION_TYPE_MIGRATION */
-#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
-
-/*
- * The structure vfio_device_migration_info is placed at the 0th offset of
- * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
- * migration information. Field accesses from this structure are only supported
- * at their native width and alignment. Otherwise, the result is undefined and
- * vendor drivers should return an error.
- *
- * device_state: (read/write)
- *      - The user application writes to this field to inform the vendor driver
- *        about the device state to be transitioned to.
- *      - The vendor driver should take the necessary actions to change the
- *        device state. After successful transition to a given state, the
- *        vendor driver should return success on write(device_state, state)
- *        system call. If the device state transition fails, the vendor driver
- *        should return an appropriate -errno for the fault condition.
- *      - On the user application side, if the device state transition fails,
- *	  that is, if write(device_state, state) returns an error, read
- *	  device_state again to determine the current state of the device from
- *	  the vendor driver.
- *      - The vendor driver should return previous state of the device unless
- *        the vendor driver has encountered an internal error, in which case
- *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
- *      - The user application must use the device reset ioctl to recover the
- *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
- *        indicated to be in a valid device state by reading device_state, the
- *        user application may attempt to transition the device to any valid
- *        state reachable from the current state or terminate itself.
- *
- *      device_state consists of 3 bits:
- *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
- *        it indicates the _STOP state. When the device state is changed to
- *        _STOP, driver should stop the device before write() returns.
- *      - If bit 1 is set, it indicates the _SAVING state, which means that the
- *        driver should start gathering device state information that will be
- *        provided to the VFIO user application to save the device's state.
- *      - If bit 2 is set, it indicates the _RESUMING state, which means that
- *        the driver should prepare to resume the device. Data provided through
- *        the migration region should be used to resume the device.
- *      Bits 3 - 31 are reserved for future use. To preserve them, the user
- *      application should perform a read-modify-write operation on this
- *      field when modifying the specified bits.
- *
- *  +------- _RESUMING
- *  |+------ _SAVING
- *  ||+----- _RUNNING
- *  |||
- *  000b => Device Stopped, not saving or resuming
- *  001b => Device running, which is the default state
- *  010b => Stop the device & save the device state, stop-and-copy state
- *  011b => Device running and save the device state, pre-copy state
- *  100b => Device stopped and the device state is resuming
- *  101b => Invalid state
- *  110b => Error state
- *  111b => Invalid state
- *
- * State transitions:
- *
- *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
- *                (100b)     (001b)     (011b)        (010b)       (000b)
- * 0. Running or default state
- *                             |
- *
- * 1. Normal Shutdown (optional)
- *                             |------------------------------------->|
- *
- * 2. Save the state or suspend
- *                             |------------------------->|---------->|
- *
- * 3. Save the state during live migration
- *                             |----------->|------------>|---------->|
- *
- * 4. Resuming
- *                  |<---------|
- *
- * 5. Resumed
- *                  |--------->|
- *
- * 0. Default state of VFIO device is _RUNNING when the user application starts.
- * 1. During normal shutdown of the user application, the user application may
- *    optionally change the VFIO device state from _RUNNING to _STOP. This
- *    transition is optional. The vendor driver must support this transition but
- *    must not require it.
- * 2. When the user application saves state or suspends the application, the
- *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
- *    On state transition from _RUNNING to stop-and-copy, driver must stop the
- *    device, save the device state and send it to the application through the
- *    migration region. The sequence to be followed for such transition is given
- *    below.
- * 3. In live migration of user application, the state transitions from _RUNNING
- *    to pre-copy, to stop-and-copy, and to _STOP.
- *    On state transition from _RUNNING to pre-copy, the driver should start
- *    gathering the device state while the application is still running and send
- *    the device state data to application through the migration region.
- *    On state transition from pre-copy to stop-and-copy, the driver must stop
- *    the device, save the device state and send it to the user application
- *    through the migration region.
- *    Vendor drivers must support the pre-copy state even for implementations
- *    where no data is provided to the user before the stop-and-copy state. The
- *    user must not be required to consume all migration data before the device
- *    transitions to a new state, including the stop-and-copy state.
- *    The sequence to be followed for above two transitions is given below.
- * 4. To start the resuming phase, the device state should be transitioned from
- *    the _RUNNING to the _RESUMING state.
- *    In the _RESUMING state, the driver should use the device state data
- *    received through the migration region to resume the device.
- * 5. After providing saved device data to the driver, the application should
- *    change the state from _RESUMING to _RUNNING.
- *
- * reserved:
- *      Reads on this field return zero and writes are ignored.
- *
- * pending_bytes: (read only)
- *      The number of pending bytes still to be migrated from the vendor driver.
- *
- * data_offset: (read only)
- *      The user application should read data_offset field from the migration
- *      region. The user application should read the device data from this
- *      offset within the migration region during the _SAVING state or write
- *      the device data during the _RESUMING state. See below for details of
- *      sequence to be followed.
- *
- * data_size: (read/write)
- *      The user application should read data_size to get the size in bytes of
- *      the data copied in the migration region during the _SAVING state and
- *      write the size in bytes of the data copied in the migration region
- *      during the _RESUMING state.
- *
- * The format of the migration region is as follows:
- *  ------------------------------------------------------------------
- * |vfio_device_migration_info|    data section                      |
- * |                          |     ///////////////////////////////  |
- * ------------------------------------------------------------------
- *   ^                              ^
- *  offset 0-trapped part        data_offset
- *
- * The structure vfio_device_migration_info is always followed by the data
- * section in the region, so data_offset will always be nonzero. The offset
- * from where the data is copied is decided by the kernel driver. The data
- * section can be trapped, mmapped, or partitioned, depending on how the kernel
- * driver defines the data section. The data section partition can be defined
- * as mapped by the sparse mmap capability. If mmapped, data_offset must be
- * page aligned, whereas initial section which contains the
- * vfio_device_migration_info structure, might not end at the offset, which is
- * page aligned. The user is not required to access through mmap regardless
- * of the capabilities of the region mmap.
- * The vendor driver should determine whether and how to partition the data
- * section. The vendor driver should return data_offset accordingly.
- *
- * The sequence to be followed while in pre-copy state and stop-and-copy state
- * is as follows:
- * a. Read pending_bytes, indicating the start of a new iteration to get device
- *    data. Repeated read on pending_bytes at this stage should have no side
- *    effects.
- *    If pending_bytes == 0, the user application should not iterate to get data
- *    for that device.
- *    If pending_bytes > 0, perform the following steps.
- * b. Read data_offset, indicating that the vendor driver should make data
- *    available through the data section. The vendor driver should return this
- *    read operation only after data is available from (region + data_offset)
- *    to (region + data_offset + data_size).
- * c. Read data_size, which is the amount of data in bytes available through
- *    the migration region.
- *    Read on data_offset and data_size should return the offset and size of
- *    the current buffer if the user application reads data_offset and
- *    data_size more than once here.
- * d. Read data_size bytes of data from (region + data_offset) from the
- *    migration region.
- * e. Process the data.
- * f. Read pending_bytes, which indicates that the data from the previous
- *    iteration has been read. If pending_bytes > 0, go to step b.
- *
- * The user application can transition from the _SAVING|_RUNNING
- * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
- * number of pending bytes. The user application should iterate in _SAVING
- * (stop-and-copy) until pending_bytes is 0.
- *
- * The sequence to be followed while _RESUMING device state is as follows:
- * While data for this device is available, repeat the following steps:
- * a. Read data_offset from where the user application should write data.
- * b. Write migration data starting at the migration region + data_offset for
- *    the length determined by data_size from the migration source.
- * c. Write data_size, which indicates to the vendor driver that data is
- *    written in the migration region. Vendor driver must return this write
- *    operations on consuming data. Vendor driver should apply the
- *    user-provided migration region data to the device resume state.
- *
- * If an error occurs during the above sequences, the vendor driver can return
- * an error code for next read() or write() operation, which will terminate the
- * loop. The user application should then take the next necessary action, for
- * example, failing migration or terminating the user application.
- *
- * For the user application, data is opaque. The user application should write
- * data in the same order as the data is received and the data should be of
- * same transaction size at the source.
- */
+#define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1)
 
 struct vfio_device_migration_info {
 	__u32 device_state;         /* VFIO device state */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (9 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.

As part of this adds the APIs and data structure that are needed to manage
the migration data.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 259 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |  35 +++++
 2 files changed, 294 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
new file mode 100644
index 000000000000..5c9f9218cc1d
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "cmd.h"
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(suspend_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(suspend_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(suspend_vhca_in, in, opcode, MLX5_CMD_OP_SUSPEND_VHCA);
+	MLX5_SET(suspend_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(suspend_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, suspend_vhca, in, out);
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(resume_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(resume_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(resume_vhca_in, in, opcode, MLX5_CMD_OP_RESUME_VHCA);
+	MLX5_SET(resume_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(resume_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, resume_vhca, in, out);
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  size_t *state_size)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(query_vhca_migration_state_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE);
+	MLX5_SET(query_vhca_migration_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(query_vhca_migration_state_in, in, op_mod, 0);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_vhca_migration_state, in, out);
+	if (ret)
+		goto end;
+
+	*state_size = MLX5_GET(query_vhca_migration_state_out, out,
+			       required_umem_size);
+
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+	int out_size;
+	void *out;
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	out = kzalloc(out_size, GFP_KERNEL);
+	if (!out) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	MLX5_SET(query_hca_cap_in, in, other_function, 1);
+	MLX5_SET(query_hca_cap_in, in, function_id, function_id);
+	MLX5_SET(query_hca_cap_in, in, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1 |
+		 HCA_CAP_OPMOD_GET_CUR);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_hca_cap, in, out);
+	if (ret)
+		goto err_exec;
+
+	*vhca_id = MLX5_GET(query_hca_cap_out, out,
+			    capability.cmd_hca_cap.vhca_id);
+
+err_exec:
+	kfree(out);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
+			      struct mlx5_vf_migration_file *migf, u32 *mkey)
+{
+	size_t npages = DIV_ROUND_UP(migf->total_length, PAGE_SIZE);
+	struct sg_dma_page_iter dma_iter;
+	int err = 0, inlen;
+	__be64 *mtt;
+	void *mkc;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(*mtt) * round_up(npages, 2);
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
+		 DIV_ROUND_UP(npages, 2));
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+
+	for_each_sgtable_dma_page(&migf->table.sgt, &dma_iter, 0)
+		*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, rr, 1);
+	MLX5_SET(mkc, mkc, rw, 1);
+	MLX5_SET(mkc, mkc, pd, pdn);
+	MLX5_SET(mkc, mkc, bsf_octword_size, 0);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
+	MLX5_SET64(mkc, mkc, len, migf->total_length);
+	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
+	kvfree(in);
+	return err;
+}
+
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	u32 pdn, mkey;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE,
+			      0);
+	if (err)
+		goto err_dma_map;
+
+	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	if (err)
+		goto err_create_mkey;
+
+	MLX5_SET(save_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_SAVE_VHCA_STATE);
+	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(save_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(save_vhca_state_in, in, mkey, mkey);
+	MLX5_SET(save_vhca_state_in, in, size, migf->total_length);
+
+	err = mlx5_cmd_exec_inout(mdev, save_vhca_state, in, out);
+	if (err)
+		goto err_exec;
+
+	migf->total_length =
+		MLX5_GET(save_vhca_state_out, out, actual_image_size);
+
+	mlx5_core_destroy_mkey(mdev, mkey);
+	mlx5_core_dealloc_pd(mdev, pdn);
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
+	mlx5_vf_put_core_dev(mdev);
+
+	return 0;
+
+err_exec:
+	mlx5_core_destroy_mkey(mdev, mkey);
+err_create_mkey:
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
+err_dma_map:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return err;
+}
+
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	u32 pdn, mkey;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	mutex_lock(&migf->lock);
+	if (!migf->total_length) {
+		err = -EINVAL;
+		goto end;
+	}
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_TO_DEVICE, 0);
+	if (err)
+		goto err_reg;
+
+	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	if (err)
+		goto err_mkey;
+
+	MLX5_SET(load_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_LOAD_VHCA_STATE);
+	MLX5_SET(load_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(load_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(load_vhca_state_in, in, mkey, mkey);
+	MLX5_SET(load_vhca_state_in, in, size, migf->total_length);
+
+	err = mlx5_cmd_exec_inout(mdev, load_vhca_state, in, out);
+
+	mlx5_core_destroy_mkey(mdev, mkey);
+err_mkey:
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_TO_DEVICE, 0);
+err_reg:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	mutex_unlock(&migf->lock);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
new file mode 100644
index 000000000000..69a1481ed953
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#ifndef MLX5_VFIO_CMD_H
+#define MLX5_VFIO_CMD_H
+
+#include <linux/kernel.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5_vf_migration_file {
+	struct file *filp;
+	struct mutex lock;
+
+	struct sg_append_table table;
+	size_t total_length;
+	size_t allocated_length;
+
+	/* Optimize mlx5vf_get_migration_page() for sequential access */
+	struct scatterlist *last_offset_sg;
+	unsigned int sg_last_entry;
+	unsigned long last_offset;
+};
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  size_t *state_size);
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf);
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf);
+#endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (10 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-09  0:07   ` Alex Williamson
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

This patch adds support for vfio_pci driver for mlx5 devices.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.

The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.

This patch implements the suspend/resume flows.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 MAINTAINERS                    |   6 +
 drivers/vfio/pci/Kconfig       |   3 +
 drivers/vfio/pci/Makefile      |   2 +
 drivers/vfio/pci/mlx5/Kconfig  |  10 +
 drivers/vfio/pci/mlx5/Makefile |   4 +
 drivers/vfio/pci/mlx5/cmd.h    |   1 +
 drivers/vfio/pci/mlx5/main.c   | 623 +++++++++++++++++++++++++++++++++
 7 files changed, 649 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ea3e6c914384..5c5216f5e43d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20260,6 +20260,12 @@ L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/platform/
 
+VFIO MLX5 PCI DRIVER
+M:	Yishai Hadas <yishaih@nvidia.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	drivers/vfio/pci/mlx5/
+
 VGA_SWITCHEROO
 R:	Lukas Wunner <lukas@wunner.de>
 S:	Maintained
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 860424ccda1b..187b9c259944 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -43,4 +43,7 @@ config VFIO_PCI_IGD
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
 endif
+
+source "drivers/vfio/pci/mlx5/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 349d68d242b4..ed9d6f2e0555 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
new file mode 100644
index 000000000000..29ba9c504a75
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config MLX5_VFIO_PCI
+	tristate "VFIO support for MLX5 PCI devices"
+	depends on MLX5_CORE
+	depends on VFIO_PCI_CORE
+	help
+	  This provides migration support for MLX5 devices using the VFIO
+	  framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
new file mode 100644
index 000000000000..689627da7ff5
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
+mlx5-vfio-pci-y := main.o cmd.o
+
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 69a1481ed953..1392a11a9cc0 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -12,6 +12,7 @@
 struct mlx5_vf_migration_file {
 	struct file *filp;
 	struct mutex lock;
+	bool disabled;
 
 	struct sg_append_table table;
 	size_t total_length;
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
new file mode 100644
index 000000000000..acd205bcff70
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -0,0 +1,623 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/anon_inodes.h>
+
+#include "cmd.h"
+
+/* Arbitrary to prevent userspace from consuming endless memory */
+#define MAX_MIGRATION_SIZE (512*1024*1024)
+
+struct mlx5vf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 migrate_cap:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	u16 vhca_id;
+	struct mlx5_vf_migration_file *resuming_migf;
+	struct mlx5_vf_migration_file *saving_migf;
+};
+
+static struct page *
+mlx5vf_get_migration_page(struct mlx5_vf_migration_file *migf,
+			  unsigned long offset)
+{
+	unsigned long cur_offset = 0;
+	struct scatterlist *sg;
+	unsigned int i;
+
+	/* All accesses are sequential */
+	if (offset < migf->last_offset || !migf->last_offset_sg) {
+		migf->last_offset = 0;
+		migf->last_offset_sg = migf->table.sgt.sgl;
+		migf->sg_last_entry = 0;
+	}
+
+	cur_offset = migf->last_offset;
+
+	for_each_sg(migf->last_offset_sg, sg,
+			migf->table.sgt.orig_nents - migf->sg_last_entry, i) {
+		if (offset < sg->length + cur_offset) {
+			migf->last_offset_sg = sg;
+			migf->sg_last_entry += i;
+			migf->last_offset = cur_offset;
+			return nth_page(sg_page(sg),
+					(offset - cur_offset) / PAGE_SIZE);
+		}
+		cur_offset += sg->length;
+	}
+	return NULL;
+}
+
+static int mlx5vf_add_migration_pages(struct mlx5_vf_migration_file *migf,
+				      unsigned int npages)
+{
+	unsigned int to_alloc = npages;
+	struct page **page_list;
+	unsigned long filled;
+	unsigned int to_fill;
+	int ret;
+
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
+	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	do {
+		filled = alloc_pages_bulk_array(GFP_KERNEL, to_fill, page_list);
+		if (!filled) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		to_alloc -= filled;
+		ret = sg_alloc_append_table_from_pages(
+			&migf->table, page_list, filled, 0,
+			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
+			GFP_KERNEL);
+
+		if (ret)
+			goto err;
+		migf->allocated_length += filled * PAGE_SIZE;
+		/* clean input for another bulk allocation */
+		memset(page_list, 0, filled * sizeof(*page_list));
+		to_fill = min_t(unsigned int, to_alloc,
+				PAGE_SIZE / sizeof(*page_list));
+	} while (to_alloc > 0);
+
+	kvfree(page_list);
+	return 0;
+
+err:
+	kvfree(page_list);
+	return ret;
+}
+
+static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf)
+{
+	struct sg_page_iter sg_iter;
+
+	mutex_lock(&migf->lock);
+	/* Undo alloc_pages_bulk_array() */
+	for_each_sgtable_page(&migf->table.sgt, &sg_iter, 0)
+		__free_page(sg_page_iter_page(&sg_iter));
+	sg_free_append_table(&migf->table);
+	migf->disabled = true;
+	migf->total_length = 0;
+	migf->allocated_length = 0;
+	migf->filp->f_pos = 0;
+	mutex_unlock(&migf->lock);
+}
+
+static int mlx5vf_release_file(struct inode *inode, struct file *filp)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+
+	mlx5vf_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+	return 0;
+}
+
+static ssize_t mlx5vf_save_read(struct file *filp, char __user *buf, size_t len,
+			       loff_t *pos)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	mutex_lock(&migf->lock);
+	if (*pos > migf->total_length) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, migf->total_length - *pos, len);
+	while (len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *from_buff;
+		int ret;
+
+		page_offset = (*pos) % PAGE_SIZE;
+		page = mlx5vf_get_migration_page(migf, *pos - page_offset);
+		if (!page) {
+			if (done == 0)
+				done = -EINVAL;
+			goto out_unlock;
+		}
+
+		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
+		from_buff = kmap_local_page(page);
+		ret = copy_to_user(buf, from_buff + page_offset, page_len);
+		kunmap_local(from_buff);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += page_len;
+		len -= page_len;
+		done += page_len;
+		buf += page_len;
+	}
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations mlx5vf_save_fops = {
+	.owner = THIS_MODULE,
+	.read = mlx5vf_save_read,
+	.release = mlx5vf_release_file,
+	.llseek = no_llseek,
+};
+
+static struct mlx5_vf_migration_file *
+mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("mlx5vf_mig", &mlx5vf_save_fops, migf,
+					O_RDONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	ret = mlx5vf_cmd_query_vhca_migration_state(
+		mvdev->core_device.pdev, mvdev->vhca_id, &migf->total_length);
+	if (ret)
+		goto out_free;
+
+	ret = mlx5vf_add_migration_pages(
+		migf, DIV_ROUND_UP_ULL(migf->total_length, PAGE_SIZE));
+	if (ret)
+		goto out_free;
+
+	ret = mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
+					 mvdev->vhca_id, migf);
+	if (ret)
+		goto out_free;
+	return migf;
+out_free:
+	fput(migf->filp);
+	return ERR_PTR(ret);
+}
+
+static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
+				   size_t len, loff_t *pos)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+	loff_t requested_length;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	if (*pos < 0 ||
+	    check_add_overflow((loff_t)len, *pos, &requested_length))
+		return -EINVAL;
+
+	if (requested_length > MAX_MIGRATION_SIZE)
+		return -ENOMEM;
+
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	if (migf->allocated_length < requested_length) {
+		done = mlx5vf_add_migration_pages(
+			migf,
+			DIV_ROUND_UP(requested_length - migf->allocated_length,
+				     PAGE_SIZE));
+		if (done)
+			goto out_unlock;
+	}
+
+	while (len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *to_buff;
+		int ret;
+
+		page_offset = (*pos) % PAGE_SIZE;
+		page = mlx5vf_get_migration_page(migf, *pos - page_offset);
+		if (!page) {
+			if (done == 0)
+				done = -EINVAL;
+			goto out_unlock;
+		}
+
+		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
+		to_buff = kmap_local_page(page);
+		ret = copy_from_user(to_buff + page_offset, buf, page_len);
+		kunmap_local(to_buff);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += page_len;
+		len -= page_len;
+		done += page_len;
+		buf += page_len;
+		migf->total_length += page_len;
+	}
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations mlx5vf_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = mlx5vf_resume_write,
+	.release = mlx5vf_release_file,
+	.llseek = no_llseek,
+};
+
+static struct mlx5_vf_migration_file *
+mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vf_migration_file *migf;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("mlx5vf_mig", &mlx5vf_resume_fops, migf,
+					O_WRONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+	return migf;
+}
+
+static void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev)
+{
+	if (mvdev->resuming_migf) {
+		mlx5vf_disable_fd(mvdev->resuming_migf);
+		fput(mvdev->resuming_migf->filp);
+		mvdev->resuming_migf = NULL;
+	}
+	if (mvdev->saving_migf) {
+		mlx5vf_disable_fd(mvdev->saving_migf);
+		fput(mvdev->saving_migf->filp);
+		mvdev->saving_migf = NULL;
+	}
+}
+
+static struct file *
+mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
+				    u32 new)
+{
+	u32 cur = mvdev->mig_state;
+	int ret;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_STOP) {
+		ret = mlx5vf_cmd_suspend_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		ret = mlx5vf_cmd_resume_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		ret = mlx5vf_cmd_suspend_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) {
+		ret = mlx5vf_cmd_resume_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct mlx5_vf_migration_file *migf;
+
+		migf = mlx5vf_pci_save_device_data(mvdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		mvdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if ((cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP)) {
+		mlx5vf_disable_fds(mvdev);
+		return 0;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
+		struct mlx5_vf_migration_file *migf;
+
+		migf = mlx5vf_pci_resume_device_data(mvdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		mvdev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
+						 mvdev->vhca_id,
+						 mvdev->resuming_migf);
+		if (ret)
+			return ERR_PTR(ret);
+		mlx5vf_disable_fds(mvdev);
+		return 0;
+	}
+
+	/*
+	 * vfio_mig_get_next_state() does not use arcs other than the above
+	 */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct file *
+mlx5vf_pci_set_device_state(struct vfio_device *vdev,
+			    enum vfio_device_mig_state new_state)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+	int ret;
+
+	mutex_lock(&mvdev->state_mutex);
+	while (new_state != mvdev->mig_state) {
+		ret = vfio_mig_get_next_state(vdev, mvdev->mig_state,
+					      new_state, &next_state);
+		if (ret) {
+			res = ERR_PTR(ret);
+			break;
+		}
+		res = mlx5vf_pci_step_device_state_locked(mvdev, next_state);
+		if (IS_ERR(res))
+			break;
+		mvdev->mig_state = next_state;
+		if (WARN_ON(res && new_state != mvdev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	mutex_unlock(&mvdev->state_mutex);
+	return res;
+}
+
+static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
+				       enum vfio_device_mig_state *curr_state)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	mutex_lock(&mvdev->state_mutex);
+	*curr_state = mvdev->mig_state;
+	mutex_unlock(&mvdev->state_mutex);
+	return 0;
+}
+
+static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &mvdev->core_device;
+	int vf_id;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	if (!mvdev->migrate_cap) {
+		vfio_pci_core_finish_enable(vdev);
+		return 0;
+	}
+
+	vf_id = pci_iov_vf_id(vdev->pdev);
+	if (vf_id < 0) {
+		ret = vf_id;
+		goto out_disable;
+	}
+
+	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1, &mvdev->vhca_id);
+	if (ret)
+		goto out_disable;
+
+	mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+out_disable:
+	vfio_pci_core_disable(vdev);
+	return ret;
+}
+
+static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	mlx5vf_disable_fds(mvdev);
+	vfio_pci_core_close_device(core_vdev);
+}
+
+static const struct vfio_device_ops mlx5vf_pci_ops = {
+	.name = "mlx5-vfio-pci",
+	.open_device = mlx5vf_pci_open_device,
+	.close_device = mlx5vf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.migration_set_state = mlx5vf_pci_set_device_state,
+	.migration_get_state = mlx5vf_pci_get_device_state,
+};
+
+static int mlx5vf_pci_probe(struct pci_dev *pdev,
+			    const struct pci_device_id *id)
+{
+	struct mlx5vf_pci_core_device *mvdev;
+	int ret;
+
+	mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL);
+	if (!mvdev)
+		return -ENOMEM;
+	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
+
+	if (pdev->is_virtfn) {
+		struct mlx5_core_dev *mdev =
+			mlx5_vf_get_core_dev(pdev);
+
+		if (mdev) {
+			if (MLX5_CAP_GEN(mdev, migration)) {
+				mvdev->migrate_cap = 1;
+				mvdev->core_device.vdev.migration_flags =
+					VFIO_MIGRATION_STOP_COPY |
+					VFIO_MIGRATION_P2P;
+				mutex_init(&mvdev->state_mutex);
+			}
+			mlx5_vf_put_core_dev(mdev);
+		}
+	}
+
+	ret = vfio_pci_core_register_device(&mvdev->core_device);
+	if (ret)
+		goto out_free;
+
+	dev_set_drvdata(&pdev->dev, mvdev);
+	return 0;
+
+out_free:
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+	return ret;
+}
+
+static void mlx5vf_pci_remove(struct pci_dev *pdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(&mvdev->core_device);
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+}
+
+static const struct pci_device_id mlx5vf_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX, 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
+
+static struct pci_driver mlx5vf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = mlx5vf_pci_table,
+	.probe = mlx5vf_pci_probe,
+	.remove = mlx5vf_pci_remove,
+};
+
+static void __exit mlx5vf_pci_cleanup(void)
+{
+	pci_unregister_driver(&mlx5vf_pci_driver);
+}
+
+static int __init mlx5vf_pci_init(void)
+{
+	return pci_register_driver(&mlx5vf_pci_driver);
+}
+
+module_init(mlx5vf_pci_init);
+module_exit(mlx5vf_pci_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
+MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
+MODULE_DESCRIPTION(
+	"MLX5 VFIO PCI - User Level meta-driver for MLX5 device family");
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected()
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (11 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

Expose vfio_pci_core_aer_err_detected() to be used by drivers as part of
their pci_error_handlers structure.

Next patch for mlx5 driver will use it.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 7 ++++---
 include/linux/vfio_pci_core.h    | 2 ++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 106e1970d653..e301092e94ef 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1871,8 +1871,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+						pci_channel_state_t state)
 {
 	struct vfio_pci_core_device *vdev;
 	struct vfio_device *device;
@@ -1894,6 +1894,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
 int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 {
@@ -1916,7 +1917,7 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 const struct pci_error_handlers vfio_pci_core_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
+	.error_detected = vfio_pci_core_aer_err_detected,
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index beba0b2ed87d..9f1bf8e49d43 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -232,6 +232,8 @@ int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+						pci_channel_state_t state);
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (12 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-09  0:08   ` Alex Williamson
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
  2022-02-18  8:11 ` [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Tarun Gupta (SW-GPU)
  15 siblings, 1 reply; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

Register its own handler for pci_error_handlers.reset_done and update
state accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/mlx5/main.c | 57 ++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index acd205bcff70..63a889210ef3 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -28,9 +28,12 @@
 struct mlx5vf_pci_core_device {
 	struct vfio_pci_core_device core_device;
 	u8 migrate_cap:1;
+	u8 deferred_reset:1;
 	/* protect migration state */
 	struct mutex state_mutex;
 	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
 	u16 vhca_id;
 	struct mlx5_vf_migration_file *resuming_migf;
 	struct mlx5_vf_migration_file *saving_migf;
@@ -437,6 +440,25 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	return ERR_PTR(-EINVAL);
 }
 
+/*
+ * This function is called in all state_mutex unlock cases to
+ * handle a 'deferred_reset' if exists.
+ */
+static void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev)
+{
+again:
+	spin_lock(&mvdev->reset_lock);
+	if (mvdev->deferred_reset) {
+		mvdev->deferred_reset = false;
+		spin_unlock(&mvdev->reset_lock);
+		mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+		mlx5vf_disable_fds(mvdev);
+		goto again;
+	}
+	mutex_unlock(&mvdev->state_mutex);
+	spin_unlock(&mvdev->reset_lock);
+}
+
 static struct file *
 mlx5vf_pci_set_device_state(struct vfio_device *vdev,
 			    enum vfio_device_mig_state new_state)
@@ -465,7 +487,7 @@ mlx5vf_pci_set_device_state(struct vfio_device *vdev,
 			break;
 		}
 	}
-	mutex_unlock(&mvdev->state_mutex);
+	mlx5vf_state_mutex_unlock(mvdev);
 	return res;
 }
 
@@ -477,10 +499,34 @@ static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
 
 	mutex_lock(&mvdev->state_mutex);
 	*curr_state = mvdev->mig_state;
-	mutex_unlock(&mvdev->state_mutex);
+	mlx5vf_state_mutex_unlock(mvdev);
 	return 0;
 }
 
+static void mlx5vf_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+
+	if (!mvdev->migrate_cap)
+		return;
+
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&mvdev->reset_lock);
+	mvdev->deferred_reset = true;
+	if (!mutex_trylock(&mvdev->state_mutex)) {
+		spin_unlock(&mvdev->reset_lock);
+		return;
+	}
+	spin_unlock(&mvdev->reset_lock);
+	mlx5vf_state_mutex_unlock(mvdev);
+}
+
 static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
 {
 	struct mlx5vf_pci_core_device *mvdev = container_of(
@@ -562,6 +608,7 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 					VFIO_MIGRATION_STOP_COPY |
 					VFIO_MIGRATION_P2P;
 				mutex_init(&mvdev->state_mutex);
+				spin_lock_init(&mvdev->reset_lock);
 			}
 			mlx5_vf_put_core_dev(mdev);
 		}
@@ -596,11 +643,17 @@ static const struct pci_device_id mlx5vf_pci_table[] = {
 
 MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
 
+static const struct pci_error_handlers mlx5vf_err_handlers = {
+	.reset_done = mlx5vf_pci_aer_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
 static struct pci_driver mlx5vf_pci_driver = {
 	.name = KBUILD_MODNAME,
 	.id_table = mlx5vf_pci_table,
 	.probe = mlx5vf_pci_probe,
 	.remove = mlx5vf_pci_remove,
+	.err_handler = &mlx5vf_err_handlers,
 };
 
 static void __exit mlx5vf_pci_cleanup(void)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (13 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
@ 2022-02-07 17:22 ` Yishai Hadas
  2022-02-17 17:15   ` Alex Williamson
  2022-02-18  8:01   ` Tian, Kevin
  2022-02-18  8:11 ` [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Tarun Gupta (SW-GPU)
  15 siblings, 2 replies; 50+ messages in thread
From: Yishai Hadas @ 2022-02-07 17:22 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

From: Jason Gunthorpe <jgg@nvidia.com>

The optional PRE_COPY states open the saving data transfer FD before
reaching STOP_COPY and allows the device to dirty track internal state
changes with the general idea to reduce the volume of data transferred
in the STOP_COPY stage.

While in PRE_COPY the device remains RUNNING, but the saving FD is open.

Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
which halts P2P transfers while continuing the saving FD.

PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
and exists as an optional FSM branch between RUNNING and STOP_COPY:
    RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY

A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
query the progress of the precopy operation in the driver with the idea it
will judge to move to STOP_COPY at least once the initial data set is
transferred, and possibly after the dirty size has shrunk appropriately.

We think there may also be merit in future extensions to the
VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
rate it generates internal dirty state.

Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
and to be defined in future. While making the whole PRE_COPY feature
optional eliminates the concern from mlx5, this is still a complicated arc
to implement and seems prudent to leave it closed until a proper use case
is developed. We also split the pending_bytes report into the initial and
sustaining values, and define the protocol to get an event via poll() for
new dirty data during PRE_COPY.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       |  71 +++++++++++++++++++++++-
 include/uapi/linux/vfio.h | 110 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 176 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 8c484593dfe0..b4c585114ef3 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1577,7 +1577,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_PRE_COPY_P2P + 1 };
 	/*
 	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
@@ -1596,25 +1596,59 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 	 *         RUNNING -> STOP
 	 *         STOP -> RUNNING
 	 *
+	 * If precopy is supported then the driver must support these additional
+	 * FSM arcs:
+	 *         RUNNING -> PRE_COPY
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> STOP_COPY
+	 * However, if precopy and P2P are supported together then the driver
+	 * must support these additional arcs beyond the P2P arcs above:
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> PRE_COPY_P2P
+	 *         PRE_COPY_P2P -> PRE_COPY
+	 *         PRE_COPY_P2P -> RUNNING_P2P
+	 *         PRE_COPY_P2P -> STOP_COPY
+	 *         RUNNING -> PRE_COPY
+	 *         RUNNING_P2P -> PRE_COPY_P2P
+	 *
 	 * If all optional features are supported then the coding will step
 	 * through multiple states for these combination transitions:
+	 *         PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RUNNING -> RUNNING_P2P -> STOP
 	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         STOP -> RUNNING_P2P -> RUNNING
+	 *         STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         STOP_COPY -> STOP -> RESUMING
 	 *         STOP_COPY -> STOP -> RUNNING_P2P
 	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
+	 *
+	 *  The following transitions are blocked:
+	 *         STOP_COPY -> PRE_COPY
+	 *         STOP_COPY -> PRE_COPY_P2P
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1623,14 +1657,38 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
+		[VFIO_DEVICE_STATE_PRE_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1639,6 +1697,8 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RESUMING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1647,6 +1707,8 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1655,6 +1717,8 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_ERROR] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
@@ -1665,6 +1729,11 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY] =
+			VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_MIGRATION_STOP_COPY |
+						   VFIO_MIGRATION_P2P |
+						   VFIO_MIGRATION_PRE_COPY,
 		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] =
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 227f55d57e06..6424c5b3415b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -817,12 +817,20 @@ struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -873,8 +881,13 @@ struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -906,20 +919,48 @@ struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise effect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
  *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
+ *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
+ *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
+ *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -937,6 +978,13 @@ struct vfio_device_feature_mig_state {
  *   internal device state for this arc if required to prepare the device to
  *   receive the migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -948,7 +996,7 @@ struct vfio_device_feature_mig_state {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
  * support them. The user can disocver if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -980,8 +1033,57 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
+};
+
+/**
+ * VFIO_DEVICE_MIG_PRECOPY - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * This ioctl is used on the migration data FD in the precopy phase of the
+ * migration data transfer. It returns an estimate of the current data sizes
+ * remaining to be transferred. It allows the user to judge when it is
+ * appropriate to leave PRE_COPY for STOP_COPY.
+ *
+ * initial_bytes reflects the estimated remaining size of any initial mandatory
+ * precopy data transfer. When initial_bytes returns as zero then the initial
+ * phase of the precopy data is completed. Generally initial_bytes should start
+ * out as approximately the entire device state.
+ *
+ * dirty_bytes reflects an estimate for how much more data needs to be
+ * transferred to complete the migration. Generally it should start as zero
+ * and increase as internal state is dirtied.
+ *
+ * Drivers should attempt to return estimates so that initial_bytes +
+ * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
+ * will require to be streamed.
+ *
+ * Drivers have alot of flexibility in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_DEVICE_MIG_PRECOPY.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY. Support
+ * for this ioctl is optional.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_device_mig_precopy {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
 };
 
+#define VFIO_DEVICE_MIG_PRECOPY _IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
@ 2022-02-09  0:07   ` Alex Williamson
  2022-02-09  2:36     ` Jason Gunthorpe
  2022-02-15  8:04   ` Tian, Kevin
  1 sibling, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2022-02-09  0:07 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Mon, 7 Feb 2022 19:22:09 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:
> +static int
> +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device,
> +					   u32 flags, void __user *arg,
> +					   size_t argsz)
> +{
> +	size_t minsz =
> +		offsetofend(struct vfio_device_feature_mig_state, data_fd);
> +	struct vfio_device_feature_mig_state mig;

Perhaps set default data_fd here?  ie.

  struct vfio_device_feature_mig_state mig = { .data_fd = -1 };

> +	struct file *filp = NULL;
> +	int ret;
> +
> +	if (!device->ops->migration_set_state ||
> +	    !device->ops->migration_get_state)
> +		return -ENOTTY;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET |
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(mig));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&mig, arg, minsz))
> +		return -EFAULT;
> +
> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> +		enum vfio_device_mig_state curr_state;
> +
> +		ret = device->ops->migration_get_state(device, &curr_state);
> +		if (ret)
> +			return ret;
> +		mig.device_state = curr_state;
> +		goto out_copy;
> +	}
> +
> +	/* Handle the VFIO_DEVICE_FEATURE_SET */
> +	filp = device->ops->migration_set_state(device, mig.device_state);
> +	if (IS_ERR(filp) || !filp)
> +		goto out_copy;
> +
> +	return vfio_ioct_mig_return_fd(filp, arg, &mig);
> +out_copy:
> +	mig.data_fd = -1;
> +	if (copy_to_user(arg, &mig, sizeof(mig)))
> +		return -EFAULT;
> +	if (IS_ERR(filp))
> +		return PTR_ERR(filp);
> +	return 0;
> +}
...
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ca69516f869d..3f4a1a7c2277 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -56,6 +56,13 @@ struct vfio_device {
>   *         match, -errno for abort (ex. match with insufficient or incorrect
>   *         additional args)
>   * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
> + * @migration_set_state: Optional callback to change the migration state for
> + *         devices that support migration. The returned FD is used for data
> + *         transfer according to the FSM definition. The driver is responsible
> + *         to ensure that FD is isolated whenever the migration FSM leaves a
> + *         data transfer state or before close_device() returns.
> + @migration_get_state: Optional callback to get the migration state for

Fix formatting, " * @mig..."

> + *         devices that support migration.
>   */
>  struct vfio_device_ops {
>  	char	*name;
...
> +/*
> + * Indicates the device can support the migration API. See enum
> + * vfio_device_mig_state for details. If present flags must be non-zero and
> + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
> + *
> + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
> + * RESUMING are supported.
> + */
> +struct vfio_device_feature_migration {
> +	__aligned_u64 flags;
> +#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
> +};
> +#define VFIO_DEVICE_FEATURE_MIGRATION 1
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET,
> + * Execute a migration state change on the VFIO device.
> + * The new state is supplied in device_state.
> + *
> + * The kernel migration driver must fully transition the device to the new state
> + * value before the write(2) operation returns to the user.

Stale comment, there's no write(2) anymore.

> + *
> + * The kernel migration driver must not generate asynchronous device state
> + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
> + * ioctl as described above.
> + *
> + * If this function fails then current device_state may be the original
> + * operating state or some other state along the combination transition path.
> + * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt
> + * to return to the original state, or attempt to return to some other state
> + * such as RUNNING or STOP.
> + *
> + * If the new_state starts a new data transfer session then the FD associated
> + * with that session is returned in data_fd. The user is responsible to close
> + * this FD when it is finished. The user must consider the migration data
> + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> + * the data segments must be written in the same order they came out of the
> + * saving side FD.
> + *
> + * Upon VFIO_DEVICE_FEATURE_GET,
> + * Get the current migration state of the VFIO device, data_fd will be -1.
> + */
> +struct vfio_device_feature_mig_state {
> +	__u32 device_state; /* From enum vfio_device_mig_state */
> +	__s32 data_fd;
> +};
> +#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2
> +
> +/*
> + * The device migration Finite State Machine is described by the enum
> + * vfio_device_mig_state. Some of the FSM arcs will create a migration data
> + * transfer session by returning a FD, in this case the migration data will
> + * flow over the FD using read() and write() as discussed below.
> + *
> + * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
> + *  RUNNING - The device is running normally
> + *  STOP - The device does not change the internal or external state
> + *  STOP_COPY - The device internal state can be read out
> + *  RESUMING - The device is stopped and is loading a new internal state
> + *  ERROR - The device has failed and must be reset
> + *
> + * The FSM takes actions on the arcs between FSM states. The driver implements
> + * the following behavior for the FSM arcs:
> + *
> + * RUNNING -> STOP
> + * STOP_COPY -> STOP
> + *   While in STOP the device must stop the operation of the device. The
> + *   device must not generate interrupts, DMA, or advance its internal
> + *   state. When stopped the device and kernel migration driver must accept
> + *   and respond to interaction to support external subsystems in the STOP
> + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> + *   restrict device access while in STOP must not result in error conditions
> + *   outside the user context (ex. host system faults).
> + *
> + *   The STOP_COPY arc will terminate a data transfer session.
> + *
> + * RESUMING -> STOP
> + *   Leaving RESUMING terminates a data transfer session and indicates the
> + *   device should complete processing of the data delivered by write(). The
> + *   kernel migration driver should complete the incorporation of data written
> + *   to the data transfer FD into the device internal state and perform
> + *   final validity and consistency checking of the new device state. If the
> + *   user provided data is found to be incomplete, inconsistent, or otherwise
> + *   invalid, the migration driver must fail the SET_STATE ioctl and
> + *   optionally go to the ERROR state as described below.
> + *
> + *   While in STOP the device has the same behavior as other STOP states
> + *   described above.
> + *
> + *   To abort a RESUMING session the device must be reset.
> + *
> + * STOP -> RUNNING
> + *   While in RUNNING the device is fully operational, the device may generate
> + *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
> + *   and the device may advance its internal state.
> + *
> + * STOP -> STOP_COPY
> + *   This arc begin the process of saving the device state and will return a
> + *   new data_fd.
> + *
> + *   While in the STOP_COPY state the device has the same behavior as STOP
> + *   with the addition that the data transfers session continues to stream the
> + *   migration state. End of stream on the FD indicates the entire device
> + *   state has been transferred.
> + *
> + *   The user should take steps to restrict access to vfio device regions while
> + *   the device is in STOP_COPY or risk corruption of the device migration data
> + *   stream.
> + *
> + * STOP -> RESUMING
> + *   Entering the RESUMING state starts a process of restoring the device
> + *   state and will return a new data_fd. The data stream fed into the data_fd
> + *   should be taken from the data transfer output of the saving group states
> + *   from a compatible device. The migration driver may alter/reset the
> + *   internal device state for this arc if required to prepare the device to
> + *   receive the migration data.
> + *
> + * any -> ERROR
> + *   ERROR cannot be specified as a device state, however any transition request
> + *   can be failed with an errno return and may then move the device_state into
> + *   ERROR. In this case the device was unable to execute the requested arc and
> + *   was also unable to restore the device to any valid device_state.
> + *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
> + *   device_state back to RUNNING.
> + *
> + * The remaining possible transitions are interpreted as combinations of the
> + * above FSM arcs. As there are multiple paths through the FSM arcs the path
> + * should be selected based on the following rules:
> + *   - Select the shortest path.
> + * Refer to vfio_mig_get_next_state() for the result of the algorithm.
> + *
> + * The automatic transit through the FSM arcs that make up the combination
> + * transition is invisible to the user. When working with combination arcs the
> + * user may see any step along the path in the device_state if SET_STATE
> + * fails. When handling these types of errors users should anticipate future
> + * revisions of this protocol using new states and those states becoming
> + * visible in this case.
> + */
> +enum vfio_device_mig_state {
> +	VFIO_DEVICE_STATE_ERROR = 0,
> +	VFIO_DEVICE_STATE_STOP = 1,
> +	VFIO_DEVICE_STATE_RUNNING = 2,

I'm a little surprised we're not using RUNNING = 0 given all the
objection in the v1 protocol that the default state was non-zero.

> +	VFIO_DEVICE_STATE_STOP_COPY = 3,
> +	VFIO_DEVICE_STATE_RESUMING = 4,
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**

Otherwise, I'm still not sure how userspace handles the fact that it
can't know how much data will be read from the device and how important
that is.  There's no replacement of that feature from the v1 protocol
here.

As you noted, it's not only the size of the migration data, but also
the rate the device can generate it.  However, I also expect that it's
generally the external rate that's the limiting factor.  I've not
previously seen any evidence that the device rate is taken into account.

I also think we're still waiting for confirmation from owners of
devices with extremely large device states (vGPUs) whether they consider
the stream FD sufficient versus their ability to directly mmap regions
of the device in the previous protocol.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
@ 2022-02-09  0:07   ` Alex Williamson
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Williamson @ 2022-02-09  0:07 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Mon, 7 Feb 2022 19:22:13 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> This patch adds support for vfio_pci driver for mlx5 devices.
> 
> It uses vfio_pci_core to register to the VFIO subsystem and then
> implements the mlx5 specific logic in the migration area.
> 
> The migration implementation follows the definition from uapi/vfio.h and
> uses the mlx5 VF->PF command channel to achieve it.
> 
> This patch implements the suspend/resume flows.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  MAINTAINERS                    |   6 +
>  drivers/vfio/pci/Kconfig       |   3 +
>  drivers/vfio/pci/Makefile      |   2 +
>  drivers/vfio/pci/mlx5/Kconfig  |  10 +
>  drivers/vfio/pci/mlx5/Makefile |   4 +
>  drivers/vfio/pci/mlx5/cmd.h    |   1 +
>  drivers/vfio/pci/mlx5/main.c   | 623 +++++++++++++++++++++++++++++++++
>  7 files changed, 649 insertions(+)
>  create mode 100644 drivers/vfio/pci/mlx5/Kconfig
>  create mode 100644 drivers/vfio/pci/mlx5/Makefile
>  create mode 100644 drivers/vfio/pci/mlx5/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea3e6c914384..5c5216f5e43d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -20260,6 +20260,12 @@ L:	kvm@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vfio/platform/
>  
> +VFIO MLX5 PCI DRIVER
> +M:	Yishai Hadas <yishaih@nvidia.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vfio/pci/mlx5/
> +
>  VGA_SWITCHEROO
>  R:	Lukas Wunner <lukas@wunner.de>
>  S:	Maintained
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 860424ccda1b..187b9c259944 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -43,4 +43,7 @@ config VFIO_PCI_IGD
>  
>  	  To enable Intel IGD assignment through vfio-pci, say Y.
>  endif
> +
> +source "drivers/vfio/pci/mlx5/Kconfig"
> +
>  endif
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 349d68d242b4..ed9d6f2e0555 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>  vfio-pci-y := vfio_pci.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> +
> +obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
> new file mode 100644
> index 000000000000..29ba9c504a75
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config MLX5_VFIO_PCI
> +	tristate "VFIO support for MLX5 PCI devices"
> +	depends on MLX5_CORE
> +	depends on VFIO_PCI_CORE
> +	help
> +	  This provides migration support for MLX5 devices using the VFIO
> +	  framework.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
> new file mode 100644
> index 000000000000..689627da7ff5
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> +mlx5-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
> index 69a1481ed953..1392a11a9cc0 100644
> --- a/drivers/vfio/pci/mlx5/cmd.h
> +++ b/drivers/vfio/pci/mlx5/cmd.h
> @@ -12,6 +12,7 @@
>  struct mlx5_vf_migration_file {
>  	struct file *filp;
>  	struct mutex lock;
> +	bool disabled;
>  
>  	struct sg_append_table table;
>  	size_t total_length;
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> new file mode 100644
> index 000000000000..acd205bcff70
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -0,0 +1,623 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/anon_inodes.h>
> +
> +#include "cmd.h"
> +
> +/* Arbitrary to prevent userspace from consuming endless memory */
> +#define MAX_MIGRATION_SIZE (512*1024*1024)
> +
> +struct mlx5vf_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	u8 migrate_cap:1;
> +	/* protect migration state */
> +	struct mutex state_mutex;
> +	enum vfio_device_mig_state mig_state;
> +	u16 vhca_id;
> +	struct mlx5_vf_migration_file *resuming_migf;
> +	struct mlx5_vf_migration_file *saving_migf;
> +};

Nit, migrate_cap and vhca_id could minimally be contiguous for better
packing of this struct.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
@ 2022-02-09  0:08   ` Alex Williamson
  2022-02-09  2:39     ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2022-02-09  0:08 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Mon, 7 Feb 2022 19:22:15 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> Register its own handler for pci_error_handlers.reset_done and update
> state accordingly.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/vfio/pci/mlx5/main.c | 57 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> index acd205bcff70..63a889210ef3 100644
> --- a/drivers/vfio/pci/mlx5/main.c
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -28,9 +28,12 @@
>  struct mlx5vf_pci_core_device {
>  	struct vfio_pci_core_device core_device;
>  	u8 migrate_cap:1;
> +	u8 deferred_reset:1;
>  	/* protect migration state */
>  	struct mutex state_mutex;
>  	enum vfio_device_mig_state mig_state;
> +	/* protect the reset_done flow */
> +	spinlock_t reset_lock;
>  	u16 vhca_id;
>  	struct mlx5_vf_migration_file *resuming_migf;
>  	struct mlx5_vf_migration_file *saving_migf;
> @@ -437,6 +440,25 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
>  	return ERR_PTR(-EINVAL);
>  }
>  
> +/*
> + * This function is called in all state_mutex unlock cases to
> + * handle a 'deferred_reset' if exists.
> + */
> +static void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev)
> +{
> +again:
> +	spin_lock(&mvdev->reset_lock);
> +	if (mvdev->deferred_reset) {
> +		mvdev->deferred_reset = false;
> +		spin_unlock(&mvdev->reset_lock);
> +		mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
> +		mlx5vf_disable_fds(mvdev);
> +		goto again;
> +	}
> +	mutex_unlock(&mvdev->state_mutex);
> +	spin_unlock(&mvdev->reset_lock);
> +}
> +
>  static struct file *
>  mlx5vf_pci_set_device_state(struct vfio_device *vdev,
>  			    enum vfio_device_mig_state new_state)
> @@ -465,7 +487,7 @@ mlx5vf_pci_set_device_state(struct vfio_device *vdev,
>  			break;
>  		}
>  	}
> -	mutex_unlock(&mvdev->state_mutex);
> +	mlx5vf_state_mutex_unlock(mvdev);
>  	return res;
>  }
>  
> @@ -477,10 +499,34 @@ static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
>  
>  	mutex_lock(&mvdev->state_mutex);
>  	*curr_state = mvdev->mig_state;
> -	mutex_unlock(&mvdev->state_mutex);
> +	mlx5vf_state_mutex_unlock(mvdev);
>  	return 0;

I still can't see why it wouldn't be a both fairly trivial to implement
and a usability improvement if the unlock wrapper returned -EAGAIN on a
deferred reset so we could avoid returning a stale state to the user
and a dead fd in the former case.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-09  0:07   ` Alex Williamson
@ 2022-02-09  2:36     ` Jason Gunthorpe
  2022-02-15 10:41       ` Tian, Kevin
  2022-02-15 10:58       ` Tian, Kevin
  0 siblings, 2 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-09  2:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Tue, Feb 08, 2022 at 05:07:54PM -0700, Alex Williamson wrote:
> On Mon, 7 Feb 2022 19:22:09 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> > +static int
> > +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device,
> > +					   u32 flags, void __user *arg,
> > +					   size_t argsz)
> > +{
> > +	size_t minsz =
> > +		offsetofend(struct vfio_device_feature_mig_state, data_fd);
> > +	struct vfio_device_feature_mig_state mig;
> 
> Perhaps set default data_fd here?  ie.
> 
>   struct vfio_device_feature_mig_state mig = { .data_fd = -1 };

Why? there is no path where this variable is read before set.

> > +	struct file *filp = NULL;
> > +	int ret;
> > +
> > +	if (!device->ops->migration_set_state ||
> > +	    !device->ops->migration_get_state)
> > +		return -ENOTTY;
> > +
> > +	ret = vfio_check_feature(flags, argsz,
> > +				 VFIO_DEVICE_FEATURE_SET |
> > +				 VFIO_DEVICE_FEATURE_GET,
> > +				 sizeof(mig));
> > +	if (ret != 1)
> > +		return ret;
> > +
> > +	if (copy_from_user(&mig, arg, minsz))
> > +		return -EFAULT;

                   ^^^^^^^^^^^^^^

Is before all gotos.

> > +enum vfio_device_mig_state {
> > +	VFIO_DEVICE_STATE_ERROR = 0,
> > +	VFIO_DEVICE_STATE_STOP = 1,
> > +	VFIO_DEVICE_STATE_RUNNING = 2,
> 
> I'm a little surprised we're not using RUNNING = 0 given all the
> objection in the v1 protocol that the default state was non-zero.

Making ERROR 0 ensures that errors, eg in the FSM table due to a
backport or something still work properly.

I think we corrected that confusion by explicitly calling out RUNNING
as the default and removing the register-like region API.

> >  /* -------- API for Type1 VFIO IOMMU -------- */
> >  
> >  /**
> 
> Otherwise, I'm still not sure how userspace handles the fact that it
> can't know how much data will be read from the device and how important
> that is.  There's no replacement of that feature from the v1 protocol
> here.

I'm not sure this was part of the v1 protocol either. Yes it had a
pending_bytes, but I don't think it was actually expected to be 100%
accurate. Computing this value accurately is potentially quite
expensive, I would prefer we not enforce this on an implementation
without a reason, and qemu currently doesn't make use of it.

The ioctl from the precopy patch is probably the best approach, I
think it would be fine to allow that for stop copy as well, but also
don't see a usage right now.

It is not something that needs decision now, it is very easy to detect
if an ioctl is supported on the data_fd at runtime to add new things
here when needed.

> I also think we're still waiting for confirmation from owners of
> devices with extremely large device states (vGPUs) whether they consider
> the stream FD sufficient versus their ability to directly mmap regions
> of the device in the previous protocol.  Thanks,

As is this.

I think the mlx5 and huwaei patches show that without a doubt the
stream fd is the correct choice for these drivers.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler
  2022-02-09  0:08   ` Alex Williamson
@ 2022-02-09  2:39     ` Jason Gunthorpe
  2022-02-10 16:48       ` Alex Williamson
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-09  2:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Tue, Feb 08, 2022 at 05:08:01PM -0700, Alex Williamson wrote:
> > @@ -477,10 +499,34 @@ static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
> >  
> >  	mutex_lock(&mvdev->state_mutex);
> >  	*curr_state = mvdev->mig_state;
> > -	mutex_unlock(&mvdev->state_mutex);
> > +	mlx5vf_state_mutex_unlock(mvdev);
> >  	return 0;
> 
> I still can't see why it wouldn't be a both fairly trivial to implement
> and a usability improvement if the unlock wrapper returned -EAGAIN on a
> deferred reset so we could avoid returning a stale state to the user
> and a dead fd in the former case.  Thanks,

It simply is not useful - again, we always resolve this race that
should never happen as though the two events happened consecutively,
which is what would normally happen if we could use a simple mutex. We
do not need to add any more complexity to deal with this already
troublesome thing..

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler
  2022-02-09  2:39     ` Jason Gunthorpe
@ 2022-02-10 16:48       ` Alex Williamson
  2022-02-10 17:27         ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2022-02-10 16:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Tue, 8 Feb 2022 22:39:18 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 08, 2022 at 05:08:01PM -0700, Alex Williamson wrote:
> > > @@ -477,10 +499,34 @@ static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
> > >  
> > >  	mutex_lock(&mvdev->state_mutex);
> > >  	*curr_state = mvdev->mig_state;
> > > -	mutex_unlock(&mvdev->state_mutex);
> > > +	mlx5vf_state_mutex_unlock(mvdev);
> > >  	return 0;  
> > 
> > I still can't see why it wouldn't be a both fairly trivial to implement
> > and a usability improvement if the unlock wrapper returned -EAGAIN on a
> > deferred reset so we could avoid returning a stale state to the user
> > and a dead fd in the former case.  Thanks,  
> 
> It simply is not useful - again, we always resolve this race that
> should never happen as though the two events happened consecutively,
> which is what would normally happen if we could use a simple mutex. We
> do not need to add any more complexity to deal with this already
> troublesome thing..

So walk me through how this works with QEMU, it's easy to hand-wave
userspace race and move on, but device reset can be triggered by guest
behavior while migration is supposed to be transparent to the guest.
These are essentially asynchronous threads where we're imposing a
synchronization point or lots of double checking in userspace whether
the device actually entered the state we think it did and if the
returned FD is usable.

Specifically, I suspect we can trigger this race if the VM reboots as
we're initiating a migration in the STOP_COPY phase, but that's maybe
less interesting if we expect the VM to be halted before the device
state is stepped.  More interesting might be how a PRE_COPY transition
works relative to asynchronous VM resets triggering device resets.  Are
we serializing all access to reset vs this DEVICE_FEATURE op or are we
resorting to double checking the device state, and how do we plan to
re-initiate migration states if a VM reset occurs during migration?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler
  2022-02-10 16:48       ` Alex Williamson
@ 2022-02-10 17:27         ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-10 17:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Thu, Feb 10, 2022 at 09:48:11AM -0700, Alex Williamson wrote:

> Specifically, I suspect we can trigger this race if the VM reboots as
> we're initiating a migration in the STOP_COPY phase, but that's maybe
> less interesting if we expect the VM to be halted before the device
> state is stepped.  

Yes, STOP_COPY drivers like mlx5/acc are fine here inherently.

We have already restricted what device touches are allowed in
STOP_COPY, and this must include reset too. None of the two drivers
posted can tolerate a reset during the serialization step. 

mlx5 will fail the STOP_COPY FW command and I guess acc will 'tear'
its register reads and produce a corrupted state.

> More interesting might be how a PRE_COPY transition works relative
> to asynchronous VM resets triggering device resets.  Are we
> serializing all access to reset vs this DEVICE_FEATURE op or are we
> resorting to double checking the device state, and how do we plan to
> re-initiate migration states if a VM reset occurs during migration?
> Thanks,

The device will be in PRE_COPY with VCPUs running. An async reset will
be triggered in the guest, so the device returns to RUNNING and the
data_fd's immediately return an errno.

There are three ways qemu can observe this:

 1) it is actively using the data_fds, so it immediately gets an
    error and propogates it up, aborting the migration
    eg it is doing read(), poll(), iouring, etc.

 2) it is done with the PRE_COPY phase of the data_fd and is moving
    toward STOP_COPY.
    In this case the vCPU is halted and the SET_STATE to STOP_COPY
    will execute, without any race, either:
      PRE_COPY -> STOP_COPY (data_fd == -1)
      RUNNING -> STOP_COPY (data_fd != -1)
    The expected data_fd is detected in the WIP qemu patch, however it
    mishandles the error, we will fix it.

 3) it is aborting the PRE_COPY migration, closing the data_fd and
    doing SET_STATE to RUNNING. In which case it doesn't know the
    device was reset. close() succeeds and SET_STATE RUNNING -> RUNNING
    is a nop.

Today's qemu must abort the migration at this point and fully restart
it because it has no mechanism to serialize a 'discard all of this
device's PRE_COPY state up to here' tag.

Some future qemu could learn to do this and then the receiver would
discard already sent device state - by triggering reset and a new
RUNNING -> RESUMING on the receiving device. In this case qemu would
have a choice of:
  abort the entire migration
  restart just this device back to PRE_COPY
  stop the vCPUs and use STOP_COPY

In any case, qemu fully detects this race as a natural part of its
operations and knows with certainty when it commands to go to
STOP_COPY, with vCPUs halted, if the preceeding PRE_COPY state is
correct or not.

It is interesting you bring this up, I'm not sure this worked properly
with v1. It seems we have solved it, inadvertently even, by using the
basic logic of the FSM and FD.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
@ 2022-02-11 11:03   ` Cornelia Huck
  0 siblings, 0 replies; 50+ messages in thread
From: Cornelia Huck @ 2022-02-11 11:03 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi

On Mon, Feb 07 2022, Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
>
> v1 was never implemented and is replaced by v2.
>
> The old uAPI documentation is removed from the header file.
>
> The old uAPI definitions are still kept in the header file till v2 will
> reach Linus's tree.

That sentence is a bit weird: If this file has reached Linus' tree,
obviously v2 has reached Linus' tree. Maybe replace with:

"The old uAPI definitions are still kept in the header file to ease
transition for userspace copying these headers."

>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 200 +-------------------------------------
>  1 file changed, 2 insertions(+), 198 deletions(-)
>
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 773895988cf1..227f55d57e06 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -323,7 +323,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> -#define VFIO_REGION_TYPE_MIGRATION              (3)
> +#define VFIO_REGION_TYPE_MIGRATION_DEPRECATED   (3)

This will still break QEMU compilation after a headers update (although
it's not hard to fix.) I think we can live with that if needed.

>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
  2022-02-09  0:07   ` Alex Williamson
@ 2022-02-15  8:04   ` Tian, Kevin
  2022-02-15 15:33     ` Jason Gunthorpe
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-15  8:04 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	maorg, Raj, Ashok, shameerali.kolothum.thodi

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Tuesday, February 8, 2022 1:22 AM
> 
> +static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
> +					       u32 flags, void __user *arg,
> +					       size_t argsz)
> +{
> +	struct vfio_device_feature_migration mig = {
> +		.flags = VFIO_MIGRATION_STOP_COPY,
> +	};
> +	int ret;
> +
> +	if (!device->ops->migration_set_state)
> +		return -ENOTTY;

Miss a check on migration_get_state, as done in last function.

> + * @migration_set_state: Optional callback to change the migration state for
> + *         devices that support migration. The returned FD is used for data
> + *         transfer according to the FSM definition. The driver is responsible
> + *         to ensure that FD is isolated whenever the migration FSM leaves a
> + *         data transfer state or before close_device() returns.

didn't understand the meaning of 'isolated' here.

> +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING
> | \
> +				     VFIO_DEVICE_STATE_V1_SAVING |  \
> +				     VFIO_DEVICE_STATE_V1_RESUMING)

Does it make sense to also add 'V1' to MASK and also following macros
given their names are general?

  #define VFIO_DEVICE_STATE_VALID(state) \
  #define VFIO_DEVICE_STATE_IS_ERROR(state) \
  #define VFIO_DEVICE_STATE_SET_ERROR(state) \

It certainly implies more changes to v1 code but readability can be
slightly improved.

> +/*
> + * Indicates the device can support the migration API. See enum

call it V2? Not necessary to add V2 in code but worthy of a clarification
in comment.

> + * vfio_device_mig_state for details. If present flags must be non-zero and
> + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
> + *
> + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY
> and
> + * RESUMING are supported.
> + */

Not aligned with other places where 5 states are mentioned. Better add
ERROR here.

> + *
> + * RUNNING -> STOP
> + * STOP_COPY -> STOP
> + *   While in STOP the device must stop the operation of the device. The
> + *   device must not generate interrupts, DMA, or advance its internal
> + *   state. When stopped the device and kernel migration driver must accept
> + *   and respond to interaction to support external subsystems in the STOP
> + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> + *   restrict device access while in STOP must not result in error conditions
> + *   outside the user context (ex. host system faults).

Right above the STOP state is defined as:

       *  STOP - The device does not change the internal or external state

'external state' I assume means P2P activities. For consistency it is clearer
to also say something about external state in above paragraph.

> + *
> + *   The STOP_COPY arc will terminate a data transfer session.

remove 'will'

> + *
> + * STOP -> STOP_COPY
> + *   This arc begin the process of saving the device state and will return a
> + *   new data_fd.
> + *
> + *   While in the STOP_COPY state the device has the same behavior as STOP
> + *   with the addition that the data transfers session continues to stream the
> + *   migration state. End of stream on the FD indicates the entire device
> + *   state has been transferred.
> + *
> + *   The user should take steps to restrict access to vfio device regions while
> + *   the device is in STOP_COPY or risk corruption of the device migration
> data
> + *   stream.

Restricting access has been explained in the to-STOP arcs and it is stated 
that while in STOP_COPY the device has the same behavior as STOP. So 
I think the last paragraph is possibly not required.

> + *
> + * STOP -> RESUMING
> + *   Entering the RESUMING state starts a process of restoring the device
> + *   state and will return a new data_fd. The data stream fed into the
> data_fd
> + *   should be taken from the data transfer output of the saving group states

No definition of 'group state' (maybe introduced in a later patch?)

> + *   from a compatible device. The migration driver may alter/reset the
> + *   internal device state for this arc if required to prepare the device to
> + *   receive the migration data.
> + *

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
@ 2022-02-15 10:18   ` Tian, Kevin
  2022-02-15 15:56     ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-15 10:18 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	maorg, Raj, Ashok, shameerali.kolothum.thodi

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Tuesday, February 8, 2022 1:22 AM
> 
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The RUNNING_P2P state is designed to support multiple devices in the same
> VM that are doing P2P transactions between themselves. When in
> RUNNING_P2P
> the device must be able to accept incoming P2P transactions but should not
> generate outgoing transactions.

outgoing 'P2P' transactions.

> 
> As an optional extension to the mandatory states it is defined as
> inbetween STOP and RUNNING:
>    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> 
> For drivers that are unable to support RUNNING_P2P the core code silently
> merges RUNNING_P2P and RUNNING together. Drivers that support this will

It would be clearer if following message could be also reflected here:

  + * The optional states cannot be used with SET_STATE if the device does not
  + * support them. The user can discover if these states are supported by using
  + * VFIO_DEVICE_FEATURE_MIGRATION. 

Otherwise the original context reads like RUNNING_P2P can be used as
end state even if the underlying driver doesn't support it then makes me
wonder what is the point of the new capability bit.

> be
> required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
> arcs become combination transitions.
> 
> Compared to the v1 clarification, NDMA is redefined into FSM states and is
> described in terms of the desired P2P quiescent behavior, noting that
> halting all DMA is an acceptable implementation.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/vfio.c       | 79 ++++++++++++++++++++++++++++++---------
>  include/linux/vfio.h      |  1 +
>  include/uapi/linux/vfio.h | 34 ++++++++++++++++-
>  3 files changed, 95 insertions(+), 19 deletions(-)
> 
> @@ -1631,17 +1657,36 @@ int vfio_mig_get_next_state(struct vfio_device

[...]

>  	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> +	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
> +			state_flags_table[*next_fsm])
> +		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
> +

A comment highlighting the silent merging of unsupported states would
be informative here.

and I have a puzzle on following messages:

>   *
> + * And 1 optional state to support VFIO_MIGRATION_P2P:
> + *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer
> DMA
>   *

and

> + * RUNNING_P2P -> RUNNING
>   *   While in RUNNING the device is fully operational, the device may
> generate
>   *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
>   *   and the device may advance its internal state.
>   *

and below

> + * The optional peer to peer (P2P) quiescent state is intended to be a
> quiescent
> + * state for the device for the purposes of managing multiple devices within
> a
> + * user context where peer-to-peer DMA between devices may be active.
> The
> + * RUNNING_P2P states must prevent the device from initiating
> + * any new P2P DMA transactions. If the device can identify P2P transactions
> + * then it can stop only P2P DMA, otherwise it must stop all DMA. The
> migration
> + * driver must complete any such outstanding operations prior to
> completing the
> + * FSM arc into a P2P state. For the purpose of specification the states
> + * behave as though the device was fully running if not supported.

Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits 
all behaviors in RUNNING except blocking outbound P2P:
	* generate interrupts and DMAs
	* respond to MMIO
	* all vfio regions are functional
	* device may advance its internal state
	* drain and block outstanding P2P requests

I think this is not the intended behavior when NDMA was being discussed
in previous threads, as above definition suggests the user could continue
to submit new requests after outstanding P2P requests are completed given
all vfio regions are functional when the device is in RUNNING_P2P.

Though just a naming thing, possibly what we really require is a STOPPING_P2P
state which indicates the device is moving to the STOP (or STOPPED) state.
In this state the device is functional but vfio regions are not so the user still
needs to restrict device access:
	* generate interrupts and DMAs
	* respond to MMIO
	* all vfio regions are NOT functional (no user access)
	* device may advance its internal state
	* drain and block outstanding P2P requests

In virtualization this means Qemu must stop vCPU first before entering
STOPPING_P2P for a device.

Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI 
usage via a new capability bit [1]:

    "A cap like "running_p2p returns an event fd, doesn't finish until the
    VCPU does stuff, and stops pri as well as p2p" might be all that is
    required here (and not an actual new state)"

vPRI requires a RUNNING semantics. A new capability bit can change 
the behaviors listed above for STOPPING_P2P to below:
	* both P2P and vPRI requests should be drained and blocked;
	* all vfio regions are functional (with a RUNNING behavior) so
	  vCPUs can continue running to help drain vPRI requests;
	* an eventfd is returned for the user to poll-wait the completion
	  of state transition;

and in this regard possibly it makes more sense to call this state 
as STOPPING to encapsulate any optional preparation work before 
the device can be transitioned to STOP (with default as defined for
STOPPING_P2P above and actual behavior changeable by future
capability bits)? 

One additional requirement in driver side is to dynamically mediate the 
fast path and queue any new request which may trigger vPRI or P2P
before moving out of RUNNING_P2P. If moving to STOP_COPY, then
queued requests will also be included as device state to be replayed
in the resuming path.

Does above sound a reasonable understanding of this FSM mechanism? 

> + *
> + * The optional states cannot be used with SET_STATE if the device does not
> + * support them. The user can disocver if these states are supported by

'disocver' -> 'discover'

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-09  2:36     ` Jason Gunthorpe
@ 2022-02-15 10:41       ` Tian, Kevin
  2022-02-15 16:04         ` Jason Gunthorpe
  2022-02-15 10:58       ` Tian, Kevin
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-15 10:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 9, 2022 10:37 AM
> 
> > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > >
> > >  /**
> >
> > Otherwise, I'm still not sure how userspace handles the fact that it
> > can't know how much data will be read from the device and how important
> > that is.  There's no replacement of that feature from the v1 protocol
> > here.
> 
> I'm not sure this was part of the v1 protocol either. Yes it had a
> pending_bytes, but I don't think it was actually expected to be 100%
> accurate. Computing this value accurately is potentially quite
> expensive, I would prefer we not enforce this on an implementation
> without a reason, and qemu currently doesn't make use of it.
> 
> The ioctl from the precopy patch is probably the best approach, I
> think it would be fine to allow that for stop copy as well, but also
> don't see a usage right now.
> 
> It is not something that needs decision now, it is very easy to detect
> if an ioctl is supported on the data_fd at runtime to add new things
> here when needed.
> 

Another interesting thing (not an immediate concern on this series)
is how to handle devices which may have long time (e.g. due to 
draining outstanding requests, even w/o vPRI) to enter the STOP 
state. that time is not as deterministic as pending bytes thus cannot
be reported back to the user before the operation is actually done.

Similarly to what we discussed for vPRI an eventfd will be beneficial 
so the user can timeout-wait on it, but it also needs an arc to create 
the eventfd between RUNNING->STOP...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-09  2:36     ` Jason Gunthorpe
  2022-02-15 10:41       ` Tian, Kevin
@ 2022-02-15 10:58       ` Tian, Kevin
  2022-02-15 13:13         ` Jason Gunthorpe
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-15 10:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Tian, Kevin
> Sent: Tuesday, February 15, 2022 6:42 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, February 9, 2022 10:37 AM
> >
> > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > >
> > > >  /**
> > >
> > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > can't know how much data will be read from the device and how
> important
> > > that is.  There's no replacement of that feature from the v1 protocol
> > > here.
> >
> > I'm not sure this was part of the v1 protocol either. Yes it had a
> > pending_bytes, but I don't think it was actually expected to be 100%
> > accurate. Computing this value accurately is potentially quite
> > expensive, I would prefer we not enforce this on an implementation
> > without a reason, and qemu currently doesn't make use of it.
> >
> > The ioctl from the precopy patch is probably the best approach, I
> > think it would be fine to allow that for stop copy as well, but also
> > don't see a usage right now.
> >
> > It is not something that needs decision now, it is very easy to detect
> > if an ioctl is supported on the data_fd at runtime to add new things
> > here when needed.
> >
> 
> Another interesting thing (not an immediate concern on this series)
> is how to handle devices which may have long time (e.g. due to
> draining outstanding requests, even w/o vPRI) to enter the STOP
> state. that time is not as deterministic as pending bytes thus cannot
> be reported back to the user before the operation is actually done.
> 
> Similarly to what we discussed for vPRI an eventfd will be beneficial
> so the user can timeout-wait on it, but it also needs an arc to create
> the eventfd between RUNNING->STOP...
> 

type too fast. it doesn’t need a new arc. Just a new capability to say
that STOP returns an event fd for the user to wait for completion,
when supporting such devices is required. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15 10:58       ` Tian, Kevin
@ 2022-02-15 13:13         ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-15 13:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, Feb 15, 2022 at 10:58:58AM +0000, Tian, Kevin wrote:

> > Another interesting thing (not an immediate concern on this series)
> > is how to handle devices which may have long time (e.g. due to
> > draining outstanding requests, even w/o vPRI) to enter the STOP
> > state. that time is not as deterministic as pending bytes thus cannot
> > be reported back to the user before the operation is actually done.
> > 
> > Similarly to what we discussed for vPRI an eventfd will be beneficial
> > so the user can timeout-wait on it, but it also needs an arc to create
> > the eventfd between RUNNING->STOP...
> > 
> 
> type too fast. it doesn’t need a new arc. Just a new capability to say
> that STOP returns an event fd for the user to wait for completion,
> when supporting such devices is required. 😊

I think it is better to add a new arc rather than radically redefine
the behavior of existing ones:

  RUNNING -> RUNNING_PRI_DRAIN

Should return the event fd and allow the async sleep. Then you
alter the FSM so that RUNNING -> STOP is not allowed anymore and
userspace has to accomodate this new behavior.

Overall it will be some extension like the PRE_COPY and P2P, though
probably not transparently backwards compatabile..

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15  8:04   ` Tian, Kevin
@ 2022-02-15 15:33     ` Jason Gunthorpe
  2022-02-16  3:04       ` Tian, Kevin
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-15 15:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, Feb 15, 2022 at 08:04:27AM +0000, Tian, Kevin wrote:
> > From: Yishai Hadas <yishaih@nvidia.com>
> > Sent: Tuesday, February 8, 2022 1:22 AM
> > 
> > +static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
> > +					       u32 flags, void __user *arg,
> > +					       size_t argsz)
> > +{
> > +	struct vfio_device_feature_migration mig = {
> > +		.flags = VFIO_MIGRATION_STOP_COPY,
> > +	};
> > +	int ret;
> > +
> > +	if (!device->ops->migration_set_state)
> > +		return -ENOTTY;
> 
> Miss a check on migration_get_state, as done in last function.

Yep

> > + * @migration_set_state: Optional callback to change the migration state for
> > + *         devices that support migration. The returned FD is used for data
> > + *         transfer according to the FSM definition. The driver is responsible
> > + *         to ensure that FD is isolated whenever the migration FSM leaves a
> > + *         data transfer state or before close_device() returns.
> 
> didn't understand the meaning of 'isolated' here.

It is not a good word. Lets say 'that FD reaches end of stream or
error whenever'
 
> > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING
> > | \
> > +				     VFIO_DEVICE_STATE_V1_SAVING |  \
> > +				     VFIO_DEVICE_STATE_V1_RESUMING)
> 
> Does it make sense to also add 'V1' to MASK and also following macros
> given their names are general?

No, the point of this exercise is to avoid trouble for qemu - the
fewest changes we can get away with the better.

Once qemu is updated we'll delete this old stuff from the kernel.

> > +/*
> > + * Indicates the device can support the migration API. See enum
> 
> call it V2? Not necessary to add V2 in code but worthy of a clarification
> in comment.

We've only called it 'v2' for discussions.

If you think it is unclear lets say 'support the migration API through
VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE'

> 
> > + * vfio_device_mig_state for details. If present flags must be non-zero and
> > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
> > + *
> > + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY
> > and
> > + * RESUMING are supported.
> > + */
> 
> Not aligned with other places where 5 states are mentioned. Better add
> ERROR here.

ERROR is not a state that is 'supported'. It could be clarified that
ERROR and RUNNING are always supported.

> 
> > + *
> > + * RUNNING -> STOP
> > + * STOP_COPY -> STOP
> > + *   While in STOP the device must stop the operation of the device. The
> > + *   device must not generate interrupts, DMA, or advance its internal
> > + *   state. When stopped the device and kernel migration driver must accept
> > + *   and respond to interaction to support external subsystems in the STOP
> > + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> > + *   restrict device access while in STOP must not result in error conditions
> > + *   outside the user context (ex. host system faults).
> 
> Right above the STOP state is defined as:
> 
>        *  STOP - The device does not change the internal or external state
> 
> 'external state' I assume means P2P activities. For consistency it is clearer
> to also say something about external state in above paragraph.

No, STOP is defined to halt all DMA. I tidied it a bit like this:

 *   While in STOP the device must stop the operation of the device. The device
 *   must not generate interrupts, DMA, or any other change to external state.
 *   It must not change its internal state.


> > + *
> > + *   The STOP_COPY arc will terminate a data transfer session.
>
> remove 'will'

will is correct grammar. It could be 'arc terminates'

> > + *
> > + * STOP -> STOP_COPY
> > + *   This arc begin the process of saving the device state and will return a
> > + *   new data_fd.
> > + *
> > + *   While in the STOP_COPY state the device has the same behavior as STOP
> > + *   with the addition that the data transfers session continues to stream the
> > + *   migration state. End of stream on the FD indicates the entire device
> > + *   state has been transferred.
> > + *
> > + *   The user should take steps to restrict access to vfio device regions while
> > + *   the device is in STOP_COPY or risk corruption of the device migration
> > data
> > + *   stream.
> 
> Restricting access has been explained in the to-STOP arcs and it is stated 
> that while in STOP_COPY the device has the same behavior as STOP. So 
> I think the last paragraph is possibly not required.

It is not the same, the language in STOP is saying that the device
must tolerate external touches without breaking the kernel

This language is saying if external touches happen then the device is
free to corrupt the migration stream.

In both cases we expect good userspace to not have device
touches, the guidance here is for driver authors about what kind of
steps they need to take to protect against hostile userspace.

> > + * STOP -> RESUMING
> > + *   Entering the RESUMING state starts a process of restoring the device
> > + *   state and will return a new data_fd. The data stream fed into the
> > data_fd
> > + *   should be taken from the data transfer output of the saving group states
> 
> No definition of 'group state' (maybe introduced in a later patch?)

Yes, it was added in the P2P patch

We can avoid talking about saving group here entirely, it really just
means a single FD.

 *    The data stream fed into the data_fd should
 *   be taken from the data transfer output of a single FD during saving on a
 *   from a compatible device.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-15 10:18   ` Tian, Kevin
@ 2022-02-15 15:56     ` Jason Gunthorpe
  2022-02-16  2:52       ` Tian, Kevin
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-15 15:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, Feb 15, 2022 at 10:18:11AM +0000, Tian, Kevin wrote:
> > From: Yishai Hadas <yishaih@nvidia.com>
> > Sent: Tuesday, February 8, 2022 1:22 AM
> > 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > The RUNNING_P2P state is designed to support multiple devices in the same
> > VM that are doing P2P transactions between themselves. When in
> > RUNNING_P2P
> > the device must be able to accept incoming P2P transactions but should not
> > generate outgoing transactions.
> 
> outgoing 'P2P' transactions.

Yes

> > As an optional extension to the mandatory states it is defined as
> > inbetween STOP and RUNNING:
> >    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> > 
> > For drivers that are unable to support RUNNING_P2P the core code silently
> > merges RUNNING_P2P and RUNNING together. Drivers that support this will
> 
> It would be clearer if following message could be also reflected here:
> 
>   + * The optional states cannot be used with SET_STATE if the device does not
>   + * support them. The user can discover if these states are supported by using
>   + * VFIO_DEVICE_FEATURE_MIGRATION. 
> 
> Otherwise the original context reads like RUNNING_P2P can be used as
> end state even if the underlying driver doesn't support it then makes me
> wonder what is the point of the new capability bit.

You've read it right. Lets just add a simple "Unless driver support is
present the new state cannot be used in SET_STATE"

> >  	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
> > +			state_flags_table[*next_fsm])
> > +		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
> > +
> 
> A comment highlighting the silent merging of unsupported states would
> be informative here.

	/*
	 * Arcs touching optional and unsupported states are skipped over. The
	 * driver will instead  see an arc from the original state to the next
	 * logical state, as per the above comment.
	 */

> Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits 
> all behaviors in RUNNING except blocking outbound P2P:
> 	* generate interrupts and DMAs
> 	* respond to MMIO
> 	* all vfio regions are functional
> 	* device may advance its internal state
> 	* drain and block outstanding P2P requests

Correct.

The device must be able to recieve and process any MMIO P2P
transaction during this state.

We discussed and left interrupts as allowed behavior.

> I think this is not the intended behavior when NDMA was being discussed
> in previous threads, as above definition suggests the user could continue
> to submit new requests after outstanding P2P requests are completed given
> all vfio regions are functional when the device is in RUNNING_P2P.

It is the desired behavior. The device must internally stop generating
DMA from new work, it cannot rely on external things not poking it
with MMIO, because the whole point of the state is that MMIO P2P is
still allowed to happen.

What gets confusing is that in normal cases I wouldn't expect any P2P
activity to trigger a new work submission.

Probably, since many devices can't implement this, we will end up with
devices providing a weaker version where they do RUNNING_P2P but this
relies on the VM operating the device "sanely" without programming P2P
work submission. It is similar to your notion that migration requires
guest co-operation in the vPRI case.

I don't like it, and better devices really should avoid requiring
guest co-operation, but it seems like where things are going.

> Though just a naming thing, possibly what we really require is a STOPPING_P2P
> state which indicates the device is moving to the STOP (or STOPPED)
> state.

No, I've deliberately avoided STOP because this isn't anything like
STOP. It is RUNNING with one restriction.

> In this state the device is functional but vfio regions are not so the user still
> needs to restrict device access:

The device is not functional in STOP. STOP means the device does not
provide working MMIO. Ie mlx5 devices will discard all writes and
read all 0's when in STOP.

The point of RUNNING_P2P is to allow the device to continue to recieve
all MMIO while halting generation of MMIO to other devices.

> In virtualization this means Qemu must stop vCPU first before entering
> STOPPING_P2P for a device.

This is already the case. RUNNING/STOP here does not refer to the
vCPU, it refers to this device.

> Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI 
> usage via a new capability bit [1]:
> 
>     "A cap like "running_p2p returns an event fd, doesn't finish until the
>     VCPU does stuff, and stops pri as well as p2p" might be all that is
>     required here (and not an actual new state)"
> 
> vPRI requires a RUNNING semantics. A new capability bit can change 
> the behaviors listed above for STOPPING_P2P to below:
> 	* both P2P and vPRI requests should be drained and blocked;
> 	* all vfio regions are functional (with a RUNNING behavior) so
> 	  vCPUs can continue running to help drain vPRI requests;
> 	* an eventfd is returned for the user to poll-wait the completion
> 	  of state transition;

vPRI draining is not STOP either. If the device is expected to provide
working MMIO it is not STOP by definition.

> One additional requirement in driver side is to dynamically mediate the 
> fast path and queue any new request which may trigger vPRI or P2P
> before moving out of RUNNING_P2P. If moving to STOP_COPY, then
> queued requests will also be included as device state to be replayed
> in the resuming path.

This could make sense. I don't know how you dynamically mediate
though, or how you will trap ENQCMD..

> Does above sound a reasonable understanding of this FSM mechanism? 

Other than mis-using the STOP label, it is close yes.

> > + * The optional states cannot be used with SET_STATE if the device does not
> > + * support them. The user can disocver if these states are supported by
> 
> 'disocver' -> 'discover'

Yep, thanks

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15 10:41       ` Tian, Kevin
@ 2022-02-15 16:04         ` Jason Gunthorpe
  2022-02-15 23:32           ` Alex Williamson
  2022-02-16  3:17           ` Tian, Kevin
  0 siblings, 2 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-15 16:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, Feb 15, 2022 at 10:41:56AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, February 9, 2022 10:37 AM
> > 
> > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > >
> > > >  /**
> > >
> > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > can't know how much data will be read from the device and how important
> > > that is.  There's no replacement of that feature from the v1 protocol
> > > here.
> > 
> > I'm not sure this was part of the v1 protocol either. Yes it had a
> > pending_bytes, but I don't think it was actually expected to be 100%
> > accurate. Computing this value accurately is potentially quite
> > expensive, I would prefer we not enforce this on an implementation
> > without a reason, and qemu currently doesn't make use of it.
> > 
> > The ioctl from the precopy patch is probably the best approach, I
> > think it would be fine to allow that for stop copy as well, but also
> > don't see a usage right now.
> > 
> > It is not something that needs decision now, it is very easy to detect
> > if an ioctl is supported on the data_fd at runtime to add new things
> > here when needed.
> > 
> 
> Another interesting thing (not an immediate concern on this series)
> is how to handle devices which may have long time (e.g. due to 
> draining outstanding requests, even w/o vPRI) to enter the STOP 
> state. that time is not as deterministic as pending bytes thus cannot
> be reported back to the user before the operation is actually done.

Well, it is not deterministic at all..

I suppose you have to do as Alex says and try to estimate how much
time the stop phase of migration will take and grant only the
remaining time from the SLA to the guest to finish its PRI flushing,
otherwise go back to PRE_COPY and try again later if the timer hits.

This suggests to me the right interface from the driver is some
estimate of time to enter STOP_COPY and resulting required transfer
size.

Still, I just don't see how SLAs can really be feasible with this kind
of HW that requires guest co-operation..

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15 16:04         ` Jason Gunthorpe
@ 2022-02-15 23:32           ` Alex Williamson
  2022-02-16  1:17             ` Jason Gunthorpe
  2022-02-16  3:17           ` Tian, Kevin
  1 sibling, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2022-02-15 23:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, 15 Feb 2022 12:04:19 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 15, 2022 at 10:41:56AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, February 9, 2022 10:37 AM
> > >   
> > > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > > >
> > > > >  /**  
> > > >
> > > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > > can't know how much data will be read from the device and how important
> > > > that is.  There's no replacement of that feature from the v1 protocol
> > > > here.  
> > > 
> > > I'm not sure this was part of the v1 protocol either. Yes it had a
> > > pending_bytes, but I don't think it was actually expected to be 100%
> > > accurate. Computing this value accurately is potentially quite
> > > expensive, I would prefer we not enforce this on an implementation
> > > without a reason, and qemu currently doesn't make use of it.
> > > 
> > > The ioctl from the precopy patch is probably the best approach, I
> > > think it would be fine to allow that for stop copy as well, but also
> > > don't see a usage right now.
> > > 
> > > It is not something that needs decision now, it is very easy to detect
> > > if an ioctl is supported on the data_fd at runtime to add new things
> > > here when needed.
> > >   
> > 
> > Another interesting thing (not an immediate concern on this series)
> > is how to handle devices which may have long time (e.g. due to 
> > draining outstanding requests, even w/o vPRI) to enter the STOP 
> > state. that time is not as deterministic as pending bytes thus cannot
> > be reported back to the user before the operation is actually done.  
> 
> Well, it is not deterministic at all..
> 
> I suppose you have to do as Alex says and try to estimate how much
> time the stop phase of migration will take and grant only the
> remaining time from the SLA to the guest to finish its PRI flushing,
> otherwise go back to PRE_COPY and try again later if the timer hits.
> 
> This suggests to me the right interface from the driver is some
> estimate of time to enter STOP_COPY and resulting required transfer
> size.
> 
> Still, I just don't see how SLAs can really be feasible with this kind
> of HW that requires guest co-operation..

Devil's advocate, does this discussion raise any concerns whether a
synchronous vs asynchronous arc transition ioctl is still the right
solution here?  I can imagine for instance that posting a state change
and being able to poll for pending transactions or completion of the
saved state generation and ultimate size could be very useful for
managing migration SLAs, not to mention trivial userspace support to
parallel'ize state changes.

Reporting a maximum device state size hint also seems relatively
trivial since this should just be the sum of on-device memory, asics,
and processors.  The mlx5 driver already places an upper bound on
migration data size internally.

Maybe some of these can come as DEVICE_FEATURES as we go, but for any
sort of cloud vendor SLA, I'm afraid we're only enabling migration of
devices with negligible transition latencies and negligible device
states, with some hand waving how to determine that either of those are
the case without device specific knowledge in the orchestration.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15 23:32           ` Alex Williamson
@ 2022-02-16  1:17             ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-16  1:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, Feb 15, 2022 at 04:32:31PM -0700, Alex Williamson wrote:

> > I suppose you have to do as Alex says and try to estimate how much
> > time the stop phase of migration will take and grant only the
> > remaining time from the SLA to the guest to finish its PRI flushing,
> > otherwise go back to PRE_COPY and try again later if the timer hits.
> > 
> > This suggests to me the right interface from the driver is some
> > estimate of time to enter STOP_COPY and resulting required transfer
> > size.
> > 
> > Still, I just don't see how SLAs can really be feasible with this kind
> > of HW that requires guest co-operation..
> 
> Devil's advocate, does this discussion raise any concerns whether a
> synchronous vs asynchronous arc transition ioctl is still the right
> solution here?  

v2 switched to the data_fd which allows almost everything important to
be async, assuming someone wants to implement it in qemu and a driver.

It allows RUNNING -> STOP_COPY to be made async because the driver can
return SET_STATE immediately, backround the state save and indicate
completion/progress/error via poll(readable) on the data_fd. However
the device does still have to suspend DMA synchronously.

RESUMING -> STOP can also be async. The driver will make the data_fd
not writable before the last byte using its internal knowledge of the
data framing. Once the driver allows the last byte to be delivered
qemu will immediately do SET_STATE which will be low latency.

The entire data transfer flow itself is now async event driven and can
be run in parallel across devices with an epoll or iouring type
scheme.

STOP->RUNNING should be low latency for any reasonable device design.

For the P2P extension the RUNNING -> RUNNING_P2P has stopped vCPUs,
but I think a reasonable implementation must make this low latency,
just like suspending DMA to get to STOP_COPY must be low latency.
Making it async won't make it faster, though I would like to see it
run in parallel for all P2P devices.

The other arcs have the vCPU running, so don't matter to this.

In essence, compared to v1, we already made it almost fully async.

Also, at least with the mlx5 design, we can run all the commands async
(though there is a blocker preventing this right now) however we
cannot abort commands in progress. So as far as a SLA is concerned I
don't think async necessarily helps much.

I also think acc and several other drivers we are looking at would not
implement, or gain any advantage from async arcs.

Are there more arcs that benefit from async? PRI draining has come
up.

Keep in mind, qemu can still userspace thread SET_STATE. There has
also been talk about a generic iouring based kernel threaded
ioctl: https://lwn.net/Articles/844875/

What I suggested to Kevin is also something to look at, userspace
provides an event FD to SET_STATE and the event FD is triggered when
the background action is done.

So, I'm not worried about this. There are more than enough options to
address any async requirements down the road.

> and processors.  The mlx5 driver already places an upper bound on
> migration data size internally.

We did that because it seemed unreasonable to allow userspace to
allocate unlimited kernel memory during resuming. Ideally we'd limit
it to the device's max capability but the device doesn't know how to
do that today.

> Maybe some of these can come as DEVICE_FEATURES as we go, but for any
> sort of cloud vendor SLA, I'm afraid we're only enabling migration of
> devices with negligible transition latencies and negligible device
> states

Even if this is true, it is not a failure! Most of the migration
drivers we foresee are of this class.

My feeling is that more complex devices would benefit from some stuff,
eg like estimating times, but I'd rather collect actual field data and
understand where things lie, and what device changes are needed,
before we design something.

> with some hand waving how to determine that either of those are
> the case without device specific knowledge in the orchestration.

I don't think the orchestration necessarily needs special
knowledge. Certainly when the cloud operator designs the VMs and sets
the SLA parameters they need to do it with understanding of what the
mix of devices are and what kind of migration performance they get out
of the entire system.

More than anything system migration performance is going to be
impacted by the network for devices like mlx5 that have a non-trivial
STOP_COPY data blob.

Basically, I think it is worth thinking about, but not worth acting on
right now.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-15 15:56     ` Jason Gunthorpe
@ 2022-02-16  2:52       ` Tian, Kevin
  2022-02-16 12:11         ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-16  2:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 15, 2022 11:56 PM
> 
> > Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits
> > all behaviors in RUNNING except blocking outbound P2P:
> > 	* generate interrupts and DMAs
> > 	* respond to MMIO
> > 	* all vfio regions are functional
> > 	* device may advance its internal state
> > 	* drain and block outstanding P2P requests
> 
> Correct.
> 
> The device must be able to recieve and process any MMIO P2P
> transaction during this state.
> 
> We discussed and left interrupts as allowed behavior.
> 
> > I think this is not the intended behavior when NDMA was being discussed
> > in previous threads, as above definition suggests the user could continue
> > to submit new requests after outstanding P2P requests are completed
> given
> > all vfio regions are functional when the device is in RUNNING_P2P.
> 
> It is the desired behavior. The device must internally stop generating
> DMA from new work, it cannot rely on external things not poking it
> with MMIO, because the whole point of the state is that MMIO P2P is
> still allowed to happen.
> 
> What gets confusing is that in normal cases I wouldn't expect any P2P
> activity to trigger a new work submission.
> 
> Probably, since many devices can't implement this, we will end up with
> devices providing a weaker version where they do RUNNING_P2P but this
> relies on the VM operating the device "sanely" without programming P2P
> work submission. It is similar to your notion that migration requires
> guest co-operation in the vPRI case.
> 
> I don't like it, and better devices really should avoid requiring
> guest co-operation, but it seems like where things are going.

Make sense to me now. 

btw can disabling PCI bus master be a general means for devices which
don't have a way of blocking P2P to implement RUNNING_P2P? 

> 
> > Though just a naming thing, possibly what we really require is a
> STOPPING_P2P
> > state which indicates the device is moving to the STOP (or STOPPED)
> > state.
> 
> No, I've deliberately avoided STOP because this isn't anything like
> STOP. It is RUNNING with one restriction.

With above explanation I'm fine with it.

> 
> > In this state the device is functional but vfio regions are not so the user still
> > needs to restrict device access:
> 
> The device is not functional in STOP. STOP means the device does not
> provide working MMIO. Ie mlx5 devices will discard all writes and
> read all 0's when in STOP.

btw I used 'STOPPING' to indicate a transitional state between RUNNING
and STOP thus its definition could be defined separately from STOP. But 
it doesn't matter now.

> 
> The point of RUNNING_P2P is to allow the device to continue to recieve
> all MMIO while halting generation of MMIO to other devices.
> 
> > In virtualization this means Qemu must stop vCPU first before entering
> > STOPPING_P2P for a device.
> 
> This is already the case. RUNNING/STOP here does not refer to the
> vCPU, it refers to this device.

I know that point. Originally I thought that having 'RUNNING' in RUNNING_P2P
implies that vCPU doesn't need to be stopped first given all vfio regions are
functional. But now I think the rationale is clear. If guest-operation exists
then vCPU can be active when entering RUNNING_P2P since the guest will
guarantee no P2P submission (via vCPU or via P2P). Otherwise vCPU must be 
stopped first to block potential P2P work submissions as a brute-force operation.

> 
> > Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI
> > usage via a new capability bit [1]:
> >
> >     "A cap like "running_p2p returns an event fd, doesn't finish until the
> >     VCPU does stuff, and stops pri as well as p2p" might be all that is
> >     required here (and not an actual new state)"
> >
> > vPRI requires a RUNNING semantics. A new capability bit can change
> > the behaviors listed above for STOPPING_P2P to below:
> > 	* both P2P and vPRI requests should be drained and blocked;
> > 	* all vfio regions are functional (with a RUNNING behavior) so
> > 	  vCPUs can continue running to help drain vPRI requests;
> > 	* an eventfd is returned for the user to poll-wait the completion
> > 	  of state transition;
> 
> vPRI draining is not STOP either. If the device is expected to provide
> working MMIO it is not STOP by definition.
> 
> > One additional requirement in driver side is to dynamically mediate the
> > fast path and queue any new request which may trigger vPRI or P2P
> > before moving out of RUNNING_P2P. If moving to STOP_COPY, then
> > queued requests will also be included as device state to be replayed
> > in the resuming path.
> 
> This could make sense. I don't know how you dynamically mediate
> though, or how you will trap ENQCMD..

Qemu can ask KVM to temporarily clear EPT mapping of the cmd portal 
to enable mediation on src and then restore the mapping before resuming
vCPU on dest. In our internal POC the cmd portal address is hard coded
in Qemu which is not good. Possibly we need a general mechanism so
migration driver which supports vPRI and extended RUNNING_P2P behavior
can report to the user a list of pages which must be accessed via read()/
write() instead of mmap when the device is in RUNNING_P2P and vCPUs
are active. Based on that information Qemu can zap related EPT mappings 
before moving the device to RUNNING_P2P.

> 
> > Does above sound a reasonable understanding of this FSM mechanism?
> 
> Other than mis-using the STOP label, it is close yes.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15 15:33     ` Jason Gunthorpe
@ 2022-02-16  3:04       ` Tian, Kevin
  0 siblings, 0 replies; 50+ messages in thread
From: Tian, Kevin @ 2022-02-16  3:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 15, 2022 11:34 PM
> 
> > > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> > > +#define VFIO_DEVICE_STATE_MASK
> (VFIO_DEVICE_STATE_V1_RUNNING
> > > | \
> > > +				     VFIO_DEVICE_STATE_V1_SAVING |  \
> > > +				     VFIO_DEVICE_STATE_V1_RESUMING)
> >
> > Does it make sense to also add 'V1' to MASK and also following macros
> > given their names are general?
> 
> No, the point of this exercise is to avoid trouble for qemu - the
> fewest changes we can get away with the better.
> 
> Once qemu is updated we'll delete this old stuff from the kernel.

sounds good.

> 
> > > +/*
> > > + * Indicates the device can support the migration API. See enum
> >
> > call it V2? Not necessary to add V2 in code but worthy of a clarification
> > in comment.
> 
> We've only called it 'v2' for discussions.
> 
> If you think it is unclear lets say 'support the migration API through
> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE'

yes, that's clearer.

> > > + *
> > > + * STOP -> STOP_COPY
> > > + *   This arc begin the process of saving the device state and will return a
> > > + *   new data_fd.
> > > + *
> > > + *   While in the STOP_COPY state the device has the same behavior as
> STOP
> > > + *   with the addition that the data transfers session continues to stream
> the
> > > + *   migration state. End of stream on the FD indicates the entire device
> > > + *   state has been transferred.
> > > + *
> > > + *   The user should take steps to restrict access to vfio device regions
> while
> > > + *   the device is in STOP_COPY or risk corruption of the device migration
> > > data
> > > + *   stream.
> >
> > Restricting access has been explained in the to-STOP arcs and it is stated
> > that while in STOP_COPY the device has the same behavior as STOP. So
> > I think the last paragraph is possibly not required.
> 
> It is not the same, the language in STOP is saying that the device
> must tolerate external touches without breaking the kernel
> 
> This language is saying if external touches happen then the device is
> free to corrupt the migration stream.
> 
> In both cases we expect good userspace to not have device
> touches, the guidance here is for driver authors about what kind of
> steps they need to take to protect against hostile userspace.

fair enough.

> 
> > > + * STOP -> RESUMING
> > > + *   Entering the RESUMING state starts a process of restoring the device
> > > + *   state and will return a new data_fd. The data stream fed into the
> > > data_fd
> > > + *   should be taken from the data transfer output of the saving group
> states
> >
> > No definition of 'group state' (maybe introduced in a later patch?)
> 
> Yes, it was added in the P2P patch
> 
> We can avoid talking about saving group here entirely, it really just
> means a single FD.
> 
>  *    The data stream fed into the data_fd should
>  *   be taken from the data transfer output of a single FD during saving on a
>  *   from a compatible device.
> 

Yes.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-15 16:04         ` Jason Gunthorpe
  2022-02-15 23:32           ` Alex Williamson
@ 2022-02-16  3:17           ` Tian, Kevin
  2022-02-16 12:14             ` Jason Gunthorpe
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-16  3:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 16, 2022 12:04 AM
> 
> On Tue, Feb 15, 2022 at 10:41:56AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, February 9, 2022 10:37 AM
> > >
> > > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > > >
> > > > >  /**
> > > >
> > > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > > can't know how much data will be read from the device and how
> important
> > > > that is.  There's no replacement of that feature from the v1 protocol
> > > > here.
> > >
> > > I'm not sure this was part of the v1 protocol either. Yes it had a
> > > pending_bytes, but I don't think it was actually expected to be 100%
> > > accurate. Computing this value accurately is potentially quite
> > > expensive, I would prefer we not enforce this on an implementation
> > > without a reason, and qemu currently doesn't make use of it.
> > >
> > > The ioctl from the precopy patch is probably the best approach, I
> > > think it would be fine to allow that for stop copy as well, but also
> > > don't see a usage right now.
> > >
> > > It is not something that needs decision now, it is very easy to detect
> > > if an ioctl is supported on the data_fd at runtime to add new things
> > > here when needed.
> > >
> >
> > Another interesting thing (not an immediate concern on this series)
> > is how to handle devices which may have long time (e.g. due to
> > draining outstanding requests, even w/o vPRI) to enter the STOP
> > state. that time is not as deterministic as pending bytes thus cannot
> > be reported back to the user before the operation is actually done.
> 
> Well, it is not deterministic at all..
> 
> I suppose you have to do as Alex says and try to estimate how much
> time the stop phase of migration will take and grant only the
> remaining time from the SLA to the guest to finish its PRI flushing,

Let's separate it from PRI stuff thus no guest operation.

It's a simple story that vCPUs have been stopped and Qemu requests
state transition from RUNNING to STOP on a device which needs
migration driver to drain outstanding requests before being stopped.

those requests don't rely on vCPUs but still take time to complete
(thus may break SLA) and are invisible to migration driver (directly
submitted by the guest thus cannot be estimated). So the only means 
is for user to wait on a fd with a timeout (based on whatever SLA) and
if expires then aborts migration (may retry later).

I'm not sure whether we want to leverage the new arc for vPRI or
just allow changing the STOP behavior to return a eventfd for an 
async transition.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-16  2:52       ` Tian, Kevin
@ 2022-02-16 12:11         ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-16 12:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Wed, Feb 16, 2022 at 02:52:55AM +0000, Tian, Kevin wrote:

> btw can disabling PCI bus master be a general means for devices which
> don't have a way of blocking P2P to implement RUNNING_P2P? 

I think if it works for a specific device then that device's driver
can use it.

I wouldn't make something general, too likely a device will blow up if
you do this to it.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-16  3:17           ` Tian, Kevin
@ 2022-02-16 12:14             ` Jason Gunthorpe
  2022-02-17  2:29               ` Tian, Kevin
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-16 12:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Wed, Feb 16, 2022 at 03:17:36AM +0000, Tian, Kevin wrote:

> those requests don't rely on vCPUs but still take time to complete
> (thus may break SLA) and are invisible to migration driver (directly
> submitted by the guest thus cannot be estimated). So the only means 
> is for user to wait on a fd with a timeout (based on whatever SLA) and
> if expires then aborts migration (may retry later).

I think I explained in my other email how this can be implemented
today with v2 for STOP_COPY without an event fd.

Such a device might even be able to implement an abort..

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-16 12:14             ` Jason Gunthorpe
@ 2022-02-17  2:29               ` Tian, Kevin
  0 siblings, 0 replies; 50+ messages in thread
From: Tian, Kevin @ 2022-02-17  2:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 16, 2022 8:14 PM
> 
> On Wed, Feb 16, 2022 at 03:17:36AM +0000, Tian, Kevin wrote:
> 
> > those requests don't rely on vCPUs but still take time to complete
> > (thus may break SLA) and are invisible to migration driver (directly
> > submitted by the guest thus cannot be estimated). So the only means
> > is for user to wait on a fd with a timeout (based on whatever SLA) and
> > if expires then aborts migration (may retry later).
> 
> I think I explained in my other email how this can be implemented
> today with v2 for STOP_COPY without an event fd.
> 

I suppose you meant this part:

"It allows RUNNING -> STOP_COPY to be made async because the driver can
return SET_STATE immediately, backround the state save and indicate
completion/progress/error via poll(readable) on the data_fd."

Yes it could work if the user directly request STOP_COPY as the end state
(with STOP as an implicit/immediate step). In that case polling on data_fd
with timeout can cover the requirement described for STOP here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
@ 2022-02-17 17:15   ` Alex Williamson
  2022-02-18  0:03     ` Jason Gunthorpe
  2022-02-18  8:01   ` Tian, Kevin
  1 sibling, 1 reply; 50+ messages in thread
From: Alex Williamson @ 2022-02-17 17:15 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Mon, 7 Feb 2022 19:22:16 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The optional PRE_COPY states open the saving data transfer FD before
> reaching STOP_COPY and allows the device to dirty track internal state
> changes with the general idea to reduce the volume of data transferred
> in the STOP_COPY stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
> which halts P2P transfers while continuing the saving FD.
> 
> PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
> and exists as an optional FSM branch between RUNNING and STOP_COPY:
>     RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> 
> A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
> query the progress of the precopy operation in the driver with the idea it
> will judge to move to STOP_COPY at least once the initial data set is
> transferred, and possibly after the dirty size has shrunk appropriately.
> 
> We think there may also be merit in future extensions to the
> VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
> rate it generates internal dirty state.
> 
> Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
> and to be defined in future. While making the whole PRE_COPY feature
> optional eliminates the concern from mlx5, this is still a complicated arc
> to implement and seems prudent to leave it closed until a proper use case
> is developed. We also split the pending_bytes report into the initial and
> sustaining values, and define the protocol to get an event via poll() for
> new dirty data during PRE_COPY.

I feel obligated to ask, is PRE_COPY support essentially RFC at this
point since we have no proposed in-kernel users?

It seems like we're winding down comments on the remainder of the
series and I feel ok with where it's headed and the options we have
available for future extensions.  Pre-copy seems like an important gap
to fill and I think this patch shows that a future extension could
allow it, but with the scrutiny not to add unused code to the kernel,
I'm not sure there's a valid justification to add it now.  Thanks,

Alex

PS - Why is this a stand-alone ioctl rather than a DEVICE_FEATURE?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-17 17:15   ` Alex Williamson
@ 2022-02-18  0:03     ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-18  0:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, ashok.raj, kevin.tian,
	shameerali.kolothum.thodi

On Thu, Feb 17, 2022 at 10:15:54AM -0700, Alex Williamson wrote:

> I feel obligated to ask, is PRE_COPY support essentially RFC at this
> point since we have no proposed in-kernel users?

Yes, it is included here because the kernel in v1 had PRE_COPY, so it
seemed essential to show how this could continue to look to evaluate
v2.

NVIDIA has an out of tree driver that implemented PRE_COPY in the v1
protocol, and we have some future plan to use it in a in-tree driver.

> It seems like we're winding down comments on the remainder of the
> series and I feel ok with where it's headed and the options we have
> available for future extensions.  

Thanks, it was a lot of work for everyone to get here!

Yishai has all the revisions from Kevin included, he will sent it on
Sunday. Based on this Leon will make a formal PR next week so it can
go into linux-next through your tree. We have to stay co-ordinated
with our netdev driver branch..

I will ping the acc team and make it priority to review their next
vresion. Let's try to include their driver as well.

We'll start to make a more review ready qemu series.

> PS - Why is this a stand-alone ioctl rather than a DEVICE_FEATURE?

You asked for the ioctl to be on the data_fd, so there is no
DEVICE_FEATURE infrastructure and I think it doesn't make sense to put
a multiplexor there. We have lots of ioctl numbers and don't want this
to be complicated for performance.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
  2022-02-17 17:15   ` Alex Williamson
@ 2022-02-18  8:01   ` Tian, Kevin
  2022-02-18 14:06     ` Jason Gunthorpe
  1 sibling, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-18  8:01 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	maorg, Raj, Ashok, shameerali.kolothum.thodi

Some comments though this may not be in next version per Alex's suggestion.

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Tuesday, February 8, 2022 1:22 AM
> 
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The optional PRE_COPY states open the saving data transfer FD before
> reaching STOP_COPY and allows the device to dirty track internal state
> changes with the general idea to reduce the volume of data transferred
> in the STOP_COPY stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
> which halts P2P transfers while continuing the saving FD.
> 
> PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
> and exists as an optional FSM branch between RUNNING and STOP_COPY:
>     RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> 

that branch includes only 5 new arcs. The other two are between RUNNING_P2P
and PRECOPY_P2P.

I draw a figure to help me understand the final FSM. Put it here in case
others are interested in 😊

RUNNING <--------------------> RUNNING_P2P <-------> STOP <-----> RESUMING
   ^                               ^                  ^
   |                               |                  |
   |                               |                  |
   v                               v                  v
PRECOPY <--------------------> PRECOPY_P2P -----> STOP_COPY

> A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
> query the progress of the precopy operation in the driver with the idea it
> will judge to move to STOP_COPY at least once the initial data set is
> transferred, and possibly after the dirty size has shrunk appropriately.
> 
> We think there may also be merit in future extensions to the
> VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
> rate it generates internal dirty state.
> 
> Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional

essentially it's *BLOCKED* per following context. 

> and to be defined in future. While making the whole PRE_COPY feature
> optional eliminates the concern from mlx5, this is still a complicated arc
> to implement and seems prudent to leave it closed until a proper use case

Can you shed some light on the complexity here?

Could a driver pretend supporting PRE_COPY by simply returning both 
initial_bytes and dirty_bytes as ZERO?

and even if the driver doesn't support the base arc (STOP_COPY->
PRE_COPY_P2P) what about the combination arc (STOP_COPY->STOP->
RUNNING_P2P->PRE_COPY_P2P)? current FSM already allows
STOP->RUNNING_P2P->PRE_COPY_P2P and in concept STOP_COPY
and STOP have exact same device behavior.

with that combination arc the interim transition from STOP_COPY to
STOP will terminate the current data stream and RUNNING_P2P to
PRE_COPY_P2P will return a new data fd. This does violate the definition
about transition between three 'saving group' of states, which says
moving between them does not terminate or otherwise affect the
associated fd.

Is this one of the complexities that worry you?

> is developed. We also split the pending_bytes report into the initial and
> sustaining values, and define the protocol to get an event via poll() for

I guess this split must have been aligned in earlier discussion but it's still
useful if some words can be put here for the motivation. Otherwise one 
could easily ask why not treating the 1st read of pending_bytes as the 
initial size.

> new dirty data during PRE_COPY.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/vfio.c       |  71 +++++++++++++++++++++++-
>  include/uapi/linux/vfio.h | 110 ++++++++++++++++++++++++++++++++++++--
>  2 files changed, 176 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 8c484593dfe0..b4c585114ef3 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1577,7 +1577,7 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  			    enum vfio_device_mig_state new_fsm,
>  			    enum vfio_device_mig_state *next_fsm)
>  {
> -	enum { VFIO_DEVICE_NUM_STATES =
> VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
> +	enum { VFIO_DEVICE_NUM_STATES =
> VFIO_DEVICE_STATE_PRE_COPY_P2P + 1 };
>  	/*
>  	 * The coding in this table requires the driver to implement
>  	 * FSM arcs:
> @@ -1596,25 +1596,59 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  	 *         RUNNING -> STOP
>  	 *         STOP -> RUNNING

The comment for above should be updated too, which currently says:

	 * Without P2P the driver must implement:

and also move it to the end as it talks about the arcs when neither
P2P nor PRECOPY is supported.

>  	 *
> +	 * If precopy is supported then the driver must support these
> additional
> +	 * FSM arcs:
> +	 *         RUNNING -> PRE_COPY
> +	 *         PRE_COPY -> RUNNING
> +	 *         PRE_COPY -> STOP_COPY
> +	 * However, if precopy and P2P are supported together then the
> driver
> +	 * must support these additional arcs beyond the P2P arcs above:
> +	 *         PRE_COPY -> RUNNING
> +	 *         PRE_COPY -> PRE_COPY_P2P
> +	 *         PRE_COPY_P2P -> PRE_COPY
> +	 *         PRE_COPY_P2P -> RUNNING_P2P
> +	 *         PRE_COPY_P2P -> STOP_COPY
> +	 *         RUNNING -> PRE_COPY
> +	 *         RUNNING_P2P -> PRE_COPY_P2P
> +	 *
>  	 * If all optional features are supported then the coding will step
>  	 * through multiple states for these combination transitions:
> +	 *         PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> +	 *         PRE_COPY -> RUNNING -> RUNNING_P2P
> +	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
> +	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP ->
> RESUMING
> +	 *         PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
> +	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP
> +	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
>  	 *         RESUMING -> STOP -> RUNNING_P2P
> +	 *         RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
>  	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
> +	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING ->
> PRE_COPY
>  	 *         RESUMING -> STOP -> STOP_COPY
> +	 *         RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
>  	 *         RUNNING -> RUNNING_P2P -> STOP
>  	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
>  	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
> +	 *         RUNNING_P2P -> RUNNING -> PRE_COPY
>  	 *         RUNNING_P2P -> STOP -> RESUMING
>  	 *         RUNNING_P2P -> STOP -> STOP_COPY
> +	 *         STOP -> RUNNING_P2P -> PRE_COPY_P2P
>  	 *         STOP -> RUNNING_P2P -> RUNNING
> +	 *         STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
>  	 *         STOP_COPY -> STOP -> RESUMING
>  	 *         STOP_COPY -> STOP -> RUNNING_P2P
>  	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
> +	 *
> +	 *  The following transitions are blocked:
> +	 *         STOP_COPY -> PRE_COPY
> +	 *         STOP_COPY -> PRE_COPY_P2P
>  	 */
>  	static const u8
> vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STA
> TES] = {
>  		[VFIO_DEVICE_STATE_STOP] = {
>  			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_STOP_COPY,
>  			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_RESUMING,
>  			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> @@ -1623,14 +1657,38 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  		[VFIO_DEVICE_STATE_RUNNING] = {
>  			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] =
> VFIO_DEVICE_STATE_ERROR,
>  		},
> +		[VFIO_DEVICE_STATE_PRE_COPY] = {
> +			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_ERROR] =
> VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] =
> VFIO_DEVICE_STATE_ERROR,
> +		},
>  		[VFIO_DEVICE_STATE_STOP_COPY] = {
>  			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_STOP_COPY,
>  			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_STOP,
> @@ -1639,6 +1697,8 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  		[VFIO_DEVICE_STATE_RESUMING] = {
>  			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_RESUMING,
>  			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_STOP,
> @@ -1647,6 +1707,8 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
>  			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_PRE_COPY_P2P,
>  			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_RUNNING_P2P,
> @@ -1655,6 +1717,8 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  		[VFIO_DEVICE_STATE_ERROR] = {
>  			[VFIO_DEVICE_STATE_STOP] =
> VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY] =
> VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_RUNNING_P2P] =
> VFIO_DEVICE_STATE_ERROR,
> @@ -1665,6 +1729,11 @@ int vfio_mig_get_next_state(struct vfio_device
> *device,
>  	static const unsigned int
> state_flags_table[VFIO_DEVICE_NUM_STATES] = {
>  		[VFIO_DEVICE_STATE_STOP] =
> VFIO_MIGRATION_STOP_COPY,
>  		[VFIO_DEVICE_STATE_RUNNING] =
> VFIO_MIGRATION_STOP_COPY,
> +		[VFIO_DEVICE_STATE_PRE_COPY] =
> +			VFIO_MIGRATION_STOP_COPY |
> VFIO_MIGRATION_PRE_COPY,
> +		[VFIO_DEVICE_STATE_PRE_COPY_P2P] =
> VFIO_MIGRATION_STOP_COPY |
> +						   VFIO_MIGRATION_P2P |
> +
> VFIO_MIGRATION_PRE_COPY,
>  		[VFIO_DEVICE_STATE_STOP_COPY] =
> VFIO_MIGRATION_STOP_COPY,
>  		[VFIO_DEVICE_STATE_RESUMING] =
> VFIO_MIGRATION_STOP_COPY,
>  		[VFIO_DEVICE_STATE_RUNNING_P2P] =
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 227f55d57e06..6424c5b3415b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -817,12 +817,20 @@ struct vfio_device_feature {
>   * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that
> RUNNING_P2P
>   * is supported in addition to the STOP_COPY states.
>   *
> + * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means
> that
> + * PRE_COPY is supported in addition to the STOP_COPY states.
> + *
> + * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P |
> VFIO_MIGRATION_PRE_COPY
> + * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
> + * in addition to the STOP_COPY states.
> + *
>   * Other combinations of flags have behavior to be defined in the future.
>   */
>  struct vfio_device_feature_migration {
>  	__aligned_u64 flags;
>  #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
>  #define VFIO_MIGRATION_P2P		(1 << 1)
> +#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
>  };
>  #define VFIO_DEVICE_FEATURE_MIGRATION 1
> 
> @@ -873,8 +881,13 @@ struct vfio_device_feature_mig_state {
>   *  RESUMING - The device is stopped and is loading a new internal state
>   *  ERROR - The device has failed and must be reset
>   *
> - * And 1 optional state to support VFIO_MIGRATION_P2P:
> + * And optional states to support VFIO_MIGRATION_P2P:
>   *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
> + * And VFIO_MIGRATION_PRE_COPY:
> + *  PRE_COPY - The device is running normally but tracking internal state
> + *             changes
> + * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
> + *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer
> DMA
>   *
>   * The FSM takes actions on the arcs between FSM states. The driver
> implements
>   * the following behavior for the FSM arcs:
> @@ -906,20 +919,48 @@ struct vfio_device_feature_mig_state {
>   *
>   *   To abort a RESUMING session the device must be reset.
>   *
> + * PRE_COPY -> RUNNING
>   * RUNNING_P2P -> RUNNING
>   *   While in RUNNING the device is fully operational, the device may
> generate
>   *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
>   *   and the device may advance its internal state.
>   *
> + *   The PRE_COPY arc will terminate a data transfer session.
> + *
> + * PRE_COPY_P2P -> RUNNING_P2P
>   * RUNNING -> RUNNING_P2P
>   * STOP -> RUNNING_P2P
>   *   While in RUNNING_P2P the device is partially running in the P2P
> quiescent
>   *   state defined below.
>   *
> + *   The PRE_COPY arc will terminate a data transfer session.

PRE_COPY_P2P

> + *
> + * RUNNING -> PRE_COPY
> + * RUNNING_P2P -> PRE_COPY_P2P
>   * STOP -> STOP_COPY
> - *   This arc begin the process of saving the device state and will return a
> - *   new data_fd.
> + *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of
> states
> + *   which share a data transfer session. Moving between these states alters
> + *   what is streamed in session, but does not terminate or otherwise effect

'effect' -> 'affect'?

> + *   the associated fd.
> + *
> + *   These arcs begin the process of saving the device state and will return a
> + *   new data_fd. The migration driver may perform actions such as enabling
> + *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
>   *
> + *   Each arc does not change the device operation, the device remains
> + *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described
> below
> + *   in PRE_COPY_P2P -> STOP_COPY.
> + *
> + * PRE_COPY -> PRE_COPY_P2P
> + *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
> + *   However, while in the PRE_COPY_P2P state, the device is partially
> running
> + *   in the P2P quiescent state defined below, like RUNNING_P2P.
> + *
> + * PRE_COPY_P2P -> PRE_COPY
> + *   This arc allows returning the device to a full RUNNING behavior while
> + *   continuing all the behaviors of PRE_COPY.
> + *
> + * PRE_COPY_P2P -> STOP_COPY
>   *   While in the STOP_COPY state the device has the same behavior as STOP
>   *   with the addition that the data transfers session continues to stream the
>   *   migration state. End of stream on the FD indicates the entire device
> @@ -937,6 +978,13 @@ struct vfio_device_feature_mig_state {
>   *   internal device state for this arc if required to prepare the device to
>   *   receive the migration data.
>   *
> + * STOP_COPY -> PRE_COPY
> + * STOP_COPY -> PRE_COPY_P2P
> + *   These arcs are not permitted and return error if requested. Future
> + *   revisions of this API may define behaviors for these arcs, in this case
> + *   support will be discoverable by a new flag in
> + *   VFIO_DEVICE_FEATURE_MIGRATION.
> + *
>   * any -> ERROR
>   *   ERROR cannot be specified as a device state, however any transition
> request
>   *   can be failed with an errno return and may then move the device_state
> into
> @@ -948,7 +996,7 @@ struct vfio_device_feature_mig_state {
>   * The optional peer to peer (P2P) quiescent state is intended to be a
> quiescent
>   * state for the device for the purposes of managing multiple devices within
> a
>   * user context where peer-to-peer DMA between devices may be active.
> The
> - * RUNNING_P2P states must prevent the device from initiating
> + * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from
> initiating
>   * any new P2P DMA transactions. If the device can identify P2P transactions
>   * then it can stop only P2P DMA, otherwise it must stop all DMA. The
> migration
>   * driver must complete any such outstanding operations prior to completing
> the
> @@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
>   * above FSM arcs. As there are multiple paths through the FSM arcs the
> path
>   * should be selected based on the following rules:
>   *   - Select the shortest path.
> + *   - The path cannot have saving group states as interior arcs, only
> + *     starting/end states.

what about PRECOPY->PRECOPY_P2P->STOP_COPY? In this case
PRECOPY_P2P is used as interior arc.

and if we disallow a non-saving-group state as interior arc when both 
start and end states are saving-group states (e.g. 
STOP_COPY->STOP->RUNNING_P2P->PRE_COPY_P2P as I asked in
the start) then it might be another rule to be specified...


>   * Refer to vfio_mig_get_next_state() for the result of the algorithm.
>   *
>   * The automatic transit through the FSM arcs that make up the combination
> @@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
>   * support them. The user can disocver if these states are supported by using
>   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the
> user can
>   * avoid knowing about these optional states if the kernel driver supports
> them.
> + *
> + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for
> PRE_COPY
> + * is not present.

why adding this sentence particularly for PRE_COPY? Isn't it already
explained by last paragraph for optional states?

>   */
>  enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_ERROR = 0,
> @@ -980,8 +1033,57 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_STOP_COPY = 3,
>  	VFIO_DEVICE_STATE_RESUMING = 4,
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> +	VFIO_DEVICE_STATE_PRE_COPY = 6,
> +	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
> +};
> +
> +/**
> + * VFIO_DEVICE_MIG_PRECOPY - _IO(VFIO_TYPE, VFIO_BASE + 21)
> + *
> + * This ioctl is used on the migration data FD in the precopy phase of the
> + * migration data transfer. It returns an estimate of the current data sizes
> + * remaining to be transferred. It allows the user to judge when it is
> + * appropriate to leave PRE_COPY for STOP_COPY.
> + *
> + * initial_bytes reflects the estimated remaining size of any initial mandatory
> + * precopy data transfer. When initial_bytes returns as zero then the initial
> + * phase of the precopy data is completed. Generally initial_bytes should
> start
> + * out as approximately the entire device state.
> + *
> + * dirty_bytes reflects an estimate for how much more data needs to be
> + * transferred to complete the migration. Generally it should start as zero
> + * and increase as internal state is dirtied.
> + *
> + * Drivers should attempt to return estimates so that initial_bytes +
> + * dirty_bytes matches the amount of data an immediate transition to
> STOP_COPY
> + * will require to be streamed.

I didn't understand this requirement. In an immediate transition to
STOP_COPY I expect the amount of data covers the entire device
state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
then why we need set some expectation on the sum of 
initial+round1_dity+round2_dirty+... 

> + *
> + * Drivers have alot of flexibility in when and what they transfer during the

'alot' -> 'a lot'

> + * PRE_COPY phase, and how they report this from
> VFIO_DEVICE_MIG_PRECOPY.
> + *
> + * During pre-copy the migration data FD has a temporary "end of stream"
> that is
> + * reached when both initial_bytes and dirty_byte are zero. For instance,
> this
> + * may indicate that the device is idle and not currently dirtying any internal
> + * state. When read() is done on this temporary end of stream the kernel
> driver
> + * should return ENOMSG from read(). Userspace can wait for more data
> (which may
> + * never come) by using poll.
> + *
> + * Once in STOP_COPY the migration data FD has a permanent end of
> stream
> + * signaled in the usual way by read() always returning 0 and poll always
> + * returning readable. ENOMSG may not be returned in STOP_COPY.
> Support
> + * for this ioctl is optional.
> + *
> + * Return: 0 on success, -1 and errno set on failure.
> + */
> +struct vfio_device_mig_precopy {
> +	__u32 argsz;
> +	__u32 flags;
> +	__aligned_u64 initial_bytes;
> +	__aligned_u64 dirty_bytes;
>  };
> 
> +#define VFIO_DEVICE_MIG_PRECOPY _IO(VFIO_TYPE, VFIO_BASE + 21)
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.18.1

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol
  2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (14 preceding siblings ...)
  2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
@ 2022-02-18  8:11 ` Tarun Gupta (SW-GPU)
  15 siblings, 0 replies; 50+ messages in thread
From: Tarun Gupta (SW-GPU) @ 2022-02-18  8:11 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	maorg, ashok.raj, kevin.tian, shameerali.kolothum.thodi, cjia



On 2/7/2022 10:52 PM, Yishai Hadas wrote:
> External email: Use caution opening links or attachments
> 
> 
> This series adds mlx5 live migration driver for VFs that are migration
> capable and includes the v2 migration protocol definition and mlx5
> implementation.
> 
> The mlx5 driver uses the vfio_pci_core split to create a specific VFIO
> PCI driver that matches the mlx5 virtual functions. The driver provides
> the same experience as normal vfio-pci with the addition of migration
> support.
> 
> In HW the migration is controlled by the PF function, using its
> mlx5_core driver, and the VFIO PCI VF driver co-ordinates with the PF to
> execute the migration actions.
> 
> The bulk of the v2 migration protocol is semantically the same v1,
> however it has been recast into a FSM for the device_state and the
> actual syscall interface uses normal ioctl(), read() and write() instead
> of building a syscall interface using the region.
> 
> Several bits of infrastructure work are included here:
>   - pci_iov_vf_id() to help drivers like mlx5 figure out the VF index from
>     a BDF
>   - pci_iov_get_pf_drvdata() to clarify the tricky locking protocol when a
>     VF reaches into its PF's driver
>   - mlx5_core uses the normal SRIOV lifecycle and disables SRIOV before
>     driver remove, to be compatible with pci_iov_get_pf_drvdata()
>   - Lifting VFIO_DEVICE_FEATURE into core VFIO code
> 
> This series comes after alot of discussion. Some major points:
> - v1 ABI compatible migration defined using the same FSM approach:
>     https://lore.kernel.org/all/0-v1-a4f7cab64938+3f-vfio_mig_states_jgg@nvidia.com/
> - Attempts to clarify how the v1 API works:
>     Alex's:
>       https://lore.kernel.org/kvm/163909282574.728533.7460416142511440919.stgit@omen/
>     Jason's:
>       https://lore.kernel.org/all/0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com/
> - Etherpad exploring the scope and questions of general VFIO migration:
>       https://lore.kernel.org/kvm/87mtm2loml.fsf@redhat.com/
> 
> NOTE: As this series touched mlx5_core parts we need to send this in a
> pull request format to VFIO to avoid conflicts.
> 
> Matching qemu changes can be previewed here:
>   https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2
> 
> Changes from V6: https://lore.kernel.org/netdev/20220130160826.32449-1-yishaih@nvidia.com/
> vfio:
> - Move to use the FEATURE ioctl for setting/getting the device state.
> - Use state_flags_table as part of vfio_mig_get_next_state() and use
>    WARN_ON as Alex suggested.
> - Leave the V1 definitions in the uAPI header and drop only its
>    documentation till V2 will be part of Linus's tree.
> - Fix errno's usage in few places.
> - Improve and adapt the uAPI documentation to match the latest code.
> - Put the VFIO_DEVICE_FEATURE_PCI_VF_TOKEN functionality into a separate
>    function.
> - Fix some rebase note.
> vfio/mlx5:
> - Adapt to use the vfio core changes.
> - Fix some bad flow upon load state.
> 
> Changes from V5: https://lore.kernel.org/kvm/20211027095658.144468-1-yishaih@nvidia.com/
> vfio:
> - Migration protocol v2:
>    + enum for device state, not bitmap
>    + ioctl to manipulate device_state, not a region
>    + Only STOP_COPY is mandatory, P2P and PRE_COPY are optional, discovered
>      via VFIO_DEVICE_FEATURE
>    + Migration data transfer is done via dedicated FD
> - VFIO core code to implement the migration related ioctls and help
>    drivers implement it correctly
> - VFIO_DEVICE_FEATURE refactor
> - Delete migration protocol, drop patches fixing it
> - Drop "vfio/pci_core: Make the region->release() function optional"
> vfio/mlx5:
> - Switch to use migration v2 protocol, with core helpers
> - Eliminate the region implementation
> 
> Changes from V4: https://lore.kernel.org/kvm/20211026090605.91646-1-yishaih@nvidia.com/
> vfio:
> - Add some Reviewed-by.
> - Rename to vfio_pci_core_aer_err_detected() as Alex asked.
> vfio/mlx5:
> - Improve to enter the error state only if unquiesce also fails.
> - Fix some typos.
> - Use the multi-line comment style as in drivers/vfio.
> 
> Changes from V3: https://lore.kernel.org/kvm/20211024083019.232813-1-yishaih@nvidia.com/
> vfio/mlx5:
> - Align with mlx5 latest specification to create the MKEY with full read
>    write permissions.
> - Fix unlock ordering in mlx5vf_state_mutex_unlock() to prevent some
>    race.
> 
> Changes from V2: https://lore.kernel.org/kvm/20211019105838.227569-1-yishaih@nvidia.com/
> vfio:
> - Put and use the new macro VFIO_DEVICE_STATE_SET_ERROR as Alex asked.
> vfio/mlx5:
> - Improve/fix state checking as was asked by Alex & Jason.
> - Let things be done in a deterministic way upon 'reset_done' following
>    the suggested algorithm by Jason.
> - Align with mlx5 latest specification when calling the SAVE command.
> - Fix some typos.
> vdpa/mlx5:
> - Drop the patch from the series based on the discussion in the mailing
>    list.
> 
> Changes from V1: https://lore.kernel.org/kvm/20211013094707.163054-1-yishaih@nvidia.com/
> PCI/IOV:
> - Add actual interface in the subject as was asked by Bjorn and add
>    his Acked-by.
> - Move to check explicitly for !dev->is_virtfn as was asked by Alex.
> vfio:
> - Come with a separate patch for fixing the non-compiled
>    VFIO_DEVICE_STATE_SET_ERROR macro.
> - Expose vfio_pci_aer_err_detected() to be set by drivers on their own
>    pci error handles.
> - Add a macro for VFIO_DEVICE_STATE_ERROR in the uapi header file as was
>    suggested by Alex.
> vfio/mlx5:
> - Improve to use xor as part of checking the 'state' change command as
>    was suggested by Alex.
> - Set state to VFIO_DEVICE_STATE_ERROR when an error occurred instead of
>    VFIO_DEVICE_STATE_INVALID.
> - Improve state checking as was suggested by Jason.
> - Use its own PCI reset_done error handler as was suggested by Jason and
>    fix the locking scheme around the state mutex to work properly.
> 
> Changes from V0: https://lore.kernel.org/kvm/cover.1632305919.git.leonro@nvidia.com/
> PCI/IOV:
> - Add an API (i.e. pci_iov_get_pf_drvdata()) that allows SRVIO VF drivers
>    to reach the drvdata of a PF.
> mlx5_core:
> - Add an extra patch to disable SRIOV before PF removal.
> - Adapt to use the above PCI/IOV API as part of mlx5_vf_get_core_dev().
> - Reuse the exported PCI/IOV virtfn index function call (i.e. pci_iov_vf_id().
> vfio:
> - Add support in the pci_core to let a driver be notified when
>   'reset_done' to let it sets its internal state accordingly.
> - Add some helper stuff for 'invalid' state handling.
> mlx5_vfio_pci:
> - Move to use the 'command mode' instead of the 'state machine'
>   scheme as was discussed in the mailing list.
> - Handle the RESET scenario when called by vfio_pci_core to sets
>   its internal state accordingly.
> - Set initial state as RUNNING.
> - Put the driver files as sub-folder under drivers/vfio/pci named mlx5
>    and update MAINTAINER file as was asked.
> vdpa_mlx5:
> Add a new patch to use mlx5_vf_get_core_dev() to get PF device.
> Jason Gunthorpe (7):
>    PCI/IOV: Add pci_iov_vf_id() to get VF index
>    PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata
>      of a PF
>    vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
>    vfio: Define device migration protocol v2
>    vfio: Extend the device migration protocol with RUNNING_P2P
>    vfio: Remove migration protocol v1 documentation
>    vfio: Extend the device migration protocol with PRE_COPY
> 
> Leon Romanovsky (1):
>    net/mlx5: Reuse exported virtfn index function call
> 
> Yishai Hadas (7):
>    net/mlx5: Disable SRIOV before PF removal
>    net/mlx5: Expose APIs to get/put the mlx5 core device
>    net/mlx5: Introduce migration bits and structures
>    vfio/mlx5: Expose migration commands over mlx5 device
>    vfio/mlx5: Implement vfio_pci driver for mlx5 devices
>    vfio/pci: Expose vfio_pci_core_aer_err_detected()
>    vfio/mlx5: Use its own PCI reset_done error handler
> 
>   MAINTAINERS                                   |   6 +
>   .../net/ethernet/mellanox/mlx5/core/main.c    |  45 ++
>   .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   1 +
>   .../net/ethernet/mellanox/mlx5/core/sriov.c   |  17 +-
>   drivers/pci/iov.c                             |  43 ++
>   drivers/vfio/pci/Kconfig                      |   3 +
>   drivers/vfio/pci/Makefile                     |   2 +
>   drivers/vfio/pci/mlx5/Kconfig                 |  10 +
>   drivers/vfio/pci/mlx5/Makefile                |   4 +
>   drivers/vfio/pci/mlx5/cmd.c                   | 259 +++++++
>   drivers/vfio/pci/mlx5/cmd.h                   |  36 +
>   drivers/vfio/pci/mlx5/main.c                  | 676 ++++++++++++++++++
>   drivers/vfio/pci/vfio_pci.c                   |   1 +
>   drivers/vfio/pci/vfio_pci_core.c              | 101 ++-
>   drivers/vfio/vfio.c                           | 358 +++++++++-
>   include/linux/mlx5/driver.h                   |   3 +
>   include/linux/mlx5/mlx5_ifc.h                 | 147 +++-
>   include/linux/pci.h                           |  15 +-
>   include/linux/vfio.h                          |  50 ++
>   include/linux/vfio_pci_core.h                 |   4 +
>   include/uapi/linux/vfio.h                     | 504 +++++++------
>   21 files changed, 1994 insertions(+), 291 deletions(-)
>   create mode 100644 drivers/vfio/pci/mlx5/Kconfig
>   create mode 100644 drivers/vfio/pci/mlx5/Makefile
>   create mode 100644 drivers/vfio/pci/mlx5/cmd.c
>   create mode 100644 drivers/vfio/pci/mlx5/cmd.h
>   create mode 100644 drivers/vfio/pci/mlx5/main.c
> 
> --
> 2.18.1
> 

We've tested Nvidia vGPU live migration functionality with the current 
v7 proposal and functionally, it works fine.
We're thinking of further performance optimizations to migrate large 
amounts of the data, will propose it later on after working out the details.

Thanks,
Tarun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-18  8:01   ` Tian, Kevin
@ 2022-02-18 14:06     ` Jason Gunthorpe
  2022-02-22  1:43       ` Tian, Kevin
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-18 14:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Fri, Feb 18, 2022 at 08:01:47AM +0000, Tian, Kevin wrote:
 
> > A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
> > query the progress of the precopy operation in the driver with the idea it
> > will judge to move to STOP_COPY at least once the initial data set is
> > transferred, and possibly after the dirty size has shrunk appropriately.
> > 
> > We think there may also be merit in future extensions to the
> > VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
> > rate it generates internal dirty state.
> > 
> > Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
> 
> essentially it's *BLOCKED* per following context.

Yes I suppose now that we have the cap bits not the arc discovery this
isn't worded well

> > and to be defined in future. While making the whole PRE_COPY feature
> > optional eliminates the concern from mlx5, this is still a complicated arc
> > to implement and seems prudent to leave it closed until a proper use case
> 
> Can you shed some light on the complexity here?

It is with the data_fd, once a driver enters STOP_COPY it should stuff
its final state into the data_fd. If this is aborted back to PRE_COPY
then the data_fd needs to return to streaming changes. Managing this
transition is not trivial - it is something that has to be signaled to
the receiver.

There is also something of a race here where the data_fd can reach
end-of-stream and then the user can do STOP_COPY->PRE_COPY and
continue stuffing data. This makes the construction of the data stream
framing "interesting" as there is no longer a possible in-band end of
stream marker. See the other discussion about async operation why this
is not ideal.

Basically, it is behavior current qemu doesn't trigger that requires
significant complexity and testing in any driver to support
properly. No driver proposed

> Could a driver pretend supporting PRE_COPY by simply returning both 
> initial_bytes and dirty_bytes as ZERO?

I think so, yes.

> and even if the driver doesn't support the base arc (STOP_COPY->
> PRE_COPY_P2P) what about the combination arc (STOP_COPY->STOP->
> RUNNING_P2P->PRE_COPY_P2P)?

Userspace can walk through this sequence on its own, but it cannot be
part of the FSM because it violates the construction rules. The
data_fd is open in two places.

> current FSM already allows STOP->RUNNING_P2P->PRE_COPY_P2P and in
> concept STOP_COPY and STOP have exact same device behavior.

This is allowed because it follows the FSM rules. The data_fd is the
key difference.

> with that combination arc the interim transition from STOP_COPY to
> STOP will terminate the current data stream and RUNNING_P2P to
> PRE_COPY_P2P will return a new data fd. This does violate the definition
> about transition between three 'saving group' of states, which says
> moving between them does not terminate or otherwise affect the
> associated fd.

Right, and because this happens the VMM wuld have to terminate the
resuming session as well. Remember the output of a single saving
data_fd can be sent to a single receiving resuming data_fd - they
cannot be spliced.

> > is developed. We also split the pending_bytes report into the initial and
> > sustaining values, and define the protocol to get an event via poll() for
> 
> I guess this split must have been aligned in earlier discussion but it's still
> useful if some words can be put here for the motivation. Otherwise one 
> could easily ask why not treating the 1st read of pending_bytes as the 
> initial size.

As everything is estimates the approach allows the estimate to be
refined as we go along. PRE_COPY can stop at any time, but knowing
some initial mandatory stage has passed is somewhat consistent with
how qemu seems to treat this.

> > @@ -1596,25 +1596,59 @@ int vfio_mig_get_next_state(struct vfio_device
> > *device,
> >  	 *         RUNNING -> STOP
> >  	 *         STOP -> RUNNING
> 
> The comment for above should be updated too, which currently says:
> 
> 	 * Without P2P the driver must implement:
> 
> and also move it to the end as it talks about the arcs when neither
> P2P nor PRECOPY is supported.

Yes

> > + * PRE_COPY_P2P -> RUNNING_P2P
> >   * RUNNING -> RUNNING_P2P
> >   * STOP -> RUNNING_P2P
> >   *   While in RUNNING_P2P the device is partially running in the P2P
> > quiescent
> >   *   state defined below.
> >   *
> > + *   The PRE_COPY arc will terminate a data transfer session.
> 
> PRE_COPY_P2P

Yes

> 
> > + *
> > + * RUNNING -> PRE_COPY
> > + * RUNNING_P2P -> PRE_COPY_P2P
> >   * STOP -> STOP_COPY
> > - *   This arc begin the process of saving the device state and will return a
> > - *   new data_fd.
> > + *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of
> > states
> > + *   which share a data transfer session. Moving between these states alters
> > + *   what is streamed in session, but does not terminate or otherwise effect
> 
> 'effect' -> 'affect'?

yes

> > @@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
> >   * above FSM arcs. As there are multiple paths through the FSM arcs the
> > path
> >   * should be selected based on the following rules:
> >   *   - Select the shortest path.
> > + *   - The path cannot have saving group states as interior arcs, only
> > + *     starting/end states.
> 
> what about PRECOPY->PRECOPY_P2P->STOP_COPY? In this case
> PRECOPY_P2P is used as interior arc.

It isn't an interior arc because there are only two arcs :) But yes,
it is bit unclear.

> and if we disallow a non-saving-group state as interior arc when both 
> start and end states are saving-group states (e.g. 
> STOP_COPY->STOP->RUNNING_P2P->PRE_COPY_P2P as I asked in
> the start) then it might be another rule to be specified...

This isn't a shortest path.

> > @@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
> >   * support them. The user can disocver if these states are supported by using
> >   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the
> > user can
> >   * avoid knowing about these optional states if the kernel driver supports
> > them.
> > + *
> > + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for
> > PRE_COPY
> > + * is not present.
> 
> why adding this sentence particularly for PRE_COPY? Isn't it already
> explained by last paragraph for optional states?

Well, I thought it was clarifying about how the optionality is
constructed.

> > + * Drivers should attempt to return estimates so that initial_bytes +
> > + * dirty_bytes matches the amount of data an immediate transition to
> > STOP_COPY
> > + * will require to be streamed.
> 
> I didn't understand this requirement. In an immediate transition to
> STOP_COPY I expect the amount of data covers the entire device
> state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> then why we need set some expectation on the sum of 
> initial+round1_dity+round2_dirty+... 

"will require to be streamed" means additional data from this point
forward, not including anything already sent.

It turns into the estimate of how long STOP_COPY will take.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-18 14:06     ` Jason Gunthorpe
@ 2022-02-22  1:43       ` Tian, Kevin
  2022-02-22 15:50         ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-22  1:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, February 18, 2022 10:06 PM
> 
> 
> > > and to be defined in future. While making the whole PRE_COPY feature
> > > optional eliminates the concern from mlx5, this is still a complicated arc
> > > to implement and seems prudent to leave it closed until a proper use
> case
> >
> > Can you shed some light on the complexity here?
> 
> It is with the data_fd, once a driver enters STOP_COPY it should stuff
> its final state into the data_fd. If this is aborted back to PRE_COPY
> then the data_fd needs to return to streaming changes. Managing this
> transition is not trivial - it is something that has to be signaled to
> the receiver.
> 
> There is also something of a race here where the data_fd can reach
> end-of-stream and then the user can do STOP_COPY->PRE_COPY and
> continue stuffing data. This makes the construction of the data stream
> framing "interesting" as there is no longer a possible in-band end of
> stream marker. See the other discussion about async operation why this
> is not ideal.
> 
> Basically, it is behavior current qemu doesn't trigger that requires
> significant complexity and testing in any driver to support
> properly. No driver proposed

Make sense.

> 
> > > @@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
> > >   * above FSM arcs. As there are multiple paths through the FSM arcs the
> > > path
> > >   * should be selected based on the following rules:
> > >   *   - Select the shortest path.
> > > + *   - The path cannot have saving group states as interior arcs, only
> > > + *     starting/end states.
> >
> > what about PRECOPY->PRECOPY_P2P->STOP_COPY? In this case
> > PRECOPY_P2P is used as interior arc.
> 
> It isn't an interior arc because there are only two arcs :) But yes,
> it is bit unclear.
> 
> > and if we disallow a non-saving-group state as interior arc when both
> > start and end states are saving-group states (e.g.
> > STOP_COPY->STOP->RUNNING_P2P->PRE_COPY_P2P as I asked in
> > the start) then it might be another rule to be specified...
> 
> This isn't a shortest path.

it is the shortest path when STOP_COPY->PRE_COPY_P2P base arc is
not supported. I guess your earlier explanation about data_fd
should be the 3rd rule for why that combination arc is not allowed
in FSM.

> 
> > > @@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
> > >   * support them. The user can disocver if these states are supported by
> using
> > >   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions
> the
> > > user can
> > >   * avoid knowing about these optional states if the kernel driver supports
> > > them.
> > > + *
> > > + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support
> for
> > > PRE_COPY
> > > + * is not present.
> >
> > why adding this sentence particularly for PRE_COPY? Isn't it already
> > explained by last paragraph for optional states?
> 
> Well, I thought it was clarifying about how the optionality is
> constructed.

The last paragraph already says:

+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can disocver if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.

> 
> > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > + * dirty_bytes matches the amount of data an immediate transition to
> > > STOP_COPY
> > > + * will require to be streamed.
> >
> > I didn't understand this requirement. In an immediate transition to
> > STOP_COPY I expect the amount of data covers the entire device
> > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > then why we need set some expectation on the sum of
> > initial+round1_dity+round2_dirty+...
> 
> "will require to be streamed" means additional data from this point
> forward, not including anything already sent.
> 
> It turns into the estimate of how long STOP_COPY will take.
> 

I still didn't get the 'match' part. Why should the amount of data which
has already been sent match the additional data to be sent in STOP_COPY?

Thanks
Keivn

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-22  1:43       ` Tian, Kevin
@ 2022-02-22 15:50         ` Jason Gunthorpe
  2022-02-23  0:40           ` Tian, Kevin
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-22 15:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:

> > > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > > + * dirty_bytes matches the amount of data an immediate transition to
> > > > STOP_COPY
> > > > + * will require to be streamed.
> > >
> > > I didn't understand this requirement. In an immediate transition to
> > > STOP_COPY I expect the amount of data covers the entire device
> > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > > then why we need set some expectation on the sum of
> > > initial+round1_dity+round2_dirty+...
> > 
> > "will require to be streamed" means additional data from this point
> > forward, not including anything already sent.
> > 
> > It turns into the estimate of how long STOP_COPY will take.
> 
> I still didn't get the 'match' part. Why should the amount of data which
> has already been sent match the additional data to be sent in STOP_COPY?

None of it is 'already been sent' the return values are always 'still
to be sent'

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-22 15:50         ` Jason Gunthorpe
@ 2022-02-23  0:40           ` Tian, Kevin
  2022-02-23  0:44             ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Tian, Kevin @ 2022-02-23  0:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 22, 2022 11:51 PM
> 
> On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:
> 
> > > > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > > > + * dirty_bytes matches the amount of data an immediate transition
> to
> > > > > STOP_COPY
> > > > > + * will require to be streamed.
> > > >
> > > > I didn't understand this requirement. In an immediate transition to
> > > > STOP_COPY I expect the amount of data covers the entire device
> > > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > > > then why we need set some expectation on the sum of
> > > > initial+round1_dity+round2_dirty+...
> > >
> > > "will require to be streamed" means additional data from this point
> > > forward, not including anything already sent.
> > >
> > > It turns into the estimate of how long STOP_COPY will take.
> >
> > I still didn't get the 'match' part. Why should the amount of data which
> > has already been sent match the additional data to be sent in STOP_COPY?
> 
> None of it is 'already been sent' the return values are always 'still
> to be sent'
> 

Reread the description:

+ * Drivers should attempt to return estimates so that initial_bytes +
+ * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
+ * will require to be streamed.

I guess you intended to mean that when EITHER initial_bytes OR
dirty_bytes is read the returned value should match the amount 
of data as described above. It is "+" which confused me to think 
it as a sum of both numbers...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-23  0:40           ` Tian, Kevin
@ 2022-02-23  0:44             ` Jason Gunthorpe
  2022-02-23  1:46               ` Tian, Kevin
  0 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2022-02-23  0:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

On Wed, Feb 23, 2022 at 12:40:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, February 22, 2022 11:51 PM
> > 
> > On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:
> > 
> > > > > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > > > > + * dirty_bytes matches the amount of data an immediate transition
> > to
> > > > > > STOP_COPY
> > > > > > + * will require to be streamed.
> > > > >
> > > > > I didn't understand this requirement. In an immediate transition to
> > > > > STOP_COPY I expect the amount of data covers the entire device
> > > > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > > > > then why we need set some expectation on the sum of
> > > > > initial+round1_dity+round2_dirty+...
> > > >
> > > > "will require to be streamed" means additional data from this point
> > > > forward, not including anything already sent.
> > > >
> > > > It turns into the estimate of how long STOP_COPY will take.
> > >
> > > I still didn't get the 'match' part. Why should the amount of data which
> > > has already been sent match the additional data to be sent in STOP_COPY?
> > 
> > None of it is 'already been sent' the return values are always 'still
> > to be sent'
> > 
> 
> Reread the description:
> 
> + * Drivers should attempt to return estimates so that initial_bytes +
> + * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
> + * will require to be streamed.
> 
> I guess you intended to mean that when EITHER initial_bytes OR
> dirty_bytes is read the returned value should match the amount 
> of data as described above. It is "+" which confused me to think 
> it as a sum of both numbers...

It is the sum

initial_bytes declines as the data is transferred. Once everything is
read out the sum will be 0.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-02-23  0:44             ` Jason Gunthorpe
@ 2022-02-23  1:46               ` Tian, Kevin
  0 siblings, 0 replies; 50+ messages in thread
From: Tian, Kevin @ 2022-02-23  1:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg, Raj, Ashok,
	shameerali.kolothum.thodi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 23, 2022 8:45 AM
> 
> On Wed, Feb 23, 2022 at 12:40:58AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, February 22, 2022 11:51 PM
> > >
> > > On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:
> > >
> > > > > > > + * Drivers should attempt to return estimates so that initial_bytes
> +
> > > > > > > + * dirty_bytes matches the amount of data an immediate
> transition
> > > to
> > > > > > > STOP_COPY
> > > > > > > + * will require to be streamed.
> > > > > >
> > > > > > I didn't understand this requirement. In an immediate transition to
> > > > > > STOP_COPY I expect the amount of data covers the entire device
> > > > > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively
> returned
> > > > > > then why we need set some expectation on the sum of
> > > > > > initial+round1_dity+round2_dirty+...
> > > > >
> > > > > "will require to be streamed" means additional data from this point
> > > > > forward, not including anything already sent.
> > > > >
> > > > > It turns into the estimate of how long STOP_COPY will take.
> > > >
> > > > I still didn't get the 'match' part. Why should the amount of data which
> > > > has already been sent match the additional data to be sent in
> STOP_COPY?
> > >
> > > None of it is 'already been sent' the return values are always 'still
> > > to be sent'
> > >
> >
> > Reread the description:
> >
> > + * Drivers should attempt to return estimates so that initial_bytes +
> > + * dirty_bytes matches the amount of data an immediate transition to
> STOP_COPY
> > + * will require to be streamed.
> >
> > I guess you intended to mean that when EITHER initial_bytes OR
> > dirty_bytes is read the returned value should match the amount
> > of data as described above. It is "+" which confused me to think
> > it as a sum of both numbers...
> 
> It is the sum
> 
> initial_bytes declines as the data is transferred. Once everything is
> read out the sum will be 0.
> 

That is the point which I overlooked (with the impression that initial_bytes
is static). As explained in the code comment 'initial' here means the
initial phase of precopy instead of a static number for the entire 
device state. During the initial precopy phase dirty_bytes should not
count any state which hasn't been transmitted then the sum of both
numbers can reflect the accurate size of remaining bytes to be
transmitted. Once initial phase is completed initial_bytes is always 
ZERO then dirty_bytes alone represents the remaining bytes. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-02-23  1:46 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
2022-02-09  0:07   ` Alex Williamson
2022-02-09  2:36     ` Jason Gunthorpe
2022-02-15 10:41       ` Tian, Kevin
2022-02-15 16:04         ` Jason Gunthorpe
2022-02-15 23:32           ` Alex Williamson
2022-02-16  1:17             ` Jason Gunthorpe
2022-02-16  3:17           ` Tian, Kevin
2022-02-16 12:14             ` Jason Gunthorpe
2022-02-17  2:29               ` Tian, Kevin
2022-02-15 10:58       ` Tian, Kevin
2022-02-15 13:13         ` Jason Gunthorpe
2022-02-15  8:04   ` Tian, Kevin
2022-02-15 15:33     ` Jason Gunthorpe
2022-02-16  3:04       ` Tian, Kevin
2022-02-07 17:22 ` [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
2022-02-15 10:18   ` Tian, Kevin
2022-02-15 15:56     ` Jason Gunthorpe
2022-02-16  2:52       ` Tian, Kevin
2022-02-16 12:11         ` Jason Gunthorpe
2022-02-07 17:22 ` [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
2022-02-11 11:03   ` Cornelia Huck
2022-02-07 17:22 ` [PATCH V7 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2022-02-09  0:07   ` Alex Williamson
2022-02-07 17:22 ` [PATCH V7 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2022-02-09  0:08   ` Alex Williamson
2022-02-09  2:39     ` Jason Gunthorpe
2022-02-10 16:48       ` Alex Williamson
2022-02-10 17:27         ` Jason Gunthorpe
2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
2022-02-17 17:15   ` Alex Williamson
2022-02-18  0:03     ` Jason Gunthorpe
2022-02-18  8:01   ` Tian, Kevin
2022-02-18 14:06     ` Jason Gunthorpe
2022-02-22  1:43       ` Tian, Kevin
2022-02-22 15:50         ` Jason Gunthorpe
2022-02-23  0:40           ` Tian, Kevin
2022-02-23  0:44             ` Jason Gunthorpe
2022-02-23  1:46               ` Tian, Kevin
2022-02-18  8:11 ` [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Tarun Gupta (SW-GPU)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.