All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol
@ 2022-01-30 16:08 Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
                   ` (14 more replies)
  0 siblings, 15 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

This series adds mlx5 live migration driver for VFs that are migration
capable and includes the v2 migration protocol definition and mlx5
implementation.

The mlx5 driver uses the vfio_pci_core split to create a specific VFIO
PCI driver that matches the mlx5 virtual functions. The driver provides
the same experience as normal vfio-pci with the addition of migration
support.

In HW the migration is controlled by the PF function, using its
mlx5_core driver, and the VFIO PCI VF driver co-ordinates with the PF to
execute the migration actions.

The bulk of the v2 migration protocol is semantically the same v1,
however it has been recast into a FSM for the device_state and the
actual syscall interface uses normal ioctl(), read() and write() instead
of building a syscall interface using the region.

Several bits of infrastructure work are included here:
 - pci_iov_vf_id() to help drivers like mlx5 figure out the VF index from
   a BDF
 - pci_iov_get_pf_drvdata() to clarify the tricky locking protocol when a
   VF reaches into its PF's driver
 - mlx5_core uses the normal SRIOV lifecycle and disables SRIOV before
   driver remove, to be compatible with pci_iov_get_pf_drvdata()
 - Lifting VFIO_DEVICE_FEATURE into core VFIO code

This series comes after alot of discussion. Some major points:
- v1 ABI compatible migration defined using the same FSM approach:
   https://lore.kernel.org/all/0-v1-a4f7cab64938+3f-vfio_mig_states_jgg@nvidia.com/
- Attempts to clarify how the v1 API works:
   Alex's:
     https://lore.kernel.org/kvm/163909282574.728533.7460416142511440919.stgit@omen/
   Jason's:
     https://lore.kernel.org/all/0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com/
- Etherpad exploring the scope and questions of general VFIO migration:
     https://lore.kernel.org/kvm/87mtm2loml.fsf@redhat.com/

NOTE: As this series touched mlx5_core parts we need to send this in a
pull request format to VFIO to avoid conflicts.

Matching qemu changes can be previewed here:
 https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2

Changes from V5: https://lore.kernel.org/kvm/20211027095658.144468-1-yishaih@nvidia.com/
vfio:
- Migration protocol v2:
  + enum for device state, not bitmap
  + ioctl to manipulate device_state, not a region
  + Only STOP_COPY is mandatory, P2P and PRE_COPY are optional, discovered
    via VFIO_DEVICE_FEATURE
  + Migration data transfer is done via dedicated FD
- VFIO core code to implement the migration related ioctls and help
  drivers implement it correctly
- VFIO_DEVICE_FEATURE refactor
- Delete migration protocol, drop patches fixing it
- Drop "vfio/pci_core: Make the region->release() function optional"
vfio/mlx5:
- Switch to use migration v2 protocol, with core helpers
- Eliminate the region implementation

Changes from V4: https://lore.kernel.org/kvm/20211026090605.91646-1-yishaih@nvidia.com/
vfio:
- Add some Reviewed-by.
- Rename to vfio_pci_core_aer_err_detected() as Alex asked.
vfio/mlx5:
- Improve to enter the error state only if unquiesce also fails.
- Fix some typos.
- Use the multi-line comment style as in drivers/vfio.

Changes from V3: https://lore.kernel.org/kvm/20211024083019.232813-1-yishaih@nvidia.com/
vfio/mlx5:
- Align with mlx5 latest specification to create the MKEY with full read
  write permissions.
- Fix unlock ordering in mlx5vf_state_mutex_unlock() to prevent some
  race.

Changes from V2: https://lore.kernel.org/kvm/20211019105838.227569-1-yishaih@nvidia.com/
vfio:
- Put and use the new macro VFIO_DEVICE_STATE_SET_ERROR as Alex asked.
vfio/mlx5:
- Improve/fix state checking as was asked by Alex & Jason.
- Let things be done in a deterministic way upon 'reset_done' following
  the suggested algorithm by Jason.
- Align with mlx5 latest specification when calling the SAVE command.
- Fix some typos.
vdpa/mlx5:
- Drop the patch from the series based on the discussion in the mailing
  list.

Changes from V1: https://lore.kernel.org/kvm/20211013094707.163054-1-yishaih@nvidia.com/
PCI/IOV:
- Add actual interface in the subject as was asked by Bjorn and add
  his Acked-by.
- Move to check explicitly for !dev->is_virtfn as was asked by Alex.
vfio:
- Come with a separate patch for fixing the non-compiled
  VFIO_DEVICE_STATE_SET_ERROR macro.
- Expose vfio_pci_aer_err_detected() to be set by drivers on their own
  pci error handles.
- Add a macro for VFIO_DEVICE_STATE_ERROR in the uapi header file as was
  suggested by Alex.
vfio/mlx5:
- Improve to use xor as part of checking the 'state' change command as
  was suggested by Alex.
- Set state to VFIO_DEVICE_STATE_ERROR when an error occurred instead of
  VFIO_DEVICE_STATE_INVALID.
- Improve state checking as was suggested by Jason.
- Use its own PCI reset_done error handler as was suggested by Jason and
  fix the locking scheme around the state mutex to work properly.

Changes from V0: https://lore.kernel.org/kvm/cover.1632305919.git.leonro@nvidia.com/
PCI/IOV:
- Add an API (i.e. pci_iov_get_pf_drvdata()) that allows SRVIO VF drivers
  to reach the drvdata of a PF.
mlx5_core:
- Add an extra patch to disable SRIOV before PF removal.
- Adapt to use the above PCI/IOV API as part of mlx5_vf_get_core_dev().
- Reuse the exported PCI/IOV virtfn index function call (i.e. pci_iov_vf_id().
vfio:
- Add support in the pci_core to let a driver be notified when
 'reset_done' to let it sets its internal state accordingly.
- Add some helper stuff for 'invalid' state handling.
mlx5_vfio_pci:
- Move to use the 'command mode' instead of the 'state machine'
 scheme as was discussed in the mailing list.
- Handle the RESET scenario when called by vfio_pci_core to sets
 its internal state accordingly.
- Set initial state as RUNNING.
- Put the driver files as sub-folder under drivers/vfio/pci named mlx5
  and update MAINTAINER file as was asked.
vdpa_mlx5:
Add a new patch to use mlx5_vf_get_core_dev() to get PF device.

Jason Gunthorpe (7):
  PCI/IOV: Add pci_iov_vf_id() to get VF index
  PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata
    of a PF
  vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  vfio: Define device migration protocol v2
  vfio: Extend the device migration protocol with RUNNING_P2P
  vfio: Remove migration protocol v1
  vfio: Extend the device migration protocol with PRE_COPY

Leon Romanovsky (1):
  net/mlx5: Reuse exported virtfn index function call

Yishai Hadas (7):
  net/mlx5: Disable SRIOV before PF removal
  net/mlx5: Expose APIs to get/put the mlx5 core device
  net/mlx5: Introduce migration bits and structures
  vfio/mlx5: Expose migration commands over mlx5 device
  vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  vfio/pci: Expose vfio_pci_core_aer_err_detected()
  vfio/mlx5: Use its own PCI reset_done error handler

 MAINTAINERS                                   |   6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  45 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   1 +
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  17 +-
 drivers/pci/iov.c                             |  43 ++
 drivers/vfio/pci/Kconfig                      |   3 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/mlx5/Kconfig                 |  10 +
 drivers/vfio/pci/mlx5/Makefile                |   4 +
 drivers/vfio/pci/mlx5/cmd.c                   | 252 +++++++
 drivers/vfio/pci/mlx5/cmd.h                   |  36 +
 drivers/vfio/pci/mlx5/main.c                  | 664 ++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c                   |   1 +
 drivers/vfio/pci/vfio_pci_core.c              |  97 +--
 drivers/vfio/vfio.c                           | 346 ++++++++-
 include/linux/mlx5/driver.h                   |   3 +
 include/linux/mlx5/mlx5_ifc.h                 | 147 +++-
 include/linux/pci.h                           |  15 +-
 include/linux/vfio.h                          |  43 ++
 include/linux/vfio_pci_core.h                 |   4 +
 include/uapi/linux/vfio.h                     | 516 ++++++++------
 21 files changed, 1946 insertions(+), 309 deletions(-)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h
 create mode 100644 drivers/vfio/pci/mlx5/main.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

The PCI core uses the VF index internally, often called the vf_id,
during the setup of the VF, eg pci_iov_add_virtfn().

This index is needed for device drivers that implement live migration
for their internal operations that configure/control their VFs.

Specifically, mlx5_vfio_pci driver that is introduced in coming patches
from this series needs it and not the bus/device/function which is
exposed today.

Add pci_iov_vf_id() which computes the vf_id by reversing the math that
was used to create the bus/device/function.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/pci/iov.c   | 14 ++++++++++++++
 include/linux/pci.h |  8 +++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0267977c9f17..2e9f3d70803a 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 }
 EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
 
+int pci_iov_vf_id(struct pci_dev *dev)
+{
+	struct pci_dev *pf;
+
+	if (!dev->is_virtfn)
+		return -EINVAL;
+
+	pf = pci_physfn(dev);
+	return (((dev->bus->number << 8) + dev->devfn) -
+		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
+	       pf->sriov->stride;
+}
+EXPORT_SYMBOL_GPL(pci_iov_vf_id);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8253a5413d7c..3d4ff7b35ad1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2166,7 +2166,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 #ifdef CONFIG_PCI_IOV
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
-
+int pci_iov_vf_id(struct pci_dev *dev);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2194,6 +2194,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
 	return -ENOSYS;
 }
+
+static inline int pci_iov_vf_id(struct pci_dev *dev)
+{
+	return -ENOSYS;
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Leon Romanovsky <leonro@nvidia.com>

Instead open-code iteration to compare virtfn internal index, use newly
introduced pci_iov_vf_id() call.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index e8185b69ac6c..24c4b4f05214 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -205,19 +205,8 @@ int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
 			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
 
 	sriov = &dev->priv.sriov;
-
-	/* Reversed translation of PCI VF function number to the internal
-	 * function_id, which exists in the name of virtfn symlink.
-	 */
-	for (id = 0; id < pci_num_vf(pf); id++) {
-		if (!sriov->vfs_ctx[id].enabled)
-			continue;
-
-		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
-			break;
-	}
-
-	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+	id = pci_iov_vf_id(vf);
+	if (id < 0 || !sriov->vfs_ctx[id].enabled)
 		return -EINVAL;
 
 	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Virtual functions depend on physical function for device access (for example
firmware host PAGE management), so make sure to disable SRIOV once PF is gone.

This will prevent also the below warning if PF has gone before disabling SRIOV.
"driver left SR-IOV enabled after remove"

Next patch from this series will rely on that when the VF may need to
access safely the PF 'driver data'.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c      | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c     | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 2c774f367199..5b8958186157 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1620,6 +1620,7 @@ static void remove_one(struct pci_dev *pdev)
 	struct devlink *devlink = priv_to_devlink(dev);
 
 	devlink_unregister(devlink);
+	mlx5_sriov_disable(pdev);
 	mlx5_crdump_disable(dev);
 	mlx5_drain_health_wq(dev);
 	mlx5_uninit_one(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 6f8baa0f2a73..37b2805b3bf3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -164,6 +164,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+void mlx5_sriov_disable(struct pci_dev *pdev);
 int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 24c4b4f05214..887ee0f729d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -161,7 +161,7 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	return err;
 }
 
-static void mlx5_sriov_disable(struct pci_dev *pdev)
+void mlx5_sriov_disable(struct pci_dev *pdev)
 {
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	int num_vfs = pci_num_vf(dev->pdev);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (2 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

There are some cases where a SR-IOV VF driver will need to reach into and
interact with the PF driver. This requires accessing the drvdata of the PF.

Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
safe way. Normally accessing a drvdata of a foreign struct device would be
done using the device_lock() to protect against device driver
probe()/remove() races.

However, due to the design of pci_enable_sriov() this will result in a
ABBA deadlock on the device_lock as the PF's device_lock is held during PF
sriov_configure() while calling pci_enable_sriov() which in turn holds the
VF's device_lock while calling VF probe(), and similarly for remove.

This means the VF driver can never obtain the PF's device_lock.

Instead use the implicit locking created by pci_enable/disable_sriov(). A
VF driver can access its PF drvdata only while its own driver is attached,
and the PF driver can control access to its own drvdata based on when it
calls pci_enable/disable_sriov().

To use this API the PF driver will setup the PF drvdata in the probe()
function. pci_enable_sriov() is only called from sriov_configure() which
cannot happen until probe() completes, ensuring no VF races with drvdata
setup.

For removal, the PF driver must call pci_disable_sriov() in its remove
function before destroying any of the drvdata. This ensures that all VF
drivers are unbound before returning, fencing concurrent access to the
drvdata.

The introduction of a new function to do this access makes clear the
special locking scheme and the documents the requirements on the PF/VF
drivers using this.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/pci/iov.c   | 29 +++++++++++++++++++++++++++++
 include/linux/pci.h |  7 +++++++
 2 files changed, 36 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 2e9f3d70803a..28ec952e1221 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -47,6 +47,35 @@ int pci_iov_vf_id(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_iov_vf_id);
 
+/**
+ * pci_iov_get_pf_drvdata - Return the drvdata of a PF
+ * @dev - VF pci_dev
+ * @pf_driver - Device driver required to own the PF
+ *
+ * This must be called from a context that ensures that a VF driver is attached.
+ * The value returned is invalid once the VF driver completes its remove()
+ * callback.
+ *
+ * Locking is achieved by the driver core. A VF driver cannot be probed until
+ * pci_enable_sriov() is called and pci_disable_sriov() does not return until
+ * all VF drivers have completed their remove().
+ *
+ * The PF driver must call pci_disable_sriov() before it begins to destroy the
+ * drvdata.
+ */
+void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver)
+{
+	struct pci_dev *pf_dev;
+
+	if (!dev->is_virtfn)
+		return ERR_PTR(-EINVAL);
+	pf_dev = dev->physfn;
+	if (pf_dev->driver != pf_driver)
+		return ERR_PTR(-EINVAL);
+	return pci_get_drvdata(pf_dev);
+}
+EXPORT_SYMBOL_GPL(pci_iov_get_pf_drvdata);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3d4ff7b35ad1..60d423d8f0c4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2167,6 +2167,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
 int pci_iov_vf_id(struct pci_dev *dev);
+void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2200,6 +2201,12 @@ static inline int pci_iov_vf_id(struct pci_dev *dev)
 	return -ENOSYS;
 }
 
+static inline void *pci_iov_get_pf_drvdata(struct pci_dev *dev,
+					   struct pci_driver *pf_driver)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (3 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Expose an API to get the mlx5 core device from a given VF PCI device if
mlx5_core is its driver.

Upon the get API we stay with the intf_state_mutex locked to make sure
that the device can't be gone/unloaded till the caller will complete
its job over the device, this expects to be for a short period of time
for any flow that the lock is taken.

Upon the put API we unlock the intf_state_mutex.

The use case for those APIs is the migration flow of a VF over VFIO PCI.
In that case the VF doesn't ride on mlx5_core, because the device is
driving *two* different PCI devices, the PF owned by mlx5_core and the
VF owned by the vfio driver.

The mlx5_core of the PF is accessed only during the narrow window of the
VF's ioctl that requires its services.

This allows the PF driver to be more independent of the VF driver, so
long as it doesn't reset the FW.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 44 +++++++++++++++++++
 include/linux/mlx5/driver.h                   |  3 ++
 2 files changed, 47 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 5b8958186157..e9aeba4267ff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1881,6 +1881,50 @@ static struct pci_driver mlx5_core_driver = {
 	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
+/**
+ * mlx5_vf_get_core_dev - Get the mlx5 core device from a given VF PCI device if
+ *                     mlx5_core is its driver.
+ * @pdev: The associated PCI device.
+ *
+ * Upon return the interface state lock stay held to let caller uses it safely.
+ * Caller must ensure to use the returned mlx5 device for a narrow window
+ * and put it back with mlx5_vf_put_core_dev() immediately once usage was over.
+ *
+ * Return: Pointer to the associated mlx5_core_dev or NULL.
+ */
+struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev)
+			__acquires(&mdev->intf_state_mutex)
+{
+	struct mlx5_core_dev *mdev;
+
+	mdev = pci_iov_get_pf_drvdata(pdev, &mlx5_core_driver);
+	if (IS_ERR(mdev))
+		return NULL;
+
+	mutex_lock(&mdev->intf_state_mutex);
+	if (!test_bit(MLX5_INTERFACE_STATE_UP, &mdev->intf_state)) {
+		mutex_unlock(&mdev->intf_state_mutex);
+		return NULL;
+	}
+
+	return mdev;
+}
+EXPORT_SYMBOL(mlx5_vf_get_core_dev);
+
+/**
+ * mlx5_vf_put_core_dev - Put the mlx5 core device back.
+ * @mdev: The mlx5 core device.
+ *
+ * Upon return the interface state lock is unlocked and caller should not
+ * access the mdev any more.
+ */
+void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev)
+			__releases(&mdev->intf_state_mutex)
+{
+	mutex_unlock(&mdev->intf_state_mutex);
+}
+EXPORT_SYMBOL(mlx5_vf_put_core_dev);
+
 static void mlx5_core_verify_params(void)
 {
 	if (prof_sel >= ARRAY_SIZE(profile)) {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 78655d8d13a7..319322a8ff94 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1143,6 +1143,9 @@ int mlx5_dm_sw_icm_alloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 			   u64 length, u16 uid, phys_addr_t addr, u32 obj_id);
 
+struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev);
+void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev);
+
 #ifdef CONFIG_MLX5_CORE_IPOIB
 struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
 					  struct ib_device *ibdev,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (4 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Introduce migration IFC related stuff to enable migration commands.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 147 +++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 598ac3bcc901..45891a75c5ca 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -127,6 +127,11 @@ enum {
 	MLX5_CMD_OP_QUERY_SF_PARTITION            = 0x111,
 	MLX5_CMD_OP_ALLOC_SF                      = 0x113,
 	MLX5_CMD_OP_DEALLOC_SF                    = 0x114,
+	MLX5_CMD_OP_SUSPEND_VHCA                  = 0x115,
+	MLX5_CMD_OP_RESUME_VHCA                   = 0x116,
+	MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE    = 0x117,
+	MLX5_CMD_OP_SAVE_VHCA_STATE               = 0x118,
+	MLX5_CMD_OP_LOAD_VHCA_STATE               = 0x119,
 	MLX5_CMD_OP_CREATE_MKEY                   = 0x200,
 	MLX5_CMD_OP_QUERY_MKEY                    = 0x201,
 	MLX5_CMD_OP_DESTROY_MKEY                  = 0x202,
@@ -1757,7 +1762,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         reserved_at_682[0x1];
 	u8         log_max_sf[0x5];
 	u8         apu[0x1];
-	u8         reserved_at_689[0x7];
+	u8         reserved_at_689[0x4];
+	u8         migration[0x1];
+	u8         reserved_at_68e[0x2];
 	u8         log_min_sf_size[0x8];
 	u8         max_num_sf_partitions[0x8];
 
@@ -11519,4 +11526,142 @@ enum {
 	MLX5_MTT_PERM_RW	= MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE,
 };
 
+enum {
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER  = 0x0,
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE   = 0x1,
+};
+
+struct mlx5_ifc_suspend_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_suspend_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+enum {
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE   = 0x0,
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER  = 0x1,
+};
+
+struct mlx5_ifc_resume_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_resume_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	u8         required_umem_size[0x20];
+
+	u8         reserved_at_a0[0x160];
+};
+
+struct mlx5_ifc_save_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_save_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         actual_image_size[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_load_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_load_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
 #endif /* MLX5_IFC_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (5 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-31 23:41   ` Alex Williamson
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

Invoke a new device op 'device_feature' to handle just the data array
portion of the command. This lifts the ioctl validation to the core code
and makes it simpler for either the core code, or layered drivers, to
implement their own feature values.

Provide vfio_check_feature() to consolidate checking the flags/etc against
what the driver supports.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/vfio_pci.c      |  1 +
 drivers/vfio/pci/vfio_pci_core.c | 90 ++++++++++++--------------------
 drivers/vfio/vfio.c              | 46 ++++++++++++++--
 include/linux/vfio.h             | 32 ++++++++++++
 include/linux/vfio_pci_core.h    |  2 +
 5 files changed, 109 insertions(+), 62 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a5ce92beb655..2b047469e02f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -130,6 +130,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.open_device	= vfio_pci_open_device,
 	.close_device	= vfio_pci_core_close_device,
 	.ioctl		= vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
 	.read		= vfio_pci_core_read,
 	.write		= vfio_pci_core_write,
 	.mmap		= vfio_pci_core_mmap,
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f948e6cd2993..14a22ff20ef8 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1114,70 +1114,44 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 
 		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
 					  ioeventfd.data, count, ioeventfd.fd);
-	} else if (cmd == VFIO_DEVICE_FEATURE) {
-		struct vfio_device_feature feature;
-		uuid_t uuid;
-
-		minsz = offsetofend(struct vfio_device_feature, flags);
-
-		if (copy_from_user(&feature, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (feature.argsz < minsz)
-			return -EINVAL;
-
-		/* Check unknown flags */
-		if (feature.flags & ~(VFIO_DEVICE_FEATURE_MASK |
-				      VFIO_DEVICE_FEATURE_SET |
-				      VFIO_DEVICE_FEATURE_GET |
-				      VFIO_DEVICE_FEATURE_PROBE))
-			return -EINVAL;
-
-		/* GET & SET are mutually exclusive except with PROBE */
-		if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) &&
-		    (feature.flags & VFIO_DEVICE_FEATURE_SET) &&
-		    (feature.flags & VFIO_DEVICE_FEATURE_GET))
-			return -EINVAL;
-
-		switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
-		case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-			if (!vdev->vf_token)
-				return -ENOTTY;
-
-			/*
-			 * We do not support GET of the VF Token UUID as this
-			 * could expose the token of the previous device user.
-			 */
-			if (feature.flags & VFIO_DEVICE_FEATURE_GET)
-				return -EINVAL;
-
-			if (feature.flags & VFIO_DEVICE_FEATURE_PROBE)
-				return 0;
-
-			/* Don't SET unless told to do so */
-			if (!(feature.flags & VFIO_DEVICE_FEATURE_SET))
-				return -EINVAL;
+	}
+	return -ENOTTY;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-			if (feature.argsz < minsz + sizeof(uuid))
-				return -EINVAL;
+int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
+				void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	uuid_t uuid;
+	int ret;
 
-			if (copy_from_user(&uuid, (void __user *)(arg + minsz),
-					   sizeof(uuid)))
-				return -EFAULT;
+	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
+	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
+		if (!vdev->vf_token)
+			return -ENOTTY;
+		/*
+		 * We do not support GET of the VF Token UUID as this could
+		 * expose the token of the previous device user.
+		 */
+		ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+					sizeof(uuid));
+		if (ret != 1)
+			return ret;
 
-			mutex_lock(&vdev->vf_token->lock);
-			uuid_copy(&vdev->vf_token->uuid, &uuid);
-			mutex_unlock(&vdev->vf_token->lock);
+		if (copy_from_user(&uuid, arg, sizeof(uuid)))
+			return -EFAULT;
 
-			return 0;
-		default:
-			return -ENOTTY;
-		}
+		mutex_lock(&vdev->vf_token->lock);
+		uuid_copy(&vdev->vf_token->uuid, &uuid);
+		mutex_unlock(&vdev->vf_token->lock);
+		return 0;
+	default:
+		return -ENOTTY;
 	}
-
-	return -ENOTTY;
 }
-EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl_feature);
 
 static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 735d1d344af9..71763e2ac561 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1557,15 +1557,53 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+static int vfio_ioctl_device_feature(struct vfio_device *device,
+				     struct vfio_device_feature __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_device_feature, flags);
+	struct vfio_device_feature feature;
+
+	if (copy_from_user(&feature, arg, minsz))
+		return -EFAULT;
+
+	if (feature.argsz < minsz)
+		return -EINVAL;
+
+	/* Check unknown flags */
+	if (feature.flags &
+	    ~(VFIO_DEVICE_FEATURE_MASK | VFIO_DEVICE_FEATURE_SET |
+	      VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_PROBE))
+		return -EINVAL;
+
+	/* GET & SET are mutually exclusive except with PROBE */
+	if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) &&
+	    (feature.flags & VFIO_DEVICE_FEATURE_SET) &&
+	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
+		return -EINVAL;
+
+	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	default:
+		if (unlikely(!device->ops->device_feature))
+			return -EINVAL;
+		return device->ops->device_feature(device, feature.flags,
+						   arg->data,
+						   feature.argsz - minsz);
+	}
+}
+
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
 	struct vfio_device *device = filep->private_data;
 
-	if (unlikely(!device->ops->ioctl))
-		return -EINVAL;
-
-	return device->ops->ioctl(device, cmd, arg);
+	switch (cmd) {
+	case VFIO_DEVICE_FEATURE:
+		return vfio_ioctl_device_feature(device, (void __user *)arg);
+	default:
+		if (unlikely(!device->ops->ioctl))
+			return -EINVAL;
+		return device->ops->ioctl(device, cmd, arg);
+	}
 }
 
 static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 76191d7abed1..ca69516f869d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -55,6 +55,7 @@ struct vfio_device {
  * @match: Optional device name match callback (return: 0 for no-match, >0 for
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
+ * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
  */
 struct vfio_device_ops {
 	char	*name;
@@ -69,8 +70,39 @@ struct vfio_device_ops {
 	int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
 	void	(*request)(struct vfio_device *vdev, unsigned int count);
 	int	(*match)(struct vfio_device *vdev, char *buf);
+	int	(*device_feature)(struct vfio_device *device, u32 flags,
+				  void __user *arg, size_t argsz);
 };
 
+/**
+ * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
+ * @flags: Arg from the device_feature op
+ * @argsz: Arg from the device_feature op
+ * @supported_ops: Combination of VFIO_DEVICE_FEATURE_GET and SET the driver
+ *                 supports
+ * @minsz: Minimum data size the driver accepts
+ *
+ * For use in a driver's device_feature op. Checks that the inputs to the
+ * VFIO_DEVICE_FEATURE ioctl are correct for the driver's feature. Returns 1 if
+ * the driver should execute the get or set, otherwise the relevant
+ * value should be returned.
+ */
+static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
+				    size_t minsz)
+{
+	if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) &
+	    ~supported_ops)
+		return -EINVAL;
+	if (flags & VFIO_DEVICE_FEATURE_PROBE)
+		return 0;
+	/* Without PROBE one of GET or SET must be requested */
+	if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)))
+		return -EINVAL;
+	if (argsz < minsz)
+		return -EINVAL;
+	return 1;
+}
+
 void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
 			 const struct vfio_device_ops *ops);
 void vfio_uninit_group_dev(struct vfio_device *device);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..beba0b2ed87d 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -220,6 +220,8 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
 extern const struct pci_error_handlers vfio_pci_core_err_handlers;
 long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		unsigned long arg);
+int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
+				void __user *arg, size_t argsz);
 ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (6 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-31 23:43   ` Alex Williamson
  2022-02-01 12:06   ` Cornelia Huck
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.

This is the STOP_COPY portion of the new protocol, it defines the 5 states
for basic stop and copy migration and the protocol to move the migration
data in/out of the kernel.

Compared to the clarification of the v1 protocol Alex proposed:

https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen

This has a few deliberate functional differences:

 - ERROR arcs allow the device function to remain unchanged.

 - The protocol is not required to return to the original state on
   transition failure. Instead we directly return the current state,
   whatever it may be. Userspace can execute an unwind back to the
   original state, reset, or do something else without needing kernel
   support. This simplifies the kernel design and should userspace choose
   a policy like always reset, avoids doing useless work in the kernel
   on error handling paths.

 - PRE_COPY is made optional, userspace must discover it before using it.
   This reflects the fact that the majority of drivers we are aware of
   right now will not implement PRE_COPY.

 - segmentation is not part of the data stream protocol, the receiver
   does not have to reproduce the framing boundaries.

The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.

A new IOCTL VFIO_DEVICE_MIG_SET_STATE is defined to replace writing to the
device_state field in the region. This allows returning more information
in the case of failure, and includes returning a brand new FD whenever the
requested transition opens a data transfer session.

The VFIO core code implements the new ioctl and provides a helper function
to the driver. Using the helper the driver only has to implement 6 of the
FSM arcs and the other combination transitions are elaborated consistently
from those arcs.

A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
report the capability for migration and indicate which set of states and
arcs are supported by the device. The FSM provides a lot of flexability to
make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
allows for future breaking extensions for scenarios that cannot support
even the basic STOP_COPY requirements.

Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.

The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using a read() on the
region.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 184 ++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h      |  10 +++
 include/uapi/linux/vfio.h | 164 ++++++++++++++++++++++++++++++++-
 3 files changed, 354 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 71763e2ac561..b12be212d048 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1557,6 +1557,184 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+/*
+ * vfio_mig_get_next_state - Compute the next step in the FSM
+ * @cur_fsm - The current state the device is in
+ * @new_fsm - The target state to reach
+ *
+ * Return the next step in the state progression between cur_fsm and new_fsm.
+ * This breaks down requests for combination transitions into smaller steps and
+ * returns the next step to get to new_fsm. The function may need to be called
+ * multiple times before reaching new_fsm.
+ *
+ * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
+ */
+u32 vfio_mig_get_next_state(struct vfio_device *device,
+			    enum vfio_device_mig_state cur_fsm,
+			    enum vfio_device_mig_state new_fsm)
+{
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	/*
+	 * The coding in this table requires the driver to implement 6
+	 * FSM arcs:
+	 *         RESUMING -> STOP
+	 *         RUNNING -> STOP
+	 *         STOP -> RESUMING
+	 *         STOP -> RUNNING
+	 *         STOP -> STOP_COPY
+	 *         STOP_COPY -> STOP
+	 *
+	 * The coding will step through multiple states for these combination
+	 * transitions:
+	 *         RESUMING -> STOP -> RUNNING
+	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> STOP -> RESUMING
+	 *         RUNNING -> STOP -> STOP_COPY
+	 *         STOP_COPY -> STOP -> RESUMING
+	 *         STOP_COPY -> STOP -> RUNNING
+	 */
+	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
+		[VFIO_DEVICE_STATE_STOP] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_STOP_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RESUMING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_ERROR] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+	};
+	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
+	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
+		return VFIO_DEVICE_STATE_ERROR;
+
+	return vfio_from_fsm_table[cur_fsm][new_fsm];
+}
+EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
+
+/*
+ * Convert the drivers's struct file into a FD number and return it to userspace
+ */
+static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg,
+				   struct vfio_device_mig_set_state *set_state)
+{
+	int ret;
+	int fd;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
+		goto out_fput;
+	}
+
+	set_state->data_fd = fd;
+	if (copy_to_user(arg, set_state, sizeof(*set_state))) {
+		ret = -EFAULT;
+		goto out_put_unused;
+	}
+	fd_install(fd, filp);
+	return 0;
+
+out_put_unused:
+	put_unused_fd(fd);
+out_fput:
+	fput(filp);
+	return ret;
+}
+
+static int vfio_ioctl_mig_set_state(struct vfio_device *device,
+				    void __user *arg)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_mig_set_state, flags);
+	enum vfio_device_mig_state final_state = VFIO_DEVICE_STATE_ERROR;
+	struct vfio_device_mig_set_state set_state;
+	struct file *filp;
+
+	if (!device->ops->migration_set_state)
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&set_state, arg, minsz))
+		return -EFAULT;
+
+	if (set_state.argsz < minsz || set_state.flags)
+		return -EOPNOTSUPP;
+
+	/*
+	 * It is tempting to try to validate set_state.device_state here, but
+	 * then we can't return final_state. The validation is done in
+	 * vfio_mig_get_next_state().
+	 */
+	filp = device->ops->migration_set_state(device, set_state.device_state,
+						&final_state);
+	set_state.device_state = final_state;
+	if (IS_ERR(filp)) {
+		if (WARN_ON(PTR_ERR(filp) == -EOPNOTSUPP ||
+			    PTR_ERR(filp) == -ENOTTY ||
+			    PTR_ERR(filp) == -EFAULT))
+			filp = ERR_PTR(-EINVAL);
+		goto out_copy;
+	}
+
+	if (!filp)
+		goto out_copy;
+	return vfio_ioct_mig_return_fd(filp, arg, &set_state);
+out_copy:
+	set_state.data_fd = -1;
+	if (copy_to_user(arg, &set_state, sizeof(set_state)))
+		return -EFAULT;
+	if (IS_ERR(filp))
+		return PTR_ERR(filp);
+	return 0;
+}
+
+static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
+					       u32 flags, void __user *arg,
+					       size_t argsz)
+{
+	struct vfio_device_feature_migration mig = {
+		.flags = VFIO_MIGRATION_STOP_COPY,
+	};
+	int ret;
+
+	if (!device->ops->migration_set_state)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(mig));
+	if (ret != 1)
+		return ret;
+	if (copy_to_user(arg, &mig, sizeof(mig)))
+		return -EFAULT;
+	return 0;
+}
+
 static int vfio_ioctl_device_feature(struct vfio_device *device,
 				     struct vfio_device_feature __user *arg)
 {
@@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 		return -EINVAL;
 
 	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	case VFIO_DEVICE_FEATURE_MIGRATION:
+		return vfio_ioctl_device_feature_migration(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
 	default:
 		if (unlikely(!device->ops->device_feature))
 			return -EINVAL;
@@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 	struct vfio_device *device = filep->private_data;
 
 	switch (cmd) {
+	case VFIO_DEVICE_MIG_SET_STATE:
+		return vfio_ioctl_mig_set_state(device, (void __user *)arg);
 	case VFIO_DEVICE_FEATURE:
 		return vfio_ioctl_device_feature(device, (void __user *)arg);
 	default:
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ca69516f869d..697790ec4065 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -56,6 +56,8 @@ struct vfio_device {
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
  * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
+ * @migration_set_state: Optional callback to change the migration
+ *         state for devices that support migration.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -72,6 +74,10 @@ struct vfio_device_ops {
 	int	(*match)(struct vfio_device *vdev, char *buf);
 	int	(*device_feature)(struct vfio_device *device, u32 flags,
 				  void __user *arg, size_t argsz);
+	struct file *(*migration_set_state)(
+		struct vfio_device *device,
+		enum vfio_device_mig_state new_state,
+		enum vfio_device_mig_state *final_state);
 };
 
 /**
@@ -114,6 +120,10 @@ extern void vfio_device_put(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 
+u32 vfio_mig_get_next_state(struct vfio_device *device,
+			    enum vfio_device_mig_state cur_fsm,
+			    enum vfio_device_mig_state new_fsm);
+
 /*
  * External user API
  */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..d9162702973a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -605,10 +605,10 @@ struct vfio_region_gfx_edid {
 
 struct vfio_device_migration_info {
 	__u32 device_state;         /* VFIO device state */
-#define VFIO_DEVICE_STATE_STOP      (0)
-#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
-#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
-#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_V1_STOP      (0)
+#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
 #define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
 				     VFIO_DEVICE_STATE_SAVING |  \
 				     VFIO_DEVICE_STATE_RESUMING)
@@ -1002,6 +1002,162 @@ struct vfio_device_feature {
  */
 #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
 
+/*
+ * Indicates the device can support the migration API. See enum
+ * vfio_device_mig_state for details. If present flags must be non-zero and
+ * VFIO_DEVICE_MIG_SET_STATE is supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
+ * RESUMING are supported.
+ */
+struct vfio_device_feature_migration {
+	__aligned_u64 flags;
+#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+};
+#define VFIO_DEVICE_FEATURE_MIGRATION 1
+
+/*
+ * The device migration Finite State Machine is described by the enum
+ * vfio_device_mig_state. Some of the FSM arcs will create a migration data
+ * transfer session by returning a FD, in this case the migration data will
+ * flow over the FD using read() and write() as discussed below.
+ *
+ * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
+ *  RUNNING - The device is running normally
+ *  STOP - The device does not change the internal or external state
+ *  STOP_COPY - The device internal state can be read out
+ *  RESUMING - The device is stopped and is loading a new internal state
+ *  ERROR - The device has failed and must be reset
+ *
+ * The FSM takes actions on the arcs between FSM states. The driver implements
+ * the following behavior for the FSM arcs:
+ *
+ * RUNNING -> STOP
+ * STOP_COPY -> STOP
+ *   While in STOP the device must stop the operation of the device. The
+ *   device must not generate interrupts, DMA, or advance its internal
+ *   state. When stopped the device and kernel migration driver must accept
+ *   and respond to interaction to support external subsystems in the STOP
+ *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
+ *   restrict device access while in STOP must not result in error conditions
+ *   outside the user context (ex. host system faults).
+ *
+ *   The STOP_COPY arc will terminate a data transfer session.
+ *
+ * RESUMING -> STOP
+ *   Leaving RESUMING terminates a data transfer session and indicates the
+ *   device should complete processing of the data delivered by write(). The
+ *   kernel migration driver should complete the incorporation of data written
+ *   to the data transfer FD into the device internal state and perform
+ *   final validity and consistency checking of the new device state. If the
+ *   user provided data is found to be incomplete, inconsistent, or otherwise
+ *   invalid, the migration driver must fail the SET_STATE ioctl and
+ *   optionally go to the ERROR state as described below.
+ *
+ *   While in STOP the device has the same behavior as other STOP states
+ *   described above.
+ *
+ *   To abort a RESUMING session the device must be reset.
+ *
+ * STOP -> RUNNING
+ *   While in RUNNING the device is fully operational, the device may generate
+ *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
+ *   and the device may advance its internal state.
+ *
+ * STOP -> STOP_COPY
+ *   This arc begin the process of saving the device state and will return a
+ *   new data_fd.
+ *
+ *   While in the STOP_COPY state the device has the same behavior as STOP
+ *   with the addition that the data transfers session continues to stream the
+ *   migration state. End of stream on the FD indicates the entire device
+ *   state has been transferred.
+ *
+ *   The user should take steps to restrict access to vfio device regions while
+ *   the device is in STOP_COPY or risk corruption of the device migration data
+ *   stream.
+ *
+ * STOP -> RESUMING
+ *   Entering the RESUMING state starts a process of restoring the device
+ *   state and will return a new data_fd. The data stream fed into the data_fd
+ *   should be taken from the data transfer output of the saving group states
+ *   from a compatible device. The migration driver may alter/reset the
+ *   internal device state for this arc if required to prepare the device to
+ *   receive the migration data.
+ *
+ * any -> ERROR
+ *   ERROR cannot be specified as a device state, however any transition request
+ *   can be failed with an errno return and may then move the device_state into
+ *   ERROR. In this case the device was unable to execute the requested arc and
+ *   was also unable to restore the device to any valid device_state. The ERROR
+ *   state will be returned as described below in VFIO_DEVICE_MIG_SET_STATE. To
+ *   recover from ERROR VFIO_DEVICE_RESET must be used to return the
+ *   device_state back to RUNNING.
+ *
+ * The remaining possible transitions are interpreted as combinations of the
+ * above FSM arcs. As there are multiple paths through the FSM arcs the path
+ * should be selected based on the following rules:
+ *   - Select the shortest path.
+ * Refer to vfio_mig_get_next_state() for the result of the algorithm.
+ *
+ * The automatic transit through the FSM arcs that make up the combination
+ * transition is invisible to the user. When working with combination arcs the
+ * user may see any step along the path in the device_state if SET_STATE
+ * fails. When handling these types of errors users should anticipate future
+ * revisions of this protocol using new states and those states becoming
+ * visible in this case.
+ */
+enum vfio_device_mig_state {
+	VFIO_DEVICE_STATE_ERROR = 0,
+	VFIO_DEVICE_STATE_STOP = 1,
+	VFIO_DEVICE_STATE_RUNNING = 2,
+	VFIO_DEVICE_STATE_STOP_COPY = 3,
+	VFIO_DEVICE_STATE_RESUMING = 4,
+};
+
+/**
+ * VFIO_DEVICE_MIG_SET_STATE - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * Execute a migration state change command on the VFIO device. The new state is
+ * supplied in device_state.
+ *
+ * The kernel migration driver must fully transition the device to the new state
+ * value before the write(2) operation returns to the user.
+ *
+ * The kernel migration driver must not generate asynchronous device state
+ * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
+ * ioctl as described above.
+ *
+ * If this function fails and returns -1 then the device_state is updated with
+ * the current state the device is in. This may be the original operating state
+ * or some other state along the combination transition path. The user can then
+ * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the
+ * original state, or attempt to return to some other state such as RUNNING or
+ * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state
+ * output is not reliable.
+ *
+ * If the new_state starts a new data transfer session then the FD associated
+ * with that session is returned in data_fd. The user is responsible to close
+ * this FD when it is finished. The user must consider the migration data
+ * segments carried over the FD to be opaque and non-fungible. During RESUMING,
+ * the data segments must be written in the same order they came out of the
+ * saving side FD.
+ *
+ * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL,
+ * and take no action. However the device_state will be updated with the current
+ * value.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_device_mig_set_state {
+	__u32 argsz;
+	__u32 device_state;
+	__s32 data_fd;
+	__u32 flags;
+};
+
+#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (7 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-02-01 11:54   ` Cornelia Huck
  2022-02-01 18:31   ` Alex Williamson
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1 Yishai Hadas
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing transactions.

As an optional extension to the mandatory states it is defined as
inbetween STOP and RUNNING:
   STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP

For drivers that are unable to support RUNNING_P2P the core code silently
merges RUNNING_P2P and RUNNING together. Drivers that support this will be
required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
arcs become combination transitions.

Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 70 ++++++++++++++++++++++++++++++---------
 include/linux/vfio.h      |  2 ++
 include/uapi/linux/vfio.h | 34 +++++++++++++++++--
 3 files changed, 88 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index b12be212d048..a722a1a8a48a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1573,39 +1573,55 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state cur_fsm,
 			    enum vfio_device_mig_state new_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
 	/*
-	 * The coding in this table requires the driver to implement 6
+	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
 	 *         RESUMING -> STOP
-	 *         RUNNING -> STOP
 	 *         STOP -> RESUMING
-	 *         STOP -> RUNNING
 	 *         STOP -> STOP_COPY
 	 *         STOP_COPY -> STOP
 	 *
-	 * The coding will step through multiple states for these combination
-	 * transitions:
-	 *         RESUMING -> STOP -> RUNNING
+	 * If P2P is supported then the driver must also implement these FSM
+	 * arcs:
+	 *         RUNNING -> RUNNING_P2P
+	 *         RUNNING_P2P -> RUNNING
+	 *         RUNNING_P2P -> STOP
+	 *         STOP -> RUNNING_P2P
+	 * Without P2P the driver must implement:
+	 *         RUNNING -> STOP
+	 *         STOP -> RUNNING
+	 *
+	 * If all optional features are supported then the coding will step
+	 * through multiple states for these combination transitions:
+	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
 	 *         RESUMING -> STOP -> STOP_COPY
-	 *         RUNNING -> STOP -> RESUMING
-	 *         RUNNING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> STOP
+	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> RUNNING
 	 *         STOP_COPY -> STOP -> RESUMING
-	 *         STOP_COPY -> STOP -> RUNNING
+	 *         STOP_COPY -> STOP -> RUNNING_P2P
+	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RUNNING] = {
-			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
-			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
@@ -1613,6 +1629,7 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RESUMING] = {
@@ -1620,6 +1637,15 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_ERROR] = {
@@ -1627,14 +1653,26 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 	};
+	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
+
 	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
 	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
 		return VFIO_DEVICE_STATE_ERROR;
 
-	return vfio_from_fsm_table[cur_fsm][new_fsm];
+	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
+			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
+		return VFIO_DEVICE_STATE_ERROR;
+
+	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	if (!have_p2p) {
+		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
+			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	}
+	return cur_fsm;
 }
 EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
 
@@ -1719,7 +1757,7 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 					       size_t argsz)
 {
 	struct vfio_device_feature_migration mig = {
-		.flags = VFIO_MIGRATION_STOP_COPY,
+		.flags = device->migration_flags,
 	};
 	int ret;
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 697790ec4065..69a574ba085e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -33,6 +33,7 @@ struct vfio_device {
 	struct vfio_group *group;
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
+	unsigned int migration_flags;
 
 	/* Members below here are private, not for driver use */
 	refcount_t refcount;
@@ -44,6 +45,7 @@ struct vfio_device {
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
+ * @flags: Global flags from enum vfio_device_ops_flags
  * @open_device: Called when the first file descriptor is opened for this device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d9162702973a..9efc35535b29 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1009,10 +1009,16 @@ struct vfio_device_feature {
  *
  * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
  * RESUMING are supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
+ * is supported in addition to the STOP_COPY states.
+ *
+ * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+#define VFIO_MIGRATION_P2P		(1 << 1)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -1029,10 +1035,13 @@ struct vfio_device_feature_migration {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
+ * And 1 optional state to support VFIO_MIGRATION_P2P:
+ *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
  *
- * RUNNING -> STOP
+ * RUNNING_P2P -> STOP
  * STOP_COPY -> STOP
  *   While in STOP the device must stop the operation of the device. The
  *   device must not generate interrupts, DMA, or advance its internal
@@ -1059,11 +1068,16 @@ struct vfio_device_feature_migration {
  *
  *   To abort a RESUMING session the device must be reset.
  *
- * STOP -> RUNNING
+ * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ * RUNNING -> RUNNING_P2P
+ * STOP -> RUNNING_P2P
+ *   While in RUNNING_P2P the device is partially running in the P2P quiescent
+ *   state defined below.
+ *
  * STOP -> STOP_COPY
  *   This arc begin the process of saving the device state and will return a
  *   new data_fd.
@@ -1094,6 +1108,16 @@ struct vfio_device_feature_migration {
  *   recover from ERROR VFIO_DEVICE_RESET must be used to return the
  *   device_state back to RUNNING.
  *
+ * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
+ * state for the device for the purposes of managing multiple devices within a
+ * user context where peer-to-peer DMA between devices may be active. The
+ * RUNNING_P2P states must prevent the device from initiating
+ * any new P2P DMA transactions. If the device can identify P2P transactions
+ * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
+ * driver must complete any such outstanding operations prior to completing the
+ * FSM arc into a P2P state. For the purpose of specification the states
+ * behave as though the device was fully running if not supported.
+ *
  * The remaining possible transitions are interpreted as combinations of the
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
@@ -1106,6 +1130,11 @@ struct vfio_device_feature_migration {
  * fails. When handling these types of errors users should anticipate future
  * revisions of this protocol using new states and those states becoming
  * visible in this case.
+ *
+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can disocver if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -1113,6 +1142,7 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING = 2,
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
+	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (8 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-02-01 11:23   ` Cornelia Huck
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

v1 was never implemented and is replaced by v2.

The old uAPI definitions are removed from the header file. As per Linus's
past remarks we do not have a hard requirement to retain compilation
compatibility in uapi headers and qemu is already following Linus's
preferred model of copying the kernel headers.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/uapi/linux/vfio.h | 228 --------------------------------------
 1 file changed, 228 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9efc35535b29..70c77da5812d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -323,7 +323,6 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
-#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -404,233 +403,6 @@ struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_CCW_SCHIB		(2)
 #define VFIO_REGION_SUBTYPE_CCW_CRW		(3)
 
-/* sub-types for VFIO_REGION_TYPE_MIGRATION */
-#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
-
-/*
- * The structure vfio_device_migration_info is placed at the 0th offset of
- * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
- * migration information. Field accesses from this structure are only supported
- * at their native width and alignment. Otherwise, the result is undefined and
- * vendor drivers should return an error.
- *
- * device_state: (read/write)
- *      - The user application writes to this field to inform the vendor driver
- *        about the device state to be transitioned to.
- *      - The vendor driver should take the necessary actions to change the
- *        device state. After successful transition to a given state, the
- *        vendor driver should return success on write(device_state, state)
- *        system call. If the device state transition fails, the vendor driver
- *        should return an appropriate -errno for the fault condition.
- *      - On the user application side, if the device state transition fails,
- *	  that is, if write(device_state, state) returns an error, read
- *	  device_state again to determine the current state of the device from
- *	  the vendor driver.
- *      - The vendor driver should return previous state of the device unless
- *        the vendor driver has encountered an internal error, in which case
- *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
- *      - The user application must use the device reset ioctl to recover the
- *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
- *        indicated to be in a valid device state by reading device_state, the
- *        user application may attempt to transition the device to any valid
- *        state reachable from the current state or terminate itself.
- *
- *      device_state consists of 3 bits:
- *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
- *        it indicates the _STOP state. When the device state is changed to
- *        _STOP, driver should stop the device before write() returns.
- *      - If bit 1 is set, it indicates the _SAVING state, which means that the
- *        driver should start gathering device state information that will be
- *        provided to the VFIO user application to save the device's state.
- *      - If bit 2 is set, it indicates the _RESUMING state, which means that
- *        the driver should prepare to resume the device. Data provided through
- *        the migration region should be used to resume the device.
- *      Bits 3 - 31 are reserved for future use. To preserve them, the user
- *      application should perform a read-modify-write operation on this
- *      field when modifying the specified bits.
- *
- *  +------- _RESUMING
- *  |+------ _SAVING
- *  ||+----- _RUNNING
- *  |||
- *  000b => Device Stopped, not saving or resuming
- *  001b => Device running, which is the default state
- *  010b => Stop the device & save the device state, stop-and-copy state
- *  011b => Device running and save the device state, pre-copy state
- *  100b => Device stopped and the device state is resuming
- *  101b => Invalid state
- *  110b => Error state
- *  111b => Invalid state
- *
- * State transitions:
- *
- *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
- *                (100b)     (001b)     (011b)        (010b)       (000b)
- * 0. Running or default state
- *                             |
- *
- * 1. Normal Shutdown (optional)
- *                             |------------------------------------->|
- *
- * 2. Save the state or suspend
- *                             |------------------------->|---------->|
- *
- * 3. Save the state during live migration
- *                             |----------->|------------>|---------->|
- *
- * 4. Resuming
- *                  |<---------|
- *
- * 5. Resumed
- *                  |--------->|
- *
- * 0. Default state of VFIO device is _RUNNING when the user application starts.
- * 1. During normal shutdown of the user application, the user application may
- *    optionally change the VFIO device state from _RUNNING to _STOP. This
- *    transition is optional. The vendor driver must support this transition but
- *    must not require it.
- * 2. When the user application saves state or suspends the application, the
- *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
- *    On state transition from _RUNNING to stop-and-copy, driver must stop the
- *    device, save the device state and send it to the application through the
- *    migration region. The sequence to be followed for such transition is given
- *    below.
- * 3. In live migration of user application, the state transitions from _RUNNING
- *    to pre-copy, to stop-and-copy, and to _STOP.
- *    On state transition from _RUNNING to pre-copy, the driver should start
- *    gathering the device state while the application is still running and send
- *    the device state data to application through the migration region.
- *    On state transition from pre-copy to stop-and-copy, the driver must stop
- *    the device, save the device state and send it to the user application
- *    through the migration region.
- *    Vendor drivers must support the pre-copy state even for implementations
- *    where no data is provided to the user before the stop-and-copy state. The
- *    user must not be required to consume all migration data before the device
- *    transitions to a new state, including the stop-and-copy state.
- *    The sequence to be followed for above two transitions is given below.
- * 4. To start the resuming phase, the device state should be transitioned from
- *    the _RUNNING to the _RESUMING state.
- *    In the _RESUMING state, the driver should use the device state data
- *    received through the migration region to resume the device.
- * 5. After providing saved device data to the driver, the application should
- *    change the state from _RESUMING to _RUNNING.
- *
- * reserved:
- *      Reads on this field return zero and writes are ignored.
- *
- * pending_bytes: (read only)
- *      The number of pending bytes still to be migrated from the vendor driver.
- *
- * data_offset: (read only)
- *      The user application should read data_offset field from the migration
- *      region. The user application should read the device data from this
- *      offset within the migration region during the _SAVING state or write
- *      the device data during the _RESUMING state. See below for details of
- *      sequence to be followed.
- *
- * data_size: (read/write)
- *      The user application should read data_size to get the size in bytes of
- *      the data copied in the migration region during the _SAVING state and
- *      write the size in bytes of the data copied in the migration region
- *      during the _RESUMING state.
- *
- * The format of the migration region is as follows:
- *  ------------------------------------------------------------------
- * |vfio_device_migration_info|    data section                      |
- * |                          |     ///////////////////////////////  |
- * ------------------------------------------------------------------
- *   ^                              ^
- *  offset 0-trapped part        data_offset
- *
- * The structure vfio_device_migration_info is always followed by the data
- * section in the region, so data_offset will always be nonzero. The offset
- * from where the data is copied is decided by the kernel driver. The data
- * section can be trapped, mmapped, or partitioned, depending on how the kernel
- * driver defines the data section. The data section partition can be defined
- * as mapped by the sparse mmap capability. If mmapped, data_offset must be
- * page aligned, whereas initial section which contains the
- * vfio_device_migration_info structure, might not end at the offset, which is
- * page aligned. The user is not required to access through mmap regardless
- * of the capabilities of the region mmap.
- * The vendor driver should determine whether and how to partition the data
- * section. The vendor driver should return data_offset accordingly.
- *
- * The sequence to be followed while in pre-copy state and stop-and-copy state
- * is as follows:
- * a. Read pending_bytes, indicating the start of a new iteration to get device
- *    data. Repeated read on pending_bytes at this stage should have no side
- *    effects.
- *    If pending_bytes == 0, the user application should not iterate to get data
- *    for that device.
- *    If pending_bytes > 0, perform the following steps.
- * b. Read data_offset, indicating that the vendor driver should make data
- *    available through the data section. The vendor driver should return this
- *    read operation only after data is available from (region + data_offset)
- *    to (region + data_offset + data_size).
- * c. Read data_size, which is the amount of data in bytes available through
- *    the migration region.
- *    Read on data_offset and data_size should return the offset and size of
- *    the current buffer if the user application reads data_offset and
- *    data_size more than once here.
- * d. Read data_size bytes of data from (region + data_offset) from the
- *    migration region.
- * e. Process the data.
- * f. Read pending_bytes, which indicates that the data from the previous
- *    iteration has been read. If pending_bytes > 0, go to step b.
- *
- * The user application can transition from the _SAVING|_RUNNING
- * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
- * number of pending bytes. The user application should iterate in _SAVING
- * (stop-and-copy) until pending_bytes is 0.
- *
- * The sequence to be followed while _RESUMING device state is as follows:
- * While data for this device is available, repeat the following steps:
- * a. Read data_offset from where the user application should write data.
- * b. Write migration data starting at the migration region + data_offset for
- *    the length determined by data_size from the migration source.
- * c. Write data_size, which indicates to the vendor driver that data is
- *    written in the migration region. Vendor driver must return this write
- *    operations on consuming data. Vendor driver should apply the
- *    user-provided migration region data to the device resume state.
- *
- * If an error occurs during the above sequences, the vendor driver can return
- * an error code for next read() or write() operation, which will terminate the
- * loop. The user application should then take the next necessary action, for
- * example, failing migration or terminating the user application.
- *
- * For the user application, data is opaque. The user application should write
- * data in the same order as the data is received and the data should be of
- * same transaction size at the source.
- */
-
-struct vfio_device_migration_info {
-	__u32 device_state;         /* VFIO device state */
-#define VFIO_DEVICE_STATE_V1_STOP      (0)
-#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
-#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
-#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
-#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
-				     VFIO_DEVICE_STATE_SAVING |  \
-				     VFIO_DEVICE_STATE_RESUMING)
-
-#define VFIO_DEVICE_STATE_VALID(state) \
-	(state & VFIO_DEVICE_STATE_RESUMING ? \
-	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
-
-#define VFIO_DEVICE_STATE_IS_ERROR(state) \
-	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
-					      VFIO_DEVICE_STATE_RESUMING))
-
-#define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
-
-	__u32 reserved;
-	__u64 pending_bytes;
-	__u64 data_offset;
-	__u64 data_size;
-};
-
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (9 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1 Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.

As part of this adds the APIs and data structure that are needed to manage
the migration data.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |  35 +++++
 2 files changed, 287 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
new file mode 100644
index 000000000000..fcf2dab0541a
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -0,0 +1,252 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "cmd.h"
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(suspend_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(suspend_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(suspend_vhca_in, in, opcode, MLX5_CMD_OP_SUSPEND_VHCA);
+	MLX5_SET(suspend_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(suspend_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, suspend_vhca, in, out);
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(resume_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(resume_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(resume_vhca_in, in, opcode, MLX5_CMD_OP_RESUME_VHCA);
+	MLX5_SET(resume_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(resume_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, resume_vhca, in, out);
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  size_t *state_size)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(query_vhca_migration_state_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE);
+	MLX5_SET(query_vhca_migration_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(query_vhca_migration_state_in, in, op_mod, 0);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_vhca_migration_state, in, out);
+	if (ret)
+		goto end;
+
+	*state_size = MLX5_GET(query_vhca_migration_state_out, out,
+			       required_umem_size);
+
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+	int out_size;
+	void *out;
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	out = kzalloc(out_size, GFP_KERNEL);
+	if (!out) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	MLX5_SET(query_hca_cap_in, in, other_function, 1);
+	MLX5_SET(query_hca_cap_in, in, function_id, function_id);
+	MLX5_SET(query_hca_cap_in, in, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1 |
+		 HCA_CAP_OPMOD_GET_CUR);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_hca_cap, in, out);
+	if (ret)
+		goto err_exec;
+
+	*vhca_id = MLX5_GET(query_hca_cap_out, out,
+			    capability.cmd_hca_cap.vhca_id);
+
+err_exec:
+	kfree(out);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
+			      struct mlx5_vf_migration_file *migf, u32 *mkey)
+{
+	size_t npages = DIV_ROUND_UP(migf->total_length, PAGE_SIZE);
+	struct sg_dma_page_iter dma_iter;
+	int err = 0, inlen;
+	__be64 *mtt;
+	void *mkc;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(*mtt) * round_up(npages, 2);
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
+		 DIV_ROUND_UP(npages, 2));
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+
+	for_each_sgtable_dma_page (&migf->table.sgt, &dma_iter, 0)
+		*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, rr, 1);
+	MLX5_SET(mkc, mkc, rw, 1);
+	MLX5_SET(mkc, mkc, pd, pdn);
+	MLX5_SET(mkc, mkc, bsf_octword_size, 0);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
+	MLX5_SET64(mkc, mkc, len, migf->total_length);
+	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
+	kvfree(in);
+	return err;
+}
+
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	u32 pdn, mkey;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE,
+			      0);
+	if (err)
+		goto err_dma_map;
+
+	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	if (err)
+		goto err_create_mkey;
+
+	MLX5_SET(save_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_SAVE_VHCA_STATE);
+	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(save_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(save_vhca_state_in, in, mkey, mkey);
+	MLX5_SET(save_vhca_state_in, in, size, migf->total_length);
+
+	err = mlx5_cmd_exec_inout(mdev, save_vhca_state, in, out);
+	if (err)
+		goto err_exec;
+
+	migf->total_length =
+		MLX5_GET(save_vhca_state_out, out, actual_image_size);
+
+	mlx5_core_destroy_mkey(mdev, mkey);
+	mlx5_core_dealloc_pd(mdev, pdn);
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
+	mlx5_vf_put_core_dev(mdev);
+
+	return 0;
+
+err_exec:
+	mlx5_core_destroy_mkey(mdev, mkey);
+err_create_mkey:
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
+err_dma_map:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return err;
+}
+
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	u32 pdn, mkey;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_TO_DEVICE, 0);
+	if (err)
+		goto err_reg;
+
+	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	if (err)
+		goto err_mkey;
+
+	MLX5_SET(load_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_LOAD_VHCA_STATE);
+	MLX5_SET(load_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(load_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(load_vhca_state_in, in, mkey, mkey);
+	MLX5_SET(load_vhca_state_in, in, size, migf->total_length);
+
+	err = mlx5_cmd_exec_inout(mdev, load_vhca_state, in, out);
+
+	mlx5_core_destroy_mkey(mdev, mkey);
+err_mkey:
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_TO_DEVICE, 0);
+err_reg:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
new file mode 100644
index 000000000000..69a1481ed953
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#ifndef MLX5_VFIO_CMD_H
+#define MLX5_VFIO_CMD_H
+
+#include <linux/kernel.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5_vf_migration_file {
+	struct file *filp;
+	struct mutex lock;
+
+	struct sg_append_table table;
+	size_t total_length;
+	size_t allocated_length;
+
+	/* Optimize mlx5vf_get_migration_page() for sequential access */
+	struct scatterlist *last_offset_sg;
+	unsigned int sg_last_entry;
+	unsigned long last_offset;
+};
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  size_t *state_size);
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf);
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vf_migration_file *migf);
+#endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (10 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

This patch adds support for vfio_pci driver for mlx5 devices.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.

The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.

This patch implements the suspend/resume flows.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 MAINTAINERS                    |   6 +
 drivers/vfio/pci/Kconfig       |   3 +
 drivers/vfio/pci/Makefile      |   2 +
 drivers/vfio/pci/mlx5/Kconfig  |  10 +
 drivers/vfio/pci/mlx5/Makefile |   4 +
 drivers/vfio/pci/mlx5/cmd.h    |   1 +
 drivers/vfio/pci/mlx5/main.c   | 611 +++++++++++++++++++++++++++++++++
 7 files changed, 637 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ea3e6c914384..5c5216f5e43d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20260,6 +20260,12 @@ L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/platform/
 
+VFIO MLX5 PCI DRIVER
+M:	Yishai Hadas <yishaih@nvidia.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	drivers/vfio/pci/mlx5/
+
 VGA_SWITCHEROO
 R:	Lukas Wunner <lukas@wunner.de>
 S:	Maintained
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 860424ccda1b..187b9c259944 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -43,4 +43,7 @@ config VFIO_PCI_IGD
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
 endif
+
+source "drivers/vfio/pci/mlx5/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 349d68d242b4..ed9d6f2e0555 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
new file mode 100644
index 000000000000..29ba9c504a75
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config MLX5_VFIO_PCI
+	tristate "VFIO support for MLX5 PCI devices"
+	depends on MLX5_CORE
+	depends on VFIO_PCI_CORE
+	help
+	  This provides migration support for MLX5 devices using the VFIO
+	  framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
new file mode 100644
index 000000000000..689627da7ff5
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
+mlx5-vfio-pci-y := main.o cmd.o
+
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 69a1481ed953..1392a11a9cc0 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -12,6 +12,7 @@
 struct mlx5_vf_migration_file {
 	struct file *filp;
 	struct mutex lock;
+	bool disabled;
 
 	struct sg_append_table table;
 	size_t total_length;
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
new file mode 100644
index 000000000000..c15c8eed85d3
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -0,0 +1,611 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/anon_inodes.h>
+
+#include "cmd.h"
+
+/* Arbitrary to prevent userspace from consuming endless memory */
+#define MAX_MIGRATION_SIZE (512*1024*1024)
+
+struct mlx5vf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 migrate_cap:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	u16 vhca_id;
+	struct mlx5_vf_migration_file *resuming_migf;
+	struct mlx5_vf_migration_file *saving_migf;
+};
+
+static struct page *
+mlx5vf_get_migration_page(struct mlx5_vf_migration_file *migf,
+			  unsigned long offset)
+{
+	unsigned long cur_offset = 0;
+	struct scatterlist *sg;
+	unsigned int i;
+
+	/* All accesses are sequential */
+	if (offset < migf->last_offset || !migf->last_offset_sg) {
+		migf->last_offset = 0;
+		migf->last_offset_sg = migf->table.sgt.sgl;
+		migf->sg_last_entry = 0;
+	}
+
+	cur_offset = migf->last_offset;
+
+	for_each_sg(migf->last_offset_sg, sg,
+			migf->table.sgt.orig_nents - migf->sg_last_entry, i) {
+		if (offset < sg->length + cur_offset) {
+			migf->last_offset_sg = sg;
+			migf->sg_last_entry += i;
+			migf->last_offset = cur_offset;
+			return nth_page(sg_page(sg),
+					(offset - cur_offset) / PAGE_SIZE);
+		}
+		cur_offset += sg->length;
+	}
+	return NULL;
+}
+
+static int mlx5vf_add_migration_pages(struct mlx5_vf_migration_file *migf,
+				      unsigned int npages)
+{
+	unsigned int to_alloc = npages;
+	struct page **page_list;
+	unsigned long filled;
+	unsigned int to_fill;
+	int ret;
+
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
+	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	do {
+		filled = alloc_pages_bulk_array(GFP_KERNEL, to_fill, page_list);
+		if (!filled) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		to_alloc -= filled;
+		ret = sg_alloc_append_table_from_pages(
+			&migf->table, page_list, filled, 0,
+			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
+			GFP_KERNEL);
+
+		if (ret)
+			goto err;
+		migf->allocated_length += filled * PAGE_SIZE;
+		/* clean input for another bulk allocation */
+		memset(page_list, 0, filled * sizeof(*page_list));
+		to_fill = min_t(unsigned int, to_alloc,
+				PAGE_SIZE / sizeof(*page_list));
+	} while (to_alloc > 0);
+
+	kvfree(page_list);
+	return 0;
+
+err:
+	kvfree(page_list);
+	return ret;
+}
+
+static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf)
+{
+	struct sg_page_iter sg_iter;
+
+	mutex_lock(&migf->lock);
+	/* Undo alloc_pages_bulk_array() */
+	for_each_sgtable_page(&migf->table.sgt, &sg_iter, 0)
+		__free_page(sg_page_iter_page(&sg_iter));
+	sg_free_append_table(&migf->table);
+	migf->disabled = true;
+	migf->total_length = 0;
+	migf->allocated_length = 0;
+	migf->filp->f_pos = 0;
+	mutex_unlock(&migf->lock);
+}
+
+static int mlx5vf_release_file(struct inode *inode, struct file *filp)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+
+	mlx5vf_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+	return 0;
+}
+
+static ssize_t mlx5vf_save_read(struct file *filp, char __user *buf, size_t len,
+			       loff_t *pos)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	mutex_lock(&migf->lock);
+	if (*pos > migf->total_length) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, migf->total_length - *pos, len);
+	while (len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *from_buff;
+		int ret;
+
+		page_offset = (*pos) % PAGE_SIZE;
+		page = mlx5vf_get_migration_page(migf, *pos - page_offset);
+		if (!page) {
+			if (done == 0)
+				done = -EINVAL;
+			goto out_unlock;
+		}
+
+		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
+		from_buff = kmap_local_page(page);
+		ret = copy_to_user(buf, from_buff + page_offset, page_len);
+		kunmap_local(from_buff);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += page_len;
+		len -= page_len;
+		done += page_len;
+		buf += page_len;
+	}
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations mlx5vf_save_fops = {
+	.owner = THIS_MODULE,
+	.read = mlx5vf_save_read,
+	.release = mlx5vf_release_file,
+	.llseek = no_llseek,
+};
+
+static struct mlx5_vf_migration_file *
+mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("mlx5vf_mig", &mlx5vf_save_fops, migf,
+					O_RDONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	ret = mlx5vf_cmd_query_vhca_migration_state(
+		mvdev->core_device.pdev, mvdev->vhca_id, &migf->total_length);
+	if (ret)
+		goto out_free;
+
+	ret = mlx5vf_add_migration_pages(
+		migf, DIV_ROUND_UP_ULL(migf->total_length, PAGE_SIZE));
+	if (ret)
+		goto out_free;
+
+	ret = mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
+					 mvdev->vhca_id, migf);
+	if (ret)
+		goto out_free;
+	return migf;
+out_free:
+	fput(migf->filp);
+	return ERR_PTR(ret);
+}
+
+static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
+				   size_t len, loff_t *pos)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+	loff_t requested_length;
+	ssize_t done = 0;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	if (*pos < 0 ||
+	    check_add_overflow((loff_t)len, *pos, &requested_length))
+		return -EINVAL;
+
+	if (requested_length > MAX_MIGRATION_SIZE)
+		return -ENOMEM;
+
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	if (migf->allocated_length < requested_length) {
+		done = mlx5vf_add_migration_pages(
+			migf,
+			DIV_ROUND_UP(requested_length - migf->allocated_length,
+				     PAGE_SIZE));
+		if (done)
+			goto out_unlock;
+	}
+
+	while (len) {
+		size_t page_offset;
+		struct page *page;
+		size_t page_len;
+		u8 *to_buff;
+		int ret;
+
+		page_offset = (*pos) % PAGE_SIZE;
+		page = mlx5vf_get_migration_page(migf, *pos - page_offset);
+		if (!page) {
+			if (done == 0)
+				done = -EINVAL;
+			goto out_unlock;
+		}
+
+		page_len = min_t(size_t, len, PAGE_SIZE - page_offset);
+		to_buff = kmap_local_page(page);
+		ret = copy_from_user(to_buff + page_offset, buf, page_len);
+		kunmap_local(to_buff);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += page_len;
+		len -= page_len;
+		done += page_len;
+		buf += page_len;
+		migf->total_length += page_len;
+	}
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations mlx5vf_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = mlx5vf_resume_write,
+	.release = mlx5vf_release_file,
+	.llseek = no_llseek,
+};
+
+static struct mlx5_vf_migration_file *
+mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vf_migration_file *migf;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("mlx5vf_mig", &mlx5vf_resume_fops, migf,
+					O_WRONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+	return migf;
+}
+
+static void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev)
+{
+	if (mvdev->resuming_migf) {
+		mlx5vf_disable_fd(mvdev->resuming_migf);
+		fput(mvdev->resuming_migf->filp);
+		mvdev->resuming_migf = NULL;
+	}
+	if (mvdev->saving_migf) {
+		mlx5vf_disable_fd(mvdev->saving_migf);
+		fput(mvdev->saving_migf->filp);
+		mvdev->saving_migf = NULL;
+	}
+}
+
+static struct file *
+mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
+				    u32 new)
+{
+	u32 cur = mvdev->mig_state;
+	int ret;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_STOP) {
+		ret = mlx5vf_cmd_suspend_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		ret = mlx5vf_cmd_resume_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
+		ret = mlx5vf_cmd_suspend_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) {
+		ret = mlx5vf_cmd_resume_vhca(
+			mvdev->core_device.pdev, mvdev->vhca_id,
+			MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct mlx5_vf_migration_file *migf;
+
+		migf = mlx5vf_pci_save_device_data(mvdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		mvdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if ((cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP)) {
+		mlx5vf_disable_fds(mvdev);
+		return 0;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
+		struct mlx5_vf_migration_file *migf;
+
+		migf = mlx5vf_pci_resume_device_data(mvdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		mvdev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
+						 mvdev->vhca_id,
+						 mvdev->resuming_migf);
+		if (ret)
+			return ERR_PTR(ret);
+		mlx5vf_disable_fds(mvdev);
+		return 0;
+	}
+
+	/*
+	 * vfio_mig_get_next_state() does not use arcs other than the above
+	 */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct file *
+mlx5vf_pci_set_device_state(struct vfio_device *vdev,
+			    enum vfio_device_mig_state new_state,
+			    enum vfio_device_mig_state *final_state)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+
+	mutex_lock(&mvdev->state_mutex);
+	while (new_state != mvdev->mig_state) {
+		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
+						     new_state);
+		if (next_state == VFIO_DEVICE_STATE_ERROR) {
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+		res = mlx5vf_pci_step_device_state_locked(mvdev, next_state);
+		if (IS_ERR(res))
+			break;
+		mvdev->mig_state = next_state;
+		if (WARN_ON(res && new_state != mvdev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	*final_state = mvdev->mig_state;
+	mutex_unlock(&mvdev->state_mutex);
+	return res;
+}
+
+static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &mvdev->core_device;
+	int vf_id;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	if (!mvdev->migrate_cap) {
+		vfio_pci_core_finish_enable(vdev);
+		return 0;
+	}
+
+	vf_id = pci_iov_vf_id(vdev->pdev);
+	if (vf_id < 0) {
+		ret = vf_id;
+		goto out_disable;
+	}
+
+	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1, &mvdev->vhca_id);
+	if (ret)
+		goto out_disable;
+
+	mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+out_disable:
+	vfio_pci_core_disable(vdev);
+	return ret;
+}
+
+static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	mlx5vf_disable_fds(mvdev);
+	vfio_pci_core_close_device(core_vdev);
+}
+
+static const struct vfio_device_ops mlx5vf_pci_ops = {
+	.name = "mlx5-vfio-pci",
+	.open_device = mlx5vf_pci_open_device,
+	.close_device = mlx5vf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.migration_set_state = mlx5vf_pci_set_device_state,
+};
+
+static int mlx5vf_pci_probe(struct pci_dev *pdev,
+			    const struct pci_device_id *id)
+{
+	struct mlx5vf_pci_core_device *mvdev;
+	int ret;
+
+	mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL);
+	if (!mvdev)
+		return -ENOMEM;
+	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
+
+	if (pdev->is_virtfn) {
+		struct mlx5_core_dev *mdev =
+			mlx5_vf_get_core_dev(pdev);
+
+		if (mdev) {
+			if (MLX5_CAP_GEN(mdev, migration)) {
+				mvdev->migrate_cap = 1;
+				mvdev->core_device.vdev.migration_flags =
+					VFIO_MIGRATION_STOP_COPY |
+					VFIO_MIGRATION_P2P;
+				mutex_init(&mvdev->state_mutex);
+			}
+			mlx5_vf_put_core_dev(mdev);
+		}
+	}
+
+	ret = vfio_pci_core_register_device(&mvdev->core_device);
+	if (ret)
+		goto out_free;
+
+	dev_set_drvdata(&pdev->dev, mvdev);
+	return 0;
+
+out_free:
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+	return ret;
+}
+
+static void mlx5vf_pci_remove(struct pci_dev *pdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(&mvdev->core_device);
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+}
+
+static const struct pci_device_id mlx5vf_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX, 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
+
+static struct pci_driver mlx5vf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = mlx5vf_pci_table,
+	.probe = mlx5vf_pci_probe,
+	.remove = mlx5vf_pci_remove,
+};
+
+static void __exit mlx5vf_pci_cleanup(void)
+{
+	pci_unregister_driver(&mlx5vf_pci_driver);
+}
+
+static int __init mlx5vf_pci_init(void)
+{
+	return pci_register_driver(&mlx5vf_pci_driver);
+}
+
+module_init(mlx5vf_pci_init);
+module_exit(mlx5vf_pci_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
+MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
+MODULE_DESCRIPTION(
+	"MLX5 VFIO PCI - User Level meta-driver for MLX5 device family");
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected()
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (11 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Expose vfio_pci_core_aer_err_detected() to be used by drivers as part of
their pci_error_handlers structure.

Next patch for mlx5 driver will use it.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 7 ++++---
 include/linux/vfio_pci_core.h    | 2 ++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 14a22ff20ef8..69e6d22ae815 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1865,8 +1865,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+						pci_channel_state_t state)
 {
 	struct vfio_pci_core_device *vdev;
 	struct vfio_device *device;
@@ -1888,6 +1888,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
 int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 {
@@ -1910,7 +1911,7 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 const struct pci_error_handlers vfio_pci_core_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
+	.error_detected = vfio_pci_core_aer_err_detected,
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index beba0b2ed87d..9f1bf8e49d43 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -232,6 +232,8 @@ int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+						pci_channel_state_t state);
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (12 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Register its own handler for pci_error_handlers.reset_done and update
state accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/mlx5/main.c | 55 +++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index c15c8eed85d3..4d65a5c2d3b3 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -28,9 +28,12 @@
 struct mlx5vf_pci_core_device {
 	struct vfio_pci_core_device core_device;
 	u8 migrate_cap:1;
+	u8 deferred_reset:1;
 	/* protect migration state */
 	struct mutex state_mutex;
 	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
 	u16 vhca_id;
 	struct mlx5_vf_migration_file *resuming_migf;
 	struct mlx5_vf_migration_file *saving_migf;
@@ -437,6 +440,25 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	return ERR_PTR(-EINVAL);
 }
 
+/*
+ * This function is called in all state_mutex unlock cases to
+ * handle a 'deferred_reset' if exists.
+ */
+static void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev)
+{
+again:
+	spin_lock(&mvdev->reset_lock);
+	if (mvdev->deferred_reset) {
+		mvdev->deferred_reset = false;
+		spin_unlock(&mvdev->reset_lock);
+		mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+		mlx5vf_disable_fds(mvdev);
+		goto again;
+	}
+	mutex_unlock(&mvdev->state_mutex);
+	spin_unlock(&mvdev->reset_lock);
+}
+
 static struct file *
 mlx5vf_pci_set_device_state(struct vfio_device *vdev,
 			    enum vfio_device_mig_state new_state,
@@ -466,10 +488,34 @@ mlx5vf_pci_set_device_state(struct vfio_device *vdev,
 		}
 	}
 	*final_state = mvdev->mig_state;
-	mutex_unlock(&mvdev->state_mutex);
+	mlx5vf_state_mutex_unlock(mvdev);
 	return res;
 }
 
+static void mlx5vf_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+
+	if (!mvdev->migrate_cap)
+		return;
+
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&mvdev->reset_lock);
+	mvdev->deferred_reset = true;
+	if (!mutex_trylock(&mvdev->state_mutex)) {
+		spin_unlock(&mvdev->reset_lock);
+		return;
+	}
+	spin_unlock(&mvdev->reset_lock);
+	mlx5vf_state_mutex_unlock(mvdev);
+}
+
 static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
 {
 	struct mlx5vf_pci_core_device *mvdev = container_of(
@@ -550,6 +596,7 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 					VFIO_MIGRATION_STOP_COPY |
 					VFIO_MIGRATION_P2P;
 				mutex_init(&mvdev->state_mutex);
+				spin_lock_init(&mvdev->reset_lock);
 			}
 			mlx5_vf_put_core_dev(mdev);
 		}
@@ -584,11 +631,17 @@ static const struct pci_device_id mlx5vf_pci_table[] = {
 
 MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
 
+static const struct pci_error_handlers mlx5vf_err_handlers = {
+	.reset_done = mlx5vf_pci_aer_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
 static struct pci_driver mlx5vf_pci_driver = {
 	.name = KBUILD_MODNAME,
 	.id_table = mlx5vf_pci_table,
 	.probe = mlx5vf_pci_probe,
 	.remove = mlx5vf_pci_remove,
+	.err_handler = &mlx5vf_err_handlers,
 };
 
 static void __exit mlx5vf_pci_cleanup(void)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V6 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
  2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
                   ` (13 preceding siblings ...)
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
@ 2022-01-30 16:08 ` Yishai Hadas
  14 siblings, 0 replies; 55+ messages in thread
From: Yishai Hadas @ 2022-01-30 16:08 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

The optional PRE_COPY states open the saving data transfer FD before
reaching STOP_COPY and allows the device to dirty track internal state
changes with the general idea to reduce the volume of data transferred
in the STOP_COPY stage.

While in PRE_COPY the device remains RUNNING, but the saving FD is open.

Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
which halts P2P transfers while continuing the saving FD.

PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
and exists as an optional FSM branch between RUNNING and STOP_COPY:
    RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY

A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
query the progress of the precopy operation in the driver with the idea it
will judge to move to STOP_COPY at least once the initial data set is
transferred, and possibly after the dirty size has shrunk appropriately.

We think there may also be merit in future extensions to the
VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
rate it generates internal dirty state.

Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
and to be defined in future. While making the whole PRE_COPY feature
optional eliminates the concern from mlx5, this is still a complicated arc
to implement and seems prudent to leave it closed until a proper use case
is developed. We also split the pending_bytes report into the initial and
sustaining values, and define the protocol to get an event via poll() for
new dirty data during PRE_COPY.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       |  84 +++++++++++++++++++++++++++--
 include/linux/vfio.h      |   1 -
 include/uapi/linux/vfio.h | 110 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 187 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index a722a1a8a48a..264daffedb09 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1573,7 +1573,7 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state cur_fsm,
 			    enum vfio_device_mig_state new_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_PRE_COPY_P2P + 1 };
 	/*
 	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
@@ -1592,25 +1592,59 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 	 *         RUNNING -> STOP
 	 *         STOP -> RUNNING
 	 *
+	 * If precopy is supported then the driver must support these additional
+	 * FSM arcs:
+	 *         RUNNING -> PRE_COPY
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> STOP_COPY
+	 * However, if precopy and P2P are supported together then the driver
+	 * must support these additional arcs beyond the P2P arcs above:
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> PRE_COPY_P2P
+	 *         PRE_COPY_P2P -> PRE_COPY
+	 *         PRE_COPY_P2P -> RUNNING_P2P
+	 *         PRE_COPY_P2P -> STOP_COPY
+	 *         RUNNING -> PRE_COPY
+	 *         RUNNING_P2P -> PRE_COPY_P2P
+	 *
 	 * If all optional features are supported then the coding will step
 	 * through multiple states for these combination transitions:
+	 *         PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RUNNING -> RUNNING_P2P -> STOP
 	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         STOP -> RUNNING_P2P -> RUNNING
+	 *         STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         STOP_COPY -> STOP -> RESUMING
 	 *         STOP_COPY -> STOP -> RUNNING_P2P
 	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
+	 *
+	 *  The following transitions are blocked:
+	 *         STOP_COPY -> PRE_COPY
+	 *         STOP_COPY -> PRE_COPY_P2P
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1619,14 +1653,38 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
+		[VFIO_DEVICE_STATE_PRE_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1635,6 +1693,8 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RESUMING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1643,6 +1703,8 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1651,6 +1713,8 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_ERROR] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
@@ -1658,18 +1722,32 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
 		},
 	};
 	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
+	bool have_pre_copy = device->migration_flags & VFIO_MIGRATION_PRE_COPY;
 
 	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
 	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
 		return VFIO_DEVICE_STATE_ERROR;
 
 	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
-			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
+			  new_fsm == VFIO_DEVICE_STATE_PRE_COPY_P2P ||
+			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
+			  cur_fsm == VFIO_DEVICE_STATE_PRE_COPY_P2P))
+		return VFIO_DEVICE_STATE_ERROR;
+
+	/*
+	 * PRE_COPY states do not appear as an interior next state, so rejecting
+	 * them here is sufficient.
+	 */
+	if (!have_pre_copy && (new_fsm == VFIO_DEVICE_STATE_PRE_COPY ||
+			       new_fsm == VFIO_DEVICE_STATE_PRE_COPY_P2P ||
+			       cur_fsm == VFIO_DEVICE_STATE_PRE_COPY ||
+			       cur_fsm == VFIO_DEVICE_STATE_PRE_COPY_P2P))
 		return VFIO_DEVICE_STATE_ERROR;
 
 	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
 	if (!have_p2p) {
-		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
+		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
+		       cur_fsm == VFIO_DEVICE_STATE_PRE_COPY_P2P)
 			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
 	}
 	return cur_fsm;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 69a574ba085e..f853a3539b8b 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -45,7 +45,6 @@ struct vfio_device {
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
- * @flags: Global flags from enum vfio_device_ops_flags
  * @open_device: Called when the first file descriptor is opened for this device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 70c77da5812d..7837fe2ca8de 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -785,12 +785,20 @@ struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -807,8 +815,13 @@ struct vfio_device_feature_migration {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -840,20 +853,48 @@ struct vfio_device_feature_migration {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise effect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
+ *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
+ *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
  *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
+ *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -871,6 +912,13 @@ struct vfio_device_feature_migration {
  *   internal device state for this arc if required to prepare the device to
  *   receive the migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -883,7 +931,7 @@ struct vfio_device_feature_migration {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -894,6 +942,8 @@ struct vfio_device_feature_migration {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -907,6 +957,9 @@ struct vfio_device_feature_migration {
  * support them. The user can disocver if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -915,6 +968,8 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
 };
 
 /**
@@ -960,6 +1015,53 @@ struct vfio_device_mig_set_state {
 
 #define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * VFIO_DEVICE_MIG_PRECOPY - _IO(VFIO_TYPE, VFIO_BASE + 22)
+ *
+ * This ioctl is used in the precopy phase of the migration data transfer. It
+ * returns an estimate of the current data sizes remaining to be transferred.
+ * It allows the user to judge when it is appropriate to leave PRE_COPY for
+ * STOP_COPY.
+ *
+ * initial_bytes reflects the estimated remaining size of any initial mandatory
+ * precopy data transfer. When initial_bytes returns as zero then the initial
+ * phase of the precopy data is completed. Generally initial_bytes should start
+ * out as approximately the entire device state.
+ *
+ * dirty_bytes reflects an estimate for how much more data needs to be
+ * transferred to complete the migration. Generally it should start as zero
+ * and increase as internal state is dirtied.
+ *
+ * Drivers should attempt to return estimates so that initial_bytes +
+ * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
+ * will require to be streamed.
+ *
+ * Drivers have alot of flexability in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_DEVICE_MIG_PRECOPY.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY. Support
+ * for this ioctl is optional.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_device_mig_precopy {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
+};
+
+#define VFIO_DEVICE_MIG_PRECOPY _IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
@ 2022-01-31 23:41   ` Alex Williamson
  2022-02-01  0:11     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-01-31 23:41 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Sun, 30 Jan 2022 18:08:18 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> Invoke a new device op 'device_feature' to handle just the data array
> portion of the command. This lifts the ioctl validation to the core code
> and makes it simpler for either the core code, or layered drivers, to
> implement their own feature values.
> 
> Provide vfio_check_feature() to consolidate checking the flags/etc against
> what the driver supports.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci.c      |  1 +
>  drivers/vfio/pci/vfio_pci_core.c | 90 ++++++++++++--------------------
>  drivers/vfio/vfio.c              | 46 ++++++++++++++--
>  include/linux/vfio.h             | 32 ++++++++++++
>  include/linux/vfio_pci_core.h    |  2 +
>  5 files changed, 109 insertions(+), 62 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index a5ce92beb655..2b047469e02f 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -130,6 +130,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
>  	.open_device	= vfio_pci_open_device,
>  	.close_device	= vfio_pci_core_close_device,
>  	.ioctl		= vfio_pci_core_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
>  	.read		= vfio_pci_core_read,
>  	.write		= vfio_pci_core_write,
>  	.mmap		= vfio_pci_core_mmap,
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f948e6cd2993..14a22ff20ef8 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1114,70 +1114,44 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  
>  		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
>  					  ioeventfd.data, count, ioeventfd.fd);
> -	} else if (cmd == VFIO_DEVICE_FEATURE) {
> -		struct vfio_device_feature feature;
> -		uuid_t uuid;
> -
> -		minsz = offsetofend(struct vfio_device_feature, flags);
> -
> -		if (copy_from_user(&feature, (void __user *)arg, minsz))
> -			return -EFAULT;
> -
> -		if (feature.argsz < minsz)
> -			return -EINVAL;
> -
> -		/* Check unknown flags */
> -		if (feature.flags & ~(VFIO_DEVICE_FEATURE_MASK |
> -				      VFIO_DEVICE_FEATURE_SET |
> -				      VFIO_DEVICE_FEATURE_GET |
> -				      VFIO_DEVICE_FEATURE_PROBE))
> -			return -EINVAL;
> -
> -		/* GET & SET are mutually exclusive except with PROBE */
> -		if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) &&
> -		    (feature.flags & VFIO_DEVICE_FEATURE_SET) &&
> -		    (feature.flags & VFIO_DEVICE_FEATURE_GET))
> -			return -EINVAL;
> -
> -		switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
> -		case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> -			if (!vdev->vf_token)
> -				return -ENOTTY;
> -
> -			/*
> -			 * We do not support GET of the VF Token UUID as this
> -			 * could expose the token of the previous device user.
> -			 */
> -			if (feature.flags & VFIO_DEVICE_FEATURE_GET)
> -				return -EINVAL;
> -
> -			if (feature.flags & VFIO_DEVICE_FEATURE_PROBE)
> -				return 0;
> -
> -			/* Don't SET unless told to do so */
> -			if (!(feature.flags & VFIO_DEVICE_FEATURE_SET))
> -				return -EINVAL;
> +	}
> +	return -ENOTTY;
> +}
> +EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> -			if (feature.argsz < minsz + sizeof(uuid))
> -				return -EINVAL;
> +int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> +				void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	uuid_t uuid;
> +	int ret;

Nit, should uuid at least be scoped within the token code?  Or token
code pushed to a separate function?

>  
> -			if (copy_from_user(&uuid, (void __user *)(arg + minsz),
> -					   sizeof(uuid)))
> -				return -EFAULT;
> +	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
> +	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> +		if (!vdev->vf_token)
> +			return -ENOTTY;
> +		/*
> +		 * We do not support GET of the VF Token UUID as this could
> +		 * expose the token of the previous device user.
> +		 */
> +		ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +					sizeof(uuid));
> +		if (ret != 1)
> +			return ret;
>  
> -			mutex_lock(&vdev->vf_token->lock);
> -			uuid_copy(&vdev->vf_token->uuid, &uuid);
> -			mutex_unlock(&vdev->vf_token->lock);
> +		if (copy_from_user(&uuid, arg, sizeof(uuid)))
> +			return -EFAULT;
>  
> -			return 0;
> -		default:
> -			return -ENOTTY;
> -		}
> +		mutex_lock(&vdev->vf_token->lock);
> +		uuid_copy(&vdev->vf_token->uuid, &uuid);
> +		mutex_unlock(&vdev->vf_token->lock);
> +		return 0;
> +	default:
> +		return -ENOTTY;
>  	}
> -
> -	return -ENOTTY;
>  }
> -EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
> +EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl_feature);
...
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 76191d7abed1..ca69516f869d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -55,6 +55,7 @@ struct vfio_device {
>   * @match: Optional device name match callback (return: 0 for no-match, >0 for
>   *         match, -errno for abort (ex. match with insufficient or incorrect
>   *         additional args)
> + * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
>   */
>  struct vfio_device_ops {
>  	char	*name;
> @@ -69,8 +70,39 @@ struct vfio_device_ops {
>  	int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
>  	void	(*request)(struct vfio_device *vdev, unsigned int count);
>  	int	(*match)(struct vfio_device *vdev, char *buf);
> +	int	(*device_feature)(struct vfio_device *device, u32 flags,
> +				  void __user *arg, size_t argsz);
>  };
>  
> +/**
> + * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
> + * @flags: Arg from the device_feature op
> + * @argsz: Arg from the device_feature op
> + * @supported_ops: Combination of VFIO_DEVICE_FEATURE_GET and SET the driver
> + *                 supports
> + * @minsz: Minimum data size the driver accepts
> + *
> + * For use in a driver's device_feature op. Checks that the inputs to the
> + * VFIO_DEVICE_FEATURE ioctl are correct for the driver's feature. Returns 1 if
> + * the driver should execute the get or set, otherwise the relevant
> + * value should be returned.
> + */
> +static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
> +				    size_t minsz)
> +{
> +	if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) &
> +	    ~supported_ops)
> +		return -EINVAL;

These look like cases where it would be useful for userspace debugging
to differentiate errnos.

-EOPNOTSUPP?

> +	if (flags & VFIO_DEVICE_FEATURE_PROBE)
> +		return 0;
> +	/* Without PROBE one of GET or SET must be requested */
> +	if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)))
> +		return -EINVAL;
> +	if (argsz < minsz)
> +		return -EINVAL;

-ENOSPC?

Thanks,
Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
@ 2022-01-31 23:43   ` Alex Williamson
  2022-02-01  0:31     ` Jason Gunthorpe
  2022-02-01 12:06   ` Cornelia Huck
  1 sibling, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-01-31 23:43 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Sun, 30 Jan 2022 18:08:19 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..d9162702973a 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid {
>  
>  struct vfio_device_migration_info {
>  	__u32 device_state;         /* VFIO device state */
> -#define VFIO_DEVICE_STATE_STOP      (0)
> -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)

I assume the below is kept until we rip out all the references, but I'm
not sure why we're bothering to define V1 that's not used anywhere
versus just deleting the above to avoid collision with the new enum.

>  #define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>  				     VFIO_DEVICE_STATE_SAVING |  \
>  				     VFIO_DEVICE_STATE_RESUMING)
> @@ -1002,6 +1002,162 @@ struct vfio_device_feature {
>   */
>  #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
>  
> +/*
> + * Indicates the device can support the migration API. See enum
> + * vfio_device_mig_state for details. If present flags must be non-zero and
> + * VFIO_DEVICE_MIG_SET_STATE is supported.
> + *
> + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
> + * RESUMING are supported.
> + */
> +struct vfio_device_feature_migration {
> +	__aligned_u64 flags;
> +#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
> +};
> +#define VFIO_DEVICE_FEATURE_MIGRATION 1
> +
> +/*
> + * The device migration Finite State Machine is described by the enum
> + * vfio_device_mig_state. Some of the FSM arcs will create a migration data
> + * transfer session by returning a FD, in this case the migration data will
> + * flow over the FD using read() and write() as discussed below.
> + *
> + * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
> + *  RUNNING - The device is running normally
> + *  STOP - The device does not change the internal or external state
> + *  STOP_COPY - The device internal state can be read out
> + *  RESUMING - The device is stopped and is loading a new internal state
> + *  ERROR - The device has failed and must be reset
> + *
> + * The FSM takes actions on the arcs between FSM states. The driver implements
> + * the following behavior for the FSM arcs:
> + *
> + * RUNNING -> STOP
> + * STOP_COPY -> STOP
> + *   While in STOP the device must stop the operation of the device. The
> + *   device must not generate interrupts, DMA, or advance its internal
> + *   state. When stopped the device and kernel migration driver must accept
> + *   and respond to interaction to support external subsystems in the STOP
> + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> + *   restrict device access while in STOP must not result in error conditions
> + *   outside the user context (ex. host system faults).
> + *
> + *   The STOP_COPY arc will terminate a data transfer session.
> + *
> + * RESUMING -> STOP
> + *   Leaving RESUMING terminates a data transfer session and indicates the
> + *   device should complete processing of the data delivered by write(). The
> + *   kernel migration driver should complete the incorporation of data written
> + *   to the data transfer FD into the device internal state and perform
> + *   final validity and consistency checking of the new device state. If the
> + *   user provided data is found to be incomplete, inconsistent, or otherwise
> + *   invalid, the migration driver must fail the SET_STATE ioctl and
> + *   optionally go to the ERROR state as described below.
> + *
> + *   While in STOP the device has the same behavior as other STOP states
> + *   described above.
> + *
> + *   To abort a RESUMING session the device must be reset.
> + *
> + * STOP -> RUNNING
> + *   While in RUNNING the device is fully operational, the device may generate
> + *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
> + *   and the device may advance its internal state.
> + *
> + * STOP -> STOP_COPY
> + *   This arc begin the process of saving the device state and will return a
> + *   new data_fd.
> + *
> + *   While in the STOP_COPY state the device has the same behavior as STOP
> + *   with the addition that the data transfers session continues to stream the
> + *   migration state. End of stream on the FD indicates the entire device
> + *   state has been transferred.
> + *
> + *   The user should take steps to restrict access to vfio device regions while
> + *   the device is in STOP_COPY or risk corruption of the device migration data
> + *   stream.
> + *
> + * STOP -> RESUMING
> + *   Entering the RESUMING state starts a process of restoring the device
> + *   state and will return a new data_fd. The data stream fed into the data_fd
> + *   should be taken from the data transfer output of the saving group states
> + *   from a compatible device. The migration driver may alter/reset the
> + *   internal device state for this arc if required to prepare the device to
> + *   receive the migration data.
> + *
> + * any -> ERROR
> + *   ERROR cannot be specified as a device state, however any transition request
> + *   can be failed with an errno return and may then move the device_state into
> + *   ERROR. In this case the device was unable to execute the requested arc and
> + *   was also unable to restore the device to any valid device_state. The ERROR
> + *   state will be returned as described below in VFIO_DEVICE_MIG_SET_STATE. To
> + *   recover from ERROR VFIO_DEVICE_RESET must be used to return the
> + *   device_state back to RUNNING.
> + *
> + * The remaining possible transitions are interpreted as combinations of the
> + * above FSM arcs. As there are multiple paths through the FSM arcs the path
> + * should be selected based on the following rules:
> + *   - Select the shortest path.
> + * Refer to vfio_mig_get_next_state() for the result of the algorithm.
> + *
> + * The automatic transit through the FSM arcs that make up the combination
> + * transition is invisible to the user. When working with combination arcs the
> + * user may see any step along the path in the device_state if SET_STATE
> + * fails. When handling these types of errors users should anticipate future
> + * revisions of this protocol using new states and those states becoming
> + * visible in this case.
> + */
> +enum vfio_device_mig_state {
> +	VFIO_DEVICE_STATE_ERROR = 0,
> +	VFIO_DEVICE_STATE_STOP = 1,
> +	VFIO_DEVICE_STATE_RUNNING = 2,
> +	VFIO_DEVICE_STATE_STOP_COPY = 3,
> +	VFIO_DEVICE_STATE_RESUMING = 4,
> +};
> +
> +/**
> + * VFIO_DEVICE_MIG_SET_STATE - _IO(VFIO_TYPE, VFIO_BASE + 21)
> + *
> + * Execute a migration state change command on the VFIO device. The new state is
> + * supplied in device_state.
> + *
> + * The kernel migration driver must fully transition the device to the new state
> + * value before the write(2) operation returns to the user.
> + *
> + * The kernel migration driver must not generate asynchronous device state
> + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
> + * ioctl as described above.
> + *
> + * If this function fails and returns -1 then the device_state is updated with
> + * the current state the device is in. This may be the original operating state
> + * or some other state along the combination transition path. The user can then
> + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the
> + * original state, or attempt to return to some other state such as RUNNING or
> + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state
> + * output is not reliable.

I haven't made it through the full series yet, but it's not clear to me
why these specific errnos are being masked above.

> + *
> + * If the new_state starts a new data transfer session then the FD associated
> + * with that session is returned in data_fd. The user is responsible to close
> + * this FD when it is finished. The user must consider the migration data
> + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> + * the data segments must be written in the same order they came out of the
> + * saving side FD.

The lifecycle of this FD is a little sketchy.  The user is responsible
to close the FD, are they required to?  ie. should the migration driver
fail transitions if there's an outstanding FD?  Should the core code
mangle the f_ops or force and EOF or in some other way disconnect the FD
to avoid driver bugs/exploits with users poking stale FDs?  Should we
be bumping a reference on the device FD such that we can't have
outstanding migration FDs with the device closed (and re-assigned to a
new user)?

> + *
> + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL,
> + * and take no action. However the device_state will be updated with the current
> + * value.
> + *
> + * Return: 0 on success, -1 and errno set on failure.
> + */
> +struct vfio_device_mig_set_state {
> +	__u32 argsz;
> +	__u32 device_state;
> +	__s32 data_fd;
> +	__u32 flags;
> +};

argsz and flags layout is inconsistent with all other vfio ioctls.

> +
> +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)

Did you consider whether this could also be implemented as a
VFIO_DEVICE_FEATURE?  Seems the feature struct would just be
device_state and data_fd.  Perhaps there's a use case for GET as well.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  2022-01-31 23:41   ` Alex Williamson
@ 2022-02-01  0:11     ` Jason Gunthorpe
  2022-02-01 15:47       ` Alex Williamson
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01  0:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Mon, Jan 31, 2022 at 04:41:43PM -0700, Alex Williamson wrote:
> > +int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> > +				void __user *arg, size_t argsz)
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(device, struct vfio_pci_core_device, vdev);
> > +	uuid_t uuid;
> > +	int ret;
> 
> Nit, should uuid at least be scoped within the token code?  Or token
> code pushed to a separate function?

Sure, it wasn't done before, but it would be nicer,.

> > +static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
> > +				    size_t minsz)
> > +{
> > +	if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) &
> > +	    ~supported_ops)
> > +		return -EINVAL;
> 
> These look like cases where it would be useful for userspace debugging
> to differentiate errnos.

I tried to keep it unchanged from what it was today.

> -EOPNOTSUPP?

This would be my preference, but it would also be the first use in
vfio

> > +	if (flags & VFIO_DEVICE_FEATURE_PROBE)
> > +		return 0;
> > +	/* Without PROBE one of GET or SET must be requested */
> > +	if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)))
> > +		return -EINVAL;
> > +	if (argsz < minsz)
> > +		return -EINVAL;
>
> -ENOSPC?

Do you want to do all of these minsz then? There are lots..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-01-31 23:43   ` Alex Williamson
@ 2022-02-01  0:31     ` Jason Gunthorpe
  2022-02-01 17:04       ` Alex Williamson
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01  0:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Mon, Jan 31, 2022 at 04:43:18PM -0700, Alex Williamson wrote:
> On Sun, 30 Jan 2022 18:08:19 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ef33ea002b0b..d9162702973a 100644
> > +++ b/include/uapi/linux/vfio.h
> > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid {
> >  
> >  struct vfio_device_migration_info {
> >  	__u32 device_state;         /* VFIO device state */
> > -#define VFIO_DEVICE_STATE_STOP      (0)
> > -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> 
> I assume the below is kept until we rip out all the references, but I'm
> not sure why we're bothering to define V1 that's not used anywhere
> versus just deleting the above to avoid collision with the new enum.

I felt adding the deletion made this patch too big so I shoved it into
its own patch after the v2 stuff is described. The rename here is only
because we end up with a naming conflict with the enum below.

> > + * If this function fails and returns -1 then the device_state is updated with
> > + * the current state the device is in. This may be the original operating state
> > + * or some other state along the combination transition path. The user can then
> > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the
> > + * original state, or attempt to return to some other state such as RUNNING or
> > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state
> > + * output is not reliable.
> 
> I haven't made it through the full series yet, but it's not clear to me
> why these specific errnos are being masked above.

Basically, we can't return the device_state unless we properly process
the ioctl. Eg old kernels that do not support this will return ENOTTY
and will not update it. If userspace messed up the pointer EFAULT will
be return and it will not be updated, finally EOPNOTSUPP is a generic
escape for any future reason the kernel might not want to update it.

In practice, I found no use for using the device_state in the error
path in qemu, but it seemed useful for debugging.

> > + * If the new_state starts a new data transfer session then the FD associated
> > + * with that session is returned in data_fd. The user is responsible to close
> > + * this FD when it is finished. The user must consider the migration data
> > + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> > + * the data segments must be written in the same order they came out of the
> > + * saving side FD.
> 
> The lifecycle of this FD is a little sketchy.  The user is responsible
> to close the FD, are they required to?

No. Detecting this in the kernel would be notable added complexity to
the drivers.

Let's clarify it:

 "close this FD when it no longer has data to
 read/write. data_fds are not re-used, every data transfer session gets
 a new FD."

?

> ie. should the migration driver fail transitions if there's an
> outstanding FD?

No, the driver should orphan that FD and use a fresh new one the next
cycle. mlx5 will sanitize the FD, free all the memory, and render it
inoperable which I'd view as best practice.

> Should the core code mangle the f_ops or force and EOF or in some
> other way disconnect the FD to avoid driver bugs/exploits with users
> poking stale FDs?  

We looked at swapping f_ops of a running fd for the iommufd project
and decided it was not allowed/desired. It needs locking.

Here the driver should piggy back the force EOF using its own existing
locking protecting concurrent read/write, like mlx5 did. It is
straightforward.

> Should we be bumping a reference on the device FD such that we can't
> have outstanding migration FDs with the device closed (and
> re-assigned to a new user)?

The driver must ensure any activity triggered by the migration FD
against the vfio_device is halted before close_device() returns, just
like basically everything else connected to open/close_device(). mlx5
does this by using the same EOF sanitizing the FSM logic uses.

Once sanitized the f_ops should not be touching the vfio_device, or
even have a pointer to it, so there is no reason to connect the two
FDs together. I'd say it is a red flag if a driver proposes to do
this, likely it means it has a problem with the open/close_device()
lifetime model.

> > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL,
> > + * and take no action. However the device_state will be updated with the current
> > + * value.
> > + *
> > + * Return: 0 on success, -1 and errno set on failure.
> > + */
> > +struct vfio_device_mig_set_state {
> > +	__u32 argsz;
> > +	__u32 device_state;
> > +	__s32 data_fd;
> > +	__u32 flags;
> > +};
> 
> argsz and flags layout is inconsistent with all other vfio ioctls.

OK

> 
> > +
> > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)
> 
> Did you consider whether this could also be implemented as a
> VFIO_DEVICE_FEATURE?  Seems the feature struct would just be
> device_state and data_fd.  Perhaps there's a use case for GET as well.
> Thanks,

Only briefly..

I'm not sure what the overall VFIO vision is here.. Are we abandoning
traditional ioctls in favour of a multiplexer? Calling the multiplexer
ioctl "feature" is a bit odd..

It complicates the user code a bit, it is more complicated to invoke the
VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).

Either way I don't have a strong opinion, please have a think and let
us know which you'd like to follow.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1 Yishai Hadas
@ 2022-02-01 11:23   ` Cornelia Huck
  2022-02-01 12:13     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 11:23 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
>
> v1 was never implemented and is replaced by v2.
>
> The old uAPI definitions are removed from the header file. As per Linus's
> past remarks we do not have a hard requirement to retain compilation
> compatibility in uapi headers and qemu is already following Linus's
> preferred model of copying the kernel headers.

If we are all in agreement that we will replace v1 with v2 (and I think
we are), we probably should remove the x-enable-migration stuff in QEMU
sooner rather than later, to avoid leaving a trap for the next
unsuspecting person trying to update the headers.

>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 228 --------------------------------------
>  1 file changed, 228 deletions(-)
>
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9efc35535b29..70c77da5812d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -323,7 +323,6 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> -#define VFIO_REGION_TYPE_MIGRATION              (3)

Do we want to keep region type 3 reserved? Probably not really needed,
but would put us on the safe side.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
@ 2022-02-01 11:54   ` Cornelia Huck
  2022-02-01 12:13     ` Jason Gunthorpe
  2022-02-01 18:31   ` Alex Williamson
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 11:54 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:

> @@ -44,6 +45,7 @@ struct vfio_device {
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
>   *
> + * @flags: Global flags from enum vfio_device_ops_flags

You add this here, only to remove it in patch 15 again. Leftover from
some refactoring?

>   * @open_device: Called when the first file descriptor is opened for this device
>   * @close_device: Opposite of open_device
>   * @read: Perform read(2) on device file descriptor


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
  2022-01-31 23:43   ` Alex Williamson
@ 2022-02-01 12:06   ` Cornelia Huck
  2022-02-01 12:10     ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 12:06 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:

> @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>  		return -EINVAL;
>  
>  	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
> +	case VFIO_DEVICE_FEATURE_MIGRATION:
> +		return vfio_ioctl_device_feature_migration(
> +			device, feature.flags, arg->data,
> +			feature.argsz - minsz);
>  	default:
>  		if (unlikely(!device->ops->device_feature))
>  			return -EINVAL;
> @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>  	struct vfio_device *device = filep->private_data;
>  
>  	switch (cmd) {
> +	case VFIO_DEVICE_MIG_SET_STATE:
> +		return vfio_ioctl_mig_set_state(device, (void __user *)arg);
>  	case VFIO_DEVICE_FEATURE:
>  		return vfio_ioctl_device_feature(device, (void __user *)arg);
>  	default:

Not really a critique of this patch, but have we considered how mediated
devices will implement migration?

I.e. what parts of the ops will need to be looped through the mdev ops?
Do we need to consider the scope of some queries/operations (whole
device vs subdivisions etc.)? Not trying to distract from the whole new
interface here, but I think we should have at least an idea.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01 12:06   ` Cornelia Huck
@ 2022-02-01 12:10     ` Jason Gunthorpe
  2022-02-01 12:18       ` Cornelia Huck
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 12:10 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 01:06:51PM +0100, Cornelia Huck wrote:
> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
> >  		return -EINVAL;
> >  
> >  	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
> > +	case VFIO_DEVICE_FEATURE_MIGRATION:
> > +		return vfio_ioctl_device_feature_migration(
> > +			device, feature.flags, arg->data,
> > +			feature.argsz - minsz);
> >  	default:
> >  		if (unlikely(!device->ops->device_feature))
> >  			return -EINVAL;
> > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
> >  	struct vfio_device *device = filep->private_data;
> >  
> >  	switch (cmd) {
> > +	case VFIO_DEVICE_MIG_SET_STATE:
> > +		return vfio_ioctl_mig_set_state(device, (void __user *)arg);
> >  	case VFIO_DEVICE_FEATURE:
> >  		return vfio_ioctl_device_feature(device, (void __user *)arg);
> >  	default:
> 
> Not really a critique of this patch, but have we considered how mediated
> devices will implement migration?

Yes

> I.e. what parts of the ops will need to be looped through the mdev
> ops?

I've deleted mdev ops in every driver except the intel vgpu, once
Christoph's patch there is merged mdev ops will be almost gone
completely.

mdev drivers now implement normal vfio_device_ops and require nothing
special for migration.

> Do we need to consider the scope of some queries/operations (whole
> device vs subdivisions etc.)? Not trying to distract from the whole new
> interface here, but I think we should have at least an idea.

All vfio operations on the device FD operate on whatever the struct
vfio_device is.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 11:23   ` Cornelia Huck
@ 2022-02-01 12:13     ` Jason Gunthorpe
  2022-02-01 12:39       ` Cornelia Huck
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 12:13 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 12:23:05PM +0100, Cornelia Huck wrote:
> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> >
> > v1 was never implemented and is replaced by v2.
> >
> > The old uAPI definitions are removed from the header file. As per Linus's
> > past remarks we do not have a hard requirement to retain compilation
> > compatibility in uapi headers and qemu is already following Linus's
> > preferred model of copying the kernel headers.
> 
> If we are all in agreement that we will replace v1 with v2 (and I think
> we are), we probably should remove the x-enable-migration stuff in QEMU
> sooner rather than later, to avoid leaving a trap for the next
> unsuspecting person trying to update the headers.

Once we have agreement on the kernel patch we plan to send a QEMU
patch making it support the v2 interface and the migration
non-experimental. We are also working to fixing the error paths, at
least least within the limitations of the current qemu design.

The v1 support should remain in old releases as it is being used in
the field "experimentally".

> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9efc35535b29..70c77da5812d 100644
> > +++ b/include/uapi/linux/vfio.h
> > @@ -323,7 +323,6 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >  #define VFIO_REGION_TYPE_GFX                    (1)
> >  #define VFIO_REGION_TYPE_CCW			(2)
> > -#define VFIO_REGION_TYPE_MIGRATION              (3)
> 
> Do we want to keep region type 3 reserved? Probably not really needed,
> but would put us on the safe side.

Yes, thanks, this was too zealous to drop it

Jason 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-01 11:54   ` Cornelia Huck
@ 2022-02-01 12:13     ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 12:13 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 12:54:10PM +0100, Cornelia Huck wrote:
> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > @@ -44,6 +45,7 @@ struct vfio_device {
> >  /**
> >   * struct vfio_device_ops - VFIO bus driver device callbacks
> >   *
> > + * @flags: Global flags from enum vfio_device_ops_flags
> 
> You add this here, only to remove it in patch 15 again. Leftover from
> some refactoring?

Yes, thanks, it is a rebasing error :\

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01 12:10     ` Jason Gunthorpe
@ 2022-02-01 12:18       ` Cornelia Huck
  2022-02-01 12:27         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 12:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 01:06:51PM +0100, Cornelia Huck wrote:
>> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
>> 
>> > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>> >  		return -EINVAL;
>> >  
>> >  	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
>> > +	case VFIO_DEVICE_FEATURE_MIGRATION:
>> > +		return vfio_ioctl_device_feature_migration(
>> > +			device, feature.flags, arg->data,
>> > +			feature.argsz - minsz);
>> >  	default:
>> >  		if (unlikely(!device->ops->device_feature))
>> >  			return -EINVAL;
>> > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>> >  	struct vfio_device *device = filep->private_data;
>> >  
>> >  	switch (cmd) {
>> > +	case VFIO_DEVICE_MIG_SET_STATE:
>> > +		return vfio_ioctl_mig_set_state(device, (void __user *)arg);
>> >  	case VFIO_DEVICE_FEATURE:
>> >  		return vfio_ioctl_device_feature(device, (void __user *)arg);
>> >  	default:
>> 
>> Not really a critique of this patch, but have we considered how mediated
>> devices will implement migration?
>
> Yes
>
>> I.e. what parts of the ops will need to be looped through the mdev
>> ops?
>
> I've deleted mdev ops in every driver except the intel vgpu, once
> Christoph's patch there is merged mdev ops will be almost gone
> completely.

Ok, if there's nothing left to do, that's fine. (I'm assuming that the
Intel vgpu patch is on its way in? I usually don't keep track of things
I'm not directly involved with.)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01 12:18       ` Cornelia Huck
@ 2022-02-01 12:27         ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 12:27 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 01:18:29PM +0100, Cornelia Huck wrote:
> On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 01:06:51PM +0100, Cornelia Huck wrote:
> >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> >> 
> >> > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
> >> >  		return -EINVAL;
> >> >  
> >> >  	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
> >> > +	case VFIO_DEVICE_FEATURE_MIGRATION:
> >> > +		return vfio_ioctl_device_feature_migration(
> >> > +			device, feature.flags, arg->data,
> >> > +			feature.argsz - minsz);
> >> >  	default:
> >> >  		if (unlikely(!device->ops->device_feature))
> >> >  			return -EINVAL;
> >> > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
> >> >  	struct vfio_device *device = filep->private_data;
> >> >  
> >> >  	switch (cmd) {
> >> > +	case VFIO_DEVICE_MIG_SET_STATE:
> >> > +		return vfio_ioctl_mig_set_state(device, (void __user *)arg);
> >> >  	case VFIO_DEVICE_FEATURE:
> >> >  		return vfio_ioctl_device_feature(device, (void __user *)arg);
> >> >  	default:
> >> 
> >> Not really a critique of this patch, but have we considered how mediated
> >> devices will implement migration?
> >
> > Yes
> >
> >> I.e. what parts of the ops will need to be looped through the mdev
> >> ops?
> >
> > I've deleted mdev ops in every driver except the intel vgpu, once
> > Christoph's patch there is merged mdev ops will be almost gone
> > completely.
> 
> Ok, if there's nothing left to do, that's fine. (I'm assuming that the
> Intel vgpu patch is on its way in? I usually don't keep track of things
> I'm not directly involved with.)

It is awaiting some infrastructure patches Intel is working on, but
progressing slowly.

In any event, it doesn't block other mdev drivers from using the new
ops scheme, it only blocks us from deleting the core code supporting
it.

Jason 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 12:13     ` Jason Gunthorpe
@ 2022-02-01 12:39       ` Cornelia Huck
  2022-02-01 12:54         ` Jason Gunthorpe
  2022-02-01 23:01         ` Alex Williamson
  0 siblings, 2 replies; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 12:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 12:23:05PM +0100, Cornelia Huck wrote:
>> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
>> 
>> > From: Jason Gunthorpe <jgg@nvidia.com>
>> >
>> > v1 was never implemented and is replaced by v2.
>> >
>> > The old uAPI definitions are removed from the header file. As per Linus's
>> > past remarks we do not have a hard requirement to retain compilation
>> > compatibility in uapi headers and qemu is already following Linus's
>> > preferred model of copying the kernel headers.
>> 
>> If we are all in agreement that we will replace v1 with v2 (and I think
>> we are), we probably should remove the x-enable-migration stuff in QEMU
>> sooner rather than later, to avoid leaving a trap for the next
>> unsuspecting person trying to update the headers.
>
> Once we have agreement on the kernel patch we plan to send a QEMU
> patch making it support the v2 interface and the migration
> non-experimental. We are also working to fixing the error paths, at
> least least within the limitations of the current qemu design.

I'd argue that just ripping out the old interface first would be easier,
as it does not require us to synchronize with a headers sync (and does
not require to synchronize a headers sync with ripping it out...)

> The v1 support should remain in old releases as it is being used in
> the field "experimentally".

Of course; it would be hard to rip it out retroactively :)

But it should really be gone in QEMU 7.0.

Considering adding the v2 uapi, we might get unlucky: The Linux 5.18
merge window will likely be in mid-late March (and we cannot run a
headers sync before the patches hit Linus' tree), while QEMU 7.0 will
likely enter freeze in mid-late March as well. So there's a non-zero
chance that the new uapi will need to be deferred to 7.1.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 12:39       ` Cornelia Huck
@ 2022-02-01 12:54         ` Jason Gunthorpe
  2022-02-01 13:26           ` Cornelia Huck
  2022-02-01 23:01         ` Alex Williamson
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 12:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 01:39:23PM +0100, Cornelia Huck wrote:
> On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 12:23:05PM +0100, Cornelia Huck wrote:
> >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> >> 
> >> > From: Jason Gunthorpe <jgg@nvidia.com>
> >> >
> >> > v1 was never implemented and is replaced by v2.
> >> >
> >> > The old uAPI definitions are removed from the header file. As per Linus's
> >> > past remarks we do not have a hard requirement to retain compilation
> >> > compatibility in uapi headers and qemu is already following Linus's
> >> > preferred model of copying the kernel headers.
> >> 
> >> If we are all in agreement that we will replace v1 with v2 (and I think
> >> we are), we probably should remove the x-enable-migration stuff in QEMU
> >> sooner rather than later, to avoid leaving a trap for the next
> >> unsuspecting person trying to update the headers.
> >
> > Once we have agreement on the kernel patch we plan to send a QEMU
> > patch making it support the v2 interface and the migration
> > non-experimental. We are also working to fixing the error paths, at
> > least least within the limitations of the current qemu design.
> 
> I'd argue that just ripping out the old interface first would be easier,
> as it does not require us to synchronize with a headers sync (and does
> not require to synchronize a headers sync with ripping it out...)

We haven't worked out the best way to organize the qemu patch series,
currently it is just one patch that updates everything together, but
that is perhaps a bit too big...

I have thought that a 3 patch series deleting the existing v1 code and
then readding it is a potential option, but we don't change
everything, just almost everything..

> > The v1 support should remain in old releases as it is being used in
> > the field "experimentally".
> 
> Of course; it would be hard to rip it out retroactively :)
> 
> But it should really be gone in QEMU 7.0.

Seems like you are arguing from both sides, we can't put the v2 in to
7.0 because Linus has not accepted it but we have to rip the v1 out
even though Linus hasn't accepted that?

We can certainly defer the kernels removal patch for a release if it
makes qemu's life easier?

> Considering adding the v2 uapi, we might get unlucky: The Linux 5.18
> merge window will likely be in mid-late March (and we cannot run a
> headers sync before the patches hit Linus' tree), while QEMU 7.0 will
> likely enter freeze in mid-late March as well. So there's a non-zero
> chance that the new uapi will need to be deferred to 7.1.

Usually in rdma land we start advancing the user side once the kernel
patches hit the kernel maintainer tree, not Linus's. I run a
non-rebasing tree so that gives a permanent git hash. It works well
enough and avoids these kinds of artificial delays.

Anyhow, it doesn't matter much for the kernel series, but the sooner
we can agree on this the better, I suppose.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 12:54         ` Jason Gunthorpe
@ 2022-02-01 13:26           ` Cornelia Huck
  2022-02-01 13:52             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 13:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 01:39:23PM +0100, Cornelia Huck wrote:
>> On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Tue, Feb 01, 2022 at 12:23:05PM +0100, Cornelia Huck wrote:
>> >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
>> >> 
>> >> > From: Jason Gunthorpe <jgg@nvidia.com>
>> >> >
>> >> > v1 was never implemented and is replaced by v2.
>> >> >
>> >> > The old uAPI definitions are removed from the header file. As per Linus's
>> >> > past remarks we do not have a hard requirement to retain compilation
>> >> > compatibility in uapi headers and qemu is already following Linus's
>> >> > preferred model of copying the kernel headers.
>> >> 
>> >> If we are all in agreement that we will replace v1 with v2 (and I think
>> >> we are), we probably should remove the x-enable-migration stuff in QEMU
>> >> sooner rather than later, to avoid leaving a trap for the next
>> >> unsuspecting person trying to update the headers.
>> >
>> > Once we have agreement on the kernel patch we plan to send a QEMU
>> > patch making it support the v2 interface and the migration
>> > non-experimental. We are also working to fixing the error paths, at
>> > least least within the limitations of the current qemu design.
>> 
>> I'd argue that just ripping out the old interface first would be easier,
>> as it does not require us to synchronize with a headers sync (and does
>> not require to synchronize a headers sync with ripping it out...)
>
> We haven't worked out the best way to organize the qemu patch series,
> currently it is just one patch that updates everything together, but
> that is perhaps a bit too big...
>
> I have thought that a 3 patch series deleting the existing v1 code and
> then readding it is a potential option, but we don't change
> everything, just almost everything..

Even in that case, removing the old code and adding the new one is
probably much easier to review. (Also, you obviously need to have the
header update in between those two stages.)

>
>> > The v1 support should remain in old releases as it is being used in
>> > the field "experimentally".
>> 
>> Of course; it would be hard to rip it out retroactively :)
>> 
>> But it should really be gone in QEMU 7.0.
>
> Seems like you are arguing from both sides, we can't put the v2 in to
> 7.0 because Linus has not accepted it but we have to rip the v1 out
> even though Linus hasn't accepted that?
>
> We can certainly defer the kernels removal patch for a release if it
> makes qemu's life easier?

No, I'm only talking about the QEMU implementation (i.e. the code that
uses the v1 definitions and exposes x-enable-migration). Any change in
the headers needs to be done via a sync with upstream Linux.

>
>> Considering adding the v2 uapi, we might get unlucky: The Linux 5.18
>> merge window will likely be in mid-late March (and we cannot run a
>> headers sync before the patches hit Linus' tree), while QEMU 7.0 will
>> likely enter freeze in mid-late March as well. So there's a non-zero
>> chance that the new uapi will need to be deferred to 7.1.
>
> Usually in rdma land we start advancing the user side once the kernel
> patches hit the kernel maintainer tree, not Linus's. I run a
> non-rebasing tree so that gives a permanent git hash. It works well
> enough and avoids these kinds of artificial delays.

QEMU policy is "it must be in Linus' tree [*]", because we run a full
header sync. We have been bitten by premature updates in the
past. Updates of only parts of the headers are only acceptable during
development of a patch series, and must be marked as "will be replaced
with a proper header sync".

[*] Preferrably a (full or -rc) release, but the very minimum is a git
hash from his tree.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 13:26           ` Cornelia Huck
@ 2022-02-01 13:52             ` Jason Gunthorpe
  2022-02-01 14:19               ` Cornelia Huck
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 13:52 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 02:26:29PM +0100, Cornelia Huck wrote:

> > We can certainly defer the kernels removal patch for a release if it
> > makes qemu's life easier?
> 
> No, I'm only talking about the QEMU implementation (i.e. the code that
> uses the v1 definitions and exposes x-enable-migration). Any change in
> the headers needs to be done via a sync with upstream Linux.

If we leave the v1 and v2 defs in the kernel header then qemu can sync
and do the trivial rename and keep going as-is.

Then we can come with the patches to qemu update to v2, however that
looks.

We'll clean the kernel header in the next cylce.

OK?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 13:52             ` Jason Gunthorpe
@ 2022-02-01 14:19               ` Cornelia Huck
  2022-02-01 14:29                 ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-01 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 02:26:29PM +0100, Cornelia Huck wrote:
>
>> > We can certainly defer the kernels removal patch for a release if it
>> > makes qemu's life easier?
>> 
>> No, I'm only talking about the QEMU implementation (i.e. the code that
>> uses the v1 definitions and exposes x-enable-migration). Any change in
>> the headers needs to be done via a sync with upstream Linux.
>
> If we leave the v1 and v2 defs in the kernel header then qemu can sync
> and do the trivial rename and keep going as-is.
>
> Then we can come with the patches to qemu update to v2, however that
> looks.
>
> We'll clean the kernel header in the next cylce.

I'm not sure we're talking about the same things here...

My proposal is:

- remove the current QEMU implementation of vfio migration for 7.0 (it's
  experimental, and if there's anybody experimenting with that, they can
  stay on 6.2)
- continue with getting this proposal for the kernel into good shape, so
  that it can hopefully make the next merge window
(- also continue to get the documentation into good shape)
- have an RFC for QEMU that contains a provisional update of the
  relevant vfio headers so that we can discuss the QEMU side (and maybe
  shoot down any potential problems in the uapi before they are merged
  in the kernel)

I don't think a "dual version header" would really help here. If we
don't want to rip out the old QEMU implementation yet, I can certainly
also live with that. We just need to be mindful once the changes hit
Linus' tree, but it is quite likely that QEMU would be in freeze by
then. As long as updating the headers leads to an obvious failure, it's
managable (although the removal would still be my preferred approach.)

Alex, what do you think?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 14:19               ` Cornelia Huck
@ 2022-02-01 14:29                 ` Jason Gunthorpe
  2022-02-02 11:34                   ` Cornelia Huck
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 14:29 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 03:19:18PM +0100, Cornelia Huck wrote:

> - remove the current QEMU implementation of vfio migration for 7.0 (it's
>   experimental, and if there's anybody experimenting with that, they can
>   stay on 6.2)

I think we went from "we clarified how the ABI works and made
something ABI compataible with qemu" to "let's delete the whole thing
from a released qemu" rather quickly..

To be clear this is still all logically compatible with the v1
interface, and we might, or might not, want to use the ABI compatible
version we already built out of tree to support the existing installed
base of qemu.

Dropping the whole thing seems to only make things worse for this
ecosystem, IMHO.

> (- also continue to get the documentation into good shape)

Which items do you see here?

> - have an RFC for QEMU that contains a provisional update of the
>   relevant vfio headers so that we can discuss the QEMU side (and maybe
>   shoot down any potential problems in the uapi before they are merged
>   in the kernel)

This qemu patch is linked in the cover letter.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  2022-02-01  0:11     ` Jason Gunthorpe
@ 2022-02-01 15:47       ` Alex Williamson
  2022-02-01 15:49         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-01 15:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Mon, 31 Jan 2022 20:11:48 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Jan 31, 2022 at 04:41:43PM -0700, Alex Williamson wrote:
> > > +int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> > > +				void __user *arg, size_t argsz)
> > > +{
> > > +	struct vfio_pci_core_device *vdev =
> > > +		container_of(device, struct vfio_pci_core_device, vdev);
> > > +	uuid_t uuid;
> > > +	int ret;  
> > 
> > Nit, should uuid at least be scoped within the token code?  Or token
> > code pushed to a separate function?  
> 
> Sure, it wasn't done before, but it would be nicer,.
> 
> > > +static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
> > > +				    size_t minsz)
> > > +{
> > > +	if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) &
> > > +	    ~supported_ops)
> > > +		return -EINVAL;  
> > 
> > These look like cases where it would be useful for userspace debugging
> > to differentiate errnos.  
> 
> I tried to keep it unchanged from what it was today.
> 
> > -EOPNOTSUPP?  
> 
> This would be my preference, but it would also be the first use in
> vfio
> 
> > > +	if (flags & VFIO_DEVICE_FEATURE_PROBE)
> > > +		return 0;
> > > +	/* Without PROBE one of GET or SET must be requested */
> > > +	if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)))
> > > +		return -EINVAL;
> > > +	if (argsz < minsz)
> > > +		return -EINVAL;  
> >
> > -ENOSPC?  
> 
> Do you want to do all of these minsz then? There are lots..

Hmm, maybe this one is more correct as EINVAL.  In the existing use
cases the structure associated with the feature is a fixed size, so
it's not a matter that we down have space for a return like
HOT_RESET_INFO, it's simply invalid arguments by the caller.  I guess
keep this one as EINVAL, but EOPNOTSUPP seems useful for the previous.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl
  2022-02-01 15:47       ` Alex Williamson
@ 2022-02-01 15:49         ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 15:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 08:47:58AM -0700, Alex Williamson wrote:
> On Mon, 31 Jan 2022 20:11:48 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Jan 31, 2022 at 04:41:43PM -0700, Alex Williamson wrote:
> > > > +int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> > > > +				void __user *arg, size_t argsz)
> > > > +{
> > > > +	struct vfio_pci_core_device *vdev =
> > > > +		container_of(device, struct vfio_pci_core_device, vdev);
> > > > +	uuid_t uuid;
> > > > +	int ret;  
> > > 
> > > Nit, should uuid at least be scoped within the token code?  Or token
> > > code pushed to a separate function?  
> > 
> > Sure, it wasn't done before, but it would be nicer,.
> > 
> > > > +static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
> > > > +				    size_t minsz)
> > > > +{
> > > > +	if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) &
> > > > +	    ~supported_ops)
> > > > +		return -EINVAL;  
> > > 
> > > These look like cases where it would be useful for userspace debugging
> > > to differentiate errnos.  
> > 
> > I tried to keep it unchanged from what it was today.
> > 
> > > -EOPNOTSUPP?  
> > 
> > This would be my preference, but it would also be the first use in
> > vfio
> > 
> > > > +	if (flags & VFIO_DEVICE_FEATURE_PROBE)
> > > > +		return 0;
> > > > +	/* Without PROBE one of GET or SET must be requested */
> > > > +	if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)))
> > > > +		return -EINVAL;
> > > > +	if (argsz < minsz)
> > > > +		return -EINVAL;  
> > >
> > > -ENOSPC?  
> > 
> > Do you want to do all of these minsz then? There are lots..
> 
> Hmm, maybe this one is more correct as EINVAL.  In the existing use
> cases the structure associated with the feature is a fixed size, so
> it's not a matter that we down have space for a return like
> HOT_RESET_INFO, it's simply invalid arguments by the caller.  I guess
> keep this one as EINVAL, but EOPNOTSUPP seems useful for the
> previous.

Do you want EOPNOTSUPP or ENOTTY like most other places in vfio?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01  0:31     ` Jason Gunthorpe
@ 2022-02-01 17:04       ` Alex Williamson
  2022-02-01 18:36         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-01 17:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Mon, 31 Jan 2022 20:31:24 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Jan 31, 2022 at 04:43:18PM -0700, Alex Williamson wrote:
> > On Sun, 30 Jan 2022 18:08:19 +0200
> > Yishai Hadas <yishaih@nvidia.com> wrote:  
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ef33ea002b0b..d9162702973a 100644
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid {
> > >  
> > >  struct vfio_device_migration_info {
> > >  	__u32 device_state;         /* VFIO device state */
> > > -#define VFIO_DEVICE_STATE_STOP      (0)
> > > -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)  
> > 
> > I assume the below is kept until we rip out all the references, but I'm
> > not sure why we're bothering to define V1 that's not used anywhere
> > versus just deleting the above to avoid collision with the new enum.  
> 
> I felt adding the deletion made this patch too big so I shoved it into
> its own patch after the v2 stuff is described. The rename here is only
> because we end up with a naming conflict with the enum below.

Right, but we could just as easily delete the above 4 lines here to
avoid the conflict rather than renaming them to V1.

> > > + * If this function fails and returns -1 then the device_state is updated with
> > > + * the current state the device is in. This may be the original operating state
> > > + * or some other state along the combination transition path. The user can then
> > > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the
> > > + * original state, or attempt to return to some other state such as RUNNING or
> > > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state
> > > + * output is not reliable.  
> > 
> > I haven't made it through the full series yet, but it's not clear to me
> > why these specific errnos are being masked above.  
> 
> Basically, we can't return the device_state unless we properly process
> the ioctl. Eg old kernels that do not support this will return ENOTTY
> and will not update it. If userspace messed up the pointer EFAULT will
> be return and it will not be updated, finally EOPNOTSUPP is a generic
> escape for any future reason the kernel might not want to update it.
> 
> In practice, I found no use for using the device_state in the error
> path in qemu, but it seemed useful for debugging.


Ok, let me parrot back to see if I understand.  -ENOTTY will be
returned if the ioctl doesn't exist, in which case device_state is
untouched and cannot be trusted.  At the same time, we expect the user
to use the feature ioctl to make sure the ioctl exists, so it would
seem that we've reclaimed that errno if we believe the user should
follow the protocol.

-EOPNOTSUPP is returned both if the driver doesn't support migration
(which should be invalid based on the protocol).  ie. this:

+       if (!device->ops->migration_set_state)
+               return -EOPNOTSUPP;

Should return -ENOTTY, just as the feature does.  But it's also for
future unsupported ops, but couldn't we also specify that the driver
must fill final_state with the current device state for any such case.
We also have this:

+       if (set_state.argsz < minsz || set_state.flags)
+               return -EOPNOTSUPP;

Which I think should be -EINVAL.

That leaves -EFAULT, for example:

+       if (copy_from_user(&set_state, arg, minsz))
+               return -EFAULT;

Should we be able to know the current device state in core code such
that we can fill in device state here?

I think those changes would go a ways towards fully specified behavior
instead of these wishy washy unreliable return values.  Then we could
also get rid of this paranoia protection of those errnos:

+       if (IS_ERR(filp)) {
+               if (WARN_ON(PTR_ERR(filp) == -EOPNOTSUPP ||
+                           PTR_ERR(filp) == -ENOTTY ||
+                           PTR_ERR(filp) == -EFAULT))
+                       filp = ERR_PTR(-EINVAL);
+               goto out_copy;
+       }

Also, the original text of this uapi paragraph reads:

 "If this function fails and returns -1 then..."

Could we clarify that to s/function/ioctl/?  It caused me a moment of
confusion for the returned -errnos.

> > > + * If the new_state starts a new data transfer session then the FD associated
> > > + * with that session is returned in data_fd. The user is responsible to close
> > > + * this FD when it is finished. The user must consider the migration data
> > > + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> > > + * the data segments must be written in the same order they came out of the
> > > + * saving side FD.  
> > 
> > The lifecycle of this FD is a little sketchy.  The user is responsible
> > to close the FD, are they required to?  
> 
> No. Detecting this in the kernel would be notable added complexity to
> the drivers.
> 
> Let's clarify it:
> 
>  "close this FD when it no longer has data to
>  read/write. data_fds are not re-used, every data transfer session gets
>  a new FD."
> 
> ?


Better


> > ie. should the migration driver fail transitions if there's an
> > outstanding FD?  
> 
> No, the driver should orphan that FD and use a fresh new one the next
> cycle. mlx5 will sanitize the FD, free all the memory, and render it
> inoperable which I'd view as best practice.

Agreed, can we add a second sentence to the above clarification to
outline those driver responsibilities?


> > Should the core code mangle the f_ops or force and EOF or in some
> > other way disconnect the FD to avoid driver bugs/exploits with users
> > poking stale FDs?    
> 
> We looked at swapping f_ops of a running fd for the iommufd project
> and decided it was not allowed/desired. It needs locking.
> 
> Here the driver should piggy back the force EOF using its own existing
> locking protecting concurrent read/write, like mlx5 did. It is
> straightforward.

Right, sounded ugly but I thought I'd toss it out.  If we define it as
the driver's responsibility, I think I'm ok.

> > Should we be bumping a reference on the device FD such that we can't
> > have outstanding migration FDs with the device closed (and
> > re-assigned to a new user)?  
> 
> The driver must ensure any activity triggered by the migration FD
> against the vfio_device is halted before close_device() returns, just
> like basically everything else connected to open/close_device(). mlx5
> does this by using the same EOF sanitizing the FSM logic uses.
> 
> Once sanitized the f_ops should not be touching the vfio_device, or
> even have a pointer to it, so there is no reason to connect the two
> FDs together. I'd say it is a red flag if a driver proposes to do
> this, likely it means it has a problem with the open/close_device()
> lifetime model.

Maybe we just need a paragraph somewhere to describe the driver
responsibilities and expectations in managing the migration FD,
including disconnecting it after end of stream and access relative to
the open state of the vfio_device.  Seems an expanded descriptions
somewhere near the declaration in vfio_device_ops would be appropriate.

> > > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL,
> > > + * and take no action. However the device_state will be updated with the current
> > > + * value.
> > > + *
> > > + * Return: 0 on success, -1 and errno set on failure.
> > > + */
> > > +struct vfio_device_mig_set_state {
> > > +	__u32 argsz;
> > > +	__u32 device_state;
> > > +	__s32 data_fd;
> > > +	__u32 flags;
> > > +};  
> > 
> > argsz and flags layout is inconsistent with all other vfio ioctls.  
> 
> OK
> 
> >   
> > > +
> > > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)  
> > 
> > Did you consider whether this could also be implemented as a
> > VFIO_DEVICE_FEATURE?  Seems the feature struct would just be
> > device_state and data_fd.  Perhaps there's a use case for GET as well.
> > Thanks,  
> 
> Only briefly..
> 
> I'm not sure what the overall VFIO vision is here.. Are we abandoning
> traditional ioctls in favour of a multiplexer? Calling the multiplexer
> ioctl "feature" is a bit odd..

Is it really?  VF Token support is a feature that a device might have
and we can use the same interface to probe that it exists as well as
set the UUID token.  We're using it to manipulate the state of a device
feature.

If we're only looking for a means to expose that a device has support
for something, our options are a flag bit on the vfio_device_info or a
capability on that ioctl.  It's arguable that the latter might be a
better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is
only to return a flags field, ie. we're not interacting with a feature,
we're exposing a capability with fixed properties.

However as we move to MIG_SET_SET, well now we are interacting with a
feature of the device and there's really nothing unique about the
calling convention that would demand that we define a stand alone ioctl.

> It complicates the user code a bit, it is more complicated to invoke the
> VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).

Is it really any more than some wrapper code?  Are there objections to
this sort of multiplexer?  As I was working on the VF Token support, it
felt like a fairly small device feature and I didn't want to set a
precedent of cluttering our ioctl space with every niche little
feature.  The s390 folks have some proposals on list for using features
and I'm tempted to suggest it to Abhishek as well for their
implementation of D3cold support.
 
> Either way I don't have a strong opinion, please have a think and let
> us know which you'd like to follow.

I'm leaning towards a capability for migration support flags and a
feature for setting the state, but let me know if this looks like a bad
idea for some reason.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-01-30 16:08 ` [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
  2022-02-01 11:54   ` Cornelia Huck
@ 2022-02-01 18:31   ` Alex Williamson
  2022-02-01 18:53     ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-01 18:31 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Sun, 30 Jan 2022 18:08:20 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The RUNNING_P2P state is designed to support multiple devices in the same
> VM that are doing P2P transactions between themselves. When in RUNNING_P2P
> the device must be able to accept incoming P2P transactions but should not
> generate outgoing transactions.
> 
> As an optional extension to the mandatory states it is defined as
> inbetween STOP and RUNNING:
>    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> 
> For drivers that are unable to support RUNNING_P2P the core code silently
> merges RUNNING_P2P and RUNNING together. Drivers that support this will be
> required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
> arcs become combination transitions.
> 
> Compared to the v1 clarification, NDMA is redefined into FSM states and is
> described in terms of the desired P2P quiescent behavior, noting that
> halting all DMA is an acceptable implementation.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/vfio.c       | 70 ++++++++++++++++++++++++++++++---------
>  include/linux/vfio.h      |  2 ++
>  include/uapi/linux/vfio.h | 34 +++++++++++++++++--
>  3 files changed, 88 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index b12be212d048..a722a1a8a48a 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1573,39 +1573,55 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			    enum vfio_device_mig_state cur_fsm,
>  			    enum vfio_device_mig_state new_fsm)
>  {
> -	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
> +	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
>  	/*
> -	 * The coding in this table requires the driver to implement 6
> +	 * The coding in this table requires the driver to implement
>  	 * FSM arcs:
>  	 *         RESUMING -> STOP
> -	 *         RUNNING -> STOP
>  	 *         STOP -> RESUMING
> -	 *         STOP -> RUNNING
>  	 *         STOP -> STOP_COPY
>  	 *         STOP_COPY -> STOP
>  	 *
> -	 * The coding will step through multiple states for these combination
> -	 * transitions:
> -	 *         RESUMING -> STOP -> RUNNING
> +	 * If P2P is supported then the driver must also implement these FSM
> +	 * arcs:
> +	 *         RUNNING -> RUNNING_P2P
> +	 *         RUNNING_P2P -> RUNNING
> +	 *         RUNNING_P2P -> STOP
> +	 *         STOP -> RUNNING_P2P
> +	 * Without P2P the driver must implement:
> +	 *         RUNNING -> STOP
> +	 *         STOP -> RUNNING
> +	 *
> +	 * If all optional features are supported then the coding will step
> +	 * through multiple states for these combination transitions:
> +	 *         RESUMING -> STOP -> RUNNING_P2P
> +	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
>  	 *         RESUMING -> STOP -> STOP_COPY
> -	 *         RUNNING -> STOP -> RESUMING
> -	 *         RUNNING -> STOP -> STOP_COPY
> +	 *         RUNNING -> RUNNING_P2P -> STOP
> +	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
> +	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
> +	 *         RUNNING_P2P -> STOP -> RESUMING
> +	 *         RUNNING_P2P -> STOP -> STOP_COPY
> +	 *         STOP -> RUNNING_P2P -> RUNNING
>  	 *         STOP_COPY -> STOP -> RESUMING
> -	 *         STOP_COPY -> STOP -> RUNNING
> +	 *         STOP_COPY -> STOP -> RUNNING_P2P
> +	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
>  	 */
>  	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
>  		[VFIO_DEVICE_STATE_STOP] = {
>  			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> -			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_RUNNING] = {
> -			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> -			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> -			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_STOP_COPY] = {
> @@ -1613,6 +1629,7 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_RESUMING] = {
> @@ -1620,6 +1637,15 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_ERROR] = {
> @@ -1627,14 +1653,26 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  	};
> +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> +
>  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
>  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
>  		return VFIO_DEVICE_STATE_ERROR;
>  
> -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> +		return VFIO_DEVICE_STATE_ERROR;

new_fsm is provided by the user, we pass set_state.device_state
directly to .migration_set_state.  We should do bounds checking and
compatibility testing on the end state in the core so that we can
return an appropriate -EINVAL and -ENOSUPP respectively, otherwise
we're giving userspace a path to put the device into ERROR state, which
we claim is not allowed.

Testing cur_fsm is more an internal consistency check, maybe those
should be WARN_ON.

> +
> +	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> +	if (!have_p2p) {
> +		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
> +			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> +	}

Perhaps this could be generalized with something like:

	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_MIGRATION_P2P,
		[VFIO_DEVICE_STATE_ERROR] = ~0U,
	};

	while (!(state_flags_table[cur_fsm] & device->migration_flags))
		cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];

Thanks,
Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01 17:04       ` Alex Williamson
@ 2022-02-01 18:36         ` Jason Gunthorpe
  2022-02-01 21:49           ` Alex Williamson
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 18:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:

> Ok, let me parrot back to see if I understand.  -ENOTTY will be
> returned if the ioctl doesn't exist, in which case device_state is
> untouched and cannot be trusted.  At the same time, we expect the user
> to use the feature ioctl to make sure the ioctl exists, so it would
> seem that we've reclaimed that errno if we believe the user should
> follow the protocol.

I don't follow - the documentation says what the code does, if you get
ENOTTY returned then you don't get the device_state too. Saying the
user shouldn't have called it in the first place is completely
correct, but doesn't change the device_state output.

> +       if (!device->ops->migration_set_state)
> +               return -EOPNOTSUPP;
> 
> Should return -ENOTTY, just as the feature does.  

As far as I know the kernel 'standard' is:
 - ENOTTY if the ioctl cmd # itself is not understood
 - E2BIG if the ioctl arg is longer than the kernel understands
 - EOPNOTSUPP if the ioctl arg contains data the kernel doesn't
   understand (eg flags the kernel doesn't know about), or the
   kernel understands the request but cannot support it for some
   reason.
 - EINVAL if the ioctl arg contains data the kernel knows about but
   rejects (ie invalid combinations of flags)

VFIO kind of has its own thing, but I'm not entirely sure what the
rules are, eg you asked for EOPNOTSUPP in the other patch, and here we
are asking for ENOTTY?

But sure, lets make it ENOTTY.

> But it's also for future unsupported ops, but couldn't we also
> specify that the driver must fill final_state with the current
> device state for any such case.  We also have this:
> 
> +       if (set_state.argsz < minsz || set_state.flags)
> +               return -EOPNOTSUPP;
> 
> Which I think should be -EINVAL.

That would match the majority of other VFIO tests.

> That leaves -EFAULT, for example:
> 
> +       if (copy_from_user(&set_state, arg, minsz))
> +               return -EFAULT;
> 
> Should we be able to know the current device state in core code such
> that we can fill in device state here?

There is no point in doing a copy_to_user() to the same memory if a
copy_from_user() failed, so device_state will still not be returned.

We don't know the device_state in the core code because it can only be
read under locking that is controlled by the driver. I hope when we
get another driver merged that we can hoist the locking, but right now
I'm not really sure - it is a complicated lock.

> I think those changes would go a ways towards fully specified behavior
> instead of these wishy washy unreliable return values.  Then we could

Huh? It is fully specified already. These changes just removed
EOPNOTSUPP from the list where device_state isn't filled in. It is OK,
but it is not really different...

>  "If this function fails and returns -1 then..."
> 
> Could we clarify that to s/function/ioctl/?  It caused me a moment of
> confusion for the returned -errnos.

Sure.

> > > Should we be bumping a reference on the device FD such that we can't
> > > have outstanding migration FDs with the device closed (and
> > > re-assigned to a new user)?  
> > 
> > The driver must ensure any activity triggered by the migration FD
> > against the vfio_device is halted before close_device() returns, just
> > like basically everything else connected to open/close_device(). mlx5
> > does this by using the same EOF sanitizing the FSM logic uses.
> > 
> > Once sanitized the f_ops should not be touching the vfio_device, or
> > even have a pointer to it, so there is no reason to connect the two
> > FDs together. I'd say it is a red flag if a driver proposes to do
> > this, likely it means it has a problem with the open/close_device()
> > lifetime model.
> 
> Maybe we just need a paragraph somewhere to describe the driver
> responsibilities and expectations in managing the migration FD,
> including disconnecting it after end of stream and access relative to
> the open state of the vfio_device.  Seems an expanded descriptions
> somewhere near the declaration in vfio_device_ops would be appropriate.

Yes that is probably better than in the uapi header.

> > I'm not sure what the overall VFIO vision is here.. Are we abandoning
> > traditional ioctls in favour of a multiplexer? Calling the multiplexer
> > ioctl "feature" is a bit odd..
> 
> Is it really?  VF Token support is a feature that a device might have
> and we can use the same interface to probe that it exists as well as
> set the UUID token.  We're using it to manipulate the state of a device
> feature.
> 
> If we're only looking for a means to expose that a device has support
> for something, our options are a flag bit on the vfio_device_info or a
> capability on that ioctl.  It's arguable that the latter might be a
> better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is
> only to return a flags field, ie. we're not interacting with a feature,
> we're exposing a capability with fixed properties.

I looked at this, and decided against it on practical reasons.

I've organized this so the core code can do more work for the driver,
which means the core code supplies the support info back to
userspace. VFIO_DEVICE_INFO is currently open coded in every single
driver and lifting that to get the same support looks like a huge
pain. Even if we try to work it backwards somehow, we'd need to
re-organize vfio-pci so other drivers can contribute to the cap chain -
which is another ugly looking thing.

On top of that, qemu becomes much less straightforward as we have to
piggy back on the existing vfio code instead of just doing a simple
ioctl to get back the small support info back. There is even an
unpleasing mandatory user/kernel memory allocation and double ioctl in
the caps path.

The feature approach is much better, it has a much cleaner
implementation in user/kernel. I think we should focus on it going
forward and freeze caps.

> > It complicates the user code a bit, it is more complicated to invoke the
> > VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).
> 
> Is it really any more than some wrapper code?  Are there objections to
> this sort of multiplexer?

There isn't too much reason to do this kind of stuff. Each subsystem
gets something like 4 million ioctl numbers within its type, we will
never run out of unique ioctls.

Normal ioctls have a nice simplicity to them, adding layers creates
complexity, feature is defiantly more complex to implement, and cap
is a whole other level of more complex. None of this is necessary.

I don't know what "cluttering" means here, I'd prefer we focus on
things that give clean code and simple implementations than arbitary
aesthetics.

> > Either way I don't have a strong opinion, please have a think and let
> > us know which you'd like to follow.
> 
> I'm leaning towards a capability for migration support flags and a
> feature for setting the state, but let me know if this looks like a bad
> idea for some reason.  Thanks,

I don't want to touch capabilities, but we can try to use feature for
set state. Please confirm this is what you want.

You'll want the same for the PRE_COPY related information too?

If we are into these very minor nitpicks does this mean you are OK
with all the big topics now?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-01 18:31   ` Alex Williamson
@ 2022-02-01 18:53     ` Jason Gunthorpe
  2022-02-01 19:13       ` Alex Williamson
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 18:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:
> > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > +
> >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> >  		return VFIO_DEVICE_STATE_ERROR;
> >  
> > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > +		return VFIO_DEVICE_STATE_ERROR;
> 
> new_fsm is provided by the user, we pass set_state.device_state
> directly to .migration_set_state.  We should do bounds checking and
> compatibility testing on the end state in the core so that we can

This is the core :)

> return an appropriate -EINVAL and -ENOSUPP respectively, otherwise
> we're giving userspace a path to put the device into ERROR state, which
> we claim is not allowed.

Userspace can never put the device into error. As the function comment
says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
permitted. The driver is required to reflect that back as an errno
like mlx5 shows:

+		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
+						     new_state);
+		if (next_state == VFIO_DEVICE_STATE_ERROR) {
+			res = ERR_PTR(-EINVAL);
+			break;
+		}

We never get the driver into error, userspaces gets an EINVAL and no
change to the device state.

It is organized this way because the driver controls the locking for
its current state and thus the core code caller along the ioctl path
cannot validate the arc before passing it to the driver. The code is
shared by having the driver callback to the core to validate the
entire fsm arc under its lock.

The driver ends up with a small while loop that will probably be copy
and pasted to each driver. As I said, I'm interested to lift this up
as well but I need to better understand the locking needs of the other
driver implementations first, or we need your patch series to use the
inode for zap to land to eliminate the complicated locking in the
first place..

> Testing cur_fsm is more an internal consistency check, maybe those
> should be WARN_ON.

Sure
 
> > +
> > +	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	if (!have_p2p) {
> > +		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
> > +			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	}
> 
> Perhaps this could be generalized with something like:

Oh, that table could probably do both tests, if the bit isn't set it
is an invalid cur/next_fsm as well..

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-01 18:53     ` Jason Gunthorpe
@ 2022-02-01 19:13       ` Alex Williamson
  2022-02-01 19:50         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-01 19:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, 1 Feb 2022 14:53:21 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:
> > > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > > +
> > >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> > >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> > >  		return VFIO_DEVICE_STATE_ERROR;
> > >  
> > > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > > +		return VFIO_DEVICE_STATE_ERROR;  
> > 
> > new_fsm is provided by the user, we pass set_state.device_state
> > directly to .migration_set_state.  We should do bounds checking and
> > compatibility testing on the end state in the core so that we can  
> 
> This is the core :)

But this is the wrong place, we need to do it earlier rather than when
we're already iterating next states.  I only mention core to avoid that
I'm suggesting a per driver responsibility.

> 
> > return an appropriate -EINVAL and -ENOSUPP respectively, otherwise
> > we're giving userspace a path to put the device into ERROR state, which
> > we claim is not allowed.  
> 
> Userspace can never put the device into error. As the function comment
> says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
> permitted. The driver is required to reflect that back as an errno
> like mlx5 shows:
> 
> +		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
> +						     new_state);
> +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> +			res = ERR_PTR(-EINVAL);
> +			break;
> +		}
> 
> We never get the driver into error, userspaces gets an EINVAL and no
> change to the device state.

Hmm, subtle.  I'd argue that if we do a bounds and support check of the
end state in vfio_ioctl_mig_set_state() before calling
.migration_set_state() then we could remove ERROR from
vfio_from_fsm_table[] altogether and simply begin
vfio_mig_get_next_state() with:

	if (cur_fsm = ERROR)
		return ERROR;

Then we only get to ERROR by the driver placing us in ERROR and things
feel a bit more sane to me.

> It is organized this way because the driver controls the locking for
> its current state and thus the core code caller along the ioctl path
> cannot validate the arc before passing it to the driver. The code is
> shared by having the driver callback to the core to validate the
> entire fsm arc under its lock.

P2P is defined in a way that if the endpoint is valid then the arc is
valid.  We skip intermediate unsupported states.  We need to do that
for compatibility.  So why do we care about driver locking to do that?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-01 19:13       ` Alex Williamson
@ 2022-02-01 19:50         ` Jason Gunthorpe
  2022-02-02 23:54           ` Alex Williamson
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-01 19:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 12:13:22PM -0700, Alex Williamson wrote:
> On Tue, 1 Feb 2022 14:53:21 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:
> > > > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > > > +
> > > >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> > > >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> > > >  		return VFIO_DEVICE_STATE_ERROR;
> > > >  
> > > > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > > > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > > > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > > > +		return VFIO_DEVICE_STATE_ERROR;  
> > > 
> > > new_fsm is provided by the user, we pass set_state.device_state
> > > directly to .migration_set_state.  We should do bounds checking and
> > > compatibility testing on the end state in the core so that we can  
> > 
> > This is the core :)
> 
> But this is the wrong place, we need to do it earlier rather than when
> we're already iterating next states.  I only mention core to avoid that
> I'm suggesting a per driver responsibility.

Only the first vfio_mig_get_next_state() can return ERROR, once it
succeeds the subsequent ones must also succeed.

This is the earliest can be. It is done directly after taking the lock
that allows us to read the current state to call this function to
determine if the requested transition is acceptable.

> > Userspace can never put the device into error. As the function comment
> > says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
> > permitted. The driver is required to reflect that back as an errno
> > like mlx5 shows:
> > 
> > +		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
> > +						     new_state);
> > +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> > +			res = ERR_PTR(-EINVAL);
> > +			break;
> > +		}
> > 
> > We never get the driver into error, userspaces gets an EINVAL and no
> > change to the device state.
> 
> Hmm, subtle.  I'd argue that if we do a bounds and support check of the
> end state in vfio_ioctl_mig_set_state() before calling
> .migration_set_state() then we could remove ERROR from
> vfio_from_fsm_table[] altogether and simply begin
> vfio_mig_get_next_state() with:

Then we can't reject blocked arcs like STOP_COPY -> PRE_COPY.

It is setup this way to allow the core code to assert all policy, not
just a simple validation of the next_fsm.

> Then we only get to ERROR by the driver placing us in ERROR and things
> feel a bit more sane to me.

This is already true.

Perhaps it is confusing using ERROR to indicate that
vfio_mig_get_next_state() failed. Would you be happier with a -errno
return?

> > It is organized this way because the driver controls the locking for
> > its current state and thus the core code caller along the ioctl path
> > cannot validate the arc before passing it to the driver. The code is
> > shared by having the driver callback to the core to validate the
> > entire fsm arc under its lock.
> 
> P2P is defined in a way that if the endpoint is valid then the arc is
> valid.  We skip intermediate unsupported states.  We need to do that
> for compatibility.  So why do we care about driver locking to do
> that?

Without the driver locking we can't identify the arc because we don't
know the curent state the driver is in. We only know the target
state.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01 18:36         ` Jason Gunthorpe
@ 2022-02-01 21:49           ` Alex Williamson
  2022-02-02  0:24             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-01 21:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, 1 Feb 2022 14:36:20 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:
> 
> > Ok, let me parrot back to see if I understand.  -ENOTTY will be
> > returned if the ioctl doesn't exist, in which case device_state is
> > untouched and cannot be trusted.  At the same time, we expect the user
> > to use the feature ioctl to make sure the ioctl exists, so it would
> > seem that we've reclaimed that errno if we believe the user should
> > follow the protocol.  
> 
> I don't follow - the documentation says what the code does, if you get
> ENOTTY returned then you don't get the device_state too. Saying the
> user shouldn't have called it in the first place is completely
> correct, but doesn't change the device_state output.

The documentation says "...the device state output is not reliable", and
I have to question whether this qualifies as a well specified,
interoperable spec with such language.  We're essentially asking users
to keep track that certain errnos result in certain fields of the
structure _maybe_ being invalid.

> > +       if (!device->ops->migration_set_state)
> > +               return -EOPNOTSUPP;
> > 
> > Should return -ENOTTY, just as the feature does.    
> 
> As far as I know the kernel 'standard' is:
>  - ENOTTY if the ioctl cmd # itself is not understood
>  - E2BIG if the ioctl arg is longer than the kernel understands
>  - EOPNOTSUPP if the ioctl arg contains data the kernel doesn't
>    understand (eg flags the kernel doesn't know about), or the
>    kernel understands the request but cannot support it for some
>    reason.
>  - EINVAL if the ioctl arg contains data the kernel knows about but
>    rejects (ie invalid combinations of flags)
> 
> VFIO kind of has its own thing, but I'm not entirely sure what the
> rules are, eg you asked for EOPNOTSUPP in the other patch, and here we
> are asking for ENOTTY?
> 
> But sure, lets make it ENOTTY.

I'd move your first example of EOPNOTSUPP to EINVAL.  To me, the user
providing bits/fields/values that are undefined in an invalid argument.
I've typically steered away from the extended errnos in favor of things
in the base set, so as you noted, there are currently no instances of
EOPNOTSUPP in vfio.  In the case we discussed of a user trying to do
SET/GET on a feature that only supports GET/SET could go either way,
it's an invalid argument for the feature and in this case the user can
determine the supported arguments via the PROBE interface.  But when I
start seeing multiple tests that all result in an EINVAL return, then I
wonder if a different errno might help user debugging.  EINVAL is
acceptable in the case I noted, but maybe another errno could be more
descriptive.

In the immediate example here, userspace really has no reason to see a
difference in the ioctl between lack of kernel support for migration
altogether and lack of device support for migration.  So I'd fall back
to the ioctl is not known "for this device", -ENOTTY.

Now you're making me wonder how much I care to invest in semantic
arguments over extended errnos :-\
 
> > But it's also for future unsupported ops, but couldn't we also
> > specify that the driver must fill final_state with the current
> > device state for any such case.  We also have this:
> > 
> > +       if (set_state.argsz < minsz || set_state.flags)
> > +               return -EOPNOTSUPP;
> > 
> > Which I think should be -EINVAL.  
> 
> That would match the majority of other VFIO tests.
> 
> > That leaves -EFAULT, for example:
> > 
> > +       if (copy_from_user(&set_state, arg, minsz))
> > +               return -EFAULT;
> > 
> > Should we be able to know the current device state in core code such
> > that we can fill in device state here?  
> 
> There is no point in doing a copy_to_user() to the same memory if a
> copy_from_user() failed, so device_state will still not be returned.

Duh, good point.
 
> We don't know the device_state in the core code because it can only be
> read under locking that is controlled by the driver. I hope when we
> get another driver merged that we can hoist the locking, but right now
> I'm not really sure - it is a complicated lock.

The device cannot self transition to a new state, so if the core were
to serialize this ioctl then the device_state provided by the driver is
valid, regardless of its internal locking.

Whether this ioctl should be serialized anyway is probably another good
topic to breach.  Should a user be able to have concurrent ioctls
setting conflicting states?

> > I think those changes would go a ways towards fully specified behavior
> > instead of these wishy washy unreliable return values.  Then we could  
> 
> Huh? It is fully specified already. These changes just removed
> EOPNOTSUPP from the list where device_state isn't filled in. It is OK,
> but it is not really different...

Hmm, "output is not reliable" is fully specified?  We can't really make
use of return flags to identify valid fields either since the copy-out
might fault.  I'd suggest that ioctl return structure is only valid at
all on success and we add a GET interface to return the current device
state on errno given the argument above that driver locking is
irrelevant because the device cannot self transition.

> >  "If this function fails and returns -1 then..."
> > 
> > Could we clarify that to s/function/ioctl/?  It caused me a moment of
> > confusion for the returned -errnos.  
> 
> Sure.
> 
> > > > Should we be bumping a reference on the device FD such that we can't
> > > > have outstanding migration FDs with the device closed (and
> > > > re-assigned to a new user)?    
> > > 
> > > The driver must ensure any activity triggered by the migration FD
> > > against the vfio_device is halted before close_device() returns, just
> > > like basically everything else connected to open/close_device(). mlx5
> > > does this by using the same EOF sanitizing the FSM logic uses.
> > > 
> > > Once sanitized the f_ops should not be touching the vfio_device, or
> > > even have a pointer to it, so there is no reason to connect the two
> > > FDs together. I'd say it is a red flag if a driver proposes to do
> > > this, likely it means it has a problem with the open/close_device()
> > > lifetime model.  
> > 
> > Maybe we just need a paragraph somewhere to describe the driver
> > responsibilities and expectations in managing the migration FD,
> > including disconnecting it after end of stream and access relative to
> > the open state of the vfio_device.  Seems an expanded descriptions
> > somewhere near the declaration in vfio_device_ops would be appropriate.  
> 
> Yes that is probably better than in the uapi header.
> 
> > > I'm not sure what the overall VFIO vision is here.. Are we abandoning
> > > traditional ioctls in favour of a multiplexer? Calling the multiplexer
> > > ioctl "feature" is a bit odd..  
> > 
> > Is it really?  VF Token support is a feature that a device might have
> > and we can use the same interface to probe that it exists as well as
> > set the UUID token.  We're using it to manipulate the state of a device
> > feature.
> > 
> > If we're only looking for a means to expose that a device has support
> > for something, our options are a flag bit on the vfio_device_info or a
> > capability on that ioctl.  It's arguable that the latter might be a
> > better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is
> > only to return a flags field, ie. we're not interacting with a feature,
> > we're exposing a capability with fixed properties.  
> 
> I looked at this, and decided against it on practical reasons.
> 
> I've organized this so the core code can do more work for the driver,
> which means the core code supplies the support info back to
> userspace. VFIO_DEVICE_INFO is currently open coded in every single
> driver and lifting that to get the same support looks like a huge
> pain. Even if we try to work it backwards somehow, we'd need to
> re-organize vfio-pci so other drivers can contribute to the cap chain -
> which is another ugly looking thing.
> 
> On top of that, qemu becomes much less straightforward as we have to
> piggy back on the existing vfio code instead of just doing a simple
> ioctl to get back the small support info back. There is even an
> unpleasing mandatory user/kernel memory allocation and double ioctl in
> the caps path.
> 
> The feature approach is much better, it has a much cleaner
> implementation in user/kernel. I think we should focus on it going
> forward and freeze caps.

Ok, I'm not demanding a capability interface.
 
> > > It complicates the user code a bit, it is more complicated to invoke the
> > > VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).  
> > 
> > Is it really any more than some wrapper code?  Are there objections to
> > this sort of multiplexer?  
> 
> There isn't too much reason to do this kind of stuff. Each subsystem
> gets something like 4 million ioctl numbers within its type, we will
> never run out of unique ioctls.
> 
> Normal ioctls have a nice simplicity to them, adding layers creates
> complexity, feature is defiantly more complex to implement, and cap
> is a whole other level of more complex. None of this is necessary.
> 
> I don't know what "cluttering" means here, I'd prefer we focus on
> things that give clean code and simple implementations than arbitary
> aesthetics.

It's entirely possible that I'm overly averse to ioctl proliferation,
but for every new ioctl we need to take a critical look at the proposed
API, use case, applicability, and extensibility.  That isn't entirely
removed when we use something like this generic feature ioctl, but I
consider it substantially reduced since we're working within an
existing framework.  A direct ioctl might be able to slightly
streamline the interface (I don't think that significantly matters in
this case), but on the other hand, defining this as a feature within an
existing interface provides consistency and compartmentalization.

> > > Either way I don't have a strong opinion, please have a think and let
> > > us know which you'd like to follow.  
> > 
> > I'm leaning towards a capability for migration support flags and a
> > feature for setting the state, but let me know if this looks like a bad
> > idea for some reason.  Thanks,  
> 
> I don't want to touch capabilities, but we can try to use feature for
> set state. Please confirm this is what you want.

It's a team sport, but to me it seems like it fits well both in my
mental model of interacting with a device feature, without
significantly altering the uAPI you're defining anyway.
 
> You'll want the same for the PRE_COPY related information too?

I hadn't gotten there yet.  It seems like a discontinuity to me that
we're handing out new FDs for data transfer sessions, but then we
require the user to come back to the device to query about the data its
reading through that other FD.  Should that be an ioctl on the data
stream FD itself?  Is there a use case for also having it on the
STOP_COPY FD?
 
> If we are into these very minor nitpicks does this mean you are OK
> with all the big topics now?

I'm not hating it, but I'd like to see buy-in from others who have a
vested interest in supporting migration.  I don't see Intel or Huawei
on the Cc list and the original collaborators of the v1 interface from
NVIDIA have been silent through this redesign.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 12:39       ` Cornelia Huck
  2022-02-01 12:54         ` Jason Gunthorpe
@ 2022-02-01 23:01         ` Alex Williamson
  2022-02-02  0:28           ` Jason Gunthorpe
  2022-02-02 11:38           ` Cornelia Huck
  1 sibling, 2 replies; 55+ messages in thread
From: Alex Williamson @ 2022-02-01 23:01 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Jason Gunthorpe, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, 01 Feb 2022 13:39:23 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 12:23:05PM +0100, Cornelia Huck wrote:  
> >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> >>   
> >> > From: Jason Gunthorpe <jgg@nvidia.com>
> >> >
> >> > v1 was never implemented and is replaced by v2.
> >> >
> >> > The old uAPI definitions are removed from the header file. As per Linus's
> >> > past remarks we do not have a hard requirement to retain compilation
> >> > compatibility in uapi headers and qemu is already following Linus's
> >> > preferred model of copying the kernel headers.  
> >> 
> >> If we are all in agreement that we will replace v1 with v2 (and I think
> >> we are), we probably should remove the x-enable-migration stuff in QEMU
> >> sooner rather than later, to avoid leaving a trap for the next
> >> unsuspecting person trying to update the headers.  
> >
> > Once we have agreement on the kernel patch we plan to send a QEMU
> > patch making it support the v2 interface and the migration
> > non-experimental. We are also working to fixing the error paths, at
> > least least within the limitations of the current qemu design.  
> 
> I'd argue that just ripping out the old interface first would be easier,
> as it does not require us to synchronize with a headers sync (and does
> not require to synchronize a headers sync with ripping it out...)
> 
> > The v1 support should remain in old releases as it is being used in
> > the field "experimentally".  
> 
> Of course; it would be hard to rip it out retroactively :)
> 
> But it should really be gone in QEMU 7.0.
> 
> Considering adding the v2 uapi, we might get unlucky: The Linux 5.18
> merge window will likely be in mid-late March (and we cannot run a
> headers sync before the patches hit Linus' tree), while QEMU 7.0 will
> likely enter freeze in mid-late March as well. So there's a non-zero
> chance that the new uapi will need to be deferred to 7.1.


Agreed that v1 migration TYPE/SUBTYPE should live in infamy as
reserved, but I'm not sure why we need to make the rest of it a big
complicated problem.  On one hand, leaving stubs for the necessary
structure and macros until QEMU gets updated doesn't seem so terrible.
Nor actually does letting the next QEMU header update cause build
breakages, which would probably frustrate the person submitting that
update, but it's not like QEMU hasn't done selective header updates in
the past.  The former is probably the more friendly approach if we
don't outrage someone in the kernel community in the meantime.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-01 21:49           ` Alex Williamson
@ 2022-02-02  0:24             ` Jason Gunthorpe
  2022-02-02 23:36               ` Alex Williamson
  2022-02-03 15:51               ` Tarun Gupta (SW-GPU)
  0 siblings, 2 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02  0:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote:
> On Tue, 1 Feb 2022 14:36:20 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:
> > 
> > > Ok, let me parrot back to see if I understand.  -ENOTTY will be
> > > returned if the ioctl doesn't exist, in which case device_state is
> > > untouched and cannot be trusted.  At the same time, we expect the user
> > > to use the feature ioctl to make sure the ioctl exists, so it would
> > > seem that we've reclaimed that errno if we believe the user should
> > > follow the protocol.  
> > 
> > I don't follow - the documentation says what the code does, if you get
> > ENOTTY returned then you don't get the device_state too. Saying the
> > user shouldn't have called it in the first place is completely
> > correct, but doesn't change the device_state output.
> 
> The documentation says "...the device state output is not reliable", and
> I have to question whether this qualifies as a well specified,
> interoperable spec with such language.  We're essentially asking users
> to keep track that certain errnos result in certain fields of the
> structure _maybe_ being invalid.

So you are asking to remove "is not reliable" and just phrase is as:

"device_state is updated to the current value when -1 is returned,
except when these XXX errnos are returned?

(actually userspace can tell directly without checking the errno - as
if -1 is returned the device_state cannot be the requested target
state anyhow)

> Now you're making me wonder how much I care to invest in semantic
> arguments over extended errnos :-\

Well, I know I don't :) We don't have consistency in the kernel and
userspace is hard pressed to make any sense of it most of the time,
IMHO. It just doesn't practically matter..

> > We don't know the device_state in the core code because it can only be
> > read under locking that is controlled by the driver. I hope when we
> > get another driver merged that we can hoist the locking, but right now
> > I'm not really sure - it is a complicated lock.
> 
> The device cannot self transition to a new state, so if the core were
> to serialize this ioctl then the device_state provided by the driver is
> valid, regardless of its internal locking.

It is allowed to transition to RUNNING due to reset events it captures
and since we capture the reset through the PCI hook, not from VFIO,
the core code doesn't synchronize well. See patch 14

> Whether this ioctl should be serialized anyway is probably another good
> topic to breach.  Should a user be able to have concurrent ioctls
> setting conflicting states?

The driver is required to serialize, the core code doesn't touch any
global state and doesn't need serializing.

> I'd suggest that ioctl return structure is only valid at all on
> success and we add a GET interface to return the current device

We can do this too, but it is a bunch of code to achieve this and I
don't have any use case to read back the device_state beyond debugging
and debugging is fine with this. IMHO

> It's entirely possible that I'm overly averse to ioctl proliferation,
> but for every new ioctl we need to take a critical look at the proposed
> API, use case, applicability, and extensibility.  

This is all basicly the same no matter where it is put, the feature
multiplexer is just an ioctl in some semi-standard format, but the
vfio pattern of argsz/flags is also a standard format that is
basically the same thing.

We still need to think about extensibility, alignment, etc..

The problem I usually see with ioctls is not proliferation, but ending
up with too many choices and a big ?? when it comes to adding
something new.

Clear rules where things should go and why is the best, it matters
less what the rules actually are IMHO.

> > I don't want to touch capabilities, but we can try to use feature for
> > set state. Please confirm this is what you want.
> 
> It's a team sport, but to me it seems like it fits well both in my
> mental model of interacting with a device feature, without
> significantly altering the uAPI you're defining anyway.

Well, my advice is that ioctls are fine, and a bit easier all around.
eg strace and syzkaller are a bit easier if everything neatly maps
into one struct per ioctl - their generator tools are optimized for
this common case.

Simple multiplexors are next-best-fine, but there should be a clear
idea when to use the multiplexer, or not.

Things like the cap chains enter a whole world of adventure for
strace/syzkaller :)

> > You'll want the same for the PRE_COPY related information too?
> 
> I hadn't gotten there yet.  It seems like a discontinuity to me that
> we're handing out new FDs for data transfer sessions, but then we
> require the user to come back to the device to query about the data its
> reading through that other FD.  

An earlier draft of this put it on the data FD, but v6 made it fully
optional with no functional impact on the data FD. The values decrease
as the data FD progresses and increases as the VM dirties data - ie it
is 50/50 data_fd/device behavior.

It doesn't matter which way, but it feels quite weird to have the main
state function is a FEATURE and the precopy query is an ioctl.

> Should that be an ioctl on the data stream FD itself?  

I can be. Implementation wise it is about a wash.

> Is there a use case for also having it on the STOP_COPY FD?

I didn't think of one worthwhile enough to mandate implementing it in
every driver.

> > If we are into these very minor nitpicks does this mean you are OK
> > with all the big topics now?
> 
> I'm not hating it, but I'd like to see buy-in from others who have a
> vested interest in supporting migration.  I don't see Intel or Huawei
> on the Cc list and the original collaborators of the v1 interface
> from

That is an oversight, I'll ping them. I think people have been staying
away until the dust settles.

> NVIDIA have been silent through this redesign.

We've reviewed this internally with them. They reserve judgement on
the data transfer performance until they work on it, but functionally
it has all the necessary semantics.

They have the same P2P issue mlx5 does, and are happy with the
solution under the same general provisions as already discussed for
the Huawei device - RUNNING_P2P is sustainable only while the device
is not touched - ie the VCPU is halted.

The f_ops implemenation we used for mlx5 is basic, the full
performance version would want to use the read/write_iter() fop with
async completions to support the modern zero-copy iouring based data
motion in userspace. This is all part of the standard FD abstraction
and why it is appealing to use it.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 23:01         ` Alex Williamson
@ 2022-02-02  0:28           ` Jason Gunthorpe
  2022-02-02 11:38           ` Cornelia Huck
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02  0:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01, 2022 at 04:01:06PM -0700, Alex Williamson wrote:

> Agreed that v1 migration TYPE/SUBTYPE should live in infamy as
> reserved, but I'm not sure why we need to make the rest of it a big
> complicated problem.  On one hand, leaving stubs for the necessary
> structure and macros until QEMU gets updated doesn't seem so terrible.
> Nor actually does letting the next QEMU header update cause build
> breakages, which would probably frustrate the person submitting that
> update, but it's not like QEMU hasn't done selective header updates in
> the past.  The former is probably the more friendly approach if we
> don't outrage someone in the kernel community in the meantime.

So lets drop the removal patch and keep the V1 rename, it is easy for
qemu to follow along with this.

Sometime later we can purge all the dead things from the header, eg
the POWERNV stuff we left behind last year as well.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 14:29                 ` Jason Gunthorpe
@ 2022-02-02 11:34                   ` Cornelia Huck
  2022-02-02 12:22                     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Cornelia Huck @ 2022-02-02 11:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 03:19:18PM +0100, Cornelia Huck wrote:
>> (- also continue to get the documentation into good shape)
>
> Which items do you see here?

Well, it still needs to be updated, no?

>
>> - have an RFC for QEMU that contains a provisional update of the
>>   relevant vfio headers so that we can discuss the QEMU side (and maybe
>>   shoot down any potential problems in the uapi before they are merged
>>   in the kernel)
>
> This qemu patch is linked in the cover letter.

The QEMU changes need to be discussed on qemu-devel, a link to a git
tree with work in progress only goes so far.

(From my quick look there, this needs to have any headers changes split
out into a separate patch. The changes in migration.c are hard to
review; is there any chance to split the error path cleanups from the
interface changes?)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-01 23:01         ` Alex Williamson
  2022-02-02  0:28           ` Jason Gunthorpe
@ 2022-02-02 11:38           ` Cornelia Huck
  1 sibling, 0 replies; 55+ messages in thread
From: Cornelia Huck @ 2022-02-02 11:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Feb 01 2022, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 01 Feb 2022 13:39:23 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
>
>> On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Tue, Feb 01, 2022 at 12:23:05PM +0100, Cornelia Huck wrote:  
>> >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
>> >>   
>> >> > From: Jason Gunthorpe <jgg@nvidia.com>
>> >> >
>> >> > v1 was never implemented and is replaced by v2.
>> >> >
>> >> > The old uAPI definitions are removed from the header file. As per Linus's
>> >> > past remarks we do not have a hard requirement to retain compilation
>> >> > compatibility in uapi headers and qemu is already following Linus's
>> >> > preferred model of copying the kernel headers.  
>> >> 
>> >> If we are all in agreement that we will replace v1 with v2 (and I think
>> >> we are), we probably should remove the x-enable-migration stuff in QEMU
>> >> sooner rather than later, to avoid leaving a trap for the next
>> >> unsuspecting person trying to update the headers.  
>> >
>> > Once we have agreement on the kernel patch we plan to send a QEMU
>> > patch making it support the v2 interface and the migration
>> > non-experimental. We are also working to fixing the error paths, at
>> > least least within the limitations of the current qemu design.  
>> 
>> I'd argue that just ripping out the old interface first would be easier,
>> as it does not require us to synchronize with a headers sync (and does
>> not require to synchronize a headers sync with ripping it out...)
>> 
>> > The v1 support should remain in old releases as it is being used in
>> > the field "experimentally".  
>> 
>> Of course; it would be hard to rip it out retroactively :)
>> 
>> But it should really be gone in QEMU 7.0.
>> 
>> Considering adding the v2 uapi, we might get unlucky: The Linux 5.18
>> merge window will likely be in mid-late March (and we cannot run a
>> headers sync before the patches hit Linus' tree), while QEMU 7.0 will
>> likely enter freeze in mid-late March as well. So there's a non-zero
>> chance that the new uapi will need to be deferred to 7.1.
>
>
> Agreed that v1 migration TYPE/SUBTYPE should live in infamy as
> reserved, but I'm not sure why we need to make the rest of it a big
> complicated problem.  On one hand, leaving stubs for the necessary
> structure and macros until QEMU gets updated doesn't seem so terrible.
> Nor actually does letting the next QEMU header update cause build
> breakages, which would probably frustrate the person submitting that
> update, but it's not like QEMU hasn't done selective header updates in
> the past.  The former is probably the more friendly approach if we
> don't outrage someone in the kernel community in the meantime.

Leaving stubs in (while making it clear that v1 is not something you
should use) seems like a good compromise. While we have done selective
headers updates in QEMU in the past, I always found them painful, so I'd
like to avoid that.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1
  2022-02-02 11:34                   ` Cornelia Huck
@ 2022-02-02 12:22                     ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02 12:22 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Wed, Feb 02, 2022 at 12:34:31PM +0100, Cornelia Huck wrote:
> On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 03:19:18PM +0100, Cornelia Huck wrote:
> >> - have an RFC for QEMU that contains a provisional update of the
> >>   relevant vfio headers so that we can discuss the QEMU side (and maybe
> >>   shoot down any potential problems in the uapi before they are merged
> >>   in the kernel)
> >
> > This qemu patch is linked in the cover letter.
>
> The QEMU changes need to be discussed on qemu-devel, a link to a git
> tree with work in progress only goes so far.

Of course, but we are not going to bother the qemu community until the
kernel side is settled.

> (From my quick look there, this needs to have any headers changes split
> out into a separate patch. The changes in migration.c are hard to
> review; is there any chance to split the error path cleanups from the
> interface changes?)

We can do whatever, once we figure out what it actually needs to look
like. Rip and replace might be the best option.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-02  0:24             ` Jason Gunthorpe
@ 2022-02-02 23:36               ` Alex Williamson
  2022-02-03 14:17                 ` Jason Gunthorpe
  2022-02-04 12:12                 ` Cornelia Huck
  2022-02-03 15:51               ` Tarun Gupta (SW-GPU)
  1 sibling, 2 replies; 55+ messages in thread
From: Alex Williamson @ 2022-02-02 23:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, 1 Feb 2022 20:24:59 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote:
> > On Tue, 1 Feb 2022 14:36:20 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:
> > >   
> > > > Ok, let me parrot back to see if I understand.  -ENOTTY will be
> > > > returned if the ioctl doesn't exist, in which case device_state is
> > > > untouched and cannot be trusted.  At the same time, we expect the user
> > > > to use the feature ioctl to make sure the ioctl exists, so it would
> > > > seem that we've reclaimed that errno if we believe the user should
> > > > follow the protocol.    
> > > 
> > > I don't follow - the documentation says what the code does, if you get
> > > ENOTTY returned then you don't get the device_state too. Saying the
> > > user shouldn't have called it in the first place is completely
> > > correct, but doesn't change the device_state output.  
> > 
> > The documentation says "...the device state output is not reliable", and
> > I have to question whether this qualifies as a well specified,
> > interoperable spec with such language.  We're essentially asking users
> > to keep track that certain errnos result in certain fields of the
> > structure _maybe_ being invalid.  
> 
> So you are asking to remove "is not reliable" and just phrase is as:
> 
> "device_state is updated to the current value when -1 is returned,
> except when these XXX errnos are returned?
> 
> (actually userspace can tell directly without checking the errno - as
> if -1 is returned the device_state cannot be the requested target
> state anyhow)

If we decide to keep the existing code, then yes the spec should
indicate the device_state is invalid, not just unreliable for those
errnos, but I'm also of the opinion that returning an error condition
AND providing valid data in the return structure for all but a few
errnos and expecting userspace to get this correct is not a good API.
 
> > Now you're making me wonder how much I care to invest in semantic
> > arguments over extended errnos :-\  
> 
> Well, I know I don't :) We don't have consistency in the kernel and
> userspace is hard pressed to make any sense of it most of the time,
> IMHO. It just doesn't practically matter..
> 
> > > We don't know the device_state in the core code because it can only be
> > > read under locking that is controlled by the driver. I hope when we
> > > get another driver merged that we can hoist the locking, but right now
> > > I'm not really sure - it is a complicated lock.  
> > 
> > The device cannot self transition to a new state, so if the core were
> > to serialize this ioctl then the device_state provided by the driver is
> > valid, regardless of its internal locking.  
> 
> It is allowed to transition to RUNNING due to reset events it captures
> and since we capture the reset through the PCI hook, not from VFIO,
> the core code doesn't synchronize well. See patch 14

Looking... your .reset_done() function sets a deferred_reset flag and
attempts to grab the state_mutex.  If there's contention on that mutex,
exit since the lock holder will perform the state transition when
dropping that mutex, otherwise reset_done will itself drop the mutex to
do that state change.  The reset_lock assures that we cannot race as the
state_mutex is being released.

So the scenario is that the user MUST be performing a reset coincident
to accessing the device_state and the solution is that the user's
SET_STATE returns success and a new device state that's already bogus
due to the reset.  Why wouldn't the solution here be to return -EAGAIN
to the user or reattempt the SET_STATE since the user is clearly now
disconnected from the actual device_state?

> > Whether this ioctl should be serialized anyway is probably another good
> > topic to breach.  Should a user be able to have concurrent ioctls
> > setting conflicting states?  
> 
> The driver is required to serialize, the core code doesn't touch any
> global state and doesn't need serializing.
> 
> > I'd suggest that ioctl return structure is only valid at all on
> > success and we add a GET interface to return the current device  
> 
> We can do this too, but it is a bunch of code to achieve this and I
> don't have any use case to read back the device_state beyond debugging
> and debugging is fine with this. IMHO

A bunch of code?  If we use a FEATURE ioctl, it just extends the
existing implementation to add GET support.  That looks rather trivial.
That seems like a selling point for using the FEATURE ioctl TBH.
 
> > It's entirely possible that I'm overly averse to ioctl proliferation,
> > but for every new ioctl we need to take a critical look at the proposed
> > API, use case, applicability, and extensibility.    
> 
> This is all basicly the same no matter where it is put, the feature
> multiplexer is just an ioctl in some semi-standard format, but the
> vfio pattern of argsz/flags is also a standard format that is
> basically the same thing.
> 
> We still need to think about extensibility, alignment, etc..
> 
> The problem I usually see with ioctls is not proliferation, but ending
> up with too many choices and a big ?? when it comes to adding
> something new.
> 
> Clear rules where things should go and why is the best, it matters
> less what the rules actually are IMHO.
> 
> > > I don't want to touch capabilities, but we can try to use feature for
> > > set state. Please confirm this is what you want.  
> > 
> > It's a team sport, but to me it seems like it fits well both in my
> > mental model of interacting with a device feature, without
> > significantly altering the uAPI you're defining anyway.  
> 
> Well, my advice is that ioctls are fine, and a bit easier all around.
> eg strace and syzkaller are a bit easier if everything neatly maps
> into one struct per ioctl - their generator tools are optimized for
> this common case.
> 
> Simple multiplexors are next-best-fine, but there should be a clear
> idea when to use the multiplexer, or not.
> 
> Things like the cap chains enter a whole world of adventure for
> strace/syzkaller :)

vfio's argsz/flags is not only a standard framework, but it's one that
promotes extensions.  We were able to add capability chains with
backwards compatibility because of this design.  IMO, that's avoided
ioctl sprawl; we've been able to maintain a fairly small set of core
ioctls rather than add add a new ioctl every time we want to describe
some new property of a device or region or IOMMU.  I think that
improves the usability of the uAPI.  I certainly wouldn't want to
program to a uAPI with a million ioctls.  A counter argument is that
we're making the interface more complex, but at the same time we're
adding shared infrastructure for dealing with that complexity.

Of course we do continue to add new ioctls as necessary, including this
FEATURE ioctl, and I recognize that with such a generic multiplexer we
run the risk of over using it, ie. everything looks like a nail.  You
initially did not see the fit for setting device state as interacting
with a device feature, but it doesn't seem like you had a strong
objection to my explanation of it in that context.

So I think if the FEATURE ioctl has an ongoing place in our uAPI (using
it to expose migration flags would seem to be a point in that
direction) and it doesn't require too many contortions to think of the
operation we're trying to perform on the device as interacting with a
device FEATURE, and there are no functional or performance implications
of it, I would think we should use it.  To do otherwise would suggest
that we should consider the FEATURE ioctl a failed experiment and not
continue to expand its use.

I'd be interested to hear more input on this from the community.
 
> > > You'll want the same for the PRE_COPY related information too?  
> > 
> > I hadn't gotten there yet.  It seems like a discontinuity to me that
> > we're handing out new FDs for data transfer sessions, but then we
> > require the user to come back to the device to query about the data its
> > reading through that other FD.    
> 
> An earlier draft of this put it on the data FD, but v6 made it fully
> optional with no functional impact on the data FD. The values decrease
> as the data FD progresses and increases as the VM dirties data - ie it
> is 50/50 data_fd/device behavior.
> 
> It doesn't matter which way, but it feels quite weird to have the main
> state function is a FEATURE and the precopy query is an ioctl.

If the main state function were a FEATURE ioctl on the device and the
data transfer query was an ioctl on the FD returned from that feature
ioctl, I don't see how that's weird at all.  Different FDs, different
interfaces.

To me, the device has provided a separate FD for data transfer, so the
fact that we consume the data via that FD, but monitor our progress in
consuming that data back on the device FD is a bit strange.
 
> > Should that be an ioctl on the data stream FD itself?    
> 
> I can be. Implementation wise it is about a wash.
> 
> > Is there a use case for also having it on the STOP_COPY FD?  
> 
> I didn't think of one worthwhile enough to mandate implementing it in
> every driver.

Can the user perform an lseek(2) on the migration FD?  Maybe that would
be the difference between what we need for PRE_COPY vs STOP_COPY.  In
the latter case the data should be a fixes size and perhaps we don't
need another interface to know how much data to expect.

One use case would be that we want to be able to detect whether we can
meet service guarantees as quickly as possible with the minimum
resource consumption and downtime.  If we can determine from the device
that we can't possibly transfer its state in the required time, we can
abort immediately without waiting for a downtime exception or flooding
the migration link.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-01 19:50         ` Jason Gunthorpe
@ 2022-02-02 23:54           ` Alex Williamson
  2022-02-03 14:22             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-02 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Tue, 1 Feb 2022 15:50:03 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 12:13:22PM -0700, Alex Williamson wrote:
> > On Tue, 1 Feb 2022 14:53:21 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:  
> > > > > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > > > > +
> > > > >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> > > > >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> > > > >  		return VFIO_DEVICE_STATE_ERROR;
> > > > >  
> > > > > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > > > > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > > > > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > > > > +		return VFIO_DEVICE_STATE_ERROR;    
> > > > 
> > > > new_fsm is provided by the user, we pass set_state.device_state
> > > > directly to .migration_set_state.  We should do bounds checking and
> > > > compatibility testing on the end state in the core so that we can    
> > > 
> > > This is the core :)  
> > 
> > But this is the wrong place, we need to do it earlier rather than when
> > we're already iterating next states.  I only mention core to avoid that
> > I'm suggesting a per driver responsibility.  
> 
> Only the first vfio_mig_get_next_state() can return ERROR, once it
> succeeds the subsequent ones must also succeed.

Yes, I see that.

> This is the earliest can be. It is done directly after taking the lock
> that allows us to read the current state to call this function to
> determine if the requested transition is acceptable.

I think the argument here is that there's no value to validating or
bounds checking the end state, which could be done in the core ioctl
before calling the driver if the first iteration will already fail for
both the end state and the full path validation.

> > > Userspace can never put the device into error. As the function comment
> > > says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
> > > permitted. The driver is required to reflect that back as an errno
> > > like mlx5 shows:
> > > 
> > > +		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
> > > +						     new_state);
> > > +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> > > +			res = ERR_PTR(-EINVAL);
> > > +			break;
> > > +		}
> > > 
> > > We never get the driver into error, userspaces gets an EINVAL and no
> > > change to the device state.  
> > 
> > Hmm, subtle.  I'd argue that if we do a bounds and support check of the
> > end state in vfio_ioctl_mig_set_state() before calling
> > .migration_set_state() then we could remove ERROR from
> > vfio_from_fsm_table[] altogether and simply begin
> > vfio_mig_get_next_state() with:  
> 
> Then we can't reject blocked arcs like STOP_COPY -> PRE_COPY.

Right, I hadn't made it through to 15/, which helps to clarify how the
cur_fsm + new_fsm validate the full arc.
 
> It is setup this way to allow the core code to assert all policy, not
> just a simple validation of the next_fsm.
> 
> > Then we only get to ERROR by the driver placing us in ERROR and things
> > feel a bit more sane to me.  
> 
> This is already true.
> 
> Perhaps it is confusing using ERROR to indicate that
> vfio_mig_get_next_state() failed. Would you be happier with a -errno
> return?

Yes, it's confusing to me that next_state() returns states that don't
become the device_state.  Stuffing the next step back into cur_fsm and
using an errno for a bounds/validity/blocked-arc test would be a better
API.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-02 23:36               ` Alex Williamson
@ 2022-02-03 14:17                 ` Jason Gunthorpe
  2022-02-04 12:12                 ` Cornelia Huck
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-03 14:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Wed, Feb 02, 2022 at 04:36:56PM -0700, Alex Williamson wrote:

> > So you are asking to remove "is not reliable" and just phrase is as:
> > 
> > "device_state is updated to the current value when -1 is returned,
> > except when these XXX errnos are returned?
> > 
> > (actually userspace can tell directly without checking the errno - as
> > if -1 is returned the device_state cannot be the requested target
> > state anyhow)
> 
> If we decide to keep the existing code, then yes the spec should
> indicate the device_state is invalid, not just unreliable for those
> errnos, but I'm also of the opinion that returning an error condition
> AND providing valid data in the return structure for all but a few
> errnos and expecting userspace to get this correct is not a good API.

It was done this way because we didn't see any use case for the
reading the device_state except debugging, and adding another ioctl
and driver op just to get the device_state without a real user looked
like overkill.

As you already analyzed, despite the scary label in the comment, this
return is actually fully reliable so long as the userspace is
operating the API correctly - eg checking the feature flag and so on.

So, it is not as scary as you are making it out to be - and yes maybe
GET on FEATURE is cleaner.

> > It is allowed to transition to RUNNING due to reset events it captures
> > and since we capture the reset through the PCI hook, not from VFIO,
> > the core code doesn't synchronize well. See patch 14
> 
> Looking... your .reset_done() function sets a deferred_reset flag and
> attempts to grab the state_mutex.  If there's contention on that mutex,
> exit since the lock holder will perform the state transition when
> dropping that mutex, otherwise reset_done will itself drop the mutex to
> do that state change.  The reset_lock assures that we cannot race as the
> state_mutex is being released.
> 
> So the scenario is that the user MUST be performing a reset coincident
> to accessing the device_state and the solution is that the user's
> SET_STATE returns success and a new device state that's already bogus
> due to the reset.

Er, no, you suggested the core code could just cache the return since
it cannot change and then use that cached value as though it is
correct. As reset happens outside the core call chain's view it means
any core cache becomes out of sync. It is not a race, it just means
we can't cache the value in the core.

> Why wouldn't the solution here be to return -EAGAIN to the user or
> reattempt the SET_STATE since the user is clearly now disconnected
> from the actual device_state?

This is just a race that the user inflicted on themselves. We protect
kernel integrity and choose to resolve the race as though the
set_state happened first in time and the reset happened second in
time.

The API is not designed to be used concurrently, so it is a user error
if they hit this.

> > We can do this too, but it is a bunch of code to achieve this and I
> > don't have any use case to read back the device_state beyond debugging
> > and debugging is fine with this. IMHO
> 
> A bunch of code?  If we use a FEATURE ioctl, it just extends the
> existing implementation to add GET support.  That looks rather trivial.
> That seems like a selling point for using the FEATURE ioctl TBH.

We didn't even define a driver op to return the current state, trivial
code yes, but code nonetheless.

> > Things like the cap chains enter a whole world of adventure for
> > strace/syzkaller :)
> 
> vfio's argsz/flags is not only a standard framework, but it's one that
> promotes extensions.  We were able to add capability chains with
> backwards compatibility because of this design.  

IHMO the formal cap chains in the INFO ioctls were a mistake. The
argsz/flags already provide enough extension capability to return the
few extra fields directly by growing the main struct through argsz and
that handles most of what is in the caps.

The few variable size caps, like iova ranges, would have been simpler
as system calls that return only that data. This avoids userspace from
having to do all the memory allocation stuff just to read a single u32
when they don't have an interest in, say, ranges.

> initially did not see the fit for setting device state as interacting
> with a device feature, but it doesn't seem like you had a strong
> objection to my explanation of it in that context.

I don't have a strong feeling here. I think as the maintainer you
should just set a clear philosophy for ioctls in VFIO and communicate
it. There are many choices, most are reasonable.

We tried the FEATURE path, and it is OK of course, but it looks weird
as set_state is in/out due to the data_fd but it being used with
SET. I can't say that it is any better, and diffstate says it is more
code.

> > > Should that be an ioctl on the data stream FD itself?    
> > 
> > I can be. Implementation wise it is about a wash.
> > 
> > > Is there a use case for also having it on the STOP_COPY FD?  
> > 
> > I didn't think of one worthwhile enough to mandate implementing it in
> > every driver.
> 
> Can the user perform an lseek(2) on the migration FD?  Maybe that would
> be the difference between what we need for PRE_COPY vs STOP_COPY.  In
> the latter case the data should be a fixes size and perhaps we don't
> need another interface to know how much data to expect.

I'm leary to abuse the FD interface this way, we setup the FD as
noseek, like a pipe, and the core fd code has some understanding of
this.

> One use case would be that we want to be able to detect whether we can
> meet service guarantees as quickly as possible with the minimum
> resource consumption and downtime.  If we can determine from the device
> that we can't possibly transfer its state in the required time, we can
> abort immediately without waiting for a downtime exception or flooding
> the migration link.  Thanks,

It is an idea, but I don't know how to translate bytes to time, we
don't know how fast the device can generate the data for instance.

Jason 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
  2022-02-02 23:54           ` Alex Williamson
@ 2022-02-03 14:22             ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-03 14:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Wed, Feb 02, 2022 at 04:54:44PM -0700, Alex Williamson wrote:

> I think the argument here is that there's no value to validating or
> bounds checking the end state, which could be done in the core ioctl
> before calling the driver if the first iteration will already fail for
> both the end state and the full path validation.

Yes, I had a version like this in an internal draft, it was something
like this:

if (vfio_mig_get_next_state(vdev, set_state.device_state, 
     set_state.device_state) == VFIO_DEVICE_STATE_ERROR)
    return -EINVAL;

Which is fully redundant with the driver, only does half the check and
looks weird.

> > Perhaps it is confusing using ERROR to indicate that
> > vfio_mig_get_next_state() failed. Would you be happier with a -errno
> > return?
> 
> Yes, it's confusing to me that next_state() returns states that don't
> become the device_state.  Stuffing the next step back into cur_fsm and
> using an errno for a bounds/validity/blocked-arc test would be a better
> API.  Thanks,

OK

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-02  0:24             ` Jason Gunthorpe
  2022-02-02 23:36               ` Alex Williamson
@ 2022-02-03 15:51               ` Tarun Gupta (SW-GPU)
  1 sibling, 0 replies; 55+ messages in thread
From: Tarun Gupta (SW-GPU) @ 2022-02-03 15:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg, cjia



On 2/2/2022 5:54 AM, Jason Gunthorpe wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote:
>> On Tue, 1 Feb 2022 14:36:20 -0400
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:
>>>
>>>> Ok, let me parrot back to see if I understand.  -ENOTTY will be
>>>> returned if the ioctl doesn't exist, in which case device_state is
>>>> untouched and cannot be trusted.  At the same time, we expect the user
>>>> to use the feature ioctl to make sure the ioctl exists, so it would
>>>> seem that we've reclaimed that errno if we believe the user should
>>>> follow the protocol.
>>>
>>> I don't follow - the documentation says what the code does, if you get
>>> ENOTTY returned then you don't get the device_state too. Saying the
>>> user shouldn't have called it in the first place is completely
>>> correct, but doesn't change the device_state output.
>>
>> The documentation says "...the device state output is not reliable", and
>> I have to question whether this qualifies as a well specified,
>> interoperable spec with such language.  We're essentially asking users
>> to keep track that certain errnos result in certain fields of the
>> structure _maybe_ being invalid.
> 
> So you are asking to remove "is not reliable" and just phrase is as:
> 
> "device_state is updated to the current value when -1 is returned,
> except when these XXX errnos are returned?
> 
> (actually userspace can tell directly without checking the errno - as
> if -1 is returned the device_state cannot be the requested target
> state anyhow)
> 
>> Now you're making me wonder how much I care to invest in semantic
>> arguments over extended errnos :-\
> 
> Well, I know I don't :) We don't have consistency in the kernel and
> userspace is hard pressed to make any sense of it most of the time,
> IMHO. It just doesn't practically matter..
> 
>>> We don't know the device_state in the core code because it can only be
>>> read under locking that is controlled by the driver. I hope when we
>>> get another driver merged that we can hoist the locking, but right now
>>> I'm not really sure - it is a complicated lock.
>>
>> The device cannot self transition to a new state, so if the core were
>> to serialize this ioctl then the device_state provided by the driver is
>> valid, regardless of its internal locking.
> 
> It is allowed to transition to RUNNING due to reset events it captures
> and since we capture the reset through the PCI hook, not from VFIO,
> the core code doesn't synchronize well. See patch 14
> 
>> Whether this ioctl should be serialized anyway is probably another good
>> topic to breach.  Should a user be able to have concurrent ioctls
>> setting conflicting states?
> 
> The driver is required to serialize, the core code doesn't touch any
> global state and doesn't need serializing.
> 
>> I'd suggest that ioctl return structure is only valid at all on
>> success and we add a GET interface to return the current device
> 
> We can do this too, but it is a bunch of code to achieve this and I
> don't have any use case to read back the device_state beyond debugging
> and debugging is fine with this. IMHO
> 
>> It's entirely possible that I'm overly averse to ioctl proliferation,
>> but for every new ioctl we need to take a critical look at the proposed
>> API, use case, applicability, and extensibility.
> 
> This is all basicly the same no matter where it is put, the feature
> multiplexer is just an ioctl in some semi-standard format, but the
> vfio pattern of argsz/flags is also a standard format that is
> basically the same thing.
> 
> We still need to think about extensibility, alignment, etc..
> 
> The problem I usually see with ioctls is not proliferation, but ending
> up with too many choices and a big ?? when it comes to adding
> something new.
> 
> Clear rules where things should go and why is the best, it matters
> less what the rules actually are IMHO.
> 
>>> I don't want to touch capabilities, but we can try to use feature for
>>> set state. Please confirm this is what you want.
>>
>> It's a team sport, but to me it seems like it fits well both in my
>> mental model of interacting with a device feature, without
>> significantly altering the uAPI you're defining anyway.
> 
> Well, my advice is that ioctls are fine, and a bit easier all around.
> eg strace and syzkaller are a bit easier if everything neatly maps
> into one struct per ioctl - their generator tools are optimized for
> this common case.
> 
> Simple multiplexors are next-best-fine, but there should be a clear
> idea when to use the multiplexer, or not.
> 
> Things like the cap chains enter a whole world of adventure for
> strace/syzkaller :)
> 
>>> You'll want the same for the PRE_COPY related information too?
>>
>> I hadn't gotten there yet.  It seems like a discontinuity to me that
>> we're handing out new FDs for data transfer sessions, but then we
>> require the user to come back to the device to query about the data its
>> reading through that other FD.
> 
> An earlier draft of this put it on the data FD, but v6 made it fully
> optional with no functional impact on the data FD. The values decrease
> as the data FD progresses and increases as the VM dirties data - ie it
> is 50/50 data_fd/device behavior.
> 
> It doesn't matter which way, but it feels quite weird to have the main
> state function is a FEATURE and the precopy query is an ioctl.
> 
>> Should that be an ioctl on the data stream FD itself?
> 
> I can be. Implementation wise it is about a wash.
> 
>> Is there a use case for also having it on the STOP_COPY FD?
> 
> I didn't think of one worthwhile enough to mandate implementing it in
> every driver.
> 
>>> If we are into these very minor nitpicks does this mean you are OK
>>> with all the big topics now?
>>
>> I'm not hating it, but I'd like to see buy-in from others who have a
>> vested interest in supporting migration.  I don't see Intel or Huawei
>> on the Cc list and the original collaborators of the v1 interface
>> from
> 
> That is an oversight, I'll ping them. I think people have been staying
> away until the dust settles.
> 
>> NVIDIA have been silent through this redesign.
> 
> We've reviewed this internally with them. They reserve judgement on
> the data transfer performance until they work on it, but functionally
> it has all the necessary semantics.
> 

Yes, we're reviewing the proposal from vGPU point of view and will 
update here once we have it figured out for vGPU.

Thanks,
Tarun

> They have the same P2P issue mlx5 does, and are happy with the
> solution under the same general provisions as already discussed for
> the Huawei device - RUNNING_P2P is sustainable only while the device
> is not touched - ie the VCPU is halted.
> 
> The f_ops implemenation we used for mlx5 is basic, the full
> performance version would want to use the read/write_iter() fop with
> async completions to support the modern zero-copy iouring based data
> motion in userspace. This is all part of the standard FD abstraction
> and why it is appealing to use it.
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
  2022-02-02 23:36               ` Alex Williamson
  2022-02-03 14:17                 ` Jason Gunthorpe
@ 2022-02-04 12:12                 ` Cornelia Huck
  1 sibling, 0 replies; 55+ messages in thread
From: Cornelia Huck @ 2022-02-04 12:12 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Wed, Feb 02 2022, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 1 Feb 2022 20:24:59 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote:
>> > On Tue, 1 Feb 2022 14:36:20 -0400
>> > Jason Gunthorpe <jgg@nvidia.com> wrote:

>> > > I don't want to touch capabilities, but we can try to use feature for
>> > > set state. Please confirm this is what you want.  
>> > 
>> > It's a team sport, but to me it seems like it fits well both in my
>> > mental model of interacting with a device feature, without
>> > significantly altering the uAPI you're defining anyway.  
>> 
>> Well, my advice is that ioctls are fine, and a bit easier all around.
>> eg strace and syzkaller are a bit easier if everything neatly maps
>> into one struct per ioctl - their generator tools are optimized for
>> this common case.
>> 
>> Simple multiplexors are next-best-fine, but there should be a clear
>> idea when to use the multiplexer, or not.
>> 
>> Things like the cap chains enter a whole world of adventure for
>> strace/syzkaller :)
>
> vfio's argsz/flags is not only a standard framework, but it's one that
> promotes extensions.  We were able to add capability chains with
> backwards compatibility because of this design.  IMO, that's avoided
> ioctl sprawl; we've been able to maintain a fairly small set of core
> ioctls rather than add add a new ioctl every time we want to describe
> some new property of a device or region or IOMMU.  I think that
> improves the usability of the uAPI.  I certainly wouldn't want to
> program to a uAPI with a million ioctls.  A counter argument is that
> we're making the interface more complex, but at the same time we're
> adding shared infrastructure for dealing with that complexity.
>
> Of course we do continue to add new ioctls as necessary, including this
> FEATURE ioctl, and I recognize that with such a generic multiplexer we
> run the risk of over using it, ie. everything looks like a nail.  You
> initially did not see the fit for setting device state as interacting
> with a device feature, but it doesn't seem like you had a strong
> objection to my explanation of it in that context.
>
> So I think if the FEATURE ioctl has an ongoing place in our uAPI (using
> it to expose migration flags would seem to be a point in that
> direction) and it doesn't require too many contortions to think of the
> operation we're trying to perform on the device as interacting with a
> device FEATURE, and there are no functional or performance implications
> of it, I would think we should use it.  To do otherwise would suggest
> that we should consider the FEATURE ioctl a failed experiment and not
> continue to expand its use.
>
> I'd be interested to hear more input on this from the community.

My personal take would be: a new ioctl is more suitable for things that
may be implemented by different backends, but in a non-generic way, and
for mandatory functionality; the FEATURE ioctl is more suitable for
things that either are very specific to a certain backend (i.e. don't
reserve an ioctl for something that will only ever be used on one
platform), or for things that have a lot of commonality for the backends
that implement them (i.e. you are using a familiar scheme to interact
with them.)

From staring at the code and the discussion here for a bit (I have not
yet made my way through all of this except in a superficial way), I'd
lean more towards using FEATURE here.


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2022-02-04 12:12 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
2022-01-31 23:41   ` Alex Williamson
2022-02-01  0:11     ` Jason Gunthorpe
2022-02-01 15:47       ` Alex Williamson
2022-02-01 15:49         ` Jason Gunthorpe
2022-01-30 16:08 ` [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
2022-01-31 23:43   ` Alex Williamson
2022-02-01  0:31     ` Jason Gunthorpe
2022-02-01 17:04       ` Alex Williamson
2022-02-01 18:36         ` Jason Gunthorpe
2022-02-01 21:49           ` Alex Williamson
2022-02-02  0:24             ` Jason Gunthorpe
2022-02-02 23:36               ` Alex Williamson
2022-02-03 14:17                 ` Jason Gunthorpe
2022-02-04 12:12                 ` Cornelia Huck
2022-02-03 15:51               ` Tarun Gupta (SW-GPU)
2022-02-01 12:06   ` Cornelia Huck
2022-02-01 12:10     ` Jason Gunthorpe
2022-02-01 12:18       ` Cornelia Huck
2022-02-01 12:27         ` Jason Gunthorpe
2022-01-30 16:08 ` [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
2022-02-01 11:54   ` Cornelia Huck
2022-02-01 12:13     ` Jason Gunthorpe
2022-02-01 18:31   ` Alex Williamson
2022-02-01 18:53     ` Jason Gunthorpe
2022-02-01 19:13       ` Alex Williamson
2022-02-01 19:50         ` Jason Gunthorpe
2022-02-02 23:54           ` Alex Williamson
2022-02-03 14:22             ` Jason Gunthorpe
2022-01-30 16:08 ` [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1 Yishai Hadas
2022-02-01 11:23   ` Cornelia Huck
2022-02-01 12:13     ` Jason Gunthorpe
2022-02-01 12:39       ` Cornelia Huck
2022-02-01 12:54         ` Jason Gunthorpe
2022-02-01 13:26           ` Cornelia Huck
2022-02-01 13:52             ` Jason Gunthorpe
2022-02-01 14:19               ` Cornelia Huck
2022-02-01 14:29                 ` Jason Gunthorpe
2022-02-02 11:34                   ` Cornelia Huck
2022-02-02 12:22                     ` Jason Gunthorpe
2022-02-01 23:01         ` Alex Williamson
2022-02-02  0:28           ` Jason Gunthorpe
2022-02-02 11:38           ` Cornelia Huck
2022-01-30 16:08 ` [PATCH V6 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.