All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver
@ 2021-10-13  9:46 Yishai Hadas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index Yishai Hadas
                   ` (12 more replies)
  0 siblings, 13 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:46 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

This series adds mlx5 live migration driver for VFs that are migrated
capable.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.

The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.

The series adds an option in the vfio core layer to let a driver being
registered to get a 'device RESET done' notification. This is needed to
let the driver maintains its state accordingly.

As part of the migration process the VF doesn't ride on mlx5_core, the
device is driving *two* different PCI devices, the PF owned by mlx5_core
and the VF owned by the mlx5 vfio driver.

The mlx5_core of the PF is accessed only during the narrow window of the
VF's ioctl that requires its services.

To let that work properly a new API was added in the PCI layer (i.e.
pci_iov_get_pf_drvdata) that lets the VF access safely to the PF
drvdata. It was used in this series as part of mlx5_core and mlx5_vdpa
when a VF needed that functionality.

In addition, mlx5_core was aligned with other drivers to disable SRIOV
before PF has gone as part of the remove_one() call back.

This enables proper usage of the above new PCI API and prevents some
warning message that exists today when it's not done.

The series also exposes from the PCI sub system an API named
pci_iov_vf_id() to get the index of the VF. The PCI core uses this index
internally, often called the vf_id, during the setup of the VF, eg
pci_iov_add_virtfn().

The returned VF index is needed by the mlx5 vfio driver for its internal
operations to configure/control its VFs as part of the migration
process.

With the above functionality in place the driver implements the
suspend/resume flows to work over QEMU.

Changes from V0:
PCI/IOV:
- Add an API (i.e. pci_iov_get_pf_drvdata()) that allows SRVIO VF
  drivers to reach the drvdata of a PF.
net/mlx5:
- Add an extra patch to disable SRIOV before PF removal.
- Adapt to use the above PCI/IOV API as part of mlx5_vf_get_core_dev().
- Reuse the exported PCI/IOV virtfn index function call (i.e.
  pci_iov_vf_id().
vfio:
- Add support in the pci_core to let a driver be notified when
 ‘reset_done’ to let it sets its internal state accordingly.
- Add some helper stuff for ‘invalid’ state handling.
vfio/mlx5:
- Move to use the ‘command mode’ instead of the ‘state machine’
  scheme as was discussed in the mailing list.
-Handle the RESET scenario when called by vfio_pci_core to sets
 its internal state accordingly.
- Set initial state as RUNNING.
- Put the driver files as sub-folder under drivers/vfio/pci named mlx5
  and update the MAINTAINER file as was asked.
vdpa/mlx5:
Add a new patch to use mlx5_vf_get_core_dev() to get the PF device.

---------------------------------------------------------------
Alex,

This series touches our ethernet and RDMA drivers, so we will need to
route the patches through separate shared branch (mlx5-next) in order to
eliminate the chances of merge conflicts between different subsystems.

Thanks,
Yishai

Jason Gunthorpe (2):
  PCI/IOV: Provide internal VF index
  PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF

Leon Romanovsky (1):
  net/mlx5: Reuse exported virtfn index function call

Yishai Hadas (10):
  net/mlx5: Disable SRIOV before PF removal
  net/mlx5: Expose APIs to get/put the mlx5 core device
  vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device
  vfio: Add 'invalid' state definitions
  vfio/pci_core: Make the region->release() function optional
  net/mlx5: Introduce migration bits and structures
  vfio/mlx5: Expose migration commands over mlx5 device
  vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device
    RESET
  vfio/mlx5: Trap device RESET and update state accordingly

 MAINTAINERS                                   |   6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  44 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   1 +
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  17 +-
 drivers/pci/iov.c                             |  43 ++
 drivers/vdpa/mlx5/net/mlx5_vnet.c             |  27 +-
 drivers/vfio/pci/Kconfig                      |   3 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/mlx5/Kconfig                 |  11 +
 drivers/vfio/pci/mlx5/Makefile                |   4 +
 drivers/vfio/pci/mlx5/cmd.c                   | 353 +++++++++
 drivers/vfio/pci/mlx5/cmd.h                   |  43 ++
 drivers/vfio/pci/mlx5/main.c                  | 707 ++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c            |   8 +-
 drivers/vfio/pci/vfio_pci_core.c              |   5 +-
 include/linux/mlx5/driver.h                   |   3 +
 include/linux/mlx5/mlx5_ifc.h                 | 145 +++-
 include/linux/pci.h                           |  15 +-
 include/linux/vfio.h                          |   5 +
 include/linux/vfio_pci_core.h                 |  10 +
 include/uapi/linux/vfio.h                     |   4 +-
 21 files changed, 1428 insertions(+), 28 deletions(-)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h
 create mode 100644 drivers/vfio/pci/mlx5/main.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
@ 2021-10-13  9:46 ` Yishai Hadas
  2021-10-13 18:14   ` Bjorn Helgaas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 02/13] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:46 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

The PCI core uses the VF index internally, often called the vf_id,
during the setup of the VF, eg pci_iov_add_virtfn().

This index is needed for device drivers that implement live migration
for their internal operations that configure/control their VFs.

Specifically, mlx5_vfio_pci driver that is introduced in coming patches
from this series needs it and not the bus/device/function which is
exposed today.

Add pci_iov_vf_id() which computes the vf_id by reversing the math that
was used to create the bus/device/function.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/iov.c   | 14 ++++++++++++++
 include/linux/pci.h |  8 +++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index dafdc652fcd0..e7751fa3fe0b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 }
 EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
 
+int pci_iov_vf_id(struct pci_dev *dev)
+{
+	struct pci_dev *pf;
+
+	if (!dev->is_virtfn)
+		return -EINVAL;
+
+	pf = pci_physfn(dev);
+	return (((dev->bus->number << 8) + dev->devfn) -
+		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
+	       pf->sriov->stride;
+}
+EXPORT_SYMBOL_GPL(pci_iov_vf_id);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..2337512e67f0 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2153,7 +2153,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 #ifdef CONFIG_PCI_IOV
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
-
+int pci_iov_vf_id(struct pci_dev *dev);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2181,6 +2181,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
 	return -ENOSYS;
 }
+
+static inline int pci_iov_vf_id(struct pci_dev *dev)
+{
+	return -ENOSYS;
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 02/13] net/mlx5: Reuse exported virtfn index function call
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index Yishai Hadas
@ 2021-10-13  9:46 ` Yishai Hadas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 03/13] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:46 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Leon Romanovsky <leonro@nvidia.com>

Instead open-code iteration to compare virtfn internal index, use newly
introduced pci_iov_vf_id() call.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index e8185b69ac6c..24c4b4f05214 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -205,19 +205,8 @@ int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
 			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
 
 	sriov = &dev->priv.sriov;
-
-	/* Reversed translation of PCI VF function number to the internal
-	 * function_id, which exists in the name of virtfn symlink.
-	 */
-	for (id = 0; id < pci_num_vf(pf); id++) {
-		if (!sriov->vfs_ctx[id].enabled)
-			continue;
-
-		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
-			break;
-	}
-
-	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+	id = pci_iov_vf_id(vf);
+	if (id < 0 || !sriov->vfs_ctx[id].enabled)
 		return -EINVAL;
 
 	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 03/13] net/mlx5: Disable SRIOV before PF removal
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index Yishai Hadas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 02/13] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
@ 2021-10-13  9:46 ` Yishai Hadas
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF Yishai Hadas
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:46 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Virtual functions depend on physical function for device access (for example
firmware host PAGE management), so make sure to disable SRIOV once PF is gone.

This will prevent also the below warning if PF has gone before disabling SRIOV.
"driver left SR-IOV enabled after remove"

Next patch from this series will rely on that when the VF may need to
access safely the PF 'driver data'.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c      | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c     | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 79482824c64f..0b9a911acfc1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1558,6 +1558,7 @@ static void remove_one(struct pci_dev *pdev)
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	struct devlink *devlink = priv_to_devlink(dev);
 
+	mlx5_sriov_disable(pdev);
 	devlink_reload_disable(devlink);
 	mlx5_crdump_disable(dev);
 	mlx5_drain_health_wq(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 230eab7e3bc9..f21d64416f7f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -140,6 +140,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+void mlx5_sriov_disable(struct pci_dev *pdev);
 int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 24c4b4f05214..887ee0f729d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -161,7 +161,7 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	return err;
 }
 
-static void mlx5_sriov_disable(struct pci_dev *pdev)
+void mlx5_sriov_disable(struct pci_dev *pdev)
 {
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	int num_vfs = pci_num_vf(dev->pdev);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (2 preceding siblings ...)
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 03/13] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
@ 2021-10-13  9:46 ` Yishai Hadas
  2021-10-13 18:27   ` Bjorn Helgaas
  2021-10-14 22:11   ` Alex Williamson
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 05/13] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:46 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

From: Jason Gunthorpe <jgg@nvidia.com>

There are some cases where a SRIOV VF driver will need to reach into and
interact with the PF driver. This requires accessing the drvdata of the PF.

Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
safe way. Normally accessing a drvdata of a foreign struct device would be
done using the device_lock() to protect against device driver
probe()/remove() races.

However, due to the design of pci_enable_sriov() this will result in a
ABBA deadlock on the device_lock as the PF's device_lock is held during PF
sriov_configure() while calling pci_enable_sriov() which in turn holds the
VF's device_lock while calling VF probe(), and similarly for remove.

This means the VF driver can never obtain the PF's device_lock.

Instead use the implicit locking created by pci_enable/disable_sriov(). A
VF driver can access its PF drvdata only while its own driver is attached,
and the PF driver can control access to its own drvdata based on when it
calls pci_enable/disable_sriov().

To use this API the PF driver will setup the PF drvdata in the probe()
function. pci_enable_sriov() is only called from sriov_configure() which
cannot happen until probe() completes, ensuring no VF races with drvdata
setup.

For removal, the PF driver must call pci_disable_sriov() in its remove
function before destroying any of the drvdata. This ensures that all VF
drivers are unbound before returning, fencing concurrent access to the
drvdata.

The introduction of a new function to do this access makes clear the
special locking scheme and the documents the requirements on the PF/VF
drivers using this.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/pci/iov.c   | 29 +++++++++++++++++++++++++++++
 include/linux/pci.h |  7 +++++++
 2 files changed, 36 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e7751fa3fe0b..ca696730f761 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -47,6 +47,35 @@ int pci_iov_vf_id(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_iov_vf_id);
 
+/**
+ * pci_iov_get_pf_drvdata - Return the drvdata of a PF
+ * @dev - VF pci_dev
+ * @pf_driver - Device driver required to own the PF
+ *
+ * This must be called from a context that ensures that a VF driver is attached.
+ * The value returned is invalid once the VF driver completes its remove()
+ * callback.
+ *
+ * Locking is achieved by the driver core. A VF driver cannot be probed until
+ * pci_enable_sriov() is called and pci_disable_sriov() does not return until
+ * all VF drivers have completed their remove().
+ *
+ * The PF driver must call pci_disable_sriov() before it begins to destroy the
+ * drvdata.
+ */
+void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver)
+{
+	struct pci_dev *pf_dev;
+
+	if (dev->is_physfn)
+		return ERR_PTR(-EINVAL);
+	pf_dev = dev->physfn;
+	if (pf_dev->driver != pf_driver)
+		return ERR_PTR(-EINVAL);
+	return pci_get_drvdata(pf_dev);
+}
+EXPORT_SYMBOL_GPL(pci_iov_get_pf_drvdata);
+
 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
  * change when NumVFs changes.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2337512e67f0..639a0a239774 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2154,6 +2154,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
 int pci_iov_vf_id(struct pci_dev *dev);
+void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver);
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 
@@ -2187,6 +2188,12 @@ static inline int pci_iov_vf_id(struct pci_dev *dev)
 	return -ENOSYS;
 }
 
+static inline void *pci_iov_get_pf_drvdata(struct pci_dev *dev,
+					   struct pci_driver *pf_driver)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 05/13] net/mlx5: Expose APIs to get/put the mlx5 core device
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (3 preceding siblings ...)
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF Yishai Hadas
@ 2021-10-13  9:46 ` Yishai Hadas
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 06/13] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:46 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Expose an API to get the mlx5 core device from a given VF PCI device if
mlx5_core is its driver.

Upon the get API we stay with the intf_state_mutex locked to make sure
that the device can't be gone/unloaded till the caller will complete
its job over the device, this expects to be for a short period of time
for any flow that the lock is taken.

Upon the put API we unlock the intf_state_mutex.

The use case for those APIs is the migration flow of a VF over VFIO PCI.
In that case the VF doesn't ride on mlx5_core, because the device is
driving *two* different PCI devices, the PF owned by mlx5_core and the
VF owned by the vfio driver.

The mlx5_core of the PF is accessed only during the narrow window of the
VF's ioctl that requires its services.

This allows the PF driver to be more independent of the VF driver, so
long as it doesn't reset the FW.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 43 +++++++++++++++++++
 include/linux/mlx5/driver.h                   |  3 ++
 2 files changed, 46 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 0b9a911acfc1..38e7c692e733 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1796,6 +1796,49 @@ static struct pci_driver mlx5_core_driver = {
 	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
+/**
+ * mlx5_vf_get_core_dev - Get the mlx5 core device from a given VF PCI device if
+ *                     mlx5_core is its driver.
+ * @pdev: The associated PCI device.
+ *
+ * Upon return the interface state lock stay held to let caller uses it safely.
+ * Caller must ensure to use the returned mlx5 device for a narrow window
+ * and put it back with mlx5_vf_put_core_dev() immediately once usage was over.
+ *
+ * Return: Pointer to the associated mlx5_core_dev or NULL.
+ */
+struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev)
+			__acquires(&mdev->intf_state_mutex)
+{
+	struct mlx5_core_dev *mdev;
+
+	mdev = pci_iov_get_pf_drvdata(pdev, &mlx5_core_driver);
+	if (IS_ERR(mdev))
+		return NULL;
+
+	mutex_lock(&mdev->intf_state_mutex);
+	if (!test_bit(MLX5_INTERFACE_STATE_UP, &mdev->intf_state)) {
+		mutex_unlock(&mdev->intf_state_mutex);
+		return NULL;
+	}
+
+	return mdev;
+}
+EXPORT_SYMBOL(mlx5_vf_get_core_dev);
+
+/**
+ * mlx5_vf_put_core_dev - Put the mlx5 core device back.
+ * @mdev: The mlx5 core device.
+ *
+ * Upon return the interface state lock is unlocked and caller should not
+ * access the mdev any more.
+ */
+void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev)
+{
+	mutex_unlock(&mdev->intf_state_mutex);
+}
+EXPORT_SYMBOL(mlx5_vf_put_core_dev);
+
 static void mlx5_core_verify_params(void)
 {
 	if (prof_sel >= ARRAY_SIZE(profile)) {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 441a2f8715f8..197a76ea3f0f 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1138,6 +1138,9 @@ int mlx5_dm_sw_icm_alloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type,
 			   u64 length, u16 uid, phys_addr_t addr, u32 obj_id);
 
+struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev);
+void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev);
+
 #ifdef CONFIG_MLX5_CORE_IPOIB
 struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
 					  struct ib_device *ibdev,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 06/13] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (4 preceding siblings ...)
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 05/13] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions Yishai Hadas
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Use mlx5_vf_get_core_dev() to get PF device instead of accessing
directly the PF data structure from the VF one.

The mlx5_vf_get_core_dev() API in its turn uses the generic PCI API
(i.e. pci_iov_get_pf_drvdata) to get it.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 5c7d2a953dbd..97b8917bc34d 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1445,7 +1445,10 @@ static virtio_net_ctrl_ack handle_ctrl_mac(struct mlx5_vdpa_dev *mvdev, u8 cmd)
 	size_t read;
 	u8 mac[ETH_ALEN];
 
-	pfmdev = pci_get_drvdata(pci_physfn(mvdev->mdev->pdev));
+	pfmdev = mlx5_vf_get_core_dev(mvdev->mdev->pdev);
+	if (!pfmdev)
+		return status;
+
 	switch (cmd) {
 	case VIRTIO_NET_CTRL_MAC_ADDR_SET:
 		read = vringh_iov_pull_iotlb(&cvq->vring, &cvq->riov, (void *)mac, ETH_ALEN);
@@ -1479,6 +1482,7 @@ static virtio_net_ctrl_ack handle_ctrl_mac(struct mlx5_vdpa_dev *mvdev, u8 cmd)
 		break;
 	}
 
+	mlx5_vf_put_core_dev(pfmdev);
 	return status;
 }
 
@@ -2261,8 +2265,11 @@ static void mlx5_vdpa_free(struct vdpa_device *vdev)
 	free_resources(ndev);
 	mlx5_vdpa_destroy_mr(mvdev);
 	if (!is_zero_ether_addr(ndev->config.mac)) {
-		pfmdev = pci_get_drvdata(pci_physfn(mvdev->mdev->pdev));
-		mlx5_mpfs_del_mac(pfmdev, ndev->config.mac);
+		pfmdev = mlx5_vf_get_core_dev(mvdev->mdev->pdev);
+		if (pfmdev) {
+			mlx5_mpfs_del_mac(pfmdev, ndev->config.mac);
+			mlx5_vf_put_core_dev(pfmdev);
+		}
 	}
 	mlx5_vdpa_free_resources(&ndev->mvdev);
 	mutex_destroy(&ndev->reslock);
@@ -2449,8 +2456,11 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name)
 		goto err_mtu;
 
 	if (!is_zero_ether_addr(config->mac)) {
-		pfmdev = pci_get_drvdata(pci_physfn(mdev->pdev));
+		pfmdev = mlx5_vf_get_core_dev(mdev->pdev);
+		if (!pfmdev)
+			goto err_mtu;
 		err = mlx5_mpfs_add_mac(pfmdev, config->mac);
+		mlx5_vf_put_core_dev(pfmdev);
 		if (err)
 			goto err_mtu;
 
@@ -2497,8 +2507,13 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev *v_mdev, const char *name)
 err_res:
 	mlx5_vdpa_free_resources(&ndev->mvdev);
 err_mpfs:
-	if (!is_zero_ether_addr(config->mac))
-		mlx5_mpfs_del_mac(pfmdev, config->mac);
+	if (!is_zero_ether_addr(config->mac)) {
+		pfmdev = mlx5_vf_get_core_dev(mdev->pdev);
+		if (pfmdev) {
+			mlx5_mpfs_del_mac(pfmdev, config->mac);
+			mlx5_vf_put_core_dev(pfmdev);
+		}
+	}
 err_mtu:
 	mutex_destroy(&ndev->reslock);
 	put_device(&mvdev->vdev.dev);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (5 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 06/13] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-15 16:38   ` Alex Williamson
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 08/13] vfio/pci_core: Make the region->release() function optional Yishai Hadas
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Add 'invalid' state definition to be used by drivers to set/check
invalid state.

In addition dropped the non complied macro VFIO_DEVICE_STATE_SET_ERROR
(i.e SATE instead of STATE) which seems unusable.

Fixes: a8a24f3f6e38 ("vfio: UAPI for migration interface for device state")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/vfio.h      | 5 +++++
 include/uapi/linux/vfio.h | 4 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b53a9557884a..6a8cf6637333 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -252,4 +252,9 @@ extern int vfio_virqfd_enable(void *opaque,
 			      void *data, struct virqfd **pvirqfd, int fd);
 extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
+static inline bool vfio_is_state_invalid(u32 state)
+{
+	return state >= VFIO_DEVICE_STATE_INVALID;
+}
+
 #endif /* VFIO_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..7f8fdada5eb3 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -609,6 +609,7 @@ struct vfio_device_migration_info {
 #define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
 #define VFIO_DEVICE_STATE_SAVING    (1 << 1)
 #define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_RESUMING + 1)
 #define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
 				     VFIO_DEVICE_STATE_SAVING |  \
 				     VFIO_DEVICE_STATE_RESUMING)
@@ -621,9 +622,6 @@ struct vfio_device_migration_info {
 	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
 					      VFIO_DEVICE_STATE_RESUMING))
 
-#define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
 
 	__u32 reserved;
 	__u64 pending_bytes;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 08/13] vfio/pci_core: Make the region->release() function optional
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (6 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 09/13] net/mlx5: Introduce migration bits and structures Yishai Hadas
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Make the region->release() function optional as in some cases there is
nothing to do by driver as part of it.

This is needed for coming patch from this series once we add
mlx5_vfio_pci driver to support live migration but we don't need a
migration release function.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a03b5a99c2da..e581a327f90d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -341,7 +341,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	vdev->virq_disabled = false;
 
 	for (i = 0; i < vdev->num_regions; i++)
-		vdev->region[i].ops->release(vdev, &vdev->region[i]);
+		if (vdev->region[i].ops->release)
+			vdev->region[i].ops->release(vdev, &vdev->region[i]);
 
 	vdev->num_regions = 0;
 	kfree(vdev->region);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 09/13] net/mlx5: Introduce migration bits and structures
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (7 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 08/13] vfio/pci_core: Make the region->release() function optional Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 10/13] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Introduce migration IFC related stuff to enable migration commands.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 145 +++++++++++++++++++++++++++++++++-
 1 file changed, 144 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 399ea52171fe..f7bad4ccc24f 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -126,6 +126,11 @@ enum {
 	MLX5_CMD_OP_QUERY_SF_PARTITION            = 0x111,
 	MLX5_CMD_OP_ALLOC_SF                      = 0x113,
 	MLX5_CMD_OP_DEALLOC_SF                    = 0x114,
+	MLX5_CMD_OP_SUSPEND_VHCA                  = 0x115,
+	MLX5_CMD_OP_RESUME_VHCA                   = 0x116,
+	MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE    = 0x117,
+	MLX5_CMD_OP_SAVE_VHCA_STATE               = 0x118,
+	MLX5_CMD_OP_LOAD_VHCA_STATE               = 0x119,
 	MLX5_CMD_OP_CREATE_MKEY                   = 0x200,
 	MLX5_CMD_OP_QUERY_MKEY                    = 0x201,
 	MLX5_CMD_OP_DESTROY_MKEY                  = 0x202,
@@ -1719,7 +1724,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         reserved_at_682[0x1];
 	u8         log_max_sf[0x5];
 	u8         apu[0x1];
-	u8         reserved_at_689[0x7];
+	u8         reserved_at_689[0x4];
+	u8         migration[0x1];
+	u8         reserved_at_68e[0x2];
 	u8         log_min_sf_size[0x8];
 	u8         max_num_sf_partitions[0x8];
 
@@ -11146,4 +11153,140 @@ enum {
 	MLX5_MTT_PERM_RW	= MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE,
 };
 
+enum {
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER  = 0x0,
+	MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE   = 0x1,
+};
+
+struct mlx5_ifc_suspend_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_suspend_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+enum {
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE   = 0x0,
+	MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER  = 0x1,
+};
+
+struct mlx5_ifc_resume_vhca_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_resume_vhca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_query_vhca_migration_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	u8         required_umem_size[0x20];
+
+	u8         reserved_at_a0[0x160];
+};
+
+struct mlx5_ifc_save_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_save_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_load_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         va[0x40];
+
+	u8         mkey[0x20];
+
+	u8         size[0x20];
+};
+
+struct mlx5_ifc_load_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
 #endif /* MLX5_IFC_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 10/13] vfio/mlx5: Expose migration commands over mlx5 device
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (8 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 09/13] net/mlx5: Introduce migration bits and structures Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.

As part of this adds the APIs and data structure that are needed to
manage the migration data.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 353 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |  43 +++++
 2 files changed, 396 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
new file mode 100644
index 000000000000..5b24a7625b8a
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -0,0 +1,353 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "cmd.h"
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(suspend_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(suspend_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(suspend_vhca_in, in, opcode, MLX5_CMD_OP_SUSPEND_VHCA);
+	MLX5_SET(suspend_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(suspend_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, suspend_vhca, in, out);
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(resume_vhca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(resume_vhca_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(resume_vhca_in, in, opcode, MLX5_CMD_OP_RESUME_VHCA);
+	MLX5_SET(resume_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(resume_vhca_in, in, op_mod, op_mod);
+
+	ret = mlx5_cmd_exec_inout(mdev, resume_vhca, in, out);
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  u32 *state_size)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {};
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	MLX5_SET(query_vhca_migration_state_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE);
+	MLX5_SET(query_vhca_migration_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(query_vhca_migration_state_in, in, op_mod, 0);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_vhca_migration_state, in, out);
+	if (ret)
+		goto end;
+
+	*state_size = MLX5_GET(query_vhca_migration_state_out, out,
+			       required_umem_size);
+
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+	int out_size;
+	void *out;
+	int ret;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	out = kzalloc(out_size, GFP_KERNEL);
+	if (!out) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	MLX5_SET(query_hca_cap_in, in, other_function, 1);
+	MLX5_SET(query_hca_cap_in, in, function_id, function_id);
+	MLX5_SET(query_hca_cap_in, in, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1 |
+		 HCA_CAP_OPMOD_GET_CUR);
+
+	ret = mlx5_cmd_exec_inout(mdev, query_hca_cap, in, out);
+	if (ret)
+		goto err_exec;
+
+	*vhca_id = MLX5_GET(query_hca_cap_out, out,
+			    capability.cmd_hca_cap.vhca_id);
+
+err_exec:
+	kfree(out);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return ret;
+}
+
+static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
+			      struct mlx5_vhca_state_data *state, u32 *mkey)
+{
+	struct sg_dma_page_iter dma_iter;
+	int err = 0, inlen;
+	__be64 *mtt;
+	void *mkc;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+			sizeof(*mtt) * round_up(state->num_pages, 2);
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
+		 DIV_ROUND_UP(state->num_pages, 2));
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+
+	for_each_sgtable_dma_page(&state->mig_data.table.sgt, &dma_iter, 0)
+		*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, pd, pdn);
+	MLX5_SET(mkc, mkc, bsf_octword_size, 0);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size,
+		 DIV_ROUND_UP(state->num_pages, 2));
+	MLX5_SET64(mkc, mkc, len, state->num_pages * PAGE_SIZE);
+	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
+
+	kvfree(in);
+
+	return err;
+}
+
+struct page *mlx5vf_get_migration_page(struct migration_data *data,
+				       unsigned long offset)
+{
+	unsigned long cur_offset = 0;
+	struct scatterlist *sg;
+	unsigned int i;
+
+	if (offset < data->last_offset || !data->last_offset_sg) {
+		data->last_offset = 0;
+		data->last_offset_sg = data->table.sgt.sgl;
+		data->sg_last_entry = 0;
+	}
+
+	cur_offset = data->last_offset;
+
+	for_each_sg(data->last_offset_sg, sg,
+			data->table.sgt.orig_nents - data->sg_last_entry, i) {
+		if (offset < sg->length + cur_offset) {
+			data->last_offset_sg = sg;
+			data->sg_last_entry += i;
+			data->last_offset = cur_offset;
+			return nth_page(sg_page(sg),
+					(offset - cur_offset) / PAGE_SIZE);
+		}
+		cur_offset += sg->length;
+	}
+	return NULL;
+}
+
+void mlx5vf_reset_vhca_state(struct mlx5_vhca_state_data *state)
+{
+	struct migration_data *data = &state->mig_data;
+	struct sg_page_iter sg_iter;
+
+	if (!data->table.prv)
+		goto end;
+
+	/* Undo alloc_pages_bulk_array() */
+	for_each_sgtable_page(&data->table.sgt, &sg_iter, 0)
+		__free_page(sg_page_iter_page(&sg_iter));
+	sg_free_append_table(&data->table);
+end:
+	memset(state, 0, sizeof(*state));
+}
+
+int mlx5vf_add_migration_pages(struct mlx5_vhca_state_data *state,
+			       unsigned int npages)
+{
+	unsigned int to_alloc = npages;
+	struct page **page_list;
+	unsigned long filled;
+	unsigned int to_fill;
+	int ret = 0;
+
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
+	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	do {
+		filled = alloc_pages_bulk_array(GFP_KERNEL, to_fill,
+						page_list);
+		if (!filled) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		to_alloc -= filled;
+		ret = sg_alloc_append_table_from_pages(
+			&state->mig_data.table, page_list, filled, 0,
+			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
+			GFP_KERNEL);
+
+		if (ret)
+			goto err;
+		/* clean input for another bulk allocation */
+		memset(page_list, 0, filled * sizeof(*page_list));
+		to_fill = min_t(unsigned int, to_alloc,
+				PAGE_SIZE / sizeof(*page_list));
+	} while (to_alloc > 0);
+
+	kvfree(page_list);
+	state->num_pages += npages;
+
+	return 0;
+
+err:
+	kvfree(page_list);
+	return ret;
+}
+
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       u64 state_size,
+			       struct mlx5_vhca_state_data *state)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	u32 pdn, mkey;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = mlx5vf_add_migration_pages(state,
+				DIV_ROUND_UP_ULL(state_size, PAGE_SIZE));
+	if (err < 0)
+		goto err_alloc_pages;
+
+	err = dma_map_sgtable(mdev->device, &state->mig_data.table.sgt,
+			      DMA_FROM_DEVICE, 0);
+	if (err)
+		goto err_reg_dma;
+
+	err = _create_state_mkey(mdev, pdn, state, &mkey);
+	if (err)
+		goto err_create_mkey;
+
+	MLX5_SET(save_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_SAVE_VHCA_STATE);
+	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(save_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(save_vhca_state_in, in, mkey, mkey);
+	MLX5_SET(save_vhca_state_in, in, size, state_size);
+
+	err = mlx5_cmd_exec_inout(mdev, save_vhca_state, in, out);
+	if (err)
+		goto err_exec;
+
+	state->state_size = state_size;
+
+	mlx5_core_destroy_mkey(mdev, mkey);
+	mlx5_core_dealloc_pd(mdev, pdn);
+	dma_unmap_sgtable(mdev->device, &state->mig_data.table.sgt,
+			  DMA_FROM_DEVICE, 0);
+	mlx5_vf_put_core_dev(mdev);
+
+	return 0;
+
+err_exec:
+	mlx5_core_destroy_mkey(mdev, mkey);
+err_create_mkey:
+	dma_unmap_sgtable(mdev->device, &state->mig_data.table.sgt,
+			  DMA_FROM_DEVICE, 0);
+err_reg_dma:
+	mlx5vf_reset_vhca_state(state);
+err_alloc_pages:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return err;
+}
+
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vhca_state_data *state)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	u32 pdn, mkey;
+	int err;
+
+	if (!mdev)
+		return -ENOTCONN;
+
+	err = mlx5_core_alloc_pd(mdev, &pdn);
+	if (err)
+		goto end;
+
+	err = dma_map_sgtable(mdev->device, &state->mig_data.table.sgt,
+			      DMA_TO_DEVICE, 0);
+	if (err)
+		goto err_reg;
+
+	err = _create_state_mkey(mdev, pdn, state, &mkey);
+	if (err)
+		goto err_mkey;
+
+	MLX5_SET(load_vhca_state_in, in, opcode,
+		 MLX5_CMD_OP_LOAD_VHCA_STATE);
+	MLX5_SET(load_vhca_state_in, in, op_mod, 0);
+	MLX5_SET(load_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(load_vhca_state_in, in, mkey, mkey);
+	MLX5_SET(load_vhca_state_in, in, size, state->state_size);
+
+	err = mlx5_cmd_exec_inout(mdev, load_vhca_state, in, out);
+
+	mlx5_core_destroy_mkey(mdev, mkey);
+err_mkey:
+	dma_unmap_sgtable(mdev->device, &state->mig_data.table.sgt,
+			  DMA_TO_DEVICE, 0);
+err_reg:
+	mlx5_core_dealloc_pd(mdev, pdn);
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
new file mode 100644
index 000000000000..66221df24b19
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#ifndef MLX5_VFIO_CMD_H
+#define MLX5_VFIO_CMD_H
+
+#include <linux/kernel.h>
+#include <linux/mlx5/driver.h>
+
+struct migration_data {
+	struct sg_append_table table;
+
+	struct scatterlist *last_offset_sg;
+	unsigned int sg_last_entry;
+	unsigned long last_offset;
+};
+
+/* state data of vhca to be used as part of migration flow */
+struct mlx5_vhca_state_data {
+	u64 state_size;
+	u64 num_pages;
+	u32 win_start_offset;
+	struct migration_data mig_data;
+};
+
+int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
+int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+					  uint32_t *state_size);
+int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
+int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       u64 state_size,
+			       struct mlx5_vhca_state_data *state);
+void mlx5vf_reset_vhca_state(struct mlx5_vhca_state_data *state);
+int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+			       struct mlx5_vhca_state_data *state);
+int mlx5vf_add_migration_pages(struct mlx5_vhca_state_data *state,
+			       unsigned int npages);
+struct page *mlx5vf_get_migration_page(struct migration_data *data,
+				       unsigned long offset);
+#endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (9 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 10/13] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-15 19:48   ` Alex Williamson
  2021-10-19  9:59   ` Shameerali Kolothum Thodi
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET Yishai Hadas
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly Yishai Hadas
  12 siblings, 2 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

This patch adds support for vfio_pci driver for mlx5 devices.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.

The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.

This patch implements the suspend/resume flows.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 MAINTAINERS                    |   6 +
 drivers/vfio/pci/Kconfig       |   3 +
 drivers/vfio/pci/Makefile      |   2 +
 drivers/vfio/pci/mlx5/Kconfig  |  11 +
 drivers/vfio/pci/mlx5/Makefile |   4 +
 drivers/vfio/pci/mlx5/main.c   | 692 +++++++++++++++++++++++++++++++++
 6 files changed, 718 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index abdcbcfef73d..e824bfab4a01 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19699,6 +19699,12 @@ L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/platform/
 
+VFIO MLX5 PCI DRIVER
+M:	Yishai Hadas <yishaih@nvidia.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	drivers/vfio/pci/mlx5/
+
 VGA_SWITCHEROO
 R:	Lukas Wunner <lukas@wunner.de>
 S:	Maintained
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 860424ccda1b..187b9c259944 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -43,4 +43,7 @@ config VFIO_PCI_IGD
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
 endif
+
+source "drivers/vfio/pci/mlx5/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 349d68d242b4..ed9d6f2e0555 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
new file mode 100644
index 000000000000..a3ce00add4fe
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config MLX5_VFIO_PCI
+	tristate "VFIO support for MLX5 PCI devices"
+	depends on MLX5_CORE
+	select VFIO_PCI_CORE
+	help
+	  This provides a PCI support for MLX5 devices using the VFIO
+	  framework. The device specific driver supports suspend/resume
+	  of the MLX5 device.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
new file mode 100644
index 000000000000..689627da7ff5
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
+mlx5-vfio-pci-y := main.o cmd.o
+
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
new file mode 100644
index 000000000000..e36302b444a6
--- /dev/null
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -0,0 +1,692 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/vfio_pci_core.h>
+
+#include "cmd.h"
+
+enum {
+	MLX5VF_PCI_FREEZED = 1 << 0,
+};
+
+enum {
+	MLX5VF_REGION_PENDING_BYTES = 1 << 0,
+	MLX5VF_REGION_DATA_SIZE = 1 << 1,
+};
+
+#define MLX5VF_MIG_REGION_DATA_SIZE SZ_128K
+/* Data section offset from migration region */
+#define MLX5VF_MIG_REGION_DATA_OFFSET                                          \
+	(sizeof(struct vfio_device_migration_info))
+
+#define VFIO_DEVICE_MIGRATION_OFFSET(x)                                        \
+	(offsetof(struct vfio_device_migration_info, x))
+
+struct mlx5vf_pci_migration_info {
+	u32 vfio_dev_state; /* VFIO_DEVICE_STATE_XXX */
+	u32 dev_state; /* device migration state */
+	u32 region_state; /* Use MLX5VF_REGION_XXX */
+	u16 vhca_id;
+	struct mlx5_vhca_state_data vhca_state_data;
+};
+
+struct mlx5vf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 migrate_cap:1;
+	/* protect migartion state */
+	struct mutex state_mutex;
+	struct mlx5vf_pci_migration_info vmig;
+};
+
+static int mlx5vf_pci_unquiesce_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	return mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
+				      mvdev->vmig.vhca_id,
+				      MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
+}
+
+static int mlx5vf_pci_quiesce_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	return mlx5vf_cmd_suspend_vhca(
+		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
+		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
+}
+
+static int mlx5vf_pci_unfreeze_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	int ret;
+
+	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
+				     mvdev->vmig.vhca_id,
+				     MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
+	if (ret)
+		return ret;
+
+	mvdev->vmig.dev_state &= ~MLX5VF_PCI_FREEZED;
+	return 0;
+}
+
+static int mlx5vf_pci_freeze_device(struct mlx5vf_pci_core_device *mvdev)
+{
+	int ret;
+
+	ret = mlx5vf_cmd_suspend_vhca(
+		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
+		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
+	if (ret)
+		return ret;
+
+	mvdev->vmig.dev_state |= MLX5VF_PCI_FREEZED;
+	return 0;
+}
+
+static int mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
+{
+	u32 state_size = 0;
+	int ret;
+
+	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
+		return -EFAULT;
+
+	/* If we already read state no reason to re-read */
+	if (mvdev->vmig.vhca_state_data.state_size)
+		return 0;
+
+	ret = mlx5vf_cmd_query_vhca_migration_state(
+		mvdev->core_device.pdev, mvdev->vmig.vhca_id, &state_size);
+	if (ret)
+		return ret;
+
+	return mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
+					  mvdev->vmig.vhca_id, state_size,
+					  &mvdev->vmig.vhca_state_data);
+}
+
+static int mlx5vf_pci_new_write_window(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
+	u32 num_pages_needed;
+	u64 allocated_ready;
+	u32 bytes_needed;
+
+	/* Check how many bytes are available from previous flows */
+	WARN_ON(state_data->num_pages * PAGE_SIZE <
+		state_data->win_start_offset);
+	allocated_ready = (state_data->num_pages * PAGE_SIZE) -
+			  state_data->win_start_offset;
+	WARN_ON(allocated_ready > MLX5VF_MIG_REGION_DATA_SIZE);
+
+	bytes_needed = MLX5VF_MIG_REGION_DATA_SIZE - allocated_ready;
+	if (!bytes_needed)
+		return 0;
+
+	num_pages_needed = DIV_ROUND_UP_ULL(bytes_needed, PAGE_SIZE);
+	return mlx5vf_add_migration_pages(state_data, num_pages_needed);
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_data_size(struct mlx5vf_pci_core_device *mvdev,
+				      char __user *buf, bool iswrite)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+	u64 data_size;
+	int ret;
+
+	if (iswrite) {
+		/* data_size is writable only during resuming state */
+		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_RESUMING)
+			return -EINVAL;
+
+		ret = copy_from_user(&data_size, buf, sizeof(data_size));
+		if (ret)
+			return -EFAULT;
+
+		vmig->vhca_state_data.state_size += data_size;
+		vmig->vhca_state_data.win_start_offset += data_size;
+		ret = mlx5vf_pci_new_write_window(mvdev);
+		if (ret)
+			return ret;
+
+	} else {
+		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_SAVING)
+			return -EINVAL;
+
+		data_size = min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
+				  vmig->vhca_state_data.state_size -
+				  vmig->vhca_state_data.win_start_offset);
+		ret = copy_to_user(buf, &data_size, sizeof(data_size));
+		if (ret)
+			return -EFAULT;
+	}
+
+	vmig->region_state |= MLX5VF_REGION_DATA_SIZE;
+	return sizeof(data_size);
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_data_offset(struct mlx5vf_pci_core_device *mvdev,
+					char __user *buf, bool iswrite)
+{
+	static const u64 data_offset = MLX5VF_MIG_REGION_DATA_OFFSET;
+	int ret;
+
+	/* RO field */
+	if (iswrite)
+		return -EFAULT;
+
+	ret = copy_to_user(buf, &data_offset, sizeof(data_offset));
+	if (ret)
+		return -EFAULT;
+
+	return sizeof(data_offset);
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_pending_bytes(struct mlx5vf_pci_core_device *mvdev,
+					  char __user *buf, bool iswrite)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+	u64 pending_bytes;
+	int ret;
+
+	/* RO field */
+	if (iswrite)
+		return -EFAULT;
+
+	if (vmig->vfio_dev_state == (VFIO_DEVICE_STATE_SAVING |
+				     VFIO_DEVICE_STATE_RUNNING)) {
+		/* In pre-copy state we have no data to return for now,
+		 * return 0 pending bytes
+		 */
+		pending_bytes = 0;
+	} else {
+		if (!vmig->vhca_state_data.state_size)
+			return 0;
+		pending_bytes = vmig->vhca_state_data.state_size -
+				vmig->vhca_state_data.win_start_offset;
+	}
+
+	ret = copy_to_user(buf, &pending_bytes, sizeof(pending_bytes));
+	if (ret)
+		return -EFAULT;
+
+	/* Window moves forward once data from previous iteration was read */
+	if (vmig->region_state & MLX5VF_REGION_DATA_SIZE)
+		vmig->vhca_state_data.win_start_offset +=
+			min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE, pending_bytes);
+
+	WARN_ON(vmig->vhca_state_data.win_start_offset >
+		vmig->vhca_state_data.state_size);
+
+	/* New iteration started */
+	vmig->region_state = MLX5VF_REGION_PENDING_BYTES;
+	return sizeof(pending_bytes);
+}
+
+static int mlx5vf_load_state(struct mlx5vf_pci_core_device *mvdev)
+{
+	if (!mvdev->vmig.vhca_state_data.state_size)
+		return 0;
+
+	return mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
+					  mvdev->vmig.vhca_id,
+					  &mvdev->vmig.vhca_state_data);
+}
+
+static void mlx5vf_reset_mig_state(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+
+	vmig->region_state = 0;
+	mlx5vf_reset_vhca_state(&vmig->vhca_state_data);
+}
+
+static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
+				       u32 state)
+{
+	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
+	u32 old_state = vmig->vfio_dev_state;
+	int ret = 0;
+
+	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
+		return -EINVAL;
+
+	/* Running switches off */
+	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
+	    (state & VFIO_DEVICE_STATE_RUNNING) &&
+	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
+		ret = mlx5vf_pci_quiesce_device(mvdev);
+		if (ret)
+			return ret;
+		ret = mlx5vf_pci_freeze_device(mvdev);
+		if (ret) {
+			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
+			return ret;
+		}
+	}
+
+	/* Resuming switches off */
+	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
+	    (state & VFIO_DEVICE_STATE_RESUMING) &&
+	    (old_state & VFIO_DEVICE_STATE_RESUMING)) {
+		/* deserialize state into the device */
+		ret = mlx5vf_load_state(mvdev);
+		if (ret) {
+			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
+			return ret;
+		}
+	}
+
+	/* Resuming switches on */
+	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
+	    (state & VFIO_DEVICE_STATE_RESUMING) &&
+	    (state & VFIO_DEVICE_STATE_RESUMING)) {
+		mlx5vf_reset_mig_state(mvdev);
+		ret = mlx5vf_pci_new_write_window(mvdev);
+		if (ret)
+			return ret;
+	}
+
+	/* Saving switches on */
+	if ((old_state & VFIO_DEVICE_STATE_SAVING) !=
+	    (state & VFIO_DEVICE_STATE_SAVING) &&
+	    (state & VFIO_DEVICE_STATE_SAVING)) {
+		if (!(state & VFIO_DEVICE_STATE_RUNNING)) {
+			/* serialize post copy */
+			ret = mlx5vf_pci_save_device_data(mvdev);
+			if (ret)
+				return ret;
+		}
+	}
+
+	/* Running switches on */
+	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
+	    (state & VFIO_DEVICE_STATE_RUNNING) &&
+	    (state & VFIO_DEVICE_STATE_RUNNING)) {
+		ret = mlx5vf_pci_unfreeze_device(mvdev);
+		if (ret)
+			return ret;
+		ret = mlx5vf_pci_unquiesce_device(mvdev);
+		if (ret) {
+			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
+			return ret;
+		}
+	}
+
+	vmig->vfio_dev_state = state;
+	return 0;
+}
+
+static ssize_t
+mlx5vf_pci_handle_migration_device_state(struct mlx5vf_pci_core_device *mvdev,
+					 char __user *buf, bool iswrite)
+{
+	size_t count = sizeof(mvdev->vmig.vfio_dev_state);
+	int ret;
+
+	if (iswrite) {
+		u32 device_state;
+
+		ret = copy_from_user(&device_state, buf, count);
+		if (ret)
+			return -EFAULT;
+
+		ret = mlx5vf_pci_set_device_state(mvdev, device_state);
+		if (ret)
+			return ret;
+	} else {
+		ret = copy_to_user(buf, &mvdev->vmig.vfio_dev_state, count);
+		if (ret)
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+static ssize_t
+mlx5vf_pci_copy_user_data_to_device_state(struct mlx5vf_pci_core_device *mvdev,
+					  char __user *buf, size_t count,
+					  u64 offset)
+{
+	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
+	char __user *from_buff = buf;
+	u32 curr_offset;
+	u32 win_page_offset;
+	u32 copy_count;
+	struct page *page;
+	char *to_buff;
+	int ret;
+
+	curr_offset = state_data->win_start_offset + offset;
+
+	do {
+		page = mlx5vf_get_migration_page(&state_data->mig_data,
+						 curr_offset);
+		if (!page)
+			return -EINVAL;
+
+		win_page_offset = curr_offset % PAGE_SIZE;
+		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
+
+		to_buff = kmap_local_page(page);
+		ret = copy_from_user(to_buff + win_page_offset, from_buff,
+				     copy_count);
+		kunmap_local(to_buff);
+		if (ret)
+			return -EFAULT;
+
+		from_buff += copy_count;
+		curr_offset += copy_count;
+		count -= copy_count;
+	} while (count > 0);
+
+	return 0;
+}
+
+static ssize_t
+mlx5vf_pci_copy_device_state_to_user(struct mlx5vf_pci_core_device *mvdev,
+				     char __user *buf, u64 offset, size_t count)
+{
+	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
+	char __user *to_buff = buf;
+	u32 win_available_bytes;
+	u32 win_page_offset;
+	u32 copy_count;
+	u32 curr_offset;
+	char *from_buff;
+	struct page *page;
+	int ret;
+
+	win_available_bytes =
+		min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
+		      mvdev->vmig.vhca_state_data.state_size -
+			      mvdev->vmig.vhca_state_data.win_start_offset);
+
+	if (count + offset > win_available_bytes)
+		return -EINVAL;
+
+	curr_offset = state_data->win_start_offset + offset;
+
+	do {
+		page = mlx5vf_get_migration_page(&state_data->mig_data,
+						 curr_offset);
+		if (!page)
+			return -EINVAL;
+
+		win_page_offset = curr_offset % PAGE_SIZE;
+		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
+
+		from_buff = kmap_local_page(page);
+		ret = copy_to_user(buf, from_buff + win_page_offset,
+				   copy_count);
+		kunmap_local(from_buff);
+		if (ret)
+			return -EFAULT;
+
+		curr_offset += copy_count;
+		count -= copy_count;
+		to_buff += copy_count;
+	} while (count);
+
+	return 0;
+}
+
+static ssize_t
+mlx5vf_pci_migration_data_rw(struct mlx5vf_pci_core_device *mvdev,
+			     char __user *buf, size_t count, u64 offset,
+			     bool iswrite)
+{
+	int ret;
+
+	if (offset + count > MLX5VF_MIG_REGION_DATA_SIZE)
+		return -EINVAL;
+
+	if (iswrite)
+		ret = mlx5vf_pci_copy_user_data_to_device_state(mvdev, buf,
+								count, offset);
+	else
+		ret = mlx5vf_pci_copy_device_state_to_user(mvdev, buf, offset,
+							   count);
+	if (ret)
+		return ret;
+	return count;
+}
+
+static ssize_t mlx5vf_pci_mig_rw(struct vfio_pci_core_device *vdev,
+				 char __user *buf, size_t count, loff_t *ppos,
+				 bool iswrite)
+{
+	struct mlx5vf_pci_core_device *mvdev =
+		container_of(vdev, struct mlx5vf_pci_core_device, core_device);
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int ret;
+
+	mutex_lock(&mvdev->state_mutex);
+	/* Copy to/from the migration region data section */
+	if (pos >= MLX5VF_MIG_REGION_DATA_OFFSET) {
+		ret = mlx5vf_pci_migration_data_rw(
+			mvdev, buf, count, pos - MLX5VF_MIG_REGION_DATA_OFFSET,
+			iswrite);
+		goto end;
+	}
+
+	switch (pos) {
+	case VFIO_DEVICE_MIGRATION_OFFSET(device_state):
+		/* This is RW field. */
+		if (count != sizeof(mvdev->vmig.vfio_dev_state)) {
+			ret = -EINVAL;
+			break;
+		}
+		ret = mlx5vf_pci_handle_migration_device_state(mvdev, buf,
+							       iswrite);
+		break;
+	case VFIO_DEVICE_MIGRATION_OFFSET(pending_bytes):
+		/*
+		 * The number of pending bytes still to be migrated from the
+		 * vendor driver. This is RO field.
+		 * Reading this field indicates on the start of a new iteration
+		 * to get device data.
+		 *
+		 */
+		ret = mlx5vf_pci_handle_migration_pending_bytes(mvdev, buf,
+								iswrite);
+		break;
+	case VFIO_DEVICE_MIGRATION_OFFSET(data_offset):
+		/*
+		 * The user application should read data_offset field from the
+		 * migration region. The user application should read the
+		 * device data from this offset within the migration region
+		 * during the _SAVING mode or write the device data during the
+		 * _RESUMING mode. This is RO field.
+		 */
+		ret = mlx5vf_pci_handle_migration_data_offset(mvdev, buf,
+							      iswrite);
+		break;
+	case VFIO_DEVICE_MIGRATION_OFFSET(data_size):
+		/*
+		 * The user application should read data_size to get the size
+		 * in bytes of the data copied to the migration region during
+		 * the _SAVING state by the device. The user application should
+		 * write the size in bytes of the data that was copied to
+		 * the migration region during the _RESUMING state by the user.
+		 * This is RW field.
+		 */
+		ret = mlx5vf_pci_handle_migration_data_size(mvdev, buf,
+							    iswrite);
+		break;
+	default:
+		ret = -EFAULT;
+		break;
+	}
+
+end:
+	mutex_unlock(&mvdev->state_mutex);
+	return ret;
+}
+
+static struct vfio_pci_regops migration_ops = {
+	.rw = mlx5vf_pci_mig_rw,
+};
+
+static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &mvdev->core_device;
+	int vf_id;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	if (!mvdev->migrate_cap) {
+		vfio_pci_core_finish_enable(vdev);
+		return 0;
+	}
+
+	vf_id = pci_iov_vf_id(vdev->pdev);
+	if (vf_id < 0) {
+		ret = vf_id;
+		goto out_disable;
+	}
+
+	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1,
+				     &mvdev->vmig.vhca_id);
+	if (ret)
+		goto out_disable;
+
+	ret = vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
+					   VFIO_REGION_SUBTYPE_MIGRATION,
+					   &migration_ops,
+					   MLX5VF_MIG_REGION_DATA_OFFSET +
+					   MLX5VF_MIG_REGION_DATA_SIZE,
+					   VFIO_REGION_INFO_FLAG_READ |
+					   VFIO_REGION_INFO_FLAG_WRITE,
+					   NULL);
+	if (ret)
+		goto out_disable;
+
+	mutex_init(&mvdev->state_mutex);
+	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+out_disable:
+	vfio_pci_core_disable(vdev);
+	return ret;
+}
+
+static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	vfio_pci_core_close_device(core_vdev);
+	mlx5vf_reset_mig_state(mvdev);
+}
+
+static const struct vfio_device_ops mlx5vf_pci_ops = {
+	.name = "mlx5-vfio-pci",
+	.open_device = mlx5vf_pci_open_device,
+	.close_device = mlx5vf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+};
+
+static int mlx5vf_pci_probe(struct pci_dev *pdev,
+			    const struct pci_device_id *id)
+{
+	struct mlx5vf_pci_core_device *mvdev;
+	int ret;
+
+	mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL);
+	if (!mvdev)
+		return -ENOMEM;
+	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
+
+	if (pdev->is_virtfn) {
+		struct mlx5_core_dev *mdev =
+			mlx5_vf_get_core_dev(pdev);
+
+		if (mdev) {
+			if (MLX5_CAP_GEN(mdev, migration))
+				mvdev->migrate_cap = 1;
+			mlx5_vf_put_core_dev(mdev);
+		}
+	}
+
+	ret = vfio_pci_core_register_device(&mvdev->core_device);
+	if (ret)
+		goto out_free;
+
+	dev_set_drvdata(&pdev->dev, mvdev);
+	return 0;
+
+out_free:
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+	return ret;
+}
+
+static void mlx5vf_pci_remove(struct pci_dev *pdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(&mvdev->core_device);
+	vfio_pci_core_uninit_device(&mvdev->core_device);
+	kfree(mvdev);
+}
+
+static const struct pci_device_id mlx5vf_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX, 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
+
+static struct pci_driver mlx5vf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = mlx5vf_pci_table,
+	.probe = mlx5vf_pci_probe,
+	.remove = mlx5vf_pci_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+};
+
+static void __exit mlx5vf_pci_cleanup(void)
+{
+	pci_unregister_driver(&mlx5vf_pci_driver);
+}
+
+static int __init mlx5vf_pci_init(void)
+{
+	return pci_register_driver(&mlx5vf_pci_driver);
+}
+
+module_init(mlx5vf_pci_init);
+module_exit(mlx5vf_pci_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
+MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
+MODULE_DESCRIPTION(
+	"MLX5 VFIO PCI - User Level meta-driver for MLX5 device family");
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (10 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-15 19:52   ` Alex Williamson
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly Yishai Hadas
  12 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Add infrastructure to let vfio_pci_core drivers trap device RESET.

The motivation for this is to let the underlay driver be aware that
reset was done and set its internal state accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c |  8 ++++++--
 drivers/vfio/pci/vfio_pci_core.c   |  2 ++
 include/linux/vfio_pci_core.h      | 10 ++++++++++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 6e58b4bf7a60..002198376f43 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -859,7 +859,9 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-			pci_try_reset_function(vdev->pdev);
+			ret = pci_try_reset_function(vdev->pdev);
+			if (!ret && vdev->ops && vdev->ops->reset_done)
+				vdev->ops->reset_done(vdev);
 			up_write(&vdev->memory_lock);
 		}
 	}
@@ -941,7 +943,9 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-			pci_try_reset_function(vdev->pdev);
+			ret = pci_try_reset_function(vdev->pdev);
+			if (!ret && vdev->ops && vdev->ops->reset_done)
+				vdev->ops->reset_done(vdev);
 			up_write(&vdev->memory_lock);
 		}
 	}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index e581a327f90d..d2497a8ed7f1 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -923,6 +923,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
 		ret = pci_try_reset_function(vdev->pdev);
+		if (!ret && vdev->ops && vdev->ops->reset_done)
+			vdev->ops->reset_done(vdev);
 		up_write(&vdev->memory_lock);
 
 		return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..6ccf5824f098 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -95,6 +95,15 @@ struct vfio_pci_mmap_vma {
 	struct list_head	vma_next;
 };
 
+/**
+ * struct vfio_pci_core_device_ops - VFIO PCI driver device callbacks
+ *
+ * @reset_done: Called when the device was reset
+ */
+struct vfio_pci_core_device_ops {
+	void	(*reset_done)(struct vfio_pci_core_device *vdev);
+};
+
 struct vfio_pci_core_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
@@ -137,6 +146,7 @@ struct vfio_pci_core_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	const struct vfio_pci_core_device_ops *ops;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly
  2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
                   ` (11 preceding siblings ...)
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET Yishai Hadas
@ 2021-10-13  9:47 ` Yishai Hadas
  2021-10-13 18:06   ` Jason Gunthorpe
  12 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-13  9:47 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy,
	yishaih, maorg

Trap device RESET and update state accordingly, it's done by registering
the matching callbacks.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/main.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index e36302b444a6..8fe44ed13552 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -613,6 +613,19 @@ static const struct vfio_device_ops mlx5vf_pci_ops = {
 	.match = vfio_pci_core_match,
 };
 
+static void mlx5vf_reset_done(struct vfio_pci_core_device *core_vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+			core_vdev, struct mlx5vf_pci_core_device,
+			core_device);
+
+	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;
+}
+
+static const struct vfio_pci_core_device_ops mlx5vf_pci_core_ops = {
+	.reset_done = mlx5vf_reset_done,
+};
+
 static int mlx5vf_pci_probe(struct pci_dev *pdev,
 			    const struct pci_device_id *id)
 {
@@ -629,8 +642,10 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 			mlx5_vf_get_core_dev(pdev);
 
 		if (mdev) {
-			if (MLX5_CAP_GEN(mdev, migration))
+			if (MLX5_CAP_GEN(mdev, migration)) {
 				mvdev->migrate_cap = 1;
+				mvdev->core_device.ops = &mlx5vf_pci_core_ops;
+			}
 			mlx5_vf_put_core_dev(mdev);
 		}
 	}
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly Yishai Hadas
@ 2021-10-13 18:06   ` Jason Gunthorpe
  2021-10-14  9:18     ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-13 18:06 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: alex.williamson, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Wed, Oct 13, 2021 at 12:47:07PM +0300, Yishai Hadas wrote:
> Trap device RESET and update state accordingly, it's done by registering
> the matching callbacks.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>  drivers/vfio/pci/mlx5/main.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> index e36302b444a6..8fe44ed13552 100644
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -613,6 +613,19 @@ static const struct vfio_device_ops mlx5vf_pci_ops = {
>  	.match = vfio_pci_core_match,
>  };
>  
> +static void mlx5vf_reset_done(struct vfio_pci_core_device *core_vdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = container_of(
> +			core_vdev, struct mlx5vf_pci_core_device,
> +			core_device);
> +
> +	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;

This should hold the state mutex too

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index Yishai Hadas
@ 2021-10-13 18:14   ` Bjorn Helgaas
  2021-10-14  9:08     ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2021-10-13 18:14 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: alex.williamson, bhelgaas, jgg, saeedm, linux-pci, kvm, netdev,
	kuba, leonro, kwankhede, mgurtovoy, maorg

On Wed, Oct 13, 2021 at 12:46:55PM +0300, Yishai Hadas wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The PCI core uses the VF index internally, often called the vf_id,
> during the setup of the VF, eg pci_iov_add_virtfn().
> 
> This index is needed for device drivers that implement live migration
> for their internal operations that configure/control their VFs.
> 
> Specifically, mlx5_vfio_pci driver that is introduced in coming patches
> from this series needs it and not the bus/device/function which is
> exposed today.
> 
> Add pci_iov_vf_id() which computes the vf_id by reversing the math that
> was used to create the bus/device/function.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

I already acked this:

  https://lore.kernel.org/r/20210922215930.GA231505@bhelgaas

Saves me time if you carry the ack so I don't have to look at this
again.  But since I *am* looking at it again, I think it's nice if the
subject line includes the actual interface you're adding, e.g.,

  PCI/IOV: Add pci_iov_vf_id() to get VF index

> ---
>  drivers/pci/iov.c   | 14 ++++++++++++++
>  include/linux/pci.h |  8 +++++++-
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index dafdc652fcd0..e7751fa3fe0b 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  }
>  EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
>  
> +int pci_iov_vf_id(struct pci_dev *dev)
> +{
> +	struct pci_dev *pf;
> +
> +	if (!dev->is_virtfn)
> +		return -EINVAL;
> +
> +	pf = pci_physfn(dev);
> +	return (((dev->bus->number << 8) + dev->devfn) -
> +		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
> +	       pf->sriov->stride;
> +}
> +EXPORT_SYMBOL_GPL(pci_iov_vf_id);
> +
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>   * change when NumVFs changes.
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index cd8aa6fce204..2337512e67f0 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2153,7 +2153,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  #ifdef CONFIG_PCI_IOV
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> -
> +int pci_iov_vf_id(struct pci_dev *dev);
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
>  
> @@ -2181,6 +2181,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>  {
>  	return -ENOSYS;
>  }
> +
> +static inline int pci_iov_vf_id(struct pci_dev *dev)
> +{
> +	return -ENOSYS;
> +}
> +
>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>  { return -ENODEV; }
>  
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF Yishai Hadas
@ 2021-10-13 18:27   ` Bjorn Helgaas
  2021-10-14 22:11   ` Alex Williamson
  1 sibling, 0 replies; 44+ messages in thread
From: Bjorn Helgaas @ 2021-10-13 18:27 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: alex.williamson, bhelgaas, jgg, saeedm, linux-pci, kvm, netdev,
	kuba, leonro, kwankhede, mgurtovoy, maorg

On Wed, Oct 13, 2021 at 12:46:58PM +0300, Yishai Hadas wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> There are some cases where a SRIOV VF driver will need to reach into and
> interact with the PF driver. This requires accessing the drvdata of the PF.
> 
> Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
> safe way. Normally accessing a drvdata of a foreign struct device would be
> done using the device_lock() to protect against device driver
> probe()/remove() races.
> 
> However, due to the design of pci_enable_sriov() this will result in a
> ABBA deadlock on the device_lock as the PF's device_lock is held during PF
> sriov_configure() while calling pci_enable_sriov() which in turn holds the
> VF's device_lock while calling VF probe(), and similarly for remove.
> 
> This means the VF driver can never obtain the PF's device_lock.
> 
> Instead use the implicit locking created by pci_enable/disable_sriov(). A
> VF driver can access its PF drvdata only while its own driver is attached,
> and the PF driver can control access to its own drvdata based on when it
> calls pci_enable/disable_sriov().
> 
> To use this API the PF driver will setup the PF drvdata in the probe()
> function. pci_enable_sriov() is only called from sriov_configure() which
> cannot happen until probe() completes, ensuring no VF races with drvdata
> setup.
> 
> For removal, the PF driver must call pci_disable_sriov() in its remove
> function before destroying any of the drvdata. This ensures that all VF
> drivers are unbound before returning, fencing concurrent access to the
> drvdata.
> 
> The introduction of a new function to do this access makes clear the
> special locking scheme and the documents the requirements on the PF/VF
> drivers using this.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Nit: s/SRIOV/SR-IOV/ above so it matches usage in the spec.

I think it's nice to include the actual interface in the subject when
practical.

> ---
>  drivers/pci/iov.c   | 29 +++++++++++++++++++++++++++++
>  include/linux/pci.h |  7 +++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index e7751fa3fe0b..ca696730f761 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -47,6 +47,35 @@ int pci_iov_vf_id(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_iov_vf_id);
>  
> +/**
> + * pci_iov_get_pf_drvdata - Return the drvdata of a PF
> + * @dev - VF pci_dev
> + * @pf_driver - Device driver required to own the PF
> + *
> + * This must be called from a context that ensures that a VF driver is attached.
> + * The value returned is invalid once the VF driver completes its remove()
> + * callback.
> + *
> + * Locking is achieved by the driver core. A VF driver cannot be probed until
> + * pci_enable_sriov() is called and pci_disable_sriov() does not return until
> + * all VF drivers have completed their remove().
> + *
> + * The PF driver must call pci_disable_sriov() before it begins to destroy the
> + * drvdata.
> + */
> +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver)
> +{
> +	struct pci_dev *pf_dev;
> +
> +	if (dev->is_physfn)
> +		return ERR_PTR(-EINVAL);
> +	pf_dev = dev->physfn;
> +	if (pf_dev->driver != pf_driver)
> +		return ERR_PTR(-EINVAL);
> +	return pci_get_drvdata(pf_dev);
> +}
> +EXPORT_SYMBOL_GPL(pci_iov_get_pf_drvdata);
> +
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>   * change when NumVFs changes.
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 2337512e67f0..639a0a239774 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2154,6 +2154,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>  int pci_iov_vf_id(struct pci_dev *dev);
> +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver);
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
>  
> @@ -2187,6 +2188,12 @@ static inline int pci_iov_vf_id(struct pci_dev *dev)
>  	return -ENOSYS;
>  }
>  
> +static inline void *pci_iov_get_pf_drvdata(struct pci_dev *dev,
> +					   struct pci_driver *pf_driver)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>  { return -ENODEV; }
>  
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index
  2021-10-13 18:14   ` Bjorn Helgaas
@ 2021-10-14  9:08     ` Yishai Hadas
  0 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-14  9:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: alex.williamson, bhelgaas, jgg, saeedm, linux-pci, kvm, netdev,
	kuba, leonro, kwankhede, mgurtovoy, maorg

On 10/13/2021 9:14 PM, Bjorn Helgaas wrote:
> On Wed, Oct 13, 2021 at 12:46:55PM +0300, Yishai Hadas wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>>
>> The PCI core uses the VF index internally, often called the vf_id,
>> during the setup of the VF, eg pci_iov_add_virtfn().
>>
>> This index is needed for device drivers that implement live migration
>> for their internal operations that configure/control their VFs.
>>
>> Specifically, mlx5_vfio_pci driver that is introduced in coming patches
>> from this series needs it and not the bus/device/function which is
>> exposed today.
>>
>> Add pci_iov_vf_id() which computes the vf_id by reversing the math that
>> was used to create the bus/device/function.
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> I already acked this:
>
>    https://lore.kernel.org/r/20210922215930.GA231505@bhelgaas
>
> Saves me time if you carry the ack so I don't have to look at this
> again.  But since I *am* looking at it again, I think it's nice if the
> subject line includes the actual interface you're adding, e.g.,
>
>    PCI/IOV: Add pci_iov_vf_id() to get VF index


Sure, will change as part of V2 and add your Acked-by.

>> ---
>>   drivers/pci/iov.c   | 14 ++++++++++++++
>>   include/linux/pci.h |  8 +++++++-
>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index dafdc652fcd0..e7751fa3fe0b 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -33,6 +33,20 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>>   }
>>   EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
>>   
>> +int pci_iov_vf_id(struct pci_dev *dev)
>> +{
>> +	struct pci_dev *pf;
>> +
>> +	if (!dev->is_virtfn)
>> +		return -EINVAL;
>> +
>> +	pf = pci_physfn(dev);
>> +	return (((dev->bus->number << 8) + dev->devfn) -
>> +		((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) /
>> +	       pf->sriov->stride;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_iov_vf_id);
>> +
>>   /*
>>    * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>>    * change when NumVFs changes.
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index cd8aa6fce204..2337512e67f0 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -2153,7 +2153,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>>   #ifdef CONFIG_PCI_IOV
>>   int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>>   int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>> -
>> +int pci_iov_vf_id(struct pci_dev *dev);
>>   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>>   void pci_disable_sriov(struct pci_dev *dev);
>>   
>> @@ -2181,6 +2181,12 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>>   {
>>   	return -ENOSYS;
>>   }
>> +
>> +static inline int pci_iov_vf_id(struct pci_dev *dev)
>> +{
>> +	return -ENOSYS;
>> +}
>> +
>>   static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>>   { return -ENODEV; }
>>   
>> -- 
>> 2.18.1
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly
  2021-10-13 18:06   ` Jason Gunthorpe
@ 2021-10-14  9:18     ` Yishai Hadas
  2021-10-15 19:54       ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-14  9:18 UTC (permalink / raw)
  To: Jason Gunthorpe, alex.williamson
  Cc: bhelgaas, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On 10/13/2021 9:06 PM, Jason Gunthorpe wrote:
> On Wed, Oct 13, 2021 at 12:47:07PM +0300, Yishai Hadas wrote:
>> Trap device RESET and update state accordingly, it's done by registering
>> the matching callbacks.
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>   drivers/vfio/pci/mlx5/main.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
>> index e36302b444a6..8fe44ed13552 100644
>> +++ b/drivers/vfio/pci/mlx5/main.c
>> @@ -613,6 +613,19 @@ static const struct vfio_device_ops mlx5vf_pci_ops = {
>>   	.match = vfio_pci_core_match,
>>   };
>>   
>> +static void mlx5vf_reset_done(struct vfio_pci_core_device *core_vdev)
>> +{
>> +	struct mlx5vf_pci_core_device *mvdev = container_of(
>> +			core_vdev, struct mlx5vf_pci_core_device,
>> +			core_device);
>> +
>> +	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;
> This should hold the state mutex too
>
Thanks Jason, I'll add as part of V2.

Alex,

Any feedback from your side before that we'll send V2 ?

We already got ACK for the PCI patches, there are some minor changes to 
be done so far.

Thanks,

Yishai


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF
  2021-10-13  9:46 ` [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF Yishai Hadas
  2021-10-13 18:27   ` Bjorn Helgaas
@ 2021-10-14 22:11   ` Alex Williamson
  2021-10-17 13:43     ` Yishai Hadas
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-14 22:11 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Wed, 13 Oct 2021 12:46:58 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> There are some cases where a SRIOV VF driver will need to reach into and
> interact with the PF driver. This requires accessing the drvdata of the PF.
> 
> Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
> safe way. Normally accessing a drvdata of a foreign struct device would be
> done using the device_lock() to protect against device driver
> probe()/remove() races.
> 
> However, due to the design of pci_enable_sriov() this will result in a
> ABBA deadlock on the device_lock as the PF's device_lock is held during PF
> sriov_configure() while calling pci_enable_sriov() which in turn holds the
> VF's device_lock while calling VF probe(), and similarly for remove.
> 
> This means the VF driver can never obtain the PF's device_lock.
> 
> Instead use the implicit locking created by pci_enable/disable_sriov(). A
> VF driver can access its PF drvdata only while its own driver is attached,
> and the PF driver can control access to its own drvdata based on when it
> calls pci_enable/disable_sriov().
> 
> To use this API the PF driver will setup the PF drvdata in the probe()
> function. pci_enable_sriov() is only called from sriov_configure() which
> cannot happen until probe() completes, ensuring no VF races with drvdata
> setup.
> 
> For removal, the PF driver must call pci_disable_sriov() in its remove
> function before destroying any of the drvdata. This ensures that all VF
> drivers are unbound before returning, fencing concurrent access to the
> drvdata.
> 
> The introduction of a new function to do this access makes clear the
> special locking scheme and the documents the requirements on the PF/VF
> drivers using this.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/pci/iov.c   | 29 +++++++++++++++++++++++++++++
>  include/linux/pci.h |  7 +++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index e7751fa3fe0b..ca696730f761 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -47,6 +47,35 @@ int pci_iov_vf_id(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_iov_vf_id);
>  
> +/**
> + * pci_iov_get_pf_drvdata - Return the drvdata of a PF
> + * @dev - VF pci_dev
> + * @pf_driver - Device driver required to own the PF
> + *
> + * This must be called from a context that ensures that a VF driver is attached.
> + * The value returned is invalid once the VF driver completes its remove()
> + * callback.
> + *
> + * Locking is achieved by the driver core. A VF driver cannot be probed until
> + * pci_enable_sriov() is called and pci_disable_sriov() does not return until
> + * all VF drivers have completed their remove().
> + *
> + * The PF driver must call pci_disable_sriov() before it begins to destroy the
> + * drvdata.
> + */
> +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver)
> +{
> +	struct pci_dev *pf_dev;
> +
> +	if (dev->is_physfn)
> +		return ERR_PTR(-EINVAL);

I think we're trying to make this only accessible to VFs, so shouldn't
we test (!dev->is_virtfn)?  is_physfn will be zero for either a PF with
failed SR-IOV configuration or for a non-SR-IOV device afaict.  Thanks,

Alex

> +	pf_dev = dev->physfn;
> +	if (pf_dev->driver != pf_driver)
> +		return ERR_PTR(-EINVAL);
> +	return pci_get_drvdata(pf_dev);
> +}
> +EXPORT_SYMBOL_GPL(pci_iov_get_pf_drvdata);
> +
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>   * change when NumVFs changes.
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 2337512e67f0..639a0a239774 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2154,6 +2154,7 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>  int pci_iov_vf_id(struct pci_dev *dev);
> +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver);
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
>  
> @@ -2187,6 +2188,12 @@ static inline int pci_iov_vf_id(struct pci_dev *dev)
>  	return -ENOSYS;
>  }
>  
> +static inline void *pci_iov_get_pf_drvdata(struct pci_dev *dev,
> +					   struct pci_driver *pf_driver)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>  { return -ENODEV; }
>  


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions Yishai Hadas
@ 2021-10-15 16:38   ` Alex Williamson
  2021-10-17 14:07     ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 16:38 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Wed, 13 Oct 2021 12:47:01 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> Add 'invalid' state definition to be used by drivers to set/check
> invalid state.
> 
> In addition dropped the non complied macro VFIO_DEVICE_STATE_SET_ERROR
> (i.e SATE instead of STATE) which seems unusable.

s/non complied/non-compiled/

We can certainly assume it's unused based on the typo, but removing it
or fixing it should be a separate patch.

> Fixes: a8a24f3f6e38 ("vfio: UAPI for migration interface for device state")
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  include/linux/vfio.h      | 5 +++++
>  include/uapi/linux/vfio.h | 4 +---
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index b53a9557884a..6a8cf6637333 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -252,4 +252,9 @@ extern int vfio_virqfd_enable(void *opaque,
>  			      void *data, struct virqfd **pvirqfd, int fd);
>  extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>  
> +static inline bool vfio_is_state_invalid(u32 state)
> +{
> +	return state >= VFIO_DEVICE_STATE_INVALID;
> +}


Redundant, we already have !VFIO_DEVICE_STATE_VALID(state)

> +
>  #endif /* VFIO_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..7f8fdada5eb3 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -609,6 +609,7 @@ struct vfio_device_migration_info {
>  #define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>  #define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>  #define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_RESUMING + 1)

Nak, device_state is not an enum, this is only one of the states we
currently define as invalid and usage such as the inline above ignores
the device state mask below, which induces future limits on how we can
expand the device_state field.  Thanks,

Alex

>  #define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>  				     VFIO_DEVICE_STATE_SAVING |  \
>  				     VFIO_DEVICE_STATE_RESUMING)
> @@ -621,9 +622,6 @@ struct vfio_device_migration_info {
>  	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
>  					      VFIO_DEVICE_STATE_RESUMING))
>  
> -#define VFIO_DEVICE_STATE_SET_ERROR(state) \
> -	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> -					     VFIO_DEVICE_STATE_RESUMING)
>  
>  	__u32 reserved;
>  	__u64 pending_bytes;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
@ 2021-10-15 19:48   ` Alex Williamson
  2021-10-15 19:59     ` Jason Gunthorpe
  2021-10-19  9:59   ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 19:48 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Wed, 13 Oct 2021 12:47:05 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> This patch adds support for vfio_pci driver for mlx5 devices.
> 
> It uses vfio_pci_core to register to the VFIO subsystem and then
> implements the mlx5 specific logic in the migration area.
> 
> The migration implementation follows the definition from uapi/vfio.h and
> uses the mlx5 VF->PF command channel to achieve it.
> 
> This patch implements the suspend/resume flows.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  MAINTAINERS                    |   6 +
>  drivers/vfio/pci/Kconfig       |   3 +
>  drivers/vfio/pci/Makefile      |   2 +
>  drivers/vfio/pci/mlx5/Kconfig  |  11 +
>  drivers/vfio/pci/mlx5/Makefile |   4 +
>  drivers/vfio/pci/mlx5/main.c   | 692 +++++++++++++++++++++++++++++++++
>  6 files changed, 718 insertions(+)
>  create mode 100644 drivers/vfio/pci/mlx5/Kconfig
>  create mode 100644 drivers/vfio/pci/mlx5/Makefile
>  create mode 100644 drivers/vfio/pci/mlx5/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index abdcbcfef73d..e824bfab4a01 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19699,6 +19699,12 @@ L:	kvm@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vfio/platform/
>  
> +VFIO MLX5 PCI DRIVER
> +M:	Yishai Hadas <yishaih@nvidia.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vfio/pci/mlx5/
> +
>  VGA_SWITCHEROO
>  R:	Lukas Wunner <lukas@wunner.de>
>  S:	Maintained
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 860424ccda1b..187b9c259944 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -43,4 +43,7 @@ config VFIO_PCI_IGD
>  
>  	  To enable Intel IGD assignment through vfio-pci, say Y.
>  endif
> +
> +source "drivers/vfio/pci/mlx5/Kconfig"
> +
>  endif
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 349d68d242b4..ed9d6f2e0555 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>  vfio-pci-y := vfio_pci.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> +
> +obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
> new file mode 100644
> index 000000000000..a3ce00add4fe
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config MLX5_VFIO_PCI
> +	tristate "VFIO support for MLX5 PCI devices"
> +	depends on MLX5_CORE
> +	select VFIO_PCI_CORE
> +	help
> +	  This provides a PCI support for MLX5 devices using the VFIO
> +	  framework. The device specific driver supports suspend/resume
> +	  of the MLX5 device.


Why are we doing everything except describing this as migration
support?  First sentence also needs some grammar help.


> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
> new file mode 100644
> index 000000000000..689627da7ff5
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> +mlx5-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> new file mode 100644
> index 000000000000..e36302b444a6
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -0,0 +1,692 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "cmd.h"
> +
> +enum {
> +	MLX5VF_PCI_FREEZED = 1 << 0,
> +};
> +
> +enum {
> +	MLX5VF_REGION_PENDING_BYTES = 1 << 0,
> +	MLX5VF_REGION_DATA_SIZE = 1 << 1,
> +};
> +
> +#define MLX5VF_MIG_REGION_DATA_SIZE SZ_128K
> +/* Data section offset from migration region */
> +#define MLX5VF_MIG_REGION_DATA_OFFSET                                          \
> +	(sizeof(struct vfio_device_migration_info))
> +
> +#define VFIO_DEVICE_MIGRATION_OFFSET(x)                                        \
> +	(offsetof(struct vfio_device_migration_info, x))
> +
> +struct mlx5vf_pci_migration_info {
> +	u32 vfio_dev_state; /* VFIO_DEVICE_STATE_XXX */
> +	u32 dev_state; /* device migration state */
> +	u32 region_state; /* Use MLX5VF_REGION_XXX */
> +	u16 vhca_id;
> +	struct mlx5_vhca_state_data vhca_state_data;
> +};
> +
> +struct mlx5vf_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	u8 migrate_cap:1;
> +	/* protect migartion state */
> +	struct mutex state_mutex;
> +	struct mlx5vf_pci_migration_info vmig;
> +};
> +
> +static int mlx5vf_pci_unquiesce_device(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	return mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
> +				      mvdev->vmig.vhca_id,
> +				      MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
> +}
> +
> +static int mlx5vf_pci_quiesce_device(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	return mlx5vf_cmd_suspend_vhca(
> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
> +}
> +
> +static int mlx5vf_pci_unfreeze_device(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	int ret;
> +
> +	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
> +				     mvdev->vmig.vhca_id,
> +				     MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
> +	if (ret)
> +		return ret;
> +
> +	mvdev->vmig.dev_state &= ~MLX5VF_PCI_FREEZED;
> +	return 0;
> +}
> +
> +static int mlx5vf_pci_freeze_device(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	int ret;
> +
> +	ret = mlx5vf_cmd_suspend_vhca(
> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
> +	if (ret)
> +		return ret;
> +
> +	mvdev->vmig.dev_state |= MLX5VF_PCI_FREEZED;
> +	return 0;
> +}
> +
> +static int mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	u32 state_size = 0;
> +	int ret;
> +
> +	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
> +		return -EFAULT;
> +
> +	/* If we already read state no reason to re-read */
> +	if (mvdev->vmig.vhca_state_data.state_size)
> +		return 0;
> +
> +	ret = mlx5vf_cmd_query_vhca_migration_state(
> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id, &state_size);
> +	if (ret)
> +		return ret;
> +
> +	return mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
> +					  mvdev->vmig.vhca_id, state_size,
> +					  &mvdev->vmig.vhca_state_data);
> +}
> +
> +static int mlx5vf_pci_new_write_window(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
> +	u32 num_pages_needed;
> +	u64 allocated_ready;
> +	u32 bytes_needed;
> +
> +	/* Check how many bytes are available from previous flows */
> +	WARN_ON(state_data->num_pages * PAGE_SIZE <
> +		state_data->win_start_offset);
> +	allocated_ready = (state_data->num_pages * PAGE_SIZE) -
> +			  state_data->win_start_offset;
> +	WARN_ON(allocated_ready > MLX5VF_MIG_REGION_DATA_SIZE);
> +
> +	bytes_needed = MLX5VF_MIG_REGION_DATA_SIZE - allocated_ready;
> +	if (!bytes_needed)
> +		return 0;
> +
> +	num_pages_needed = DIV_ROUND_UP_ULL(bytes_needed, PAGE_SIZE);
> +	return mlx5vf_add_migration_pages(state_data, num_pages_needed);
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_data_size(struct mlx5vf_pci_core_device *mvdev,
> +				      char __user *buf, bool iswrite)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +	u64 data_size;
> +	int ret;
> +
> +	if (iswrite) {
> +		/* data_size is writable only during resuming state */
> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_RESUMING)
> +			return -EINVAL;
> +
> +		ret = copy_from_user(&data_size, buf, sizeof(data_size));
> +		if (ret)
> +			return -EFAULT;
> +
> +		vmig->vhca_state_data.state_size += data_size;
> +		vmig->vhca_state_data.win_start_offset += data_size;
> +		ret = mlx5vf_pci_new_write_window(mvdev);
> +		if (ret)
> +			return ret;
> +
> +	} else {
> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_SAVING)
> +			return -EINVAL;
> +
> +		data_size = min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
> +				  vmig->vhca_state_data.state_size -
> +				  vmig->vhca_state_data.win_start_offset);
> +		ret = copy_to_user(buf, &data_size, sizeof(data_size));
> +		if (ret)
> +			return -EFAULT;
> +	}
> +
> +	vmig->region_state |= MLX5VF_REGION_DATA_SIZE;
> +	return sizeof(data_size);
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_data_offset(struct mlx5vf_pci_core_device *mvdev,
> +					char __user *buf, bool iswrite)
> +{
> +	static const u64 data_offset = MLX5VF_MIG_REGION_DATA_OFFSET;
> +	int ret;
> +
> +	/* RO field */
> +	if (iswrite)
> +		return -EFAULT;
> +
> +	ret = copy_to_user(buf, &data_offset, sizeof(data_offset));
> +	if (ret)
> +		return -EFAULT;
> +
> +	return sizeof(data_offset);
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_pending_bytes(struct mlx5vf_pci_core_device *mvdev,
> +					  char __user *buf, bool iswrite)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +	u64 pending_bytes;
> +	int ret;
> +
> +	/* RO field */
> +	if (iswrite)
> +		return -EFAULT;
> +
> +	if (vmig->vfio_dev_state == (VFIO_DEVICE_STATE_SAVING |
> +				     VFIO_DEVICE_STATE_RUNNING)) {
> +		/* In pre-copy state we have no data to return for now,
> +		 * return 0 pending bytes
> +		 */
> +		pending_bytes = 0;
> +	} else {
> +		if (!vmig->vhca_state_data.state_size)
> +			return 0;
> +		pending_bytes = vmig->vhca_state_data.state_size -
> +				vmig->vhca_state_data.win_start_offset;
> +	}
> +
> +	ret = copy_to_user(buf, &pending_bytes, sizeof(pending_bytes));
> +	if (ret)
> +		return -EFAULT;
> +
> +	/* Window moves forward once data from previous iteration was read */
> +	if (vmig->region_state & MLX5VF_REGION_DATA_SIZE)
> +		vmig->vhca_state_data.win_start_offset +=
> +			min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE, pending_bytes);
> +
> +	WARN_ON(vmig->vhca_state_data.win_start_offset >
> +		vmig->vhca_state_data.state_size);
> +
> +	/* New iteration started */
> +	vmig->region_state = MLX5VF_REGION_PENDING_BYTES;
> +	return sizeof(pending_bytes);
> +}
> +
> +static int mlx5vf_load_state(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	if (!mvdev->vmig.vhca_state_data.state_size)
> +		return 0;
> +
> +	return mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
> +					  mvdev->vmig.vhca_id,
> +					  &mvdev->vmig.vhca_state_data);
> +}
> +
> +static void mlx5vf_reset_mig_state(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +
> +	vmig->region_state = 0;
> +	mlx5vf_reset_vhca_state(&vmig->vhca_state_data);
> +}
> +
> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> +				       u32 state)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +	u32 old_state = vmig->vfio_dev_state;
> +	int ret = 0;
> +
> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> +		return -EINVAL;

if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))


> +
> +	/* Running switches off */
> +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> +	    (state & VFIO_DEVICE_STATE_RUNNING) &&

((old_state ^ state) & VFIO_DEVICE_STATE_RUNNING) ?


> +	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
> +		ret = mlx5vf_pci_quiesce_device(mvdev);
> +		if (ret)
> +			return ret;
> +		ret = mlx5vf_pci_freeze_device(mvdev);
> +		if (ret) {
> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;


No, the invalid states are specifically unreachable, the uAPI defines
the error state for this purpose.  The states noted as invalid in the
uAPI should be considered reserved at this point.  If only there was a
macro to set an error state... ;)


> +			return ret;
> +		}
> +	}
> +
> +	/* Resuming switches off */
> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&

A single xor before all of these cases might be worthwhile.  Thanks,

Alex

> +	    (old_state & VFIO_DEVICE_STATE_RESUMING)) {
> +		/* deserialize state into the device */
> +		ret = mlx5vf_load_state(mvdev);
> +		if (ret) {
> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> +			return ret;
> +		}
> +	}
> +
> +	/* Resuming switches on */
> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
> +	    (state & VFIO_DEVICE_STATE_RESUMING)) {
> +		mlx5vf_reset_mig_state(mvdev);
> +		ret = mlx5vf_pci_new_write_window(mvdev);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/* Saving switches on */
> +	if ((old_state & VFIO_DEVICE_STATE_SAVING) !=
> +	    (state & VFIO_DEVICE_STATE_SAVING) &&
> +	    (state & VFIO_DEVICE_STATE_SAVING)) {
> +		if (!(state & VFIO_DEVICE_STATE_RUNNING)) {
> +			/* serialize post copy */
> +			ret = mlx5vf_pci_save_device_data(mvdev);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	/* Running switches on */
> +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> +	    (state & VFIO_DEVICE_STATE_RUNNING) &&
> +	    (state & VFIO_DEVICE_STATE_RUNNING)) {
> +		ret = mlx5vf_pci_unfreeze_device(mvdev);
> +		if (ret)
> +			return ret;
> +		ret = mlx5vf_pci_unquiesce_device(mvdev);
> +		if (ret) {
> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> +			return ret;
> +		}
> +	}
> +
> +	vmig->vfio_dev_state = state;
> +	return 0;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_device_state(struct mlx5vf_pci_core_device *mvdev,
> +					 char __user *buf, bool iswrite)
> +{
> +	size_t count = sizeof(mvdev->vmig.vfio_dev_state);
> +	int ret;
> +
> +	if (iswrite) {
> +		u32 device_state;
> +
> +		ret = copy_from_user(&device_state, buf, count);
> +		if (ret)
> +			return -EFAULT;
> +
> +		ret = mlx5vf_pci_set_device_state(mvdev, device_state);
> +		if (ret)
> +			return ret;
> +	} else {
> +		ret = copy_to_user(buf, &mvdev->vmig.vfio_dev_state, count);
> +		if (ret)
> +			return -EFAULT;
> +	}
> +
> +	return count;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_copy_user_data_to_device_state(struct mlx5vf_pci_core_device *mvdev,
> +					  char __user *buf, size_t count,
> +					  u64 offset)
> +{
> +	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
> +	char __user *from_buff = buf;
> +	u32 curr_offset;
> +	u32 win_page_offset;
> +	u32 copy_count;
> +	struct page *page;
> +	char *to_buff;
> +	int ret;
> +
> +	curr_offset = state_data->win_start_offset + offset;
> +
> +	do {
> +		page = mlx5vf_get_migration_page(&state_data->mig_data,
> +						 curr_offset);
> +		if (!page)
> +			return -EINVAL;
> +
> +		win_page_offset = curr_offset % PAGE_SIZE;
> +		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
> +
> +		to_buff = kmap_local_page(page);
> +		ret = copy_from_user(to_buff + win_page_offset, from_buff,
> +				     copy_count);
> +		kunmap_local(to_buff);
> +		if (ret)
> +			return -EFAULT;
> +
> +		from_buff += copy_count;
> +		curr_offset += copy_count;
> +		count -= copy_count;
> +	} while (count > 0);
> +
> +	return 0;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_copy_device_state_to_user(struct mlx5vf_pci_core_device *mvdev,
> +				     char __user *buf, u64 offset, size_t count)
> +{
> +	struct mlx5_vhca_state_data *state_data = &mvdev->vmig.vhca_state_data;
> +	char __user *to_buff = buf;
> +	u32 win_available_bytes;
> +	u32 win_page_offset;
> +	u32 copy_count;
> +	u32 curr_offset;
> +	char *from_buff;
> +	struct page *page;
> +	int ret;
> +
> +	win_available_bytes =
> +		min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
> +		      mvdev->vmig.vhca_state_data.state_size -
> +			      mvdev->vmig.vhca_state_data.win_start_offset);
> +
> +	if (count + offset > win_available_bytes)
> +		return -EINVAL;
> +
> +	curr_offset = state_data->win_start_offset + offset;
> +
> +	do {
> +		page = mlx5vf_get_migration_page(&state_data->mig_data,
> +						 curr_offset);
> +		if (!page)
> +			return -EINVAL;
> +
> +		win_page_offset = curr_offset % PAGE_SIZE;
> +		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
> +
> +		from_buff = kmap_local_page(page);
> +		ret = copy_to_user(buf, from_buff + win_page_offset,
> +				   copy_count);
> +		kunmap_local(from_buff);
> +		if (ret)
> +			return -EFAULT;
> +
> +		curr_offset += copy_count;
> +		count -= copy_count;
> +		to_buff += copy_count;
> +	} while (count);
> +
> +	return 0;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_migration_data_rw(struct mlx5vf_pci_core_device *mvdev,
> +			     char __user *buf, size_t count, u64 offset,
> +			     bool iswrite)
> +{
> +	int ret;
> +
> +	if (offset + count > MLX5VF_MIG_REGION_DATA_SIZE)
> +		return -EINVAL;
> +
> +	if (iswrite)
> +		ret = mlx5vf_pci_copy_user_data_to_device_state(mvdev, buf,
> +								count, offset);
> +	else
> +		ret = mlx5vf_pci_copy_device_state_to_user(mvdev, buf, offset,
> +							   count);
> +	if (ret)
> +		return ret;
> +	return count;
> +}
> +
> +static ssize_t mlx5vf_pci_mig_rw(struct vfio_pci_core_device *vdev,
> +				 char __user *buf, size_t count, loff_t *ppos,
> +				 bool iswrite)
> +{
> +	struct mlx5vf_pci_core_device *mvdev =
> +		container_of(vdev, struct mlx5vf_pci_core_device, core_device);
> +	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	mutex_lock(&mvdev->state_mutex);
> +	/* Copy to/from the migration region data section */
> +	if (pos >= MLX5VF_MIG_REGION_DATA_OFFSET) {
> +		ret = mlx5vf_pci_migration_data_rw(
> +			mvdev, buf, count, pos - MLX5VF_MIG_REGION_DATA_OFFSET,
> +			iswrite);
> +		goto end;
> +	}
> +
> +	switch (pos) {
> +	case VFIO_DEVICE_MIGRATION_OFFSET(device_state):
> +		/* This is RW field. */
> +		if (count != sizeof(mvdev->vmig.vfio_dev_state)) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		ret = mlx5vf_pci_handle_migration_device_state(mvdev, buf,
> +							       iswrite);
> +		break;
> +	case VFIO_DEVICE_MIGRATION_OFFSET(pending_bytes):
> +		/*
> +		 * The number of pending bytes still to be migrated from the
> +		 * vendor driver. This is RO field.
> +		 * Reading this field indicates on the start of a new iteration
> +		 * to get device data.
> +		 *
> +		 */
> +		ret = mlx5vf_pci_handle_migration_pending_bytes(mvdev, buf,
> +								iswrite);
> +		break;
> +	case VFIO_DEVICE_MIGRATION_OFFSET(data_offset):
> +		/*
> +		 * The user application should read data_offset field from the
> +		 * migration region. The user application should read the
> +		 * device data from this offset within the migration region
> +		 * during the _SAVING mode or write the device data during the
> +		 * _RESUMING mode. This is RO field.
> +		 */
> +		ret = mlx5vf_pci_handle_migration_data_offset(mvdev, buf,
> +							      iswrite);
> +		break;
> +	case VFIO_DEVICE_MIGRATION_OFFSET(data_size):
> +		/*
> +		 * The user application should read data_size to get the size
> +		 * in bytes of the data copied to the migration region during
> +		 * the _SAVING state by the device. The user application should
> +		 * write the size in bytes of the data that was copied to
> +		 * the migration region during the _RESUMING state by the user.
> +		 * This is RW field.
> +		 */
> +		ret = mlx5vf_pci_handle_migration_data_size(mvdev, buf,
> +							    iswrite);
> +		break;
> +	default:
> +		ret = -EFAULT;
> +		break;
> +	}
> +
> +end:
> +	mutex_unlock(&mvdev->state_mutex);
> +	return ret;
> +}
> +
> +static struct vfio_pci_regops migration_ops = {
> +	.rw = mlx5vf_pci_mig_rw,
> +};
> +
> +static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = container_of(
> +		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
> +	struct vfio_pci_core_device *vdev = &mvdev->core_device;
> +	int vf_id;
> +	int ret;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	if (!mvdev->migrate_cap) {
> +		vfio_pci_core_finish_enable(vdev);
> +		return 0;
> +	}
> +
> +	vf_id = pci_iov_vf_id(vdev->pdev);
> +	if (vf_id < 0) {
> +		ret = vf_id;
> +		goto out_disable;
> +	}
> +
> +	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1,
> +				     &mvdev->vmig.vhca_id);
> +	if (ret)
> +		goto out_disable;
> +
> +	ret = vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
> +					   VFIO_REGION_SUBTYPE_MIGRATION,
> +					   &migration_ops,
> +					   MLX5VF_MIG_REGION_DATA_OFFSET +
> +					   MLX5VF_MIG_REGION_DATA_SIZE,
> +					   VFIO_REGION_INFO_FLAG_READ |
> +					   VFIO_REGION_INFO_FLAG_WRITE,
> +					   NULL);
> +	if (ret)
> +		goto out_disable;
> +
> +	mutex_init(&mvdev->state_mutex);
> +	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;
> +	vfio_pci_core_finish_enable(vdev);
> +	return 0;
> +out_disable:
> +	vfio_pci_core_disable(vdev);
> +	return ret;
> +}
> +
> +static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = container_of(
> +		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
> +
> +	vfio_pci_core_close_device(core_vdev);
> +	mlx5vf_reset_mig_state(mvdev);
> +}
> +
> +static const struct vfio_device_ops mlx5vf_pci_ops = {
> +	.name = "mlx5-vfio-pci",
> +	.open_device = mlx5vf_pci_open_device,
> +	.close_device = mlx5vf_pci_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +};
> +
> +static int mlx5vf_pci_probe(struct pci_dev *pdev,
> +			    const struct pci_device_id *id)
> +{
> +	struct mlx5vf_pci_core_device *mvdev;
> +	int ret;
> +
> +	mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL);
> +	if (!mvdev)
> +		return -ENOMEM;
> +	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
> +
> +	if (pdev->is_virtfn) {
> +		struct mlx5_core_dev *mdev =
> +			mlx5_vf_get_core_dev(pdev);
> +
> +		if (mdev) {
> +			if (MLX5_CAP_GEN(mdev, migration))
> +				mvdev->migrate_cap = 1;
> +			mlx5_vf_put_core_dev(mdev);
> +		}
> +	}
> +
> +	ret = vfio_pci_core_register_device(&mvdev->core_device);
> +	if (ret)
> +		goto out_free;
> +
> +	dev_set_drvdata(&pdev->dev, mvdev);
> +	return 0;
> +
> +out_free:
> +	vfio_pci_core_uninit_device(&mvdev->core_device);
> +	kfree(mvdev);
> +	return ret;
> +}
> +
> +static void mlx5vf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
> +
> +	vfio_pci_core_unregister_device(&mvdev->core_device);
> +	vfio_pci_core_uninit_device(&mvdev->core_device);
> +	kfree(mvdev);
> +}
> +
> +static const struct pci_device_id mlx5vf_pci_table[] = {
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX, 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */
> +	{}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
> +
> +static struct pci_driver mlx5vf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = mlx5vf_pci_table,
> +	.probe = mlx5vf_pci_probe,
> +	.remove = mlx5vf_pci_remove,
> +	.err_handler = &vfio_pci_core_err_handlers,
> +};
> +
> +static void __exit mlx5vf_pci_cleanup(void)
> +{
> +	pci_unregister_driver(&mlx5vf_pci_driver);
> +}
> +
> +static int __init mlx5vf_pci_init(void)
> +{
> +	return pci_register_driver(&mlx5vf_pci_driver);
> +}
> +
> +module_init(mlx5vf_pci_init);
> +module_exit(mlx5vf_pci_cleanup);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> +	"MLX5 VFIO PCI - User Level meta-driver for MLX5 device family");


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET Yishai Hadas
@ 2021-10-15 19:52   ` Alex Williamson
  2021-10-15 20:03     ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 19:52 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On Wed, 13 Oct 2021 12:47:06 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> Add infrastructure to let vfio_pci_core drivers trap device RESET.
> 
> The motivation for this is to let the underlay driver be aware that
> reset was done and set its internal state accordingly.

I think the intention of the uAPI here is that the migration error
state is exited specifically via the reset ioctl.  Maybe that should be
made more clear, but variant drivers can already wrap the core ioctl
for the purpose of determining that mechanism of reset has occurred.
Thanks,

Alex

 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c |  8 ++++++--
>  drivers/vfio/pci/vfio_pci_core.c   |  2 ++
>  include/linux/vfio_pci_core.h      | 10 ++++++++++
>  3 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 6e58b4bf7a60..002198376f43 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -859,7 +859,9 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
>  
>  		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -			pci_try_reset_function(vdev->pdev);
> +			ret = pci_try_reset_function(vdev->pdev);
> +			if (!ret && vdev->ops && vdev->ops->reset_done)
> +				vdev->ops->reset_done(vdev);
>  			up_write(&vdev->memory_lock);
>  		}
>  	}
> @@ -941,7 +943,9 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
>  
>  		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -			pci_try_reset_function(vdev->pdev);
> +			ret = pci_try_reset_function(vdev->pdev);
> +			if (!ret && vdev->ops && vdev->ops->reset_done)
> +				vdev->ops->reset_done(vdev);
>  			up_write(&vdev->memory_lock);
>  		}
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index e581a327f90d..d2497a8ed7f1 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -923,6 +923,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  
>  		vfio_pci_zap_and_down_write_memory_lock(vdev);
>  		ret = pci_try_reset_function(vdev->pdev);
> +		if (!ret && vdev->ops && vdev->ops->reset_done)
> +			vdev->ops->reset_done(vdev);
>  		up_write(&vdev->memory_lock);
>  
>  		return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index ef9a44b6cf5d..6ccf5824f098 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -95,6 +95,15 @@ struct vfio_pci_mmap_vma {
>  	struct list_head	vma_next;
>  };
>  
> +/**
> + * struct vfio_pci_core_device_ops - VFIO PCI driver device callbacks
> + *
> + * @reset_done: Called when the device was reset
> + */
> +struct vfio_pci_core_device_ops {
> +	void	(*reset_done)(struct vfio_pci_core_device *vdev);
> +};
> +
>  struct vfio_pci_core_device {
>  	struct vfio_device	vdev;
>  	struct pci_dev		*pdev;
> @@ -137,6 +146,7 @@ struct vfio_pci_core_device {
>  	struct mutex		vma_lock;
>  	struct list_head	vma_list;
>  	struct rw_semaphore	memory_lock;
> +	const struct vfio_pci_core_device_ops *ops;
>  };
>  
>  #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly
  2021-10-14  9:18     ` Yishai Hadas
@ 2021-10-15 19:54       ` Alex Williamson
  0 siblings, 0 replies; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 19:54 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Jason Gunthorpe, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Thu, 14 Oct 2021 12:18:30 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 10/13/2021 9:06 PM, Jason Gunthorpe wrote:
> > On Wed, Oct 13, 2021 at 12:47:07PM +0300, Yishai Hadas wrote:  
> >> Trap device RESET and update state accordingly, it's done by registering
> >> the matching callbacks.
> >>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>   drivers/vfio/pci/mlx5/main.c | 17 ++++++++++++++++-
> >>   1 file changed, 16 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> >> index e36302b444a6..8fe44ed13552 100644
> >> +++ b/drivers/vfio/pci/mlx5/main.c
> >> @@ -613,6 +613,19 @@ static const struct vfio_device_ops mlx5vf_pci_ops = {
> >>   	.match = vfio_pci_core_match,
> >>   };
> >>   
> >> +static void mlx5vf_reset_done(struct vfio_pci_core_device *core_vdev)
> >> +{
> >> +	struct mlx5vf_pci_core_device *mvdev = container_of(
> >> +			core_vdev, struct mlx5vf_pci_core_device,
> >> +			core_device);
> >> +
> >> +	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;  
> > This should hold the state mutex too
> >  
> Thanks Jason, I'll add as part of V2.
> 
> Alex,
> 
> Any feedback from your side before that we'll send V2 ?
> 
> We already got ACK for the PCI patches, there are some minor changes to 
> be done so far.

Provided.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-15 19:48   ` Alex Williamson
@ 2021-10-15 19:59     ` Jason Gunthorpe
  2021-10-15 20:12       ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-15 19:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
> > +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> > +				       u32 state)
> > +{
> > +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> > +	u32 old_state = vmig->vfio_dev_state;
> > +	int ret = 0;
> > +
> > +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> > +		return -EINVAL;
> 
> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))

AFAICT this macro doesn't do what is needed, eg

VFIO_DEVICE_STATE_VALID(0xF000) == true

What Yishai implemented is at least functionally correct - states this
driver does not support are rejected.

> > +	/* Running switches off */
> > +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> > +	    (state & VFIO_DEVICE_STATE_RUNNING) &&
> 
> ((old_state ^ state) & VFIO_DEVICE_STATE_RUNNING) ?

It is not functionally the same, xor only tells if the bit changed, it
doesn't tell what the current value is, and this needs to know that it
changed to 1

> > +	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
> > +		ret = mlx5vf_pci_quiesce_device(mvdev);
> > +		if (ret)
> > +			return ret;
> > +		ret = mlx5vf_pci_freeze_device(mvdev);
> > +		if (ret) {
> > +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> 
> 
> No, the invalid states are specifically unreachable, the uAPI defines
> the error state for this purpose.

Indeed

> The states noted as invalid in the
> uAPI should be considered reserved at this point.  If only there was a
> macro to set an error state... ;)

It should just assign a constant value, there is only one error state.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-15 19:52   ` Alex Williamson
@ 2021-10-15 20:03     ` Jason Gunthorpe
  2021-10-15 21:12       ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-15 20:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Fri, Oct 15, 2021 at 01:52:37PM -0600, Alex Williamson wrote:
> On Wed, 13 Oct 2021 12:47:06 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > Add infrastructure to let vfio_pci_core drivers trap device RESET.
> > 
> > The motivation for this is to let the underlay driver be aware that
> > reset was done and set its internal state accordingly.
> 
> I think the intention of the uAPI here is that the migration error
> state is exited specifically via the reset ioctl.  Maybe that should be
> made more clear, but variant drivers can already wrap the core ioctl
> for the purpose of determining that mechanism of reset has occurred.

It is not just recovering the error state.

Any transition to reset changes the firmware state. Eg if userspace
uses one of the other emulation paths to trigger the reset after
putting the device off running then the driver state and FW state
become desynchronized.

So all the reset paths need to be synchronized some how, either
blocked while in non-running states or aligning the SW state with the
new post-reset FW state.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-15 19:59     ` Jason Gunthorpe
@ 2021-10-15 20:12       ` Alex Williamson
  2021-10-15 20:16         ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 20:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Fri, 15 Oct 2021 16:59:37 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
> > > +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> > > +				       u32 state)
> > > +{
> > > +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> > > +	u32 old_state = vmig->vfio_dev_state;
> > > +	int ret = 0;
> > > +
> > > +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> > > +		return -EINVAL;  
> > 
> > if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))  
> 
> AFAICT this macro doesn't do what is needed, eg
> 
> VFIO_DEVICE_STATE_VALID(0xF000) == true
> 
> What Yishai implemented is at least functionally correct - states this
> driver does not support are rejected.


if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))

old_state is controlled by the driver and can never have random bits
set, user state should be sanitized to prevent setting undefined bits.


> > > +	/* Running switches off */
> > > +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> > > +	    (state & VFIO_DEVICE_STATE_RUNNING) &&  
> > 
> > ((old_state ^ state) & VFIO_DEVICE_STATE_RUNNING) ?  
> 
> It is not functionally the same, xor only tells if the bit changed, it
> doesn't tell what the current value is, and this needs to know that it
> changed to 1

That's why I inserted my comment after the "it changed" test and not
after the "and the old old value was..." test below.

> > > +	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
> > > +		ret = mlx5vf_pci_quiesce_device(mvdev);
> > > +		if (ret)
> > > +			return ret;
> > > +		ret = mlx5vf_pci_freeze_device(mvdev);
> > > +		if (ret) {
> > > +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;  
> > 
> > 
> > No, the invalid states are specifically unreachable, the uAPI defines
> > the error state for this purpose.  
> 
> Indeed
> 
> > The states noted as invalid in the
> > uAPI should be considered reserved at this point.  If only there was a
> > macro to set an error state... ;)  
> 
> It should just assign a constant value, there is only one error state.

Fair enough.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-15 20:12       ` Alex Williamson
@ 2021-10-15 20:16         ` Jason Gunthorpe
  2021-10-15 20:59           ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-15 20:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:
> On Fri, 15 Oct 2021 16:59:37 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
> > > > +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> > > > +				       u32 state)
> > > > +{
> > > > +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> > > > +	u32 old_state = vmig->vfio_dev_state;
> > > > +	int ret = 0;
> > > > +
> > > > +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> > > > +		return -EINVAL;  
> > > 
> > > if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))  
> > 
> > AFAICT this macro doesn't do what is needed, eg
> > 
> > VFIO_DEVICE_STATE_VALID(0xF000) == true
> > 
> > What Yishai implemented is at least functionally correct - states this
> > driver does not support are rejected.
> 
> 
> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
> 
> old_state is controlled by the driver and can never have random bits
> set, user state should be sanitized to prevent setting undefined bits.

In that instance let's just write

old_state != VFIO_DEVICE_STATE_ERROR

?

I'm happy to see some device specific mask selecting the bits it
supports.

> > > > +	/* Running switches off */
> > > > +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> > > > +	    (state & VFIO_DEVICE_STATE_RUNNING) &&  
> > > 
> > > ((old_state ^ state) & VFIO_DEVICE_STATE_RUNNING) ?  
> > 
> > It is not functionally the same, xor only tells if the bit changed, it
> > doesn't tell what the current value is, and this needs to know that it
> > changed to 1
> 
> That's why I inserted my comment after the "it changed" test and not
> after the "and the old old value was..." test below.

Oh, I see, it was not clear to me

Thanks,
Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-15 20:16         ` Jason Gunthorpe
@ 2021-10-15 20:59           ` Alex Williamson
  2021-10-17 14:03             ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Fri, 15 Oct 2021 17:16:54 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:
> > On Fri, 15 Oct 2021 16:59:37 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:  
> > > > > +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> > > > > +				       u32 state)
> > > > > +{
> > > > > +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> > > > > +	u32 old_state = vmig->vfio_dev_state;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> > > > > +		return -EINVAL;    
> > > > 
> > > > if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))    
> > > 
> > > AFAICT this macro doesn't do what is needed, eg
> > > 
> > > VFIO_DEVICE_STATE_VALID(0xF000) == true
> > > 
> > > What Yishai implemented is at least functionally correct - states this
> > > driver does not support are rejected.  
> > 
> > 
> > if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
> > 
> > old_state is controlled by the driver and can never have random bits
> > set, user state should be sanitized to prevent setting undefined bits.  
> 
> In that instance let's just write
> 
> old_state != VFIO_DEVICE_STATE_ERROR
> 
> ?

Not quite, the user can't set either of the other invalid states
either.

> 
> I'm happy to see some device specific mask selecting the bits it
> supports.

There are currently no optional bits within the mask, but the
RESUME|RUNNING state is rather TBD.  I figured we'd use flags in the
region info to advertise additional feature bits when it comes to that.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-15 20:03     ` Jason Gunthorpe
@ 2021-10-15 21:12       ` Alex Williamson
  2021-10-17 14:29         ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-15 21:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Fri, 15 Oct 2021 17:03:28 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Oct 15, 2021 at 01:52:37PM -0600, Alex Williamson wrote:
> > On Wed, 13 Oct 2021 12:47:06 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >   
> > > Add infrastructure to let vfio_pci_core drivers trap device RESET.
> > > 
> > > The motivation for this is to let the underlay driver be aware that
> > > reset was done and set its internal state accordingly.  
> > 
> > I think the intention of the uAPI here is that the migration error
> > state is exited specifically via the reset ioctl.  Maybe that should be
> > made more clear, but variant drivers can already wrap the core ioctl
> > for the purpose of determining that mechanism of reset has occurred.  
> 
> It is not just recovering the error state.
> 
> Any transition to reset changes the firmware state. Eg if userspace
> uses one of the other emulation paths to trigger the reset after
> putting the device off running then the driver state and FW state
> become desynchronized.
> 
> So all the reset paths need to be synchronized some how, either
> blocked while in non-running states or aligning the SW state with the
> new post-reset FW state.

This only catches the two flavors of FLR and the RESET ioctl itself, so
we've got gaps relative to "all the reset paths" anyway.  I'm also
concerned about adding arbitrary callbacks for every case that it gets
too cumbersome to write a wrapper for the existing callbacks.

However, why is this a vfio thing when we have the
pci_error_handlers.reset_done callback.  At best this ought to be
redundant to that.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF
  2021-10-14 22:11   ` Alex Williamson
@ 2021-10-17 13:43     ` Yishai Hadas
  0 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-17 13:43 UTC (permalink / raw)
  To: Alex Williamson, bhelgaas
  Cc: jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro, kwankhede,
	mgurtovoy, maorg

On 10/15/2021 1:11 AM, Alex Williamson wrote:
> On Wed, 13 Oct 2021 12:46:58 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> From: Jason Gunthorpe <jgg@nvidia.com>
>>
>> There are some cases where a SRIOV VF driver will need to reach into and
>> interact with the PF driver. This requires accessing the drvdata of the PF.
>>
>> Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
>> safe way. Normally accessing a drvdata of a foreign struct device would be
>> done using the device_lock() to protect against device driver
>> probe()/remove() races.
>>
>> However, due to the design of pci_enable_sriov() this will result in a
>> ABBA deadlock on the device_lock as the PF's device_lock is held during PF
>> sriov_configure() while calling pci_enable_sriov() which in turn holds the
>> VF's device_lock while calling VF probe(), and similarly for remove.
>>
>> This means the VF driver can never obtain the PF's device_lock.
>>
>> Instead use the implicit locking created by pci_enable/disable_sriov(). A
>> VF driver can access its PF drvdata only while its own driver is attached,
>> and the PF driver can control access to its own drvdata based on when it
>> calls pci_enable/disable_sriov().
>>
>> To use this API the PF driver will setup the PF drvdata in the probe()
>> function. pci_enable_sriov() is only called from sriov_configure() which
>> cannot happen until probe() completes, ensuring no VF races with drvdata
>> setup.
>>
>> For removal, the PF driver must call pci_disable_sriov() in its remove
>> function before destroying any of the drvdata. This ensures that all VF
>> drivers are unbound before returning, fencing concurrent access to the
>> drvdata.
>>
>> The introduction of a new function to do this access makes clear the
>> special locking scheme and the documents the requirements on the PF/VF
>> drivers using this.
>>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   drivers/pci/iov.c   | 29 +++++++++++++++++++++++++++++
>>   include/linux/pci.h |  7 +++++++
>>   2 files changed, 36 insertions(+)
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index e7751fa3fe0b..ca696730f761 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -47,6 +47,35 @@ int pci_iov_vf_id(struct pci_dev *dev)
>>   }
>>   EXPORT_SYMBOL_GPL(pci_iov_vf_id);
>>   
>> +/**
>> + * pci_iov_get_pf_drvdata - Return the drvdata of a PF
>> + * @dev - VF pci_dev
>> + * @pf_driver - Device driver required to own the PF
>> + *
>> + * This must be called from a context that ensures that a VF driver is attached.
>> + * The value returned is invalid once the VF driver completes its remove()
>> + * callback.
>> + *
>> + * Locking is achieved by the driver core. A VF driver cannot be probed until
>> + * pci_enable_sriov() is called and pci_disable_sriov() does not return until
>> + * all VF drivers have completed their remove().
>> + *
>> + * The PF driver must call pci_disable_sriov() before it begins to destroy the
>> + * drvdata.
>> + */
>> +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver)
>> +{
>> +	struct pci_dev *pf_dev;
>> +
>> +	if (dev->is_physfn)
>> +		return ERR_PTR(-EINVAL);
> I think we're trying to make this only accessible to VFs, so shouldn't
> we test (!dev->is_virtfn)?  is_physfn will be zero for either a PF with
> failed SR-IOV configuration or for a non-SR-IOV device afaict.  Thanks,
>
> Alex
>

Yes, this should be accessible only for VFs.

We can go with your suggestion to explicitly check (!dev->is_virtfn) as 
this seems cleaner and safer as you mentioned.

We already got ACK on this patch from Bjorn but as your suggestion seems 
straight forward I may put the Acked-by as part of V2 in any case.

Yishai

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-15 20:59           ` Alex Williamson
@ 2021-10-17 14:03             ` Yishai Hadas
  2021-10-18 11:51               ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-17 14:03 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: bhelgaas, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On 10/15/2021 11:59 PM, Alex Williamson wrote:
> On Fri, 15 Oct 2021 17:16:54 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:
>>> On Fri, 15 Oct 2021 16:59:37 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>    
>>>> On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
>>>>>> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
>>>>>> +				       u32 state)
>>>>>> +{
>>>>>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>>>>>> +	u32 old_state = vmig->vfio_dev_state;
>>>>>> +	int ret = 0;
>>>>>> +
>>>>>> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
>>>>>> +		return -EINVAL;
>>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))
>>>> AFAICT this macro doesn't do what is needed, eg
>>>>
>>>> VFIO_DEVICE_STATE_VALID(0xF000) == true
>>>>
>>>> What Yishai implemented is at least functionally correct - states this
>>>> driver does not support are rejected.
>>>
>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
>>>
>>> old_state is controlled by the driver and can never have random bits
>>> set, user state should be sanitized to prevent setting undefined bits.
>> In that instance let's just write
>>
>> old_state != VFIO_DEVICE_STATE_ERROR
>>
>> ?
> Not quite, the user can't set either of the other invalid states
> either.


OK so let's go with below as you suggested.
if (!VFIO_DEVICE_STATE_VALID(old_state) ||
      !VFIO_DEVICE_STATE_VALID(state) ||
       (state & ~VFIO_DEVICE_STATE_MASK))
            return -EINVAL;

As was suggested to have some new const for ERROR STATE and use it in 
drivers when state gets into error I may come in V2 with the below extra 
patch.

Any comments on ?

commit cc7cb23773c70b998aaee5bfc2434da86c80b600
Author: Yishai Hadas <yishaih@nvidia.com>
Date:   Sun Oct 17 11:34:06 2021 +0300

     Vfio: Add a const value for VFIO_DEVICE_STATE_ERROR

     Add a const value for VFIO_DEVICE_STATE_ERROR to be used by drivers to
     set an error state.

     Signed-off-by: Yishai Hadas <yishaih@nvidia.com>

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b53a9557884a..37376dadca5a 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -15,6 +15,8 @@
  #include <linux/poll.h>
  #include <uapi/linux/vfio.h>

+static const int VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING |
+ VFIO_DEVICE_STATE_RESUMING;


Yishai


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions
  2021-10-15 16:38   ` Alex Williamson
@ 2021-10-17 14:07     ` Yishai Hadas
  0 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-17 14:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: bhelgaas, jgg, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On 10/15/2021 7:38 PM, Alex Williamson wrote:
> On Wed, 13 Oct 2021 12:47:01 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> Add 'invalid' state definition to be used by drivers to set/check
>> invalid state.
>>
>> In addition dropped the non complied macro VFIO_DEVICE_STATE_SET_ERROR
>> (i.e SATE instead of STATE) which seems unusable.
> s/non complied/non-compiled/
>
> We can certainly assume it's unused based on the typo, but removing it
> or fixing it should be a separate patch.


OK, will come with a separate patch to fix the typo and leave it for now.

>> Fixes: a8a24f3f6e38 ("vfio: UAPI for migration interface for device state")
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>> ---
>>   include/linux/vfio.h      | 5 +++++
>>   include/uapi/linux/vfio.h | 4 +---
>>   2 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index b53a9557884a..6a8cf6637333 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -252,4 +252,9 @@ extern int vfio_virqfd_enable(void *opaque,
>>   			      void *data, struct virqfd **pvirqfd, int fd);
>>   extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>>   
>> +static inline bool vfio_is_state_invalid(u32 state)
>> +{
>> +	return state >= VFIO_DEVICE_STATE_INVALID;
>> +}
>
> Redundant, we already have !VFIO_DEVICE_STATE_VALID(state)
>
>

OK, may drop this part.

Yishai



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-15 21:12       ` Alex Williamson
@ 2021-10-17 14:29         ` Yishai Hadas
  2021-10-18 12:02           ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-17 14:29 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: bhelgaas, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On 10/16/2021 12:12 AM, Alex Williamson wrote:
> On Fri, 15 Oct 2021 17:03:28 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Fri, Oct 15, 2021 at 01:52:37PM -0600, Alex Williamson wrote:
>>> On Wed, 13 Oct 2021 12:47:06 +0300
>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>    
>>>> Add infrastructure to let vfio_pci_core drivers trap device RESET.
>>>>
>>>> The motivation for this is to let the underlay driver be aware that
>>>> reset was done and set its internal state accordingly.
>>> I think the intention of the uAPI here is that the migration error
>>> state is exited specifically via the reset ioctl.  Maybe that should be
>>> made more clear, but variant drivers can already wrap the core ioctl
>>> for the purpose of determining that mechanism of reset has occurred.
>> It is not just recovering the error state.
>>
>> Any transition to reset changes the firmware state. Eg if userspace
>> uses one of the other emulation paths to trigger the reset after
>> putting the device off running then the driver state and FW state
>> become desynchronized.
>>
>> So all the reset paths need to be synchronized some how, either
>> blocked while in non-running states or aligning the SW state with the
>> new post-reset FW state.
> This only catches the two flavors of FLR and the RESET ioctl itself, so
> we've got gaps relative to "all the reset paths" anyway.  I'm also
> concerned about adding arbitrary callbacks for every case that it gets
> too cumbersome to write a wrapper for the existing callbacks.
>
> However, why is this a vfio thing when we have the
> pci_error_handlers.reset_done callback.  At best this ought to be
> redundant to that.  Thanks,
>
> Alex
>
Alex,

How about the below patch instead ?

This will centralize the 'reset_done' notifications for drivers to one 
place (i.e. pci_error_handlers.reset_done)  and may close the gap that 
you pointed on.

I just followed the logic in vfio_pci_aer_err_detected() from usage and 
locking point of view.

Do we really need to take the &vdev->igate mutex as was done there ?

The next patch from the series in mlx5 will stay as of in V1, it may 
just set its ops and be called upon PCI 'reset_done'.


diff --git a/drivers/vfio/pci/vfio_pci_core.c 
b/drivers/vfio/pci/vfio_pci_core.c
index e581a327f90d..20bf37c00fb6 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1925,6 +1925,27 @@ static pci_ers_result_t 
vfio_pci_aer_err_detected(struct pci_dev *pdev,
         return PCI_ERS_RESULT_CAN_RECOVER;
  }

+static void vfio_pci_aer_err_reset_done(struct pci_dev *pdev)
+{
+       struct vfio_pci_core_device *vdev;
+       struct vfio_device *device;
+
+       device = vfio_device_get_from_dev(&pdev->dev);
+       if (device == NULL)
+               return;
+
+       vdev = container_of(device, struct vfio_pci_core_device, vdev);
+
+       mutex_lock(&vdev->igate);
+       if (vdev->ops && vdev->ops->reset_done)
+               vdev->ops->reset_done(vdev);
+       mutex_unlock(&vdev->igate);
+
+       vfio_device_put(device);
+
+       return;
+}
+
  int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
  {
         struct vfio_device *device;
@@ -1947,6 +1968,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);

  const struct pci_error_handlers vfio_pci_core_err_handlers = {
         .error_detected = vfio_pci_aer_err_detected,
+       .reset_done = vfio_pci_aer_err_reset_done,
  };
  EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);

diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..6ccf5824f098 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -95,6 +95,15 @@ struct vfio_pci_mmap_vma {
         struct list_head        vma_next;
  };

+/**
+ * struct vfio_pci_core_device_ops - VFIO PCI driver device callbacks
+ *
+ * @reset_done: Called when the device was reset
+ */
+struct vfio_pci_core_device_ops {
+       void    (*reset_done)(struct vfio_pci_core_device *vdev);
+};
+
  struct vfio_pci_core_device {
         struct vfio_device      vdev;
         struct pci_dev          *pdev;
@@ -137,6 +146,7 @@ struct vfio_pci_core_device {
         struct mutex            vma_lock;
         struct list_head        vma_list;
         struct rw_semaphore     memory_lock;
+       const struct vfio_pci_core_device_ops *ops;
  };





^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-17 14:03             ` Yishai Hadas
@ 2021-10-18 11:51               ` Jason Gunthorpe
  2021-10-18 13:26                 ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-18 11:51 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Alex Williamson, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Sun, Oct 17, 2021 at 05:03:28PM +0300, Yishai Hadas wrote:
> On 10/15/2021 11:59 PM, Alex Williamson wrote:
> > On Fri, 15 Oct 2021 17:16:54 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:
> > > > On Fri, 15 Oct 2021 16:59:37 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > > > On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
> > > > > > > +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> > > > > > > +				       u32 state)
> > > > > > > +{
> > > > > > > +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> > > > > > > +	u32 old_state = vmig->vfio_dev_state;
> > > > > > > +	int ret = 0;
> > > > > > > +
> > > > > > > +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> > > > > > > +		return -EINVAL;
> > > > > > if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))
> > > > > AFAICT this macro doesn't do what is needed, eg
> > > > > 
> > > > > VFIO_DEVICE_STATE_VALID(0xF000) == true
> > > > > 
> > > > > What Yishai implemented is at least functionally correct - states this
> > > > > driver does not support are rejected.
> > > > 
> > > > if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
> > > > 
> > > > old_state is controlled by the driver and can never have random bits
> > > > set, user state should be sanitized to prevent setting undefined bits.
> > > In that instance let's just write
> > > 
> > > old_state != VFIO_DEVICE_STATE_ERROR
> > > 
> > > ?
> > Not quite, the user can't set either of the other invalid states
> > either.
> 
> 
> OK so let's go with below as you suggested.
> if (!VFIO_DEVICE_STATE_VALID(old_state) ||
>      !VFIO_DEVICE_STATE_VALID(state) ||
>       (state & ~VFIO_DEVICE_STATE_MASK))
>            return -EINVAL;

This is my preference:

if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_ERROR ||
    !vfio_device_state_valid(state) ||
    (state & !MLX5VF_SUPPORTED_DEVICE_STATES))


> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index b53a9557884a..37376dadca5a 100644
> +++ b/include/linux/vfio.h
> @@ -15,6 +15,8 @@
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> 
> +static const int VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING |
> + VFIO_DEVICE_STATE_RESUMING;

Do not put static variables in header files

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-17 14:29         ` Yishai Hadas
@ 2021-10-18 12:02           ` Jason Gunthorpe
  2021-10-18 13:41             ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-18 12:02 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Alex Williamson, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Sun, Oct 17, 2021 at 05:29:39PM +0300, Yishai Hadas wrote:
> On 10/16/2021 12:12 AM, Alex Williamson wrote:
> > On Fri, 15 Oct 2021 17:03:28 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Fri, Oct 15, 2021 at 01:52:37PM -0600, Alex Williamson wrote:
> > > > On Wed, 13 Oct 2021 12:47:06 +0300
> > > > Yishai Hadas <yishaih@nvidia.com> wrote:
> > > > > Add infrastructure to let vfio_pci_core drivers trap device RESET.
> > > > > 
> > > > > The motivation for this is to let the underlay driver be aware that
> > > > > reset was done and set its internal state accordingly.
> > > > I think the intention of the uAPI here is that the migration error
> > > > state is exited specifically via the reset ioctl.  Maybe that should be
> > > > made more clear, but variant drivers can already wrap the core ioctl
> > > > for the purpose of determining that mechanism of reset has occurred.
> > > It is not just recovering the error state.
> > > 
> > > Any transition to reset changes the firmware state. Eg if userspace
> > > uses one of the other emulation paths to trigger the reset after
> > > putting the device off running then the driver state and FW state
> > > become desynchronized.
> > > 
> > > So all the reset paths need to be synchronized some how, either
> > > blocked while in non-running states or aligning the SW state with the
> > > new post-reset FW state.
> > This only catches the two flavors of FLR and the RESET ioctl itself, so
> > we've got gaps relative to "all the reset paths" anyway.  I'm also
> > concerned about adding arbitrary callbacks for every case that it gets
> > too cumbersome to write a wrapper for the existing callbacks.
> > 
> > However, why is this a vfio thing when we have the
> > pci_error_handlers.reset_done callback.  At best this ought to be
> > redundant to that.  Thanks,
> > 
> > Alex
> > 
> Alex,
> 
> How about the below patch instead ?
> 
> This will centralize the 'reset_done' notifications for drivers to one place
> (i.e. pci_error_handlers.reset_done)  and may close the gap that you pointed
> on.
> 
> I just followed the logic in vfio_pci_aer_err_detected() from usage and
> locking point of view.
> 
> Do we really need to take the &vdev->igate mutex as was done there ?
> 
> The next patch from the series in mlx5 will stay as of in V1, it may just
> set its ops and be called upon PCI 'reset_done'.
> 
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c
> b/drivers/vfio/pci/vfio_pci_core.c
> index e581a327f90d..20bf37c00fb6 100644
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1925,6 +1925,27 @@ static pci_ers_result_t
> vfio_pci_aer_err_detected(struct pci_dev *pdev,
>         return PCI_ERS_RESULT_CAN_RECOVER;
>  }
> 
> +static void vfio_pci_aer_err_reset_done(struct pci_dev *pdev)
> +{
> +       struct vfio_pci_core_device *vdev;
> +       struct vfio_device *device;
> +
> +       device = vfio_device_get_from_dev(&pdev->dev);
> +       if (device == NULL)
> +               return;

Do not add new vfio_device_get_from_dev() calls, this should extract
it from the pci_get_drvdata.

> +
> +       vdev = container_of(device, struct vfio_pci_core_device, vdev);
> +
> +       mutex_lock(&vdev->igate);
> +       if (vdev->ops && vdev->ops->reset_done)
> +               vdev->ops->reset_done(vdev);
> +       mutex_unlock(&vdev->igate);
> +
> +       vfio_device_put(device);
> +
> +       return;
> +}
> +
>  int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
>  {
>         struct vfio_device *device;
> @@ -1947,6 +1968,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
> 
>  const struct pci_error_handlers vfio_pci_core_err_handlers = {
>         .error_detected = vfio_pci_aer_err_detected,
> +       .reset_done = vfio_pci_aer_err_reset_done,
>  };
>  EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);

Most likely mlx5vf should just implement a pci_error_handlers struct
and install vfio_pci_aer_err_detected in it.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-18 11:51               ` Jason Gunthorpe
@ 2021-10-18 13:26                 ` Yishai Hadas
  2021-10-18 13:42                   ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-18 13:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: bhelgaas, saeedm, linux-pci, kvm, netdev, kuba, leonro,
	kwankhede, mgurtovoy, maorg

On 10/18/2021 2:51 PM, Jason Gunthorpe wrote:
> On Sun, Oct 17, 2021 at 05:03:28PM +0300, Yishai Hadas wrote:
>> On 10/15/2021 11:59 PM, Alex Williamson wrote:
>>> On Fri, 15 Oct 2021 17:16:54 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>
>>>> On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:
>>>>> On Fri, 15 Oct 2021 16:59:37 -0300
>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>>> On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
>>>>>>>> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
>>>>>>>> +				       u32 state)
>>>>>>>> +{
>>>>>>>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>>>>>>>> +	u32 old_state = vmig->vfio_dev_state;
>>>>>>>> +	int ret = 0;
>>>>>>>> +
>>>>>>>> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
>>>>>>>> +		return -EINVAL;
>>>>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))
>>>>>> AFAICT this macro doesn't do what is needed, eg
>>>>>>
>>>>>> VFIO_DEVICE_STATE_VALID(0xF000) == true
>>>>>>
>>>>>> What Yishai implemented is at least functionally correct - states this
>>>>>> driver does not support are rejected.
>>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
>>>>>
>>>>> old_state is controlled by the driver and can never have random bits
>>>>> set, user state should be sanitized to prevent setting undefined bits.
>>>> In that instance let's just write
>>>>
>>>> old_state != VFIO_DEVICE_STATE_ERROR
>>>>
>>>> ?
>>> Not quite, the user can't set either of the other invalid states
>>> either.
>>
>> OK so let's go with below as you suggested.
>> if (!VFIO_DEVICE_STATE_VALID(old_state) ||
>>       !VFIO_DEVICE_STATE_VALID(state) ||
>>        (state & ~VFIO_DEVICE_STATE_MASK))
>>             
> This is my preference:
>
> if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_ERROR ||
>      !vfio_device_state_valid(state) ||
>      (state & !MLX5VF_SUPPORTED_DEVICE_STATES))
>

OK, let's go with this approach which enforces what the driver supports 
as well.

We may have the below post making it accurate and complete.

enum {
     MLX5VF_SUPPORTED_DEVICE_STATES = VFIO_DEVICE_STATE_RUNNING |
                                      VFIO_DEVICE_STATE_SAVING |
                                      VFIO_DEVICE_STATE_RESUMING,
};

if (old_state == VFIO_DEVICE_STATE_ERROR ||
     !vfio_device_state_valid(state) ||
     (state & ~MLX5VF_SUPPORTED_DEVICE_STATES))
           return -EINVAL;

>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index b53a9557884a..37376dadca5a 100644
>> +++ b/include/linux/vfio.h
>> @@ -15,6 +15,8 @@
>>   #include <linux/poll.h>
>>   #include <uapi/linux/vfio.h>
>>
>> +static const int VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING |
>> + VFIO_DEVICE_STATE_RESUMING;
> Do not put static variables in header files
>
> Jason

OK, we can come with an enum instead.

enum {

VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING,

};

Alex,

Do you prefer to  put it under include/uapi/vfio.h or that it can go 
under inlcude/linux/vfio.h for internal drivers usage ?

Yishai


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET
  2021-10-18 12:02           ` Jason Gunthorpe
@ 2021-10-18 13:41             ` Yishai Hadas
  0 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-18 13:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On 10/18/2021 3:02 PM, Jason Gunthorpe wrote:
> On Sun, Oct 17, 2021 at 05:29:39PM +0300, Yishai Hadas wrote:
>> On 10/16/2021 12:12 AM, Alex Williamson wrote:
>>> On Fri, 15 Oct 2021 17:03:28 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>
>>>> On Fri, Oct 15, 2021 at 01:52:37PM -0600, Alex Williamson wrote:
>>>>> On Wed, 13 Oct 2021 12:47:06 +0300
>>>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>>>> Add infrastructure to let vfio_pci_core drivers trap device RESET.
>>>>>>
>>>>>> The motivation for this is to let the underlay driver be aware that
>>>>>> reset was done and set its internal state accordingly.
>>>>> I think the intention of the uAPI here is that the migration error
>>>>> state is exited specifically via the reset ioctl.  Maybe that should be
>>>>> made more clear, but variant drivers can already wrap the core ioctl
>>>>> for the purpose of determining that mechanism of reset has occurred.
>>>> It is not just recovering the error state.
>>>>
>>>> Any transition to reset changes the firmware state. Eg if userspace
>>>> uses one of the other emulation paths to trigger the reset after
>>>> putting the device off running then the driver state and FW state
>>>> become desynchronized.
>>>>
>>>> So all the reset paths need to be synchronized some how, either
>>>> blocked while in non-running states or aligning the SW state with the
>>>> new post-reset FW state.
>>> This only catches the two flavors of FLR and the RESET ioctl itself, so
>>> we've got gaps relative to "all the reset paths" anyway.  I'm also
>>> concerned about adding arbitrary callbacks for every case that it gets
>>> too cumbersome to write a wrapper for the existing callbacks.
>>>
>>> However, why is this a vfio thing when we have the
>>> pci_error_handlers.reset_done callback.  At best this ought to be
>>> redundant to that.  Thanks,
>>>
>>> Alex
>>>
>> Alex,
>>
>> How about the below patch instead ?
>>
>> This will centralize the 'reset_done' notifications for drivers to one place
>> (i.e. pci_error_handlers.reset_done)  and may close the gap that you pointed
>> on.
>>
>> I just followed the logic in vfio_pci_aer_err_detected() from usage and
>> locking point of view.
>>
>> Do we really need to take the &vdev->igate mutex as was done there ?
>>
>> The next patch from the series in mlx5 will stay as of in V1, it may just
>> set its ops and be called upon PCI 'reset_done'.
>>
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c
>> b/drivers/vfio/pci/vfio_pci_core.c
>> index e581a327f90d..20bf37c00fb6 100644
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1925,6 +1925,27 @@ static pci_ers_result_t
>> vfio_pci_aer_err_detected(struct pci_dev *pdev,
>>          return PCI_ERS_RESULT_CAN_RECOVER;
>>   }
>>
>> +static void vfio_pci_aer_err_reset_done(struct pci_dev *pdev)
>> +{
>> +       struct vfio_pci_core_device *vdev;
>> +       struct vfio_device *device;
>> +
>> +       device = vfio_device_get_from_dev(&pdev->dev);
>> +       if (device == NULL)
>> +               return;
> Do not add new vfio_device_get_from_dev() calls, this should extract
> it from the pci_get_drvdata.
>
>> +
>> +       vdev = container_of(device, struct vfio_pci_core_device, vdev);
>> +
>> +       mutex_lock(&vdev->igate);
>> +       if (vdev->ops && vdev->ops->reset_done)
>> +               vdev->ops->reset_done(vdev);
>> +       mutex_unlock(&vdev->igate);
>> +
>> +       vfio_device_put(device);
>> +
>> +       return;
>> +}
>> +
>>   int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
>>   {
>>          struct vfio_device *device;
>> @@ -1947,6 +1968,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
>>
>>   const struct pci_error_handlers vfio_pci_core_err_handlers = {
>>          .error_detected = vfio_pci_aer_err_detected,
>> +       .reset_done = vfio_pci_aer_err_reset_done,
>>   };
>>   EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
> Most likely mlx5vf should just implement a pci_error_handlers struct
> and install vfio_pci_aer_err_detected in it.
>
> Jason

This can work as well.

It may cleanup the need to set an extra ops on vfio_pci_core_device, the 
reset will go directly to the mlx5 driver.

I plan to follow that in coming V2.

Yishai


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-18 13:26                 ` Yishai Hadas
@ 2021-10-18 13:42                   ` Alex Williamson
  2021-10-18 13:46                     ` Yishai Hadas
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-10-18 13:42 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Jason Gunthorpe, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On Mon, 18 Oct 2021 16:26:16 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 10/18/2021 2:51 PM, Jason Gunthorpe wrote:
> > On Sun, Oct 17, 2021 at 05:03:28PM +0300, Yishai Hadas wrote:  
> >> On 10/15/2021 11:59 PM, Alex Williamson wrote:  
> >>> On Fri, 15 Oct 2021 17:16:54 -0300
> >>> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>>  
> >>>> On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:  
> >>>>> On Fri, 15 Oct 2021 16:59:37 -0300
> >>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:  
> >>>>>> On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:  
> >>>>>>>> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
> >>>>>>>> +				       u32 state)
> >>>>>>>> +{
> >>>>>>>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> >>>>>>>> +	u32 old_state = vmig->vfio_dev_state;
> >>>>>>>> +	int ret = 0;
> >>>>>>>> +
> >>>>>>>> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> >>>>>>>> +		return -EINVAL;  
> >>>>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))  
> >>>>>> AFAICT this macro doesn't do what is needed, eg
> >>>>>>
> >>>>>> VFIO_DEVICE_STATE_VALID(0xF000) == true
> >>>>>>
> >>>>>> What Yishai implemented is at least functionally correct - states this
> >>>>>> driver does not support are rejected.  
> >>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
> >>>>>
> >>>>> old_state is controlled by the driver and can never have random bits
> >>>>> set, user state should be sanitized to prevent setting undefined bits.  
> >>>> In that instance let's just write
> >>>>
> >>>> old_state != VFIO_DEVICE_STATE_ERROR
> >>>>
> >>>> ?  
> >>> Not quite, the user can't set either of the other invalid states
> >>> either.  
> >>
> >> OK so let's go with below as you suggested.
> >> if (!VFIO_DEVICE_STATE_VALID(old_state) ||
> >>       !VFIO_DEVICE_STATE_VALID(state) ||
> >>        (state & ~VFIO_DEVICE_STATE_MASK))
> >>               
> > This is my preference:
> >
> > if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_ERROR ||
> >      !vfio_device_state_valid(state) ||
> >      (state & !MLX5VF_SUPPORTED_DEVICE_STATES))
> >  
> 
> OK, let's go with this approach which enforces what the driver supports 
> as well.
> 
> We may have the below post making it accurate and complete.
> 
> enum {
>      MLX5VF_SUPPORTED_DEVICE_STATES = VFIO_DEVICE_STATE_RUNNING |
>                                       VFIO_DEVICE_STATE_SAVING |
>                                       VFIO_DEVICE_STATE_RESUMING,
> };
> 
> if (old_state == VFIO_DEVICE_STATE_ERROR ||
>      !vfio_device_state_valid(state) ||
>      (state & ~MLX5VF_SUPPORTED_DEVICE_STATES))
>            return -EINVAL;
> 
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index b53a9557884a..37376dadca5a 100644
> >> +++ b/include/linux/vfio.h
> >> @@ -15,6 +15,8 @@
> >>   #include <linux/poll.h>
> >>   #include <uapi/linux/vfio.h>
> >>
> >> +static const int VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING |
> >> + VFIO_DEVICE_STATE_RESUMING;  
> > Do not put static variables in header files
> >
> > Jason  
> 
> OK, we can come with an enum instead.
> 
> enum {
> 
> VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING,
> 
> };
> 
> Alex,
> 
> Do you prefer to  put it under include/uapi/vfio.h or that it can go 
> under inlcude/linux/vfio.h for internal drivers usage ?

I don't understand why this wouldn't just be a continuation of the
#defines in the uapi header.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-18 13:42                   ` Alex Williamson
@ 2021-10-18 13:46                     ` Yishai Hadas
  0 siblings, 0 replies; 44+ messages in thread
From: Yishai Hadas @ 2021-10-18 13:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, bhelgaas, saeedm, linux-pci, kvm, netdev, kuba,
	leonro, kwankhede, mgurtovoy, maorg

On 10/18/2021 4:42 PM, Alex Williamson wrote:
> On Mon, 18 Oct 2021 16:26:16 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> On 10/18/2021 2:51 PM, Jason Gunthorpe wrote:
>>> On Sun, Oct 17, 2021 at 05:03:28PM +0300, Yishai Hadas wrote:
>>>> On 10/15/2021 11:59 PM, Alex Williamson wrote:
>>>>> On Fri, 15 Oct 2021 17:16:54 -0300
>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>>   
>>>>>> On Fri, Oct 15, 2021 at 02:12:01PM -0600, Alex Williamson wrote:
>>>>>>> On Fri, 15 Oct 2021 16:59:37 -0300
>>>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>>>>> On Fri, Oct 15, 2021 at 01:48:20PM -0600, Alex Williamson wrote:
>>>>>>>>>> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device *mvdev,
>>>>>>>>>> +				       u32 state)
>>>>>>>>>> +{
>>>>>>>>>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>>>>>>>>>> +	u32 old_state = vmig->vfio_dev_state;
>>>>>>>>>> +	int ret = 0;
>>>>>>>>>> +
>>>>>>>>>> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
>>>>>>>>>> +		return -EINVAL;
>>>>>>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state))
>>>>>>>> AFAICT this macro doesn't do what is needed, eg
>>>>>>>>
>>>>>>>> VFIO_DEVICE_STATE_VALID(0xF000) == true
>>>>>>>>
>>>>>>>> What Yishai implemented is at least functionally correct - states this
>>>>>>>> driver does not support are rejected.
>>>>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) || !VFIO_DEVICE_STATE_VALID(state)) || (state & ~VFIO_DEVICE_STATE_MASK))
>>>>>>>
>>>>>>> old_state is controlled by the driver and can never have random bits
>>>>>>> set, user state should be sanitized to prevent setting undefined bits.
>>>>>> In that instance let's just write
>>>>>>
>>>>>> old_state != VFIO_DEVICE_STATE_ERROR
>>>>>>
>>>>>> ?
>>>>> Not quite, the user can't set either of the other invalid states
>>>>> either.
>>>> OK so let's go with below as you suggested.
>>>> if (!VFIO_DEVICE_STATE_VALID(old_state) ||
>>>>        !VFIO_DEVICE_STATE_VALID(state) ||
>>>>         (state & ~VFIO_DEVICE_STATE_MASK))
>>>>                
>>> This is my preference:
>>>
>>> if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_ERROR ||
>>>       !vfio_device_state_valid(state) ||
>>>       (state & !MLX5VF_SUPPORTED_DEVICE_STATES))
>>>   
>> OK, let's go with this approach which enforces what the driver supports
>> as well.
>>
>> We may have the below post making it accurate and complete.
>>
>> enum {
>>       MLX5VF_SUPPORTED_DEVICE_STATES = VFIO_DEVICE_STATE_RUNNING |
>>                                        VFIO_DEVICE_STATE_SAVING |
>>                                        VFIO_DEVICE_STATE_RESUMING,
>> };
>>
>> if (old_state == VFIO_DEVICE_STATE_ERROR ||
>>       !vfio_device_state_valid(state) ||
>>       (state & ~MLX5VF_SUPPORTED_DEVICE_STATES))
>>             return -EINVAL;
>>
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index b53a9557884a..37376dadca5a 100644
>>>> +++ b/include/linux/vfio.h
>>>> @@ -15,6 +15,8 @@
>>>>    #include <linux/poll.h>
>>>>    #include <uapi/linux/vfio.h>
>>>>
>>>> +static const int VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING |
>>>> + VFIO_DEVICE_STATE_RESUMING;
>>> Do not put static variables in header files
>>>
>>> Jason
>> OK, we can come with an enum instead.
>>
>> enum {
>>
>> VFIO_DEVICE_STATE_ERROR = VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING,
>>
>> };
>>
>> Alex,
>>
>> Do you prefer to  put it under include/uapi/vfio.h or that it can go
>> under inlcude/linux/vfio.h for internal drivers usage ?
> I don't understand why this wouldn't just be a continuation of the
> #defines in the uapi header.  Thanks,
>
> Alex
>

Sure, let's go with this.

Thanks,
Yishai

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-13  9:47 ` [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
  2021-10-15 19:48   ` Alex Williamson
@ 2021-10-19  9:59   ` Shameerali Kolothum Thodi
  2021-10-19 10:30     ` Yishai Hadas
  2021-10-19 11:24     ` Jason Gunthorpe
  1 sibling, 2 replies; 44+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-10-19  9:59 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy, maorg



> -----Original Message-----
> From: Yishai Hadas [mailto:yishaih@nvidia.com]
> Sent: 13 October 2021 10:47
> To: alex.williamson@redhat.com; bhelgaas@google.com; jgg@nvidia.com;
> saeedm@nvidia.com
> Cc: linux-pci@vger.kernel.org; kvm@vger.kernel.org; netdev@vger.kernel.org;
> kuba@kernel.org; leonro@nvidia.com; kwankhede@nvidia.com;
> mgurtovoy@nvidia.com; yishaih@nvidia.com; maorg@nvidia.com
> Subject: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for
> mlx5 devices
> 
> This patch adds support for vfio_pci driver for mlx5 devices.
> 
> It uses vfio_pci_core to register to the VFIO subsystem and then
> implements the mlx5 specific logic in the migration area.
> 
> The migration implementation follows the definition from uapi/vfio.h and
> uses the mlx5 VF->PF command channel to achieve it.
> 
> This patch implements the suspend/resume flows.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  MAINTAINERS                    |   6 +
>  drivers/vfio/pci/Kconfig       |   3 +
>  drivers/vfio/pci/Makefile      |   2 +
>  drivers/vfio/pci/mlx5/Kconfig  |  11 +
>  drivers/vfio/pci/mlx5/Makefile |   4 +
>  drivers/vfio/pci/mlx5/main.c   | 692 +++++++++++++++++++++++++++++++++
>  6 files changed, 718 insertions(+)
>  create mode 100644 drivers/vfio/pci/mlx5/Kconfig
>  create mode 100644 drivers/vfio/pci/mlx5/Makefile
>  create mode 100644 drivers/vfio/pci/mlx5/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index abdcbcfef73d..e824bfab4a01 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19699,6 +19699,12 @@ L:	kvm@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vfio/platform/
> 
> +VFIO MLX5 PCI DRIVER
> +M:	Yishai Hadas <yishaih@nvidia.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vfio/pci/mlx5/
> +
>  VGA_SWITCHEROO
>  R:	Lukas Wunner <lukas@wunner.de>
>  S:	Maintained
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 860424ccda1b..187b9c259944 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -43,4 +43,7 @@ config VFIO_PCI_IGD
> 
>  	  To enable Intel IGD assignment through vfio-pci, say Y.
>  endif
> +
> +source "drivers/vfio/pci/mlx5/Kconfig"
> +
>  endif
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 349d68d242b4..ed9d6f2e0555 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>  vfio-pci-y := vfio_pci.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> +
> +obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
> new file mode 100644
> index 000000000000..a3ce00add4fe
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config MLX5_VFIO_PCI
> +	tristate "VFIO support for MLX5 PCI devices"
> +	depends on MLX5_CORE
> +	select VFIO_PCI_CORE
> +	help
> +	  This provides a PCI support for MLX5 devices using the VFIO
> +	  framework. The device specific driver supports suspend/resume
> +	  of the MLX5 device.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
> new file mode 100644
> index 000000000000..689627da7ff5
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> +mlx5-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> new file mode 100644
> index 000000000000..e36302b444a6
> --- /dev/null
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -0,0 +1,692 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#include "cmd.h"
> +
> +enum {
> +	MLX5VF_PCI_FREEZED = 1 << 0,
> +};
> +
> +enum {
> +	MLX5VF_REGION_PENDING_BYTES = 1 << 0,
> +	MLX5VF_REGION_DATA_SIZE = 1 << 1,
> +};
> +
> +#define MLX5VF_MIG_REGION_DATA_SIZE SZ_128K
> +/* Data section offset from migration region */
> +#define MLX5VF_MIG_REGION_DATA_OFFSET
> \
> +	(sizeof(struct vfio_device_migration_info))
> +
> +#define VFIO_DEVICE_MIGRATION_OFFSET(x)
> \
> +	(offsetof(struct vfio_device_migration_info, x))
> +
> +struct mlx5vf_pci_migration_info {
> +	u32 vfio_dev_state; /* VFIO_DEVICE_STATE_XXX */
> +	u32 dev_state; /* device migration state */
> +	u32 region_state; /* Use MLX5VF_REGION_XXX */
> +	u16 vhca_id;
> +	struct mlx5_vhca_state_data vhca_state_data;
> +};
> +
> +struct mlx5vf_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	u8 migrate_cap:1;
> +	/* protect migartion state */
> +	struct mutex state_mutex;
> +	struct mlx5vf_pci_migration_info vmig;
> +};
> +
> +static int mlx5vf_pci_unquiesce_device(struct mlx5vf_pci_core_device
> *mvdev)
> +{
> +	return mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
> +				      mvdev->vmig.vhca_id,
> +
> MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
> +}
> +
> +static int mlx5vf_pci_quiesce_device(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	return mlx5vf_cmd_suspend_vhca(
> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
> +}
> +
> +static int mlx5vf_pci_unfreeze_device(struct mlx5vf_pci_core_device
> *mvdev)
> +{
> +	int ret;
> +
> +	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
> +				     mvdev->vmig.vhca_id,
> +
> MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
> +	if (ret)
> +		return ret;
> +
> +	mvdev->vmig.dev_state &= ~MLX5VF_PCI_FREEZED;
> +	return 0;
> +}
> +
> +static int mlx5vf_pci_freeze_device(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	int ret;
> +
> +	ret = mlx5vf_cmd_suspend_vhca(
> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
> +	if (ret)
> +		return ret;
> +
> +	mvdev->vmig.dev_state |= MLX5VF_PCI_FREEZED;
> +	return 0;
> +}
> +
> +static int mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device
> *mvdev)
> +{
> +	u32 state_size = 0;
> +	int ret;
> +
> +	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
> +		return -EFAULT;
> +
> +	/* If we already read state no reason to re-read */
> +	if (mvdev->vmig.vhca_state_data.state_size)
> +		return 0;
> +
> +	ret = mlx5vf_cmd_query_vhca_migration_state(
> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id, &state_size);
> +	if (ret)
> +		return ret;
> +
> +	return mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
> +					  mvdev->vmig.vhca_id, state_size,
> +					  &mvdev->vmig.vhca_state_data);
> +}
> +
> +static int mlx5vf_pci_new_write_window(struct mlx5vf_pci_core_device
> *mvdev)
> +{
> +	struct mlx5_vhca_state_data *state_data =
> &mvdev->vmig.vhca_state_data;
> +	u32 num_pages_needed;
> +	u64 allocated_ready;
> +	u32 bytes_needed;
> +
> +	/* Check how many bytes are available from previous flows */
> +	WARN_ON(state_data->num_pages * PAGE_SIZE <
> +		state_data->win_start_offset);
> +	allocated_ready = (state_data->num_pages * PAGE_SIZE) -
> +			  state_data->win_start_offset;
> +	WARN_ON(allocated_ready > MLX5VF_MIG_REGION_DATA_SIZE);
> +
> +	bytes_needed = MLX5VF_MIG_REGION_DATA_SIZE - allocated_ready;
> +	if (!bytes_needed)
> +		return 0;
> +
> +	num_pages_needed = DIV_ROUND_UP_ULL(bytes_needed, PAGE_SIZE);
> +	return mlx5vf_add_migration_pages(state_data, num_pages_needed);
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_data_size(struct mlx5vf_pci_core_device
> *mvdev,
> +				      char __user *buf, bool iswrite)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +	u64 data_size;
> +	int ret;
> +
> +	if (iswrite) {
> +		/* data_size is writable only during resuming state */
> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_RESUMING)
> +			return -EINVAL;
> +
> +		ret = copy_from_user(&data_size, buf, sizeof(data_size));
> +		if (ret)
> +			return -EFAULT;
> +
> +		vmig->vhca_state_data.state_size += data_size;
> +		vmig->vhca_state_data.win_start_offset += data_size;
> +		ret = mlx5vf_pci_new_write_window(mvdev);
> +		if (ret)
> +			return ret;
> +
> +	} else {
> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_SAVING)
> +			return -EINVAL;
> +
> +		data_size = min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
> +				  vmig->vhca_state_data.state_size -
> +				  vmig->vhca_state_data.win_start_offset);
> +		ret = copy_to_user(buf, &data_size, sizeof(data_size));
> +		if (ret)
> +			return -EFAULT;
> +	}
> +
> +	vmig->region_state |= MLX5VF_REGION_DATA_SIZE;
> +	return sizeof(data_size);
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_data_offset(struct mlx5vf_pci_core_device
> *mvdev,
> +					char __user *buf, bool iswrite)
> +{
> +	static const u64 data_offset = MLX5VF_MIG_REGION_DATA_OFFSET;
> +	int ret;
> +
> +	/* RO field */
> +	if (iswrite)
> +		return -EFAULT;
> +
> +	ret = copy_to_user(buf, &data_offset, sizeof(data_offset));
> +	if (ret)
> +		return -EFAULT;
> +
> +	return sizeof(data_offset);
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_pending_bytes(struct mlx5vf_pci_core_device
> *mvdev,
> +					  char __user *buf, bool iswrite)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +	u64 pending_bytes;
> +	int ret;
> +
> +	/* RO field */
> +	if (iswrite)
> +		return -EFAULT;
> +
> +	if (vmig->vfio_dev_state == (VFIO_DEVICE_STATE_SAVING |
> +				     VFIO_DEVICE_STATE_RUNNING)) {
> +		/* In pre-copy state we have no data to return for now,
> +		 * return 0 pending bytes
> +		 */
> +		pending_bytes = 0;
> +	} else {
> +		if (!vmig->vhca_state_data.state_size)
> +			return 0;
> +		pending_bytes = vmig->vhca_state_data.state_size -
> +				vmig->vhca_state_data.win_start_offset;
> +	}
> +
> +	ret = copy_to_user(buf, &pending_bytes, sizeof(pending_bytes));
> +	if (ret)
> +		return -EFAULT;
> +
> +	/* Window moves forward once data from previous iteration was read */
> +	if (vmig->region_state & MLX5VF_REGION_DATA_SIZE)
> +		vmig->vhca_state_data.win_start_offset +=
> +			min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE, pending_bytes);
> +
> +	WARN_ON(vmig->vhca_state_data.win_start_offset >
> +		vmig->vhca_state_data.state_size);
> +
> +	/* New iteration started */
> +	vmig->region_state = MLX5VF_REGION_PENDING_BYTES;
> +	return sizeof(pending_bytes);
> +}
> +
> +static int mlx5vf_load_state(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	if (!mvdev->vmig.vhca_state_data.state_size)
> +		return 0;
> +
> +	return mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
> +					  mvdev->vmig.vhca_id,
> +					  &mvdev->vmig.vhca_state_data);
> +}
> +
> +static void mlx5vf_reset_mig_state(struct mlx5vf_pci_core_device *mvdev)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +
> +	vmig->region_state = 0;
> +	mlx5vf_reset_vhca_state(&vmig->vhca_state_data);
> +}
> +
> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device
> *mvdev,
> +				       u32 state)
> +{
> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> +	u32 old_state = vmig->vfio_dev_state;
> +	int ret = 0;
> +
> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> +		return -EINVAL;
> +
> +	/* Running switches off */
> +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> +	    (state & VFIO_DEVICE_STATE_RUNNING) &&
> +	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
> +		ret = mlx5vf_pci_quiesce_device(mvdev);
> +		if (ret)
> +			return ret;
> +		ret = mlx5vf_pci_freeze_device(mvdev);
> +		if (ret) {
> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> +			return ret;
> +		}
> +	}
> +
> +	/* Resuming switches off */
> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
> +	    (old_state & VFIO_DEVICE_STATE_RESUMING)) {
> +		/* deserialize state into the device */
> +		ret = mlx5vf_load_state(mvdev);
> +		if (ret) {
> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> +			return ret;
> +		}
> +	}
> +
> +	/* Resuming switches on */
> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
> +	    (state & VFIO_DEVICE_STATE_RESUMING)) {
> +		mlx5vf_reset_mig_state(mvdev);
> +		ret = mlx5vf_pci_new_write_window(mvdev);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/* Saving switches on */
> +	if ((old_state & VFIO_DEVICE_STATE_SAVING) !=
> +	    (state & VFIO_DEVICE_STATE_SAVING) &&
> +	    (state & VFIO_DEVICE_STATE_SAVING)) {
> +		if (!(state & VFIO_DEVICE_STATE_RUNNING)) {
> +			/* serialize post copy */
> +			ret = mlx5vf_pci_save_device_data(mvdev);

Does it actually get into post-copy here? The pre-copy state(old_state) 
has the _SAVING bit set already and post-copy state( new state) also
has _SAVING set. It looks like we need to handle the post copy in the above
"Running switches off" and check for (state & _SAVING). 

Or Am I missing something?

Thanks,
Shameer

> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	/* Running switches on */
> +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> +	    (state & VFIO_DEVICE_STATE_RUNNING) &&
> +	    (state & VFIO_DEVICE_STATE_RUNNING)) {
> +		ret = mlx5vf_pci_unfreeze_device(mvdev);
> +		if (ret)
> +			return ret;
> +		ret = mlx5vf_pci_unquiesce_device(mvdev);
> +		if (ret) {
> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> +			return ret;
> +		}
> +	}
> +
> +	vmig->vfio_dev_state = state;
> +	return 0;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_handle_migration_device_state(struct mlx5vf_pci_core_device
> *mvdev,
> +					 char __user *buf, bool iswrite)
> +{
> +	size_t count = sizeof(mvdev->vmig.vfio_dev_state);
> +	int ret;
> +
> +	if (iswrite) {
> +		u32 device_state;
> +
> +		ret = copy_from_user(&device_state, buf, count);
> +		if (ret)
> +			return -EFAULT;
> +
> +		ret = mlx5vf_pci_set_device_state(mvdev, device_state);
> +		if (ret)
> +			return ret;
> +	} else {
> +		ret = copy_to_user(buf, &mvdev->vmig.vfio_dev_state, count);
> +		if (ret)
> +			return -EFAULT;
> +	}
> +
> +	return count;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_copy_user_data_to_device_state(struct mlx5vf_pci_core_device
> *mvdev,
> +					  char __user *buf, size_t count,
> +					  u64 offset)
> +{
> +	struct mlx5_vhca_state_data *state_data =
> &mvdev->vmig.vhca_state_data;
> +	char __user *from_buff = buf;
> +	u32 curr_offset;
> +	u32 win_page_offset;
> +	u32 copy_count;
> +	struct page *page;
> +	char *to_buff;
> +	int ret;
> +
> +	curr_offset = state_data->win_start_offset + offset;
> +
> +	do {
> +		page = mlx5vf_get_migration_page(&state_data->mig_data,
> +						 curr_offset);
> +		if (!page)
> +			return -EINVAL;
> +
> +		win_page_offset = curr_offset % PAGE_SIZE;
> +		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
> +
> +		to_buff = kmap_local_page(page);
> +		ret = copy_from_user(to_buff + win_page_offset, from_buff,
> +				     copy_count);
> +		kunmap_local(to_buff);
> +		if (ret)
> +			return -EFAULT;
> +
> +		from_buff += copy_count;
> +		curr_offset += copy_count;
> +		count -= copy_count;
> +	} while (count > 0);
> +
> +	return 0;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_copy_device_state_to_user(struct mlx5vf_pci_core_device
> *mvdev,
> +				     char __user *buf, u64 offset, size_t count)
> +{
> +	struct mlx5_vhca_state_data *state_data =
> &mvdev->vmig.vhca_state_data;
> +	char __user *to_buff = buf;
> +	u32 win_available_bytes;
> +	u32 win_page_offset;
> +	u32 copy_count;
> +	u32 curr_offset;
> +	char *from_buff;
> +	struct page *page;
> +	int ret;
> +
> +	win_available_bytes =
> +		min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
> +		      mvdev->vmig.vhca_state_data.state_size -
> +			      mvdev->vmig.vhca_state_data.win_start_offset);
> +
> +	if (count + offset > win_available_bytes)
> +		return -EINVAL;
> +
> +	curr_offset = state_data->win_start_offset + offset;
> +
> +	do {
> +		page = mlx5vf_get_migration_page(&state_data->mig_data,
> +						 curr_offset);
> +		if (!page)
> +			return -EINVAL;
> +
> +		win_page_offset = curr_offset % PAGE_SIZE;
> +		copy_count = min_t(u32, PAGE_SIZE - win_page_offset, count);
> +
> +		from_buff = kmap_local_page(page);
> +		ret = copy_to_user(buf, from_buff + win_page_offset,
> +				   copy_count);
> +		kunmap_local(from_buff);
> +		if (ret)
> +			return -EFAULT;
> +
> +		curr_offset += copy_count;
> +		count -= copy_count;
> +		to_buff += copy_count;
> +	} while (count);
> +
> +	return 0;
> +}
> +
> +static ssize_t
> +mlx5vf_pci_migration_data_rw(struct mlx5vf_pci_core_device *mvdev,
> +			     char __user *buf, size_t count, u64 offset,
> +			     bool iswrite)
> +{
> +	int ret;
> +
> +	if (offset + count > MLX5VF_MIG_REGION_DATA_SIZE)
> +		return -EINVAL;
> +
> +	if (iswrite)
> +		ret = mlx5vf_pci_copy_user_data_to_device_state(mvdev, buf,
> +								count, offset);
> +	else
> +		ret = mlx5vf_pci_copy_device_state_to_user(mvdev, buf, offset,
> +							   count);
> +	if (ret)
> +		return ret;
> +	return count;
> +}
> +
> +static ssize_t mlx5vf_pci_mig_rw(struct vfio_pci_core_device *vdev,
> +				 char __user *buf, size_t count, loff_t *ppos,
> +				 bool iswrite)
> +{
> +	struct mlx5vf_pci_core_device *mvdev =
> +		container_of(vdev, struct mlx5vf_pci_core_device, core_device);
> +	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	mutex_lock(&mvdev->state_mutex);
> +	/* Copy to/from the migration region data section */
> +	if (pos >= MLX5VF_MIG_REGION_DATA_OFFSET) {
> +		ret = mlx5vf_pci_migration_data_rw(
> +			mvdev, buf, count, pos - MLX5VF_MIG_REGION_DATA_OFFSET,
> +			iswrite);
> +		goto end;
> +	}
> +
> +	switch (pos) {
> +	case VFIO_DEVICE_MIGRATION_OFFSET(device_state):
> +		/* This is RW field. */
> +		if (count != sizeof(mvdev->vmig.vfio_dev_state)) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		ret = mlx5vf_pci_handle_migration_device_state(mvdev, buf,
> +							       iswrite);
> +		break;
> +	case VFIO_DEVICE_MIGRATION_OFFSET(pending_bytes):
> +		/*
> +		 * The number of pending bytes still to be migrated from the
> +		 * vendor driver. This is RO field.
> +		 * Reading this field indicates on the start of a new iteration
> +		 * to get device data.
> +		 *
> +		 */
> +		ret = mlx5vf_pci_handle_migration_pending_bytes(mvdev, buf,
> +								iswrite);
> +		break;
> +	case VFIO_DEVICE_MIGRATION_OFFSET(data_offset):
> +		/*
> +		 * The user application should read data_offset field from the
> +		 * migration region. The user application should read the
> +		 * device data from this offset within the migration region
> +		 * during the _SAVING mode or write the device data during the
> +		 * _RESUMING mode. This is RO field.
> +		 */
> +		ret = mlx5vf_pci_handle_migration_data_offset(mvdev, buf,
> +							      iswrite);
> +		break;
> +	case VFIO_DEVICE_MIGRATION_OFFSET(data_size):
> +		/*
> +		 * The user application should read data_size to get the size
> +		 * in bytes of the data copied to the migration region during
> +		 * the _SAVING state by the device. The user application should
> +		 * write the size in bytes of the data that was copied to
> +		 * the migration region during the _RESUMING state by the user.
> +		 * This is RW field.
> +		 */
> +		ret = mlx5vf_pci_handle_migration_data_size(mvdev, buf,
> +							    iswrite);
> +		break;
> +	default:
> +		ret = -EFAULT;
> +		break;
> +	}
> +
> +end:
> +	mutex_unlock(&mvdev->state_mutex);
> +	return ret;
> +}
> +
> +static struct vfio_pci_regops migration_ops = {
> +	.rw = mlx5vf_pci_mig_rw,
> +};
> +
> +static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = container_of(
> +		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
> +	struct vfio_pci_core_device *vdev = &mvdev->core_device;
> +	int vf_id;
> +	int ret;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	if (!mvdev->migrate_cap) {
> +		vfio_pci_core_finish_enable(vdev);
> +		return 0;
> +	}
> +
> +	vf_id = pci_iov_vf_id(vdev->pdev);
> +	if (vf_id < 0) {
> +		ret = vf_id;
> +		goto out_disable;
> +	}
> +
> +	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1,
> +				     &mvdev->vmig.vhca_id);
> +	if (ret)
> +		goto out_disable;
> +
> +	ret = vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
> +					   VFIO_REGION_SUBTYPE_MIGRATION,
> +					   &migration_ops,
> +					   MLX5VF_MIG_REGION_DATA_OFFSET +
> +					   MLX5VF_MIG_REGION_DATA_SIZE,
> +					   VFIO_REGION_INFO_FLAG_READ |
> +					   VFIO_REGION_INFO_FLAG_WRITE,
> +					   NULL);
> +	if (ret)
> +		goto out_disable;
> +
> +	mutex_init(&mvdev->state_mutex);
> +	mvdev->vmig.vfio_dev_state = VFIO_DEVICE_STATE_RUNNING;
> +	vfio_pci_core_finish_enable(vdev);
> +	return 0;
> +out_disable:
> +	vfio_pci_core_disable(vdev);
> +	return ret;
> +}
> +
> +static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = container_of(
> +		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
> +
> +	vfio_pci_core_close_device(core_vdev);
> +	mlx5vf_reset_mig_state(mvdev);
> +}
> +
> +static const struct vfio_device_ops mlx5vf_pci_ops = {
> +	.name = "mlx5-vfio-pci",
> +	.open_device = mlx5vf_pci_open_device,
> +	.close_device = mlx5vf_pci_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +};
> +
> +static int mlx5vf_pci_probe(struct pci_dev *pdev,
> +			    const struct pci_device_id *id)
> +{
> +	struct mlx5vf_pci_core_device *mvdev;
> +	int ret;
> +
> +	mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL);
> +	if (!mvdev)
> +		return -ENOMEM;
> +	vfio_pci_core_init_device(&mvdev->core_device, pdev,
> &mlx5vf_pci_ops);
> +
> +	if (pdev->is_virtfn) {
> +		struct mlx5_core_dev *mdev =
> +			mlx5_vf_get_core_dev(pdev);
> +
> +		if (mdev) {
> +			if (MLX5_CAP_GEN(mdev, migration))
> +				mvdev->migrate_cap = 1;
> +			mlx5_vf_put_core_dev(mdev);
> +		}
> +	}
> +
> +	ret = vfio_pci_core_register_device(&mvdev->core_device);
> +	if (ret)
> +		goto out_free;
> +
> +	dev_set_drvdata(&pdev->dev, mvdev);
> +	return 0;
> +
> +out_free:
> +	vfio_pci_core_uninit_device(&mvdev->core_device);
> +	kfree(mvdev);
> +	return ret;
> +}
> +
> +static void mlx5vf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
> +
> +	vfio_pci_core_unregister_device(&mvdev->core_device);
> +	vfio_pci_core_uninit_device(&mvdev->core_device);
> +	kfree(mvdev);
> +}
> +
> +static const struct pci_device_id mlx5vf_pci_table[] = {
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX,
> 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */
> +	{}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table);
> +
> +static struct pci_driver mlx5vf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = mlx5vf_pci_table,
> +	.probe = mlx5vf_pci_probe,
> +	.remove = mlx5vf_pci_remove,
> +	.err_handler = &vfio_pci_core_err_handlers,
> +};
> +
> +static void __exit mlx5vf_pci_cleanup(void)
> +{
> +	pci_unregister_driver(&mlx5vf_pci_driver);
> +}
> +
> +static int __init mlx5vf_pci_init(void)
> +{
> +	return pci_register_driver(&mlx5vf_pci_driver);
> +}
> +
> +module_init(mlx5vf_pci_init);
> +module_exit(mlx5vf_pci_cleanup);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> +	"MLX5 VFIO PCI - User Level meta-driver for MLX5 device family");
> --
> 2.18.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-19  9:59   ` Shameerali Kolothum Thodi
@ 2021-10-19 10:30     ` Yishai Hadas
  2021-10-19 11:26       ` Shameerali Kolothum Thodi
  2021-10-19 11:24     ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Yishai Hadas @ 2021-10-19 10:30 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On 10/19/2021 12:59 PM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yishai Hadas [mailto:yishaih@nvidia.com]
>> Sent: 13 October 2021 10:47
>> To: alex.williamson@redhat.com; bhelgaas@google.com; jgg@nvidia.com;
>> saeedm@nvidia.com
>> Cc: linux-pci@vger.kernel.org; kvm@vger.kernel.org; netdev@vger.kernel.org;
>> kuba@kernel.org; leonro@nvidia.com; kwankhede@nvidia.com;
>> mgurtovoy@nvidia.com; yishaih@nvidia.com; maorg@nvidia.com
>> Subject: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for
>> mlx5 devices
>>
>> This patch adds support for vfio_pci driver for mlx5 devices.
>>
>> It uses vfio_pci_core to register to the VFIO subsystem and then
>> implements the mlx5 specific logic in the migration area.
>>
>> The migration implementation follows the definition from uapi/vfio.h and
>> uses the mlx5 VF->PF command channel to achieve it.
>>
>> This patch implements the suspend/resume flows.
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>> ---
>>   MAINTAINERS                    |   6 +
>>   drivers/vfio/pci/Kconfig       |   3 +
>>   drivers/vfio/pci/Makefile      |   2 +
>>   drivers/vfio/pci/mlx5/Kconfig  |  11 +
>>   drivers/vfio/pci/mlx5/Makefile |   4 +
>>   drivers/vfio/pci/mlx5/main.c   | 692 +++++++++++++++++++++++++++++++++
>>   6 files changed, 718 insertions(+)
>>   create mode 100644 drivers/vfio/pci/mlx5/Kconfig
>>   create mode 100644 drivers/vfio/pci/mlx5/Makefile
>>   create mode 100644 drivers/vfio/pci/mlx5/main.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index abdcbcfef73d..e824bfab4a01 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -19699,6 +19699,12 @@ L:	kvm@vger.kernel.org
>>   S:	Maintained
>>   F:	drivers/vfio/platform/
>>
>> +VFIO MLX5 PCI DRIVER
>> +M:	Yishai Hadas <yishaih@nvidia.com>
>> +L:	kvm@vger.kernel.org
>> +S:	Maintained
>> +F:	drivers/vfio/pci/mlx5/
>> +
>>   VGA_SWITCHEROO
>>   R:	Lukas Wunner <lukas@wunner.de>
>>   S:	Maintained
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 860424ccda1b..187b9c259944 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -43,4 +43,7 @@ config VFIO_PCI_IGD
>>
>>   	  To enable Intel IGD assignment through vfio-pci, say Y.
>>   endif
>> +
>> +source "drivers/vfio/pci/mlx5/Kconfig"
>> +
>>   endif
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 349d68d242b4..ed9d6f2e0555 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>>   vfio-pci-y := vfio_pci.o
>>   vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>>   obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> +
>> +obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
>> new file mode 100644
>> index 000000000000..a3ce00add4fe
>> --- /dev/null
>> +++ b/drivers/vfio/pci/mlx5/Kconfig
>> @@ -0,0 +1,11 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +config MLX5_VFIO_PCI
>> +	tristate "VFIO support for MLX5 PCI devices"
>> +	depends on MLX5_CORE
>> +	select VFIO_PCI_CORE
>> +	help
>> +	  This provides a PCI support for MLX5 devices using the VFIO
>> +	  framework. The device specific driver supports suspend/resume
>> +	  of the MLX5 device.
>> +
>> +	  If you don't know what to do here, say N.
>> diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
>> new file mode 100644
>> index 000000000000..689627da7ff5
>> --- /dev/null
>> +++ b/drivers/vfio/pci/mlx5/Makefile
>> @@ -0,0 +1,4 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
>> +mlx5-vfio-pci-y := main.o cmd.o
>> +
>> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
>> new file mode 100644
>> index 000000000000..e36302b444a6
>> --- /dev/null
>> +++ b/drivers/vfio/pci/mlx5/main.c
>> @@ -0,0 +1,692 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#include <linux/device.h>
>> +#include <linux/eventfd.h>
>> +#include <linux/file.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/notifier.h>
>> +#include <linux/pci.h>
>> +#include <linux/pm_runtime.h>
>> +#include <linux/types.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/sched/mm.h>
>> +#include <linux/vfio_pci_core.h>
>> +
>> +#include "cmd.h"
>> +
>> +enum {
>> +	MLX5VF_PCI_FREEZED = 1 << 0,
>> +};
>> +
>> +enum {
>> +	MLX5VF_REGION_PENDING_BYTES = 1 << 0,
>> +	MLX5VF_REGION_DATA_SIZE = 1 << 1,
>> +};
>> +
>> +#define MLX5VF_MIG_REGION_DATA_SIZE SZ_128K
>> +/* Data section offset from migration region */
>> +#define MLX5VF_MIG_REGION_DATA_OFFSET
>> \
>> +	(sizeof(struct vfio_device_migration_info))
>> +
>> +#define VFIO_DEVICE_MIGRATION_OFFSET(x)
>> \
>> +	(offsetof(struct vfio_device_migration_info, x))
>> +
>> +struct mlx5vf_pci_migration_info {
>> +	u32 vfio_dev_state; /* VFIO_DEVICE_STATE_XXX */
>> +	u32 dev_state; /* device migration state */
>> +	u32 region_state; /* Use MLX5VF_REGION_XXX */
>> +	u16 vhca_id;
>> +	struct mlx5_vhca_state_data vhca_state_data;
>> +};
>> +
>> +struct mlx5vf_pci_core_device {
>> +	struct vfio_pci_core_device core_device;
>> +	u8 migrate_cap:1;
>> +	/* protect migartion state */
>> +	struct mutex state_mutex;
>> +	struct mlx5vf_pci_migration_info vmig;
>> +};
>> +
>> +static int mlx5vf_pci_unquiesce_device(struct mlx5vf_pci_core_device
>> *mvdev)
>> +{
>> +	return mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
>> +				      mvdev->vmig.vhca_id,
>> +
>> MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
>> +}
>> +
>> +static int mlx5vf_pci_quiesce_device(struct mlx5vf_pci_core_device *mvdev)
>> +{
>> +	return mlx5vf_cmd_suspend_vhca(
>> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
>> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
>> +}
>> +
>> +static int mlx5vf_pci_unfreeze_device(struct mlx5vf_pci_core_device
>> *mvdev)
>> +{
>> +	int ret;
>> +
>> +	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
>> +				     mvdev->vmig.vhca_id,
>> +
>> MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mvdev->vmig.dev_state &= ~MLX5VF_PCI_FREEZED;
>> +	return 0;
>> +}
>> +
>> +static int mlx5vf_pci_freeze_device(struct mlx5vf_pci_core_device *mvdev)
>> +{
>> +	int ret;
>> +
>> +	ret = mlx5vf_cmd_suspend_vhca(
>> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
>> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mvdev->vmig.dev_state |= MLX5VF_PCI_FREEZED;
>> +	return 0;
>> +}
>> +
>> +static int mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device
>> *mvdev)
>> +{
>> +	u32 state_size = 0;
>> +	int ret;
>> +
>> +	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
>> +		return -EFAULT;
>> +
>> +	/* If we already read state no reason to re-read */
>> +	if (mvdev->vmig.vhca_state_data.state_size)
>> +		return 0;
>> +
>> +	ret = mlx5vf_cmd_query_vhca_migration_state(
>> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id, &state_size);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
>> +					  mvdev->vmig.vhca_id, state_size,
>> +					  &mvdev->vmig.vhca_state_data);
>> +}
>> +
>> +static int mlx5vf_pci_new_write_window(struct mlx5vf_pci_core_device
>> *mvdev)
>> +{
>> +	struct mlx5_vhca_state_data *state_data =
>> &mvdev->vmig.vhca_state_data;
>> +	u32 num_pages_needed;
>> +	u64 allocated_ready;
>> +	u32 bytes_needed;
>> +
>> +	/* Check how many bytes are available from previous flows */
>> +	WARN_ON(state_data->num_pages * PAGE_SIZE <
>> +		state_data->win_start_offset);
>> +	allocated_ready = (state_data->num_pages * PAGE_SIZE) -
>> +			  state_data->win_start_offset;
>> +	WARN_ON(allocated_ready > MLX5VF_MIG_REGION_DATA_SIZE);
>> +
>> +	bytes_needed = MLX5VF_MIG_REGION_DATA_SIZE - allocated_ready;
>> +	if (!bytes_needed)
>> +		return 0;
>> +
>> +	num_pages_needed = DIV_ROUND_UP_ULL(bytes_needed, PAGE_SIZE);
>> +	return mlx5vf_add_migration_pages(state_data, num_pages_needed);
>> +}
>> +
>> +static ssize_t
>> +mlx5vf_pci_handle_migration_data_size(struct mlx5vf_pci_core_device
>> *mvdev,
>> +				      char __user *buf, bool iswrite)
>> +{
>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>> +	u64 data_size;
>> +	int ret;
>> +
>> +	if (iswrite) {
>> +		/* data_size is writable only during resuming state */
>> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_RESUMING)
>> +			return -EINVAL;
>> +
>> +		ret = copy_from_user(&data_size, buf, sizeof(data_size));
>> +		if (ret)
>> +			return -EFAULT;
>> +
>> +		vmig->vhca_state_data.state_size += data_size;
>> +		vmig->vhca_state_data.win_start_offset += data_size;
>> +		ret = mlx5vf_pci_new_write_window(mvdev);
>> +		if (ret)
>> +			return ret;
>> +
>> +	} else {
>> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_SAVING)
>> +			return -EINVAL;
>> +
>> +		data_size = min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
>> +				  vmig->vhca_state_data.state_size -
>> +				  vmig->vhca_state_data.win_start_offset);
>> +		ret = copy_to_user(buf, &data_size, sizeof(data_size));
>> +		if (ret)
>> +			return -EFAULT;
>> +	}
>> +
>> +	vmig->region_state |= MLX5VF_REGION_DATA_SIZE;
>> +	return sizeof(data_size);
>> +}
>> +
>> +static ssize_t
>> +mlx5vf_pci_handle_migration_data_offset(struct mlx5vf_pci_core_device
>> *mvdev,
>> +					char __user *buf, bool iswrite)
>> +{
>> +	static const u64 data_offset = MLX5VF_MIG_REGION_DATA_OFFSET;
>> +	int ret;
>> +
>> +	/* RO field */
>> +	if (iswrite)
>> +		return -EFAULT;
>> +
>> +	ret = copy_to_user(buf, &data_offset, sizeof(data_offset));
>> +	if (ret)
>> +		return -EFAULT;
>> +
>> +	return sizeof(data_offset);
>> +}
>> +
>> +static ssize_t
>> +mlx5vf_pci_handle_migration_pending_bytes(struct mlx5vf_pci_core_device
>> *mvdev,
>> +					  char __user *buf, bool iswrite)
>> +{
>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>> +	u64 pending_bytes;
>> +	int ret;
>> +
>> +	/* RO field */
>> +	if (iswrite)
>> +		return -EFAULT;
>> +
>> +	if (vmig->vfio_dev_state == (VFIO_DEVICE_STATE_SAVING |
>> +				     VFIO_DEVICE_STATE_RUNNING)) {
>> +		/* In pre-copy state we have no data to return for now,
>> +		 * return 0 pending bytes
>> +		 */
>> +		pending_bytes = 0;
>> +	} else {
>> +		if (!vmig->vhca_state_data.state_size)
>> +			return 0;
>> +		pending_bytes = vmig->vhca_state_data.state_size -
>> +				vmig->vhca_state_data.win_start_offset;
>> +	}
>> +
>> +	ret = copy_to_user(buf, &pending_bytes, sizeof(pending_bytes));
>> +	if (ret)
>> +		return -EFAULT;
>> +
>> +	/* Window moves forward once data from previous iteration was read */
>> +	if (vmig->region_state & MLX5VF_REGION_DATA_SIZE)
>> +		vmig->vhca_state_data.win_start_offset +=
>> +			min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE, pending_bytes);
>> +
>> +	WARN_ON(vmig->vhca_state_data.win_start_offset >
>> +		vmig->vhca_state_data.state_size);
>> +
>> +	/* New iteration started */
>> +	vmig->region_state = MLX5VF_REGION_PENDING_BYTES;
>> +	return sizeof(pending_bytes);
>> +}
>> +
>> +static int mlx5vf_load_state(struct mlx5vf_pci_core_device *mvdev)
>> +{
>> +	if (!mvdev->vmig.vhca_state_data.state_size)
>> +		return 0;
>> +
>> +	return mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
>> +					  mvdev->vmig.vhca_id,
>> +					  &mvdev->vmig.vhca_state_data);
>> +}
>> +
>> +static void mlx5vf_reset_mig_state(struct mlx5vf_pci_core_device *mvdev)
>> +{
>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>> +
>> +	vmig->region_state = 0;
>> +	mlx5vf_reset_vhca_state(&vmig->vhca_state_data);
>> +}
>> +
>> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device
>> *mvdev,
>> +				       u32 state)
>> +{
>> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
>> +	u32 old_state = vmig->vfio_dev_state;
>> +	int ret = 0;
>> +
>> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
>> +		return -EINVAL;
>> +
>> +	/* Running switches off */
>> +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
>> +	    (state & VFIO_DEVICE_STATE_RUNNING) &&
>> +	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
>> +		ret = mlx5vf_pci_quiesce_device(mvdev);
>> +		if (ret)
>> +			return ret;
>> +		ret = mlx5vf_pci_freeze_device(mvdev);
>> +		if (ret) {
>> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	/* Resuming switches off */
>> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
>> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
>> +	    (old_state & VFIO_DEVICE_STATE_RESUMING)) {
>> +		/* deserialize state into the device */
>> +		ret = mlx5vf_load_state(mvdev);
>> +		if (ret) {
>> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	/* Resuming switches on */
>> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
>> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
>> +	    (state & VFIO_DEVICE_STATE_RESUMING)) {
>> +		mlx5vf_reset_mig_state(mvdev);
>> +		ret = mlx5vf_pci_new_write_window(mvdev);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	/* Saving switches on */
>> +	if ((old_state & VFIO_DEVICE_STATE_SAVING) !=
>> +	    (state & VFIO_DEVICE_STATE_SAVING) &&
>> +	    (state & VFIO_DEVICE_STATE_SAVING)) {
>> +		if (!(state & VFIO_DEVICE_STATE_RUNNING)) {
>> +			/* serialize post copy */
>> +			ret = mlx5vf_pci_save_device_data(mvdev);
> Does it actually get into post-copy here? The pre-copy state(old_state)
> has the _SAVING bit set already and post-copy state( new state) also
> has _SAVING set. It looks like we need to handle the post copy in the above
> "Running switches off" and check for (state & _SAVING).
>
> Or Am I missing something?
>

The above checks for a change in the SAVING bit, if it was turned on and 
we are not RUNNING it means post copy.

Turning on SAVING when we are RUNNING will end-up with returning zero 
bytes upon pending bytes as we don't support for now dirty pages.

see mlx5vf_pci_handle_migration_pending_bytes().

Yishai


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-19  9:59   ` Shameerali Kolothum Thodi
  2021-10-19 10:30     ` Yishai Hadas
@ 2021-10-19 11:24     ` Jason Gunthorpe
  1 sibling, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2021-10-19 11:24 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Yishai Hadas, alex.williamson, bhelgaas, saeedm, linux-pci, kvm,
	netdev, kuba, leonro, kwankhede, mgurtovoy, maorg

On Tue, Oct 19, 2021 at 09:59:03AM +0000, Shameerali Kolothum Thodi wrote:

> > +	/* Saving switches on */
> > +	if ((old_state & VFIO_DEVICE_STATE_SAVING) !=
> > +	    (state & VFIO_DEVICE_STATE_SAVING) &&
> > +	    (state & VFIO_DEVICE_STATE_SAVING)) {
> > +		if (!(state & VFIO_DEVICE_STATE_RUNNING)) {
> > +			/* serialize post copy */
> > +			ret = mlx5vf_pci_save_device_data(mvdev);
> 
> Does it actually get into post-copy here? The pre-copy state(old_state) 
> has the _SAVING bit set already and post-copy state( new state) also
> has _SAVING set. It looks like we need to handle the post copy in the above
> "Running switches off" and check for (state & _SAVING). 

Right, if statements cannot be nested like this. Probably like this:

if ((new_state ^ old_state) & (VFIO_DEVICE_STATE_SAVING|VFIO_DEVICE_STATE_RUNNING) !=
    (new_state & (VFIO_DEVICE_STATE_SAVING|VFIO_DEVICE_STATE_RUNNING)) == (VFIO_DEVICE_STATE_SAVING)

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  2021-10-19 10:30     ` Yishai Hadas
@ 2021-10-19 11:26       ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 44+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-10-19 11:26 UTC (permalink / raw)
  To: Yishai Hadas, alex.williamson, bhelgaas, jgg, saeedm
  Cc: linux-pci, kvm, netdev, kuba, leonro, kwankhede, mgurtovoy, maorg



> -----Original Message-----
> From: Yishai Hadas [mailto:yishaih@nvidia.com]
> Sent: 19 October 2021 11:30
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> alex.williamson@redhat.com; bhelgaas@google.com; jgg@nvidia.com;
> saeedm@nvidia.com
> Cc: linux-pci@vger.kernel.org; kvm@vger.kernel.org; netdev@vger.kernel.org;
> kuba@kernel.org; leonro@nvidia.com; kwankhede@nvidia.com;
> mgurtovoy@nvidia.com; maorg@nvidia.com
> Subject: Re: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver
> for mlx5 devices
> 
> On 10/19/2021 12:59 PM, Shameerali Kolothum Thodi wrote:
> >
> >> -----Original Message-----
> >> From: Yishai Hadas [mailto:yishaih@nvidia.com]
> >> Sent: 13 October 2021 10:47
> >> To: alex.williamson@redhat.com; bhelgaas@google.com; jgg@nvidia.com;
> >> saeedm@nvidia.com
> >> Cc: linux-pci@vger.kernel.org; kvm@vger.kernel.org;
> netdev@vger.kernel.org;
> >> kuba@kernel.org; leonro@nvidia.com; kwankhede@nvidia.com;
> >> mgurtovoy@nvidia.com; yishaih@nvidia.com; maorg@nvidia.com
> >> Subject: [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver
> for
> >> mlx5 devices
> >>
> >> This patch adds support for vfio_pci driver for mlx5 devices.
> >>
> >> It uses vfio_pci_core to register to the VFIO subsystem and then
> >> implements the mlx5 specific logic in the migration area.
> >>
> >> The migration implementation follows the definition from uapi/vfio.h and
> >> uses the mlx5 VF->PF command channel to achieve it.
> >>
> >> This patch implements the suspend/resume flows.
> >>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >> ---
> >>   MAINTAINERS                    |   6 +
> >>   drivers/vfio/pci/Kconfig       |   3 +
> >>   drivers/vfio/pci/Makefile      |   2 +
> >>   drivers/vfio/pci/mlx5/Kconfig  |  11 +
> >>   drivers/vfio/pci/mlx5/Makefile |   4 +
> >>   drivers/vfio/pci/mlx5/main.c   | 692
> +++++++++++++++++++++++++++++++++
> >>   6 files changed, 718 insertions(+)
> >>   create mode 100644 drivers/vfio/pci/mlx5/Kconfig
> >>   create mode 100644 drivers/vfio/pci/mlx5/Makefile
> >>   create mode 100644 drivers/vfio/pci/mlx5/main.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index abdcbcfef73d..e824bfab4a01 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -19699,6 +19699,12 @@ L:	kvm@vger.kernel.org
> >>   S:	Maintained
> >>   F:	drivers/vfio/platform/
> >>
> >> +VFIO MLX5 PCI DRIVER
> >> +M:	Yishai Hadas <yishaih@nvidia.com>
> >> +L:	kvm@vger.kernel.org
> >> +S:	Maintained
> >> +F:	drivers/vfio/pci/mlx5/
> >> +
> >>   VGA_SWITCHEROO
> >>   R:	Lukas Wunner <lukas@wunner.de>
> >>   S:	Maintained
> >> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >> index 860424ccda1b..187b9c259944 100644
> >> --- a/drivers/vfio/pci/Kconfig
> >> +++ b/drivers/vfio/pci/Kconfig
> >> @@ -43,4 +43,7 @@ config VFIO_PCI_IGD
> >>
> >>   	  To enable Intel IGD assignment through vfio-pci, say Y.
> >>   endif
> >> +
> >> +source "drivers/vfio/pci/mlx5/Kconfig"
> >> +
> >>   endif
> >> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> >> index 349d68d242b4..ed9d6f2e0555 100644
> >> --- a/drivers/vfio/pci/Makefile
> >> +++ b/drivers/vfio/pci/Makefile
> >> @@ -7,3 +7,5 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
> >>   vfio-pci-y := vfio_pci.o
> >>   vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> >>   obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> >> +
> >> +obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
> >> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
> >> new file mode 100644
> >> index 000000000000..a3ce00add4fe
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/mlx5/Kconfig
> >> @@ -0,0 +1,11 @@
> >> +# SPDX-License-Identifier: GPL-2.0-only
> >> +config MLX5_VFIO_PCI
> >> +	tristate "VFIO support for MLX5 PCI devices"
> >> +	depends on MLX5_CORE
> >> +	select VFIO_PCI_CORE
> >> +	help
> >> +	  This provides a PCI support for MLX5 devices using the VFIO
> >> +	  framework. The device specific driver supports suspend/resume
> >> +	  of the MLX5 device.
> >> +
> >> +	  If you don't know what to do here, say N.
> >> diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile
> >> new file mode 100644
> >> index 000000000000..689627da7ff5
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/mlx5/Makefile
> >> @@ -0,0 +1,4 @@
> >> +# SPDX-License-Identifier: GPL-2.0-only
> >> +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> >> +mlx5-vfio-pci-y := main.o cmd.o
> >> +
> >> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> >> new file mode 100644
> >> index 000000000000..e36302b444a6
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/mlx5/main.c
> >> @@ -0,0 +1,692 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved
> >> + */
> >> +
> >> +#include <linux/device.h>
> >> +#include <linux/eventfd.h>
> >> +#include <linux/file.h>
> >> +#include <linux/interrupt.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/module.h>
> >> +#include <linux/mutex.h>
> >> +#include <linux/notifier.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/pm_runtime.h>
> >> +#include <linux/types.h>
> >> +#include <linux/uaccess.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/sched/mm.h>
> >> +#include <linux/vfio_pci_core.h>
> >> +
> >> +#include "cmd.h"
> >> +
> >> +enum {
> >> +	MLX5VF_PCI_FREEZED = 1 << 0,
> >> +};
> >> +
> >> +enum {
> >> +	MLX5VF_REGION_PENDING_BYTES = 1 << 0,
> >> +	MLX5VF_REGION_DATA_SIZE = 1 << 1,
> >> +};
> >> +
> >> +#define MLX5VF_MIG_REGION_DATA_SIZE SZ_128K
> >> +/* Data section offset from migration region */
> >> +#define MLX5VF_MIG_REGION_DATA_OFFSET
> >> \
> >> +	(sizeof(struct vfio_device_migration_info))
> >> +
> >> +#define VFIO_DEVICE_MIGRATION_OFFSET(x)
> >> \
> >> +	(offsetof(struct vfio_device_migration_info, x))
> >> +
> >> +struct mlx5vf_pci_migration_info {
> >> +	u32 vfio_dev_state; /* VFIO_DEVICE_STATE_XXX */
> >> +	u32 dev_state; /* device migration state */
> >> +	u32 region_state; /* Use MLX5VF_REGION_XXX */
> >> +	u16 vhca_id;
> >> +	struct mlx5_vhca_state_data vhca_state_data;
> >> +};
> >> +
> >> +struct mlx5vf_pci_core_device {
> >> +	struct vfio_pci_core_device core_device;
> >> +	u8 migrate_cap:1;
> >> +	/* protect migartion state */
> >> +	struct mutex state_mutex;
> >> +	struct mlx5vf_pci_migration_info vmig;
> >> +};
> >> +
> >> +static int mlx5vf_pci_unquiesce_device(struct mlx5vf_pci_core_device
> >> *mvdev)
> >> +{
> >> +	return mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
> >> +				      mvdev->vmig.vhca_id,
> >> +
> >> MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_MASTER);
> >> +}
> >> +
> >> +static int mlx5vf_pci_quiesce_device(struct mlx5vf_pci_core_device
> *mvdev)
> >> +{
> >> +	return mlx5vf_cmd_suspend_vhca(
> >> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
> >> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_MASTER);
> >> +}
> >> +
> >> +static int mlx5vf_pci_unfreeze_device(struct mlx5vf_pci_core_device
> >> *mvdev)
> >> +{
> >> +	int ret;
> >> +
> >> +	ret = mlx5vf_cmd_resume_vhca(mvdev->core_device.pdev,
> >> +				     mvdev->vmig.vhca_id,
> >> +
> >> MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_SLAVE);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	mvdev->vmig.dev_state &= ~MLX5VF_PCI_FREEZED;
> >> +	return 0;
> >> +}
> >> +
> >> +static int mlx5vf_pci_freeze_device(struct mlx5vf_pci_core_device
> *mvdev)
> >> +{
> >> +	int ret;
> >> +
> >> +	ret = mlx5vf_cmd_suspend_vhca(
> >> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id,
> >> +		MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_SLAVE);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	mvdev->vmig.dev_state |= MLX5VF_PCI_FREEZED;
> >> +	return 0;
> >> +}
> >> +
> >> +static int mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device
> >> *mvdev)
> >> +{
> >> +	u32 state_size = 0;
> >> +	int ret;
> >> +
> >> +	if (!(mvdev->vmig.dev_state & MLX5VF_PCI_FREEZED))
> >> +		return -EFAULT;
> >> +
> >> +	/* If we already read state no reason to re-read */
> >> +	if (mvdev->vmig.vhca_state_data.state_size)
> >> +		return 0;
> >> +
> >> +	ret = mlx5vf_cmd_query_vhca_migration_state(
> >> +		mvdev->core_device.pdev, mvdev->vmig.vhca_id, &state_size);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	return mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
> >> +					  mvdev->vmig.vhca_id, state_size,
> >> +					  &mvdev->vmig.vhca_state_data);
> >> +}
> >> +
> >> +static int mlx5vf_pci_new_write_window(struct mlx5vf_pci_core_device
> >> *mvdev)
> >> +{
> >> +	struct mlx5_vhca_state_data *state_data =
> >> &mvdev->vmig.vhca_state_data;
> >> +	u32 num_pages_needed;
> >> +	u64 allocated_ready;
> >> +	u32 bytes_needed;
> >> +
> >> +	/* Check how many bytes are available from previous flows */
> >> +	WARN_ON(state_data->num_pages * PAGE_SIZE <
> >> +		state_data->win_start_offset);
> >> +	allocated_ready = (state_data->num_pages * PAGE_SIZE) -
> >> +			  state_data->win_start_offset;
> >> +	WARN_ON(allocated_ready > MLX5VF_MIG_REGION_DATA_SIZE);
> >> +
> >> +	bytes_needed = MLX5VF_MIG_REGION_DATA_SIZE - allocated_ready;
> >> +	if (!bytes_needed)
> >> +		return 0;
> >> +
> >> +	num_pages_needed = DIV_ROUND_UP_ULL(bytes_needed, PAGE_SIZE);
> >> +	return mlx5vf_add_migration_pages(state_data, num_pages_needed);
> >> +}
> >> +
> >> +static ssize_t
> >> +mlx5vf_pci_handle_migration_data_size(struct mlx5vf_pci_core_device
> >> *mvdev,
> >> +				      char __user *buf, bool iswrite)
> >> +{
> >> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> >> +	u64 data_size;
> >> +	int ret;
> >> +
> >> +	if (iswrite) {
> >> +		/* data_size is writable only during resuming state */
> >> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_RESUMING)
> >> +			return -EINVAL;
> >> +
> >> +		ret = copy_from_user(&data_size, buf, sizeof(data_size));
> >> +		if (ret)
> >> +			return -EFAULT;
> >> +
> >> +		vmig->vhca_state_data.state_size += data_size;
> >> +		vmig->vhca_state_data.win_start_offset += data_size;
> >> +		ret = mlx5vf_pci_new_write_window(mvdev);
> >> +		if (ret)
> >> +			return ret;
> >> +
> >> +	} else {
> >> +		if (vmig->vfio_dev_state != VFIO_DEVICE_STATE_SAVING)
> >> +			return -EINVAL;
> >> +
> >> +		data_size = min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE,
> >> +				  vmig->vhca_state_data.state_size -
> >> +				  vmig->vhca_state_data.win_start_offset);
> >> +		ret = copy_to_user(buf, &data_size, sizeof(data_size));
> >> +		if (ret)
> >> +			return -EFAULT;
> >> +	}
> >> +
> >> +	vmig->region_state |= MLX5VF_REGION_DATA_SIZE;
> >> +	return sizeof(data_size);
> >> +}
> >> +
> >> +static ssize_t
> >> +mlx5vf_pci_handle_migration_data_offset(struct mlx5vf_pci_core_device
> >> *mvdev,
> >> +					char __user *buf, bool iswrite)
> >> +{
> >> +	static const u64 data_offset = MLX5VF_MIG_REGION_DATA_OFFSET;
> >> +	int ret;
> >> +
> >> +	/* RO field */
> >> +	if (iswrite)
> >> +		return -EFAULT;
> >> +
> >> +	ret = copy_to_user(buf, &data_offset, sizeof(data_offset));
> >> +	if (ret)
> >> +		return -EFAULT;
> >> +
> >> +	return sizeof(data_offset);
> >> +}
> >> +
> >> +static ssize_t
> >> +mlx5vf_pci_handle_migration_pending_bytes(struct
> mlx5vf_pci_core_device
> >> *mvdev,
> >> +					  char __user *buf, bool iswrite)
> >> +{
> >> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> >> +	u64 pending_bytes;
> >> +	int ret;
> >> +
> >> +	/* RO field */
> >> +	if (iswrite)
> >> +		return -EFAULT;
> >> +
> >> +	if (vmig->vfio_dev_state == (VFIO_DEVICE_STATE_SAVING |
> >> +				     VFIO_DEVICE_STATE_RUNNING)) {
> >> +		/* In pre-copy state we have no data to return for now,
> >> +		 * return 0 pending bytes
> >> +		 */
> >> +		pending_bytes = 0;
> >> +	} else {
> >> +		if (!vmig->vhca_state_data.state_size)
> >> +			return 0;
> >> +		pending_bytes = vmig->vhca_state_data.state_size -
> >> +				vmig->vhca_state_data.win_start_offset;
> >> +	}
> >> +
> >> +	ret = copy_to_user(buf, &pending_bytes, sizeof(pending_bytes));
> >> +	if (ret)
> >> +		return -EFAULT;
> >> +
> >> +	/* Window moves forward once data from previous iteration was read */
> >> +	if (vmig->region_state & MLX5VF_REGION_DATA_SIZE)
> >> +		vmig->vhca_state_data.win_start_offset +=
> >> +			min_t(u64, MLX5VF_MIG_REGION_DATA_SIZE, pending_bytes);
> >> +
> >> +	WARN_ON(vmig->vhca_state_data.win_start_offset >
> >> +		vmig->vhca_state_data.state_size);
> >> +
> >> +	/* New iteration started */
> >> +	vmig->region_state = MLX5VF_REGION_PENDING_BYTES;
> >> +	return sizeof(pending_bytes);
> >> +}
> >> +
> >> +static int mlx5vf_load_state(struct mlx5vf_pci_core_device *mvdev)
> >> +{
> >> +	if (!mvdev->vmig.vhca_state_data.state_size)
> >> +		return 0;
> >> +
> >> +	return mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
> >> +					  mvdev->vmig.vhca_id,
> >> +					  &mvdev->vmig.vhca_state_data);
> >> +}
> >> +
> >> +static void mlx5vf_reset_mig_state(struct mlx5vf_pci_core_device
> *mvdev)
> >> +{
> >> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> >> +
> >> +	vmig->region_state = 0;
> >> +	mlx5vf_reset_vhca_state(&vmig->vhca_state_data);
> >> +}
> >> +
> >> +static int mlx5vf_pci_set_device_state(struct mlx5vf_pci_core_device
> >> *mvdev,
> >> +				       u32 state)
> >> +{
> >> +	struct mlx5vf_pci_migration_info *vmig = &mvdev->vmig;
> >> +	u32 old_state = vmig->vfio_dev_state;
> >> +	int ret = 0;
> >> +
> >> +	if (vfio_is_state_invalid(state) || vfio_is_state_invalid(old_state))
> >> +		return -EINVAL;
> >> +
> >> +	/* Running switches off */
> >> +	if ((old_state & VFIO_DEVICE_STATE_RUNNING) !=
> >> +	    (state & VFIO_DEVICE_STATE_RUNNING) &&
> >> +	    (old_state & VFIO_DEVICE_STATE_RUNNING)) {
> >> +		ret = mlx5vf_pci_quiesce_device(mvdev);
> >> +		if (ret)
> >> +			return ret;
> >> +		ret = mlx5vf_pci_freeze_device(mvdev);
> >> +		if (ret) {
> >> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> >> +			return ret;
> >> +		}
> >> +	}
> >> +
> >> +	/* Resuming switches off */
> >> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
> >> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
> >> +	    (old_state & VFIO_DEVICE_STATE_RESUMING)) {
> >> +		/* deserialize state into the device */
> >> +		ret = mlx5vf_load_state(mvdev);
> >> +		if (ret) {
> >> +			vmig->vfio_dev_state = VFIO_DEVICE_STATE_INVALID;
> >> +			return ret;
> >> +		}
> >> +	}
> >> +
> >> +	/* Resuming switches on */
> >> +	if ((old_state & VFIO_DEVICE_STATE_RESUMING) !=
> >> +	    (state & VFIO_DEVICE_STATE_RESUMING) &&
> >> +	    (state & VFIO_DEVICE_STATE_RESUMING)) {
> >> +		mlx5vf_reset_mig_state(mvdev);
> >> +		ret = mlx5vf_pci_new_write_window(mvdev);
> >> +		if (ret)
> >> +			return ret;
> >> +	}
> >> +
> >> +	/* Saving switches on */
> >> +	if ((old_state & VFIO_DEVICE_STATE_SAVING) !=
> >> +	    (state & VFIO_DEVICE_STATE_SAVING) &&
> >> +	    (state & VFIO_DEVICE_STATE_SAVING)) {
> >> +		if (!(state & VFIO_DEVICE_STATE_RUNNING)) {
> >> +			/* serialize post copy */
> >> +			ret = mlx5vf_pci_save_device_data(mvdev);
> > Does it actually get into post-copy here? The pre-copy state(old_state)
> > has the _SAVING bit set already and post-copy state( new state) also
> > has _SAVING set. It looks like we need to handle the post copy in the above
> > "Running switches off" and check for (state & _SAVING).
> >
> > Or Am I missing something?
> >
> 
> The above checks for a change in the SAVING bit, if it was turned on and
> we are not RUNNING it means post copy.
> 
> Turning on SAVING when we are RUNNING will end-up with returning zero
> bytes upon pending bytes as we don't support for now dirty pages.
> 
> see mlx5vf_pci_handle_migration_pending_bytes().

So what you are saying is Qemu won't set a pre-copy state prior to post copy here.
IIRC, that was not the case in our setup and Qemu does set the state to pre-copy
(_RUNNING | _SAVING) , reads the pending_bytes and then set it to post copy
(_SAVING). 

Thanks,
Shameer

> 
> Yishai


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-10-19 11:26 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13  9:46 [PATCH V1 mlx5-next 00/13] Add mlx5 live migration driver Yishai Hadas
2021-10-13  9:46 ` [PATCH V1 mlx5-next 01/13] PCI/IOV: Provide internal VF index Yishai Hadas
2021-10-13 18:14   ` Bjorn Helgaas
2021-10-14  9:08     ` Yishai Hadas
2021-10-13  9:46 ` [PATCH V1 mlx5-next 02/13] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2021-10-13  9:46 ` [PATCH V1 mlx5-next 03/13] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2021-10-13  9:46 ` [PATCH V1 mlx5-next 04/13] PCI/IOV: Allow SRIOV VF drivers to reach the drvdata of a PF Yishai Hadas
2021-10-13 18:27   ` Bjorn Helgaas
2021-10-14 22:11   ` Alex Williamson
2021-10-17 13:43     ` Yishai Hadas
2021-10-13  9:46 ` [PATCH V1 mlx5-next 05/13] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 06/13] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 07/13] vfio: Add 'invalid' state definitions Yishai Hadas
2021-10-15 16:38   ` Alex Williamson
2021-10-17 14:07     ` Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 08/13] vfio/pci_core: Make the region->release() function optional Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 09/13] net/mlx5: Introduce migration bits and structures Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 10/13] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 11/13] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2021-10-15 19:48   ` Alex Williamson
2021-10-15 19:59     ` Jason Gunthorpe
2021-10-15 20:12       ` Alex Williamson
2021-10-15 20:16         ` Jason Gunthorpe
2021-10-15 20:59           ` Alex Williamson
2021-10-17 14:03             ` Yishai Hadas
2021-10-18 11:51               ` Jason Gunthorpe
2021-10-18 13:26                 ` Yishai Hadas
2021-10-18 13:42                   ` Alex Williamson
2021-10-18 13:46                     ` Yishai Hadas
2021-10-19  9:59   ` Shameerali Kolothum Thodi
2021-10-19 10:30     ` Yishai Hadas
2021-10-19 11:26       ` Shameerali Kolothum Thodi
2021-10-19 11:24     ` Jason Gunthorpe
2021-10-13  9:47 ` [PATCH V1 mlx5-next 12/13] vfio/pci: Add infrastructure to let vfio_pci_core drivers trap device RESET Yishai Hadas
2021-10-15 19:52   ` Alex Williamson
2021-10-15 20:03     ` Jason Gunthorpe
2021-10-15 21:12       ` Alex Williamson
2021-10-17 14:29         ` Yishai Hadas
2021-10-18 12:02           ` Jason Gunthorpe
2021-10-18 13:41             ` Yishai Hadas
2021-10-13  9:47 ` [PATCH V1 mlx5-next 13/13] vfio/mlx5: Trap device RESET and update state accordingly Yishai Hadas
2021-10-13 18:06   ` Jason Gunthorpe
2021-10-14  9:18     ` Yishai Hadas
2021-10-15 19:54       ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.