linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices
@ 2022-12-06  5:58 Lei Rao
  2022-12-06  5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
                   ` (4 more replies)
  0 siblings, 5 replies; 65+ messages in thread
From: Lei Rao @ 2022-12-06  5:58 UTC (permalink / raw)
  To: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan,
	Lei Rao

Some devices, such as Infrastructure Processing Units (IPUs) and Data
Processing Units (DPUs), expose SR-IOV capable NVMe devices to the host.
Its VF devices support live migration via specific NVMe commands issued
through the PF's device's admin queue.

One of the problems is there are no standardized NVMe live migration
commands to make our patches spec compliant, which prevents us from
creating a spec-compliant implementation. So we've created a reference
implantation based on the Vendor-specific command fields of the protocol.
We want these commands to be standardized so that the implementation will
be spec compliant in future versions and use this RFC as a basis for the
same.

More importantly, we provide our work to help the community and start the
discussion, so everyone can participate and benefit from our work in
NVMe Live Migration implementation and help drive on standardization
efforts.

This series implements the specific vfio-pci driver to support live
migration of NVMe devices. Adds a new interface in the general
NVMe driver to receive the VF device's commands submission to the PF
device's admin queue. And also documents the details of NVMe device
live migration commands.

Any feedback and comments are greatly appreciated.

Lei Rao (5):
  nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for
    VF driver.
  nvme-vfio: add new vfio-pci driver for NVMe device
  nvme-vfio: enable the function of VFIO live migration.
  nvme-vfio: check if the hardware supports live migration
  nvme-vfio: Add a document for the NVMe device

 drivers/nvme/host/pci.c        |  18 +
 drivers/vfio/pci/Kconfig       |   2 +
 drivers/vfio/pci/Makefile      |   2 +
 drivers/vfio/pci/nvme/Kconfig  |  10 +
 drivers/vfio/pci/nvme/Makefile |   3 +
 drivers/vfio/pci/nvme/nvme.c   | 636 +++++++++++++++++++++++++++++++++
 drivers/vfio/pci/nvme/nvme.h   | 111 ++++++
 drivers/vfio/pci/nvme/nvme.txt | 278 ++++++++++++++
 8 files changed, 1060 insertions(+)
 create mode 100644 drivers/vfio/pci/nvme/Kconfig
 create mode 100644 drivers/vfio/pci/nvme/Makefile
 create mode 100644 drivers/vfio/pci/nvme/nvme.c
 create mode 100644 drivers/vfio/pci/nvme/nvme.h
 create mode 100644 drivers/vfio/pci/nvme/nvme.txt

-- 
2.34.1


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
@ 2022-12-06  5:58 ` Lei Rao
  2022-12-06  6:19   ` Christoph Hellwig
  2022-12-06  5:58 ` [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device Lei Rao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Lei Rao @ 2022-12-06  5:58 UTC (permalink / raw)
  To: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan,
	Lei Rao

The new function nvme_submit_vf_cmd() helps the host VF driver to issue
VF admin commands. It's helpful in some cases that the host NVMe driver
does not control VF's admin queue. For example, in the virtualization
device pass-through case, the VF controller's admin queue is governed
by the Guest NVMe driver. Host VF driver relies on PF device's admin
queue to control VF devices like vendor-specific live migration commands.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/nvme/host/pci.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 488ad7dabeb8..3d9c54d8e7fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3585,6 +3585,24 @@ static struct pci_driver nvme_driver = {
 	.err_handler	= &nvme_err_handler,
 };
 
+int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+			      size_t *result, void *buffer, unsigned int bufflen)
+{
+	struct nvme_dev *ndev = NULL;
+	union nvme_result res = { };
+	int ret;
+
+	ndev = pci_iov_get_pf_drvdata(dev, &nvme_driver);
+	if (IS_ERR(ndev))
+		return PTR_ERR(ndev);
+	ret = __nvme_submit_sync_cmd(ndev->ctrl.admin_q, cmd, &res, buffer, bufflen,
+			      NVME_QID_ANY, 0, 0);
+	if (ret >= 0 && result)
+		*result = le32_to_cpu(res.u32);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(nvme_submit_vf_cmd);
+
 static int __init nvme_init(void)
 {
 	BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device
  2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
  2022-12-06  5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
@ 2022-12-06  5:58 ` Lei Rao
  2022-12-06  5:58 ` [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration Lei Rao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Lei Rao @ 2022-12-06  5:58 UTC (permalink / raw)
  To: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan,
	Lei Rao

NVMe device has specific live migration implementation.
Add an specific VFIO PCI driver for NVMe device. Its
live migration support will be added in the subsequent
patches.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/vfio/pci/Kconfig       |  2 +
 drivers/vfio/pci/Makefile      |  2 +
 drivers/vfio/pci/nvme/Kconfig  |  9 ++++
 drivers/vfio/pci/nvme/Makefile |  3 ++
 drivers/vfio/pci/nvme/nvme.c   | 99 ++++++++++++++++++++++++++++++++++
 5 files changed, 115 insertions(+)
 create mode 100644 drivers/vfio/pci/nvme/Kconfig
 create mode 100644 drivers/vfio/pci/nvme/Makefile
 create mode 100644 drivers/vfio/pci/nvme/nvme.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index f9d0c908e738..fcd45144d3e3 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
 
+source "drivers/vfio/pci/nvme/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5..eddc8e889726 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
+
+obj-$(CONFIG_NVME_VFIO_PCI) += nvme/
diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
new file mode 100644
index 000000000000..c281fe154007
--- /dev/null
+++ b/drivers/vfio/pci/nvme/Kconfig
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config NVME_VFIO_PCI
+	tristate "VFIO support for NVMe PCI devices"
+	depends on VFIO_PCI_CORE
+	help
+	  This provides generic VFIO PCI support for NVMe device
+	  using the VFIO framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile
new file mode 100644
index 000000000000..2f4a0ad3d9cf
--- /dev/null
+++ b/drivers/vfio/pci/nvme/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVME_VFIO_PCI) += nvme-vfio-pci.o
+nvme-vfio-pci-y := nvme.o
diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
new file mode 100644
index 000000000000..f1386d8a9287
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/vfio.h>
+#include <linux/anon_inodes.h>
+#include <linux/kernel.h>
+#include <linux/vfio_pci_core.h>
+
+static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+}
+
+static const struct vfio_device_ops nvmevf_pci_ops = {
+	.name = "nvme-vfio-pci",
+	.init = vfio_pci_core_init_dev,
+	.release = vfio_pci_core_release_dev,
+	.open_device = nvmevf_pci_open_device,
+	.close_device = vfio_pci_core_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+};
+
+static int nvmevf_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_core_device *vdev;
+	int ret;
+
+	vdev = vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev,
+				&nvmevf_pci_ops);
+	if (IS_ERR(vdev))
+		return PTR_ERR(vdev);
+
+	dev_set_drvdata(&pdev->dev, vdev);
+	ret = vfio_pci_core_register_device(vdev);
+	if (ret)
+		goto out_put_dev;
+
+	return 0;
+
+out_put_dev:
+	vfio_put_device(&vdev->vdev);
+	return ret;
+}
+
+static void nvmevf_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(vdev);
+	vfio_put_device(&vdev->vdev);
+}
+
+static const struct pci_device_id nvmevf_pci_table[] = {
+	/* Intel IPU NVMe Virtual Function */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x1457) },
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, nvmevf_pci_table);
+
+static struct pci_driver nvmevf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = nvmevf_pci_table,
+	.probe = nvmevf_pci_probe,
+	.remove = nvmevf_pci_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(nvmevf_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Lei Rao <lei.rao@intel.com>");
+MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration.
  2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
  2022-12-06  5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
  2022-12-06  5:58 ` [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device Lei Rao
@ 2022-12-06  5:58 ` Lei Rao
  2023-01-19 10:21   ` Max Gurtovoy
  2022-12-06  5:58 ` [RFC PATCH 4/5] nvme-vfio: check if the hardware supports " Lei Rao
  2022-12-06  5:58 ` [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Lei Rao
  4 siblings, 1 reply; 65+ messages in thread
From: Lei Rao @ 2022-12-06  5:58 UTC (permalink / raw)
  To: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan,
	Lei Rao

Implement specific VFIO live migration operations for NVMe devices.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/vfio/pci/nvme/Kconfig |   5 +-
 drivers/vfio/pci/nvme/nvme.c  | 543 ++++++++++++++++++++++++++++++++--
 drivers/vfio/pci/nvme/nvme.h  | 111 +++++++
 3 files changed, 637 insertions(+), 22 deletions(-)
 create mode 100644 drivers/vfio/pci/nvme/nvme.h

diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
index c281fe154007..12e0eaba0de1 100644
--- a/drivers/vfio/pci/nvme/Kconfig
+++ b/drivers/vfio/pci/nvme/Kconfig
@@ -1,9 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config NVME_VFIO_PCI
 	tristate "VFIO support for NVMe PCI devices"
+	depends on NVME_CORE
 	depends on VFIO_PCI_CORE
 	help
-	  This provides generic VFIO PCI support for NVMe device
-	  using the VFIO framework.
+	  This provides migration support for NVMe devices using the
+	  VFIO framework.
 
 	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
index f1386d8a9287..698e470a4e53 100644
--- a/drivers/vfio/pci/nvme/nvme.c
+++ b/drivers/vfio/pci/nvme/nvme.c
@@ -13,29 +13,503 @@
 #include <linux/types.h>
 #include <linux/vfio.h>
 #include <linux/anon_inodes.h>
-#include <linux/kernel.h>
-#include <linux/vfio_pci_core.h>
+
+#include "nvme.h"
+
+#define MAX_MIGRATION_SIZE (256 * 1024)
+
+static int nvmevf_cmd_suspend_device(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_live_mig_command c = { };
+	int ret;
+
+	c.suspend.opcode = nvme_admin_live_mig_suspend;
+	c.suspend.vf_index = nvmevf_dev->vf_id;
+
+	ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, NULL, 0);
+	if (ret) {
+		dev_warn(&dev->dev, "Suspend virtual function failed (ret=0x%x)\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static int nvmevf_cmd_resume_device(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_live_mig_command c = { };
+	int ret;
+
+	c.resume.opcode = nvme_admin_live_mig_resume;
+	c.resume.vf_index = nvmevf_dev->vf_id;
+
+	ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, NULL, 0);
+	if (ret) {
+		dev_warn(&dev->dev, "Resume virtual function failed (ret=0x%x)\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static int nvmevf_cmd_query_data_size(struct nvmevf_pci_core_device *nvmevf_dev,
+					  size_t *state_size)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_live_mig_command c = { };
+	size_t result;
+	int ret;
+
+	c.query.opcode = nvme_admin_live_mig_query_data_size;
+	c.query.vf_index = nvmevf_dev->vf_id;
+
+	ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, &result, NULL, 0);
+	if (ret) {
+		dev_warn(&dev->dev, "Query the states size failed (ret=0x%x)\n", ret);
+		*state_size = 0;
+		return ret;
+	}
+	*state_size = result;
+	return 0;
+}
+
+static int nvmevf_cmd_save_data(struct nvmevf_pci_core_device *nvmevf_dev,
+				    void *buffer, size_t buffer_len)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_live_mig_command c = { };
+	int ret;
+
+	c.save.opcode = nvme_admin_live_mig_save_data;
+	c.save.vf_index = nvmevf_dev->vf_id;
+
+	ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, buffer, buffer_len);
+	if (ret) {
+		dev_warn(&dev->dev, "Save the device states failed (ret=0x%x)\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static int nvmevf_cmd_load_data(struct nvmevf_pci_core_device *nvmevf_dev,
+				    struct nvmevf_migration_file *migf)
+{
+	struct pci_dev *dev = nvmevf_dev->core_device.pdev;
+	struct nvme_live_mig_command c = { };
+	int ret;
+
+	c.load.opcode = nvme_admin_live_mig_load_data;
+	c.load.vf_index = nvmevf_dev->vf_id;
+	c.load.size = migf->total_length;
+
+	ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL,
+			migf->vf_data, migf->total_length);
+	if (ret) {
+		dev_warn(&dev->dev, "Load the device states failed (ret=0x%x)\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct nvmevf_pci_core_device, core_device);
+}
+
+static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
+{
+	mutex_lock(&migf->lock);
+
+	/* release the device states buffer */
+	kvfree(migf->vf_data);
+	migf->vf_data = NULL;
+	migf->disabled = true;
+	migf->total_length = 0;
+	migf->filp->f_pos = 0;
+	mutex_unlock(&migf->lock);
+}
+
+static int nvmevf_release_file(struct inode *inode, struct file *filp)
+{
+	struct nvmevf_migration_file *migf = filp->private_data;
+
+	nvmevf_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+	return 0;
+}
+
+static ssize_t nvmevf_save_read(struct file *filp, char __user *buf, size_t len, loff_t *pos)
+{
+	struct nvmevf_migration_file *migf = filp->private_data;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	mutex_lock(&migf->lock);
+	if (*pos > migf->total_length) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (migf->disabled) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, migf->total_length - *pos, len);
+	if (len) {
+		ret = copy_to_user(buf, migf->vf_data + *pos, len);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*pos += len;
+		done = len;
+	}
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations nvmevf_save_fops = {
+	.owner = THIS_MODULE,
+	.read = nvmevf_save_read,
+	.release = nvmevf_release_file,
+	.llseek = no_llseek,
+};
+
+static ssize_t nvmevf_resume_write(struct file *filp, const char __user *buf,
+				       size_t len, loff_t *pos)
+{
+	struct nvmevf_migration_file *migf = filp->private_data;
+	loff_t requested_length;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+	pos = &filp->f_pos;
+
+	if (*pos < 0 || check_add_overflow((loff_t)len, *pos, &requested_length))
+		return -EINVAL;
+
+	if (requested_length > MAX_MIGRATION_SIZE)
+		return -ENOMEM;
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	ret = copy_from_user(migf->vf_data + *pos, buf, len);
+	if (ret) {
+		done = -EFAULT;
+		goto out_unlock;
+	}
+	*pos += len;
+	done = len;
+	migf->total_length += len;
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations nvmevf_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = nvmevf_resume_write,
+	.release = nvmevf_release_file,
+	.llseek = no_llseek,
+};
+
+static void nvmevf_disable_fds(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	if (nvmevf_dev->resuming_migf) {
+		nvmevf_disable_fd(nvmevf_dev->resuming_migf);
+		fput(nvmevf_dev->resuming_migf->filp);
+		nvmevf_dev->resuming_migf = NULL;
+	}
+
+	if (nvmevf_dev->saving_migf) {
+		nvmevf_disable_fd(nvmevf_dev->saving_migf);
+		fput(nvmevf_dev->saving_migf->filp);
+		nvmevf_dev->saving_migf = NULL;
+	}
+}
+
+static struct nvmevf_migration_file *
+nvmevf_pci_resume_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct nvmevf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_resume_fops, migf,
+					O_WRONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	/* Allocate buffer to load the device states and the max states is 256K */
+	migf->vf_data = kvzalloc(MAX_MIGRATION_SIZE, GFP_KERNEL);
+	if (!migf->vf_data) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	return migf;
+
+out_free:
+	fput(migf->filp);
+	return ERR_PTR(ret);
+}
+
+static struct nvmevf_migration_file *
+nvmevf_pci_save_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	struct nvmevf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_save_fops, migf,
+					O_RDONLY);
+	if (IS_ERR(migf->filp)) {
+		int err = PTR_ERR(migf->filp);
+
+		kfree(migf);
+		return ERR_PTR(err);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	ret = nvmevf_cmd_query_data_size(nvmevf_dev, &migf->total_length);
+	if (ret)
+		goto out_free;
+	/* Allocate buffer and save the device states*/
+	migf->vf_data = kvzalloc(migf->total_length, GFP_KERNEL);
+	if (!migf->vf_data) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	ret = nvmevf_cmd_save_data(nvmevf_dev, migf->vf_data, migf->total_length);
+	if (ret)
+		goto out_free;
+
+	return migf;
+out_free:
+	fput(migf->filp);
+	return ERR_PTR(ret);
+}
+
+static struct file *
+nvmevf_pci_step_device_state_locked(struct nvmevf_pci_core_device *nvmevf_dev, u32 new)
+{
+	u32 cur = nvmevf_dev->mig_state;
+	int ret;
+
+	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = nvmevf_cmd_suspend_device(nvmevf_dev);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct nvmevf_migration_file *migf;
+
+		migf = nvmevf_pci_save_device_data(nvmevf_dev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		nvmevf_dev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
+		nvmevf_disable_fds(nvmevf_dev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
+		struct nvmevf_migration_file *migf;
+
+		migf = nvmevf_pci_resume_device_data(nvmevf_dev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		nvmevf_dev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = nvmevf_cmd_load_data(nvmevf_dev, nvmevf_dev->resuming_migf);
+		if (ret)
+			return ERR_PTR(ret);
+		nvmevf_disable_fds(nvmevf_dev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING) {
+		nvmevf_cmd_resume_device(nvmevf_dev);
+		return NULL;
+	}
+
+	/* vfio_mig_get_next_state() does not use arcs other than the above */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+again:
+	spin_lock(&nvmevf_dev->reset_lock);
+	if (nvmevf_dev->deferred_reset) {
+		nvmevf_dev->deferred_reset = false;
+		spin_unlock(&nvmevf_dev->reset_lock);
+		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+		nvmevf_disable_fds(nvmevf_dev);
+		goto again;
+	}
+	mutex_unlock(&nvmevf_dev->state_mutex);
+	spin_unlock(&nvmevf_dev->reset_lock);
+}
+
+static struct file *
+nvmevf_pci_set_device_state(struct vfio_device *vdev, enum vfio_device_mig_state new_state)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(vdev,
+			struct nvmevf_pci_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+	int ret;
+
+	mutex_lock(&nvmevf_dev->state_mutex);
+	while (new_state != nvmevf_dev->mig_state) {
+		ret = vfio_mig_get_next_state(vdev, nvmevf_dev->mig_state, new_state, &next_state);
+		if (ret) {
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+
+		res = nvmevf_pci_step_device_state_locked(nvmevf_dev, next_state);
+		if (IS_ERR(res))
+			break;
+		nvmevf_dev->mig_state = next_state;
+		if (WARN_ON(res && new_state != nvmevf_dev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	nvmevf_state_mutex_unlock(nvmevf_dev);
+	return res;
+}
+
+static int nvmevf_pci_get_device_state(struct vfio_device *vdev,
+					   enum vfio_device_mig_state *curr_state)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(
+			vdev, struct nvmevf_pci_core_device, core_device.vdev);
+
+	mutex_lock(&nvmevf_dev->state_mutex);
+	*curr_state = nvmevf_dev->mig_state;
+	nvmevf_state_mutex_unlock(nvmevf_dev);
+	return 0;
+}
 
 static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(
+			core_vdev, struct nvmevf_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &nvmevf_dev->core_device;
 	int ret;
 
 	ret = vfio_pci_core_enable(vdev);
 	if (ret)
 		return ret;
 
+	if (nvmevf_dev->migrate_cap)
+		nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
 	vfio_pci_core_finish_enable(vdev);
 	return 0;
 }
 
+static void nvmevf_cmd_close_migratable(struct nvmevf_pci_core_device *nvmevf_dev)
+{
+	if (!nvmevf_dev->migrate_cap)
+		return;
+
+	mutex_lock(&nvmevf_dev->state_mutex);
+	nvmevf_disable_fds(nvmevf_dev);
+	nvmevf_state_mutex_unlock(nvmevf_dev);
+}
+
+static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(
+			core_vdev, struct nvmevf_pci_core_device, core_device.vdev);
+
+	nvmevf_cmd_close_migratable(nvmevf_dev);
+	vfio_pci_core_close_device(core_vdev);
+}
+
+static const struct vfio_migration_ops nvmevf_pci_mig_ops = {
+	.migration_set_state = nvmevf_pci_set_device_state,
+	.migration_get_state = nvmevf_pci_get_device_state,
+};
+
+static int nvmevf_migration_init_dev(struct vfio_device *core_vdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = container_of(core_vdev,
+					struct nvmevf_pci_core_device, core_device.vdev);
+	struct pci_dev *pdev = to_pci_dev(core_vdev->dev);
+	int vf_id;
+	int ret = -1;
+
+	if (!pdev->is_virtfn)
+		return ret;
+
+	nvmevf_dev->migrate_cap = 1;
+
+	vf_id = pci_iov_vf_id(pdev);
+	if (vf_id < 0)
+		return ret;
+	nvmevf_dev->vf_id = vf_id + 1;
+	core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY;
+
+	mutex_init(&nvmevf_dev->state_mutex);
+	spin_lock_init(&nvmevf_dev->reset_lock);
+	core_vdev->mig_ops = &nvmevf_pci_mig_ops;
+
+	return vfio_pci_core_init_dev(core_vdev);
+}
+
 static const struct vfio_device_ops nvmevf_pci_ops = {
 	.name = "nvme-vfio-pci",
-	.init = vfio_pci_core_init_dev,
+	.init = nvmevf_migration_init_dev,
 	.release = vfio_pci_core_release_dev,
 	.open_device = nvmevf_pci_open_device,
-	.close_device = vfio_pci_core_close_device,
+	.close_device = nvmevf_pci_close_device,
 	.ioctl = vfio_pci_core_ioctl,
 	.device_feature = vfio_pci_core_ioctl_feature,
 	.read = vfio_pci_core_read,
@@ -47,32 +521,56 @@ static const struct vfio_device_ops nvmevf_pci_ops = {
 
 static int nvmevf_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
-	struct vfio_pci_core_device *vdev;
+	struct nvmevf_pci_core_device *nvmevf_dev;
 	int ret;
 
-	vdev = vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev,
-				&nvmevf_pci_ops);
-	if (IS_ERR(vdev))
-		return PTR_ERR(vdev);
+	nvmevf_dev = vfio_alloc_device(nvmevf_pci_core_device, core_device.vdev,
+					&pdev->dev, &nvmevf_pci_ops);
+	if (IS_ERR(nvmevf_dev))
+		return PTR_ERR(nvmevf_dev);
 
-	dev_set_drvdata(&pdev->dev, vdev);
-	ret = vfio_pci_core_register_device(vdev);
+	dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device);
+	ret = vfio_pci_core_register_device(&nvmevf_dev->core_device);
 	if (ret)
 		goto out_put_dev;
-
 	return 0;
 
 out_put_dev:
-	vfio_put_device(&vdev->vdev);
+	vfio_put_device(&nvmevf_dev->core_device.vdev);
 	return ret;
+
 }
 
 static void nvmevf_pci_remove(struct pci_dev *pdev)
 {
-	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
+
+	vfio_pci_core_unregister_device(&nvmevf_dev->core_device);
+	vfio_put_device(&nvmevf_dev->core_device.vdev);
+}
+
+static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
+
+	if (!nvmevf_dev->migrate_cap)
+		return;
 
-	vfio_pci_core_unregister_device(vdev);
-	vfio_put_device(&vdev->vdev);
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&nvmevf_dev->reset_lock);
+	nvmevf_dev->deferred_reset = true;
+	if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
+		spin_unlock(&nvmevf_dev->reset_lock);
+		return;
+	}
+	spin_unlock(&nvmevf_dev->reset_lock);
+	nvmevf_state_mutex_unlock(nvmevf_dev);
 }
 
 static const struct pci_device_id nvmevf_pci_table[] = {
@@ -83,12 +581,17 @@ static const struct pci_device_id nvmevf_pci_table[] = {
 
 MODULE_DEVICE_TABLE(pci, nvmevf_pci_table);
 
+static const struct pci_error_handlers nvmevf_err_handlers = {
+	.reset_done = nvmevf_pci_aer_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
 static struct pci_driver nvmevf_pci_driver = {
 	.name = KBUILD_MODNAME,
 	.id_table = nvmevf_pci_table,
 	.probe = nvmevf_pci_probe,
 	.remove = nvmevf_pci_remove,
-	.err_handler = &vfio_pci_core_err_handlers,
+	.err_handler = &nvmevf_err_handlers,
 	.driver_managed_dma = true,
 };
 
@@ -96,4 +599,4 @@ module_pci_driver(nvmevf_pci_driver);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Lei Rao <lei.rao@intel.com>");
-MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe");
+MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live migration support for NVMe");
diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
new file mode 100644
index 000000000000..c8464554ef53
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
+ */
+
+#ifndef NVME_VFIO_PCI_H
+#define NVME_VFIO_PCI_H
+
+#include <linux/kernel.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/nvme.h>
+
+struct nvme_live_mig_query_size {
+	__u8	opcode;
+	__u8	flags;
+	__u16	command_id;
+	__u32	rsvd1[9];
+	__u16	vf_index;
+	__u16	rsvd2;
+	__u32	rsvd3[5];
+};
+
+struct nvme_live_mig_suspend {
+	__u8	opcode;
+	__u8	flags;
+	__u16	command_id;
+	__u32	rsvd1[9];
+	__u16	vf_index;
+	__u16	rsvd2;
+	__u32	rsvd3[5];
+};
+
+struct nvme_live_mig_resume {
+	__u8    opcode;
+	__u8    flags;
+	__u16   command_id;
+	__u32   rsvd1[9];
+	__u16   vf_index;
+	__u16   rsvd2;
+	__u32   rsvd3[5];
+};
+
+struct nvme_live_mig_save_data {
+	__u8	opcode;
+	__u8	flags;
+	__u16	command_id;
+	__u32	rsvd1[5];
+	__le64	prp1;
+	__le64	prp2;
+	__u16	vf_index;
+	__u16	rsvd2;
+	__u32	rsvd3[5];
+};
+
+struct nvme_live_mig_load_data {
+	__u8    opcode;
+	__u8    flags;
+	__u16   command_id;
+	__u32   rsvd1[5];
+	__le64  prp1;
+	__le64  prp2;
+	__u16   vf_index;
+	__u16	rsvd2;
+	__u32	size;
+	__u32   rsvd3[4];
+};
+
+enum nvme_live_mig_admin_opcode {
+	nvme_admin_live_mig_query_data_size	= 0xC4,
+	nvme_admin_live_mig_suspend		= 0xC8,
+	nvme_admin_live_mig_resume		= 0xCC,
+	nvme_admin_live_mig_save_data		= 0xD2,
+	nvme_admin_live_mig_load_data		= 0xD5,
+};
+
+struct nvme_live_mig_command {
+	union {
+		struct nvme_live_mig_query_size query;
+		struct nvme_live_mig_suspend	suspend;
+		struct nvme_live_mig_resume	resume;
+		struct nvme_live_mig_save_data	save;
+		struct nvme_live_mig_load_data	load;
+	};
+};
+
+struct nvmevf_migration_file {
+	struct file *filp;
+	struct mutex lock;
+	bool disabled;
+	u8 *vf_data;
+	size_t total_length;
+};
+
+struct nvmevf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	int vf_id;
+	u8 migrate_cap:1;
+	u8 deferred_reset:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
+	struct nvmevf_migration_file *resuming_migf;
+	struct nvmevf_migration_file *saving_migf;
+};
+
+extern int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+			size_t *result, void *buffer, unsigned int bufflen);
+
+#endif /* NVME_VFIO_PCI_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH 4/5] nvme-vfio: check if the hardware supports live migration
  2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
                   ` (2 preceding siblings ...)
  2022-12-06  5:58 ` [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration Lei Rao
@ 2022-12-06  5:58 ` Lei Rao
  2022-12-06 13:47   ` Keith Busch
  2022-12-06  5:58 ` [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Lei Rao
  4 siblings, 1 reply; 65+ messages in thread
From: Lei Rao @ 2022-12-06  5:58 UTC (permalink / raw)
  To: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan,
	Lei Rao

NVMe device can extend the vendor-specific field in the identify
controller data structure to indicate whether live migration
is supported. This patch checks if the NVMe device supports
live migration.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/vfio/pci/nvme/nvme.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
index 698e470a4e53..2ffc90ad556d 100644
--- a/drivers/vfio/pci/nvme/nvme.c
+++ b/drivers/vfio/pci/nvme/nvme.c
@@ -473,6 +473,36 @@ static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
 	vfio_pci_core_close_device(core_vdev);
 }
 
+static bool nvmevf_check_migration(struct pci_dev *pdev)
+{
+	struct nvme_command c = { };
+	struct nvme_id_ctrl *id;
+	u8 live_mig_support;
+	int ret;
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = NVME_ID_CNS_CTRL;
+
+	id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
+	if (!id)
+		return false;
+
+	ret = nvme_submit_vf_cmd(pdev, &c, NULL, id, sizeof(struct nvme_id_ctrl));
+	if (ret) {
+		dev_warn(&pdev->dev, "Get identify ctrl failed (ret=0x%x)\n", ret);
+		goto out;
+	}
+
+	live_mig_support = id->vs[0];
+	if (live_mig_support) {
+		kfree(id);
+		return true;
+	}
+out:
+	kfree(id);
+	return false;
+}
+
 static const struct vfio_migration_ops nvmevf_pci_mig_ops = {
 	.migration_set_state = nvmevf_pci_set_device_state,
 	.migration_get_state = nvmevf_pci_get_device_state,
@@ -489,6 +519,10 @@ static int nvmevf_migration_init_dev(struct vfio_device *core_vdev)
 	if (!pdev->is_virtfn)
 		return ret;
 
+	/* Get the identify controller data structure to check the live migration support */
+	if (!nvmevf_check_migration(pdev))
+		return ret;
+
 	nvmevf_dev->migrate_cap = 1;
 
 	vf_id = pci_iov_vf_id(pdev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
                   ` (3 preceding siblings ...)
  2022-12-06  5:58 ` [RFC PATCH 4/5] nvme-vfio: check if the hardware supports " Lei Rao
@ 2022-12-06  5:58 ` Lei Rao
  2022-12-06  6:26   ` Christoph Hellwig
  4 siblings, 1 reply; 65+ messages in thread
From: Lei Rao @ 2022-12-06  5:58 UTC (permalink / raw)
  To: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan,
	Lei Rao

The documentation describes the details of the NVMe hardware
extension to support VFIO live migration.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++
 1 file changed, 278 insertions(+)
 create mode 100644 drivers/vfio/pci/nvme/nvme.txt

diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt
new file mode 100644
index 000000000000..eadcf2082eed
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.txt
@@ -0,0 +1,278 @@
+===========================
+NVMe Live Migration Support
+===========================
+
+Introduction
+------------
+To support live migration, NVMe device designs its own implementation,
+including five new specific admin commands and a capability flag in
+the vendor-specific field in the identify controller data structure to
+support VF's live migration usage. Software can use these live migration
+admin commands to get device migration state data size, save and load the
+data, suspend and resume the given VF device. They are submitted by software
+to the NVMe PF device's admin queue and ignored if placed in the VF device's
+admin queue. This is due to the NVMe VF device being passed to the virtual
+machine in the virtualization scenario. So VF device's admin queue is not
+available for the hypervisor to submit VF device live migration commands.
+The capability flag in the identify controller data structure can be used by
+software to detect if the NVMe device supports live migration. The following
+chapters introduce the detailed format of the commands and the capability flag.
+
+Definition of opcode for live migration commands
+------------------------------------------------
+
++---------------------------+-----------+-----------+------------+
+|                           |           |           |            |
+|     Opcode by Field       |           |           |            |
+|                           |           |           |            |
++--------+---------+--------+           |           |            |
+|        |         |        | Combined  | Namespace |            |
+|    07  |  06:02  | 01:00  |  Opcode   | Identifier|  Command   |
+|        |         |        |           |    used   |            |
++--------+---------+--------+           |           |            |
+|Generic | Function|  Data  |           |           |            |
+|command |         |Transfer|           |           |            |
++--------+---------+--------+-----------+-----------+------------+
+|                                                                |
+|                     Vendor SpecificOpcode                      |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Query the  |
+|   1b   |  10001  |  00    |   0xC4    |           | data size  |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Suspend the|
+|   1b   |  10010  |  00    |   0xC8    |           |    VF      |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Resume the |
+|   1b   |  10011  |  00    |   0xCC    |           |    VF      |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Save the   |
+|   1b   |  10100  |  10    |   0xD2    |           |device data |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Load the   |
+|   1b   |  10101  |  01    |   0xD5    |           |device data |
++--------+---------+--------+-----------+-----------+------------+
+
+Definition of QUERY_DATA_SIZE command
+-------------------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xC4 to indicate a qeury command                 | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for more details[1]      | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1]    | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  39:04  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+|  41:40  |  VF index: means which VF controller internal data size to query                   |
++---------+------------------------------------------------------------------------------------+
+|  63:42  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+
+The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
+When the NVMe firmware receives the command, it will return the size of NVMe VF internal
+data. The data size depends on how many IO queues are created.
+
+Definition of SUSPEND command
+-----------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xC8 to indicate a suspend command               | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe specification for details[1]  | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  39:04  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+|  41:40  |  VF index: means which VF controller to suspend                                    |
++---------+------------------------------------------------------------------------------------+
+|  63:42  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+
+The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
+this command, it will suspend the NVMe VF controller.
+
+Definition of RESUME command
+----------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xCC to indicate a resume command                | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  39:04  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+|  41:40  |  VF index: means which VF controller to resume                                     |
++---------+------------------------------------------------------------------------------------+
+|  63:42  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+
+The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
+it will restart the NVMe VF controller.
+
+Definition of SAVE_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xD2 to indicate a save command                  | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  23:04  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+|  31:24  | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer              |
++---------+------------------------------------------------------------------------------------+
+|  39:32  | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+|  41:40  | VF index: means which VF controller internal data to save                          |
++---------+------------------------------------------------------------------------------------+
+|  63:42  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+
+The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
+receives this command, it will save the admin queue states, save some registers, drain IO SQs
+and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
+host memory through DMA.
+
+Definition of LOAD_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xD5 to indicate a load command                  | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1]    | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  23:04  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+|  31:24  | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer              |
++---------+------------------------------------------------------------------------------------+
+|  39:32  | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+|  41:40  | VF index: means which VF controller internal data to load                          |
++---------+------------------------------------------------------------------------------------+
+|  47:44  | Size: means the size of the device's internal data to be loaded                    |
++---------+------------------------------------------------------------------------------------+
+|  63:48  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+
+The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
+command, it will read the device internal's data from the host memory through DMA, restore the
+admin queue states and some registers, and restore every IO queue state.
+
+Extensions of the vendor-specific field in the identify controller data structure
+---------------------------------------------------------------------------------
+
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|  Bytes  | I/O  |Admin | Disc |        Description            |
+|         |      |      |      |                               |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+| 01:00   |  M   |  M   |  R   | PCI Vendor ID(VID)            |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+| 03:02   |  M   |  M   |  R   | PCI Subsytem Vendor ID(SSVID) |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|  ...    | ...  | ...  | ...  |  ...                          |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|  3072   |  O   |  O   |  O   | Live Migration Support        |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|4095:3073|  O   |  O   |  O   | Vendor Specific               |
++---------+------+------+------+-------------------------------+
+
+According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
+NVMe device uses the 3072 bytes in the identify controller data structure to indicate
+whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
+live migration is supported, and other values are reserved.
+
+[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06  5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
@ 2022-12-06  6:19   ` Christoph Hellwig
  2022-12-06 13:44     ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06  6:19 UTC (permalink / raw)
  To: Lei Rao
  Cc: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> VF admin commands. It's helpful in some cases that the host NVMe driver
> does not control VF's admin queue. For example, in the virtualization
> device pass-through case, the VF controller's admin queue is governed
> by the Guest NVMe driver. Host VF driver relies on PF device's admin
> queue to control VF devices like vendor-specific live migration commands.

WTF are you even smoking when you think this would be acceptable?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06  5:58 ` [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Lei Rao
@ 2022-12-06  6:26   ` Christoph Hellwig
  2022-12-06 13:05     ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06  6:26 UTC (permalink / raw)
  To: Lei Rao
  Cc: kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck, jgg,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 01:58:16PM +0800, Lei Rao wrote:
> The documentation describes the details of the NVMe hardware
> extension to support VFIO live migration.

This is not a NVMe hardware extension, this is some really strange
and half-assed intel-specific extension to nvme, which like any other
vendors specific non-standard extensions to nvme we refused to support
in Linux.

There is a TPAR for live migration building blocks under discussion in
the NVMe technical working group.  It will still require mediatation
of access to the admin queue to deal with the huge amout of state nvme
has that needs to be migrated (and which doesn't seem to be covered at
all here).  In Linux the equivalent would be to implement a mdev driver
that allows passing through the I/O qeues to a guest, but it might
be a better idea to handle the device model emulation entirely in
Qemu (or other userspace device models) and just find a way to expose
enough of the I/O queues to userspace.

The current TPAR seems to be very complicated for that, as in many
cases we'd only need a way to tіe certain namespaces to certain I/O
queues and not waste a lot of resources on the rest of the controller.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06  6:26   ` Christoph Hellwig
@ 2022-12-06 13:05     ` Jason Gunthorpe
  2022-12-06 13:09       ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 13:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 07:26:04AM +0100, Christoph Hellwig wrote:
> all here).  In Linux the equivalent would be to implement a mdev driver
> that allows passing through the I/O qeues to a guest, but it might

Definately not - "mdev" drivers should be avoided as much as possible.

In this case Intel has a real PCI SRIOV VF to expose to the guest,
with a full VF RID. The proper VFIO abstraction is the variant PCI
driver as this series does. We want to use the variant PCI drivers
because they properly encapsulate all the PCI behaviors (MSI, config
space, regions, reset, etc) without requiring re-implementation of this
in mdev drivers.

mdev drivers should only be considered if a real PCI VF is not
available - eg because the device is doing "SIOV" or something.

We have several migration drivers in VFIO now following this general
pattern, from what I can see they have done it broadly properly from a
VFIO perspective.

> be a better idea to handle the device model emulation entirely in
> Qemu (or other userspace device models) and just find a way to expose
> enough of the I/O queues to userspace.

This is much closer to the VDPA model which is basically providing a
some kernel support to access the IO queue and a lot of SW in qemu to
generate the PCI device in the VM.

The approach has positives and negatives, we have done both in mlx5
devices and we have a preference toward the VFIO model. VPDA
specifically is very big and complicated compared to the VFIO
approach.

Overall having fully functional PCI SRIOV VF's available lets more
uses cases work than just "qemu to create a VM". qemu can always build
a VDPA like thing by using VFIO and VFIO live migration to shift
control of the device between qemu and HW.

I don't think we know enough about this space at the moment to fix a
specification to one path or the other, so I hope the TPAR will settle
on something that can support both models in SW and people can try
things out.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 13:05     ` Jason Gunthorpe
@ 2022-12-06 13:09       ` Christoph Hellwig
  2022-12-06 13:52         ` Jason Gunthorpe
  2022-12-09  2:05         ` Tian, Kevin
  0 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 13:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote:
> In this case Intel has a real PCI SRIOV VF to expose to the guest,
> with a full VF RID.

RID?

> The proper VFIO abstraction is the variant PCI
> driver as this series does. We want to use the variant PCI drivers
> because they properly encapsulate all the PCI behaviors (MSI, config
> space, regions, reset, etc) without requiring re-implementation of this
> in mdev drivers.

I don't think the code in this series has any chance of actually
working.  There is a lot of state associated with a NVMe subsystem,
controller and namespace, such as the serial number, subsystem NQN,
namespace uniqueue identifiers, Get/Set features state, pending AENs,
log page content.  Just migrating from one device to another without
capturing all this has no chance of actually working.

> I don't think we know enough about this space at the moment to fix a
> specification to one path or the other, so I hope the TPAR will settle
> on something that can support both models in SW and people can try
> things out.

I've not seen anyone from Intel actually contributing to the live
migration TPAR, which is almost two month old by now.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06  6:19   ` Christoph Hellwig
@ 2022-12-06 13:44     ` Jason Gunthorpe
  2022-12-06 13:51       ` Keith Busch
  2022-12-06 13:58       ` Christoph Hellwig
  0 siblings, 2 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 13:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 07:19:40AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> > The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> > VF admin commands. It's helpful in some cases that the host NVMe driver
> > does not control VF's admin queue. For example, in the virtualization
> > device pass-through case, the VF controller's admin queue is governed
> > by the Guest NVMe driver. Host VF driver relies on PF device's admin
> > queue to control VF devices like vendor-specific live migration commands.
> 
> WTF are you even smoking when you think this would be acceptable?

Not speaking to NVMe - but this driver is clearly copying mlx5's live
migration driver, almost completely - including this basic function.

So, to explain why mlx5 works this way..

The VFIO approach is to fully assign an entire VF to the guest OS. The
entire VF assignment means every MMIO register *and all the DMA* of
the VF is owned by the guest operating system.

mlx5 needs to transfer hundreds of megabytes to gigabytes of in-device
state to perform a migration.

So, we must be able to use DMA to transfer the data. However, the VM
exclusively controls the DMA of the VF. The iommu_domain of the VF
belongs to the guest VM through VFIO, and we simply cannot mutate
it. Not only should not, but physically can not, ie when IOMMU nested
translation is in use and the IO page tables are in guest VM memory.

So the VF cannot be used to control the migration, or transfer the
migration data. This leaves only the PF.

Thus, mxl5 has the same sort of design where the VF VFIO driver
reaches into the PF kernel driver and asks the PF driver to perform
some commands targeting the PF's own VFs. The DMA is then done using
the RID of the PF, and reaches the kernel owned iommu_domain of the
PF. This way the entire operation is secure aginst meddling by the
guest.

We can contrast this with the hisilicon live migration driver that
does not use the PF for control. Instead it has a very small state
that the migration driver simply reads out of registers. The VF has a
page of registers that control pause/go of the queues and the VFIO
varient driver denies access to this page from the guest VM so that
the kernel VFIO driver has reliable control over the VF.

Without involving PASID this is broadly the only two choices for doing
SRIOV live migration, AFAIK.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 4/5] nvme-vfio: check if the hardware supports live migration
  2022-12-06  5:58 ` [RFC PATCH 4/5] nvme-vfio: check if the hardware supports " Lei Rao
@ 2022-12-06 13:47   ` Keith Busch
  0 siblings, 0 replies; 65+ messages in thread
From: Keith Busch @ 2022-12-06 13:47 UTC (permalink / raw)
  To: Lei Rao
  Cc: axboe, kch, hch, sagi, alex.williamson, cohuck, jgg, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan

On Tue, Dec 06, 2022 at 01:58:15PM +0800, Lei Rao wrote:
> +static bool nvmevf_check_migration(struct pci_dev *pdev)
> +{
> +	struct nvme_command c = { };
> +	struct nvme_id_ctrl *id;
> +	u8 live_mig_support;
> +	int ret;
> +
> +	c.identify.opcode = nvme_admin_identify;
> +	c.identify.cns = NVME_ID_CNS_CTRL;
> +
> +	id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
> +	if (!id)
> +		return false;
> +
> +	ret = nvme_submit_vf_cmd(pdev, &c, NULL, id, sizeof(struct nvme_id_ctrl));
> +	if (ret) {
> +		dev_warn(&pdev->dev, "Get identify ctrl failed (ret=0x%x)\n", ret);
> +		goto out;
> +	}
> +
> +	live_mig_support = id->vs[0];

Considering this is a vendor specific region, it seems rather presumptuous to
assume this byte means "live migration supported".

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 13:44     ` Jason Gunthorpe
@ 2022-12-06 13:51       ` Keith Busch
  2022-12-06 14:27         ` Jason Gunthorpe
  2022-12-06 13:58       ` Christoph Hellwig
  1 sibling, 1 reply; 65+ messages in thread
From: Keith Busch @ 2022-12-06 13:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, axboe, kch, sagi, alex.williamson,
	cohuck, yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 07:19:40AM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> > > The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> > > VF admin commands. It's helpful in some cases that the host NVMe driver
> > > does not control VF's admin queue. For example, in the virtualization
> > > device pass-through case, the VF controller's admin queue is governed
> > > by the Guest NVMe driver. Host VF driver relies on PF device's admin
> > > queue to control VF devices like vendor-specific live migration commands.
> > 
> > WTF are you even smoking when you think this would be acceptable?
> 
> Not speaking to NVMe - but this driver is clearly copying mlx5's live
> migration driver, almost completely - including this basic function.
> 
> So, to explain why mlx5 works this way..
> 
> The VFIO approach is to fully assign an entire VF to the guest OS. The
> entire VF assignment means every MMIO register *and all the DMA* of
> the VF is owned by the guest operating system.
> 
> mlx5 needs to transfer hundreds of megabytes to gigabytes of in-device
> state to perform a migration.

For storage, though, you can't just transfer the controller state. You have to
transfer all the namespace user data, too. So potentially many terabytes?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 13:09       ` Christoph Hellwig
@ 2022-12-06 13:52         ` Jason Gunthorpe
  2022-12-06 14:00           ` Christoph Hellwig
  2022-12-09  2:05         ` Tian, Kevin
  1 sibling, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 13:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 02:09:01PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote:
> > In this case Intel has a real PCI SRIOV VF to expose to the guest,
> > with a full VF RID.
> 
> RID?

"Requester ID" - PCI SIG term that in Linux basically means you get to
assign an iommu_domain to the vfio device.

Compared to a mdev where many vfio devices will share the same RID and
cannot have iommu_domain's without using PASID.

> > The proper VFIO abstraction is the variant PCI
> > driver as this series does. We want to use the variant PCI drivers
> > because they properly encapsulate all the PCI behaviors (MSI, config
> > space, regions, reset, etc) without requiring re-implementation of this
> > in mdev drivers.
> 
> I don't think the code in this series has any chance of actually
> working.  There is a lot of state associated with a NVMe subsystem,
> controller and namespace, such as the serial number, subsystem NQN,
> namespace uniqueue identifiers, Get/Set features state, pending AENs,
> log page content.  Just migrating from one device to another without
> capturing all this has no chance of actually working.

From what I understood this series basically allows two Intel devices
to pass a big opaque blob of data. Intel didn't document what is in
that blob, so I assume it captures everything you mention above.

At least, that is the approach we have taken with mlx5. Every single
bit of device state is serialized into the blob and when the device
resumes it is indistinguishable from the original. Otherwise it is a
bug.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 13:44     ` Jason Gunthorpe
  2022-12-06 13:51       ` Keith Busch
@ 2022-12-06 13:58       ` Christoph Hellwig
  2022-12-06 15:22         ` Jason Gunthorpe
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 13:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> Not speaking to NVMe - but this driver is clearly copying mlx5's live
> migration driver, almost completely - including this basic function.

Maybe that's not a good idea in an NVMe environment, and maybe that
should have talked to the standards committee before spending their
time on cargo cult engineering.

Most importantly NVMe is very quiet on the relationship between
VFs and PFs, and there is no way to guarantee that a PF is, at the
NVMe level, much in control of a VF at all.  In other words this
concept really badly breaks NVMe abstractions.

> Thus, mxl5 has the same sort of design where the VF VFIO driver
> reaches into the PF kernel driver and asks the PF driver to perform
> some commands targeting the PF's own VFs. The DMA is then done using
> the RID of the PF, and reaches the kernel owned iommu_domain of the
> PF. This way the entire operation is secure aginst meddling by the
> guest.

And the works for you because you have a clearly defined relationship.
In NVMe we do not have this.  We'd either need to define a way
to query that relationship or find another way to deal with the
problem.  But in doubt the best design would be to drive VF
live migration entirely from the PF, turn the lookup from controlled
function to controlling function upside down, that is a list of
controlled functions (which could very well be, and in some designs
are, additional PFs and not VFs) by controlling function.  In fact
NVMe already has that list in it's architecture with the
"Secondary Controller List".

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 13:52         ` Jason Gunthorpe
@ 2022-12-06 14:00           ` Christoph Hellwig
  2022-12-06 14:20             ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 14:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 09:52:54AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 02:09:01PM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote:
> > > In this case Intel has a real PCI SRIOV VF to expose to the guest,
> > > with a full VF RID.
> > 
> > RID?
> 
> "Requester ID" - PCI SIG term that in Linux basically means you get to
> assign an iommu_domain to the vfio device.

Yeah I now the Requester ID, I've just never seen that shortcut for it.

> >From what I understood this series basically allows two Intel devices
> to pass a big opaque blob of data. Intel didn't document what is in
> that blob, so I assume it captures everything you mention above.

Which would be just as bad, because it then changes the IDs under
the live OS on a restore.  This is not something that can be done
behind the back of the hypervisors / control plane OS.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 14:00           ` Christoph Hellwig
@ 2022-12-06 14:20             ` Jason Gunthorpe
  2022-12-06 14:31               ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 03:00:02PM +0100, Christoph Hellwig wrote:
> > >From what I understood this series basically allows two Intel devices
> > to pass a big opaque blob of data. Intel didn't document what is in
> > that blob, so I assume it captures everything you mention above.
> 
> Which would be just as bad, because it then changes the IDs under
> the live OS on a restore.  This is not something that can be done
> behind the back of the hypervisors / control plane OS.

Sorry, what live OS?

In the VFIO restore model there is no "live OS" on resume. The
load/resume cycle is as destructive as reset to the vfio device.

When qemu operates vfio the destination CPU will not be running until
the load/resume of all the VFIO devices is completed.

So from the VM perspective it sees a complete no change, so long as
the data blob causes the destination vfio device to fully match the
source, including all IDs, etc.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 13:51       ` Keith Busch
@ 2022-12-06 14:27         ` Jason Gunthorpe
  0 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 14:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Lei Rao, axboe, kch, sagi, alex.williamson,
	cohuck, yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 01:51:32PM +0000, Keith Busch wrote:
> On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 06, 2022 at 07:19:40AM +0100, Christoph Hellwig wrote:
> > > On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> > > > The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> > > > VF admin commands. It's helpful in some cases that the host NVMe driver
> > > > does not control VF's admin queue. For example, in the virtualization
> > > > device pass-through case, the VF controller's admin queue is governed
> > > > by the Guest NVMe driver. Host VF driver relies on PF device's admin
> > > > queue to control VF devices like vendor-specific live migration commands.
> > > 
> > > WTF are you even smoking when you think this would be acceptable?
> > 
> > Not speaking to NVMe - but this driver is clearly copying mlx5's live
> > migration driver, almost completely - including this basic function.
> > 
> > So, to explain why mlx5 works this way..
> > 
> > The VFIO approach is to fully assign an entire VF to the guest OS. The
> > entire VF assignment means every MMIO register *and all the DMA* of
> > the VF is owned by the guest operating system.
> > 
> > mlx5 needs to transfer hundreds of megabytes to gigabytes of in-device
> > state to perform a migration.
> 
> For storage, though, you can't just transfer the controller state. You have to
> transfer all the namespace user data, too. So potentially many terabytes?

There are two different scenarios - lets call it shared medium and
local medium.

If the medium is shared then only the controller state needs to be
transfered. The controller state would include enough information to
locate and identify the shared medium.

This would apply to cases like DPU/smart NIC, multi-port physical
drives, etc.

local medium will require the medium copy. Within Linux I don't have a
clear sense if that should be done within the VFIO migration
framework, or if it would better to have its own operations.

For the NVMe spec I'd strongly suggest keeping medium copy as its own
set of commands.

It will be interesting to see how to standardize all these scenarios :)

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 14:20             ` Jason Gunthorpe
@ 2022-12-06 14:31               ` Christoph Hellwig
  2022-12-06 14:48                 ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 14:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 10:20:26AM -0400, Jason Gunthorpe wrote:
> In the VFIO restore model there is no "live OS" on resume. The
> load/resume cycle is as destructive as reset to the vfio device.

Of course there may be and OS.  As soon as the VF is live Linux
will by default bind to it.  And that's the big problem here,
the VF should not actually exist or at least not be usable when
such a restore happens - or to say it in NVMe terms, the Secondary
Controller better be in offline state when state is loaded into it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 14:31               ` Christoph Hellwig
@ 2022-12-06 14:48                 ` Jason Gunthorpe
  2022-12-06 15:01                   ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 14:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 03:31:26PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 10:20:26AM -0400, Jason Gunthorpe wrote:
> > In the VFIO restore model there is no "live OS" on resume. The
> > load/resume cycle is as destructive as reset to the vfio device.
> 
> Of course there may be and OS.  As soon as the VF is live Linux
> will by default bind to it.  And that's the big problem here,
> the VF should not actually exist or at least not be usable when
> such a restore happens - or to say it in NVMe terms, the Secondary
> Controller better be in offline state when state is loaded into it.

Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
use.

What we do have is a transfer of control from the normal OS driver (eg
nvme) to the VFIO driver. Also, remember, that VFIO only does live
migration between VFIO devices. We cannot use live migration and end
up with a situation where the normal nvme driver is controlling the
VF.

The VFIO load model is explicitly destructive. We replace the current
VF with the loading VF. Both the VFIO variant driver and the VFIO
userspace issuing the load have to be aware of this and understand
that the whole device will change.

From an implementation perspective, I would expect the nvme varient
driver to either place the nvme device in the correct state during
load, or refuse to execute load if it is in the wrong state.

To be compatible with what qemu is doing the "right state" should be
entered by completing function level reset of the VF.

The Linux/qemu parts are still being finalized, so if you see
something that could be changed to better match nvme it would be a
great time to understand that.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 14:48                 ` Jason Gunthorpe
@ 2022-12-06 15:01                   ` Christoph Hellwig
  2022-12-06 15:28                     ` Jason Gunthorpe
  2022-12-11 12:05                     ` Max Gurtovoy
  0 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 15:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
> use.

Beward:  The secondary function might as well be a physical function
as well.  In fact one of the major customers for "smart" multifunction
nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
symmetric dual ported devices are multi-PF as well).

So this isn't really about a VF live cycle, but how to manage life
migration, especially on the receive / restore side.  And restoring
the entire controller state is extremely invasive and can't be done
on a controller that is in any classic form live.  In fact a lot
of the state is subsystem-wide, so without some kind of virtualization
of the subsystem it is impossible to actually restore the state.

To cycle back to the hardware that is posted here, I'm really confused
how it actually has any chance to work and no one has even tried
to explain how it is supposed to work.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 13:58       ` Christoph Hellwig
@ 2022-12-06 15:22         ` Jason Gunthorpe
  2022-12-06 15:38           ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 15:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 02:58:10PM +0100, Christoph Hellwig wrote:

> Most importantly NVMe is very quiet on the relationship between
> VFs and PFs, and there is no way to guarantee that a PF is, at the
> NVMe level, much in control of a VF at all.  In other words this
> concept really badly breaks NVMe abstractions.

Yeah, I think the spec effort is going to be interesting for sure.

From a pure Linux and implementation perspective a decision must be
made early on how to label the DMAs for kernel/qemu vs VM controlled
items at the PCI TLP level.

> controlled functions (which could very well be, and in some designs
> are, additional PFs and not VFs) by controlling function.  

In principle PF vs VF doesn't matter much - the question is really TLP
labeling. If the spec says RID A is the controlling RID and RID B is
the guest RID, then it doesn't matter if they have a PF/VF
relationship or PF/PF relationship.

We have locking issues in Linux SW connecting different SW drivers for
things that are not a PF/VF relationship, but perhaps that can be
solved.

Using VF RID / VF PASID is appealing at first glance, but there is
list of PCI emulation details that have to be worked out for that to
be good. eg what do you do with guest triggered FLR? Or guest
triggered memory disable? How do you handle PCIe AER? Also lack of
PASID support in CPUs is problematic.

Lots of trade offs..

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 15:01                   ` Christoph Hellwig
@ 2022-12-06 15:28                     ` Jason Gunthorpe
  2022-12-06 15:35                       ` Christoph Hellwig
  2022-12-11 12:05                     ` Max Gurtovoy
  1 sibling, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 15:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 04:01:31PM +0100, Christoph Hellwig wrote:

> So this isn't really about a VF live cycle, but how to manage life
> migration, especially on the receive / restore side.  And restoring
> the entire controller state is extremely invasive and can't be done
> on a controller that is in any classic form live.  In fact a lot
> of the state is subsystem-wide, so without some kind of virtualization
> of the subsystem it is impossible to actually restore the state.

I cannot speak to nvme, but for mlx5 the VF is laregly a contained
unit so we just replace the whole thing.

From the PF there is some observability, eg the VF's MAC address is
visible and a few other things. So the PF has to re-synchronize after
the migration to get those things aligned.

> To cycle back to the hardware that is posted here, I'm really confused
> how it actually has any chance to work and no one has even tried
> to explain how it is supposed to work.

I'm interested as well, my mental model goes as far as mlx5 and
hisillicon, so if nvme prevents the VFs from being contained units, it
is a really big deviation from VFIO's migration design..

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 15:28                     ` Jason Gunthorpe
@ 2022-12-06 15:35                       ` Christoph Hellwig
  2022-12-06 18:00                         ` Dong, Eddie
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 15:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 11:28:12AM -0400, Jason Gunthorpe wrote:
> I'm interested as well, my mental model goes as far as mlx5 and
> hisillicon, so if nvme prevents the VFs from being contained units, it
> is a really big deviation from VFIO's migration design..

In NVMe the controller (which maps to a PCIe physical or virtual
function) is unfortunately not very self contained.  A lot of
state is subsystem-wide, where the subsystem is, roughly speaking,
the container for all controllers that shared storage.  That is
the right thing to do for say dual ported SSDs that are used for
clustering or multi-pathing, for tentant isolation is it about
as wrong as it gets.

There is nothing in the NVMe spec that prohibits your from
implementing multiple subsystems for multiple functions of a PCIe
device, but if you do that there is absolutely no support in the
spec to manage shared resources or any other interaction between
them.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 15:22         ` Jason Gunthorpe
@ 2022-12-06 15:38           ` Christoph Hellwig
  2022-12-06 15:51             ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 15:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 11:22:33AM -0400, Jason Gunthorpe wrote:
> > controlled functions (which could very well be, and in some designs
> > are, additional PFs and not VFs) by controlling function.  
> 
> In principle PF vs VF doesn't matter much - the question is really TLP
> labeling. If the spec says RID A is the controlling RID and RID B is
> the guest RID, then it doesn't matter if they have a PF/VF
> relationship or PF/PF relationship.

Yes.  Or in fact if you use PASIDs inside a single function.

> We have locking issues in Linux SW connecting different SW drivers for
> things that are not a PF/VF relationship, but perhaps that can be
> solved.

And I think the only reasonable answer is that the entire workflow
must be 100% managed from the controlling function, and the controlled
function is just around for a ride, with the controlling function
enabling/disabling it as needed without ever interacting with software
that directly deals with the controlled function.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 15:38           ` Christoph Hellwig
@ 2022-12-06 15:51             ` Jason Gunthorpe
  2022-12-06 16:55               ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 15:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 04:38:11PM +0100, Christoph Hellwig wrote:

> > We have locking issues in Linux SW connecting different SW drivers for
> > things that are not a PF/VF relationship, but perhaps that can be
> > solved.
> 
> And I think the only reasonable answer is that the entire workflow
> must be 100% managed from the controlling function, and the controlled
> function is just around for a ride, with the controlling function
> enabling/disabling it as needed without ever interacting with software
> that directly deals with the controlled function.

That is a big deviation from where VFIO is right now, the controlled
function is the one with the VFIO driver, it should be the one that
drives the migration uAPI components.

Adding another uAPI that can manipulate the same VFIO device from some
unrelated chardev feels wrong.

There are certain things that need to be co-ordinated for eveything to
work. Like you can't suspend the VFIO device unless you promise to
also stop MMIO operations. Stuff like FLR interfers with the migration
operation and has to be co-ordinated. Some migration operation
failures, like load failure, have to be healed through FLR.

It really does not want to be two different uAPIs even if that is
convenient for the kernel.

I'd much rather try to fix the problems PASID brings that try to make
this work :\

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 15:51             ` Jason Gunthorpe
@ 2022-12-06 16:55               ` Christoph Hellwig
  2022-12-06 19:15                 ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-06 16:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
> That is a big deviation from where VFIO is right now, the controlled
> function is the one with the VFIO driver, it should be the one that
> drives the migration uAPI components.

Well, that is one way to see it, but I think the more natural
way to deal with it is to drive everyting from the controlling
function, because that is by definition much more in control.

More importantly any sane design will have easy ways to list and
manipulate all the controlled functions from the controlling
functions, while getting from the controlled function to the
controlling one is extremely awkward, as anything that can be
used for that is by definition and information leak.  It seems
mlx5 just gets away with that by saying controlled functions are
always VFs, and the controlling function is a PF, but that will
break down very easily, especially once you want to nest the
controlling scheme (and yes, I'm not making this up, as the
nesting scheme is being proposed for nvme privately).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 15:35                       ` Christoph Hellwig
@ 2022-12-06 18:00                         ` Dong, Eddie
  2022-12-12  7:57                           ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Dong, Eddie @ 2022-12-06 18:00 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: Rao, Lei, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, Tian, Kevin, mjrosato,
	linux-kernel, linux-nvme, kvm, Li, Yadong, Liu, Yi L, Wilk,
	Konrad, stephen, Yuan, Hang



> -----Original Message-----
> From: Christoph Hellwig <hch@lst.de>
> Sent: Tuesday, December 6, 2022 7:36 AM
> To: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Christoph Hellwig <hch@lst.de>; Rao, Lei <Lei.Rao@intel.com>;
> kbusch@kernel.org; axboe@fb.com; kch@nvidia.com; sagi@grimberg.me;
> alex.williamson@redhat.com; cohuck@redhat.com; yishaih@nvidia.com;
> shameerali.kolothum.thodi@huawei.com; Tian, Kevin <kevin.tian@intel.com>;
> mjrosato@linux.ibm.com; linux-kernel@vger.kernel.org; linux-
> nvme@lists.infradead.org; kvm@vger.kernel.org; Dong, Eddie
> <eddie.dong@intel.com>; Li, Yadong <yadong.li@intel.com>; Liu, Yi L
> <yi.l.liu@intel.com>; Wilk, Konrad <konrad.wilk@oracle.com>;
> stephen@eideticom.com; Yuan, Hang <hang.yuan@intel.com>
> Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
> 
> On Tue, Dec 06, 2022 at 11:28:12AM -0400, Jason Gunthorpe wrote:
> > I'm interested as well, my mental model goes as far as mlx5 and
> > hisillicon, so if nvme prevents the VFs from being contained units, it
> > is a really big deviation from VFIO's migration design..
> 
> In NVMe the controller (which maps to a PCIe physical or virtual
> function) is unfortunately not very self contained.  A lot of state is subsystem-
> wide, where the subsystem is, roughly speaking, the container for all
> controllers that shared storage.  That is the right thing to do for say dual
> ported SSDs that are used for clustering or multi-pathing, for tentant isolation
> is it about as wrong as it gets.


NVMe spec is general, but the implementation details (such as internal state) may 
be vendor specific. If the migration happens between 2 identical NVMe devices 
(from same vendor/device w/ same firmware version), migration of 
subsystem-wide state can be naturally covered, right?

> 
> There is nothing in the NVMe spec that prohibits your from implementing
> multiple subsystems for multiple functions of a PCIe device, but if you do that
> there is absolutely no support in the spec to manage shared resources or any
> other interaction between them.

In IPU/DPU area, it seems multiple VFs with SR-IOV is widely adopted.

In VFs, the usage of shared resource can be viewed as implementation specific, 
and load/save state of a VF can rely on the hardware/firmware itself.
Migration of NVMe devices crossing vendor/device is another story: it may
be useful, but brings additional challenges. 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 16:55               ` Christoph Hellwig
@ 2022-12-06 19:15                 ` Jason Gunthorpe
  2022-12-07  2:30                   ` Max Gurtovoy
  2022-12-07  7:54                   ` Christoph Hellwig
  0 siblings, 2 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 19:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
> > That is a big deviation from where VFIO is right now, the controlled
> > function is the one with the VFIO driver, it should be the one that
> > drives the migration uAPI components.
> 
> Well, that is one way to see it, but I think the more natural
> way to deal with it is to drive everyting from the controlling
> function, because that is by definition much more in control.

Sure, the controlling function should (and does in mlx5) drive
everything here.

What the kernel is doing is providing the abstraction to link the
controlling function to the VFIO device in a general way.

We don't want to just punt this problem to user space and say 'good
luck finding the right cdev for migration control'. If the kernel
struggles to link them then userspace will not fare better on its own.

Especially, we do not want every VFIO device to have its own crazy way
for userspace to link the controlling/controlled functions
together. This is something the kernel has to abstract away.

So, IMHO, we must assume the kernel is aware of the relationship,
whatever algorithm it uses to become aware.

It just means the issue is doing the necessary cross-subsystem
locking.

That combined with the fact they really are two halfs of the same
thing - operations on the controlling function have to be sequenced
with operations on the VFIO device - makes me prefer the single uAPI.

> More importantly any sane design will have easy ways to list and
> manipulate all the controlled functions from the controlling
> functions, while getting from the controlled function to the
> controlling one is extremely awkward, as anything that can be
> used for that is by definition and information leak.  

We have spend some time looking at this for mlx5. It is a hard
problem. The VFIO driver cannot operate the device, eg it cannot do
MMIO and things, so it is limited to items in the PCI config space to
figure out the device identity.

> It seems mlx5 just gets away with that by saying controlled
> functions are always VFs, and the controlling function is a PF, but
> that will break down very easily, 

Yes, that is part of the current mlx5 model. It is not inherent to the
device design, but the problems with arbitary nesting were hard enough
they were not tackled at this point.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 19:15                 ` Jason Gunthorpe
@ 2022-12-07  2:30                   ` Max Gurtovoy
  2022-12-07  7:58                     ` Christoph Hellwig
  2022-12-07  7:54                   ` Christoph Hellwig
  1 sibling, 1 reply; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-07  2:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan


On 12/6/2022 9:15 PM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
>>> That is a big deviation from where VFIO is right now, the controlled
>>> function is the one with the VFIO driver, it should be the one that
>>> drives the migration uAPI components.
>> Well, that is one way to see it, but I think the more natural
>> way to deal with it is to drive everyting from the controlling
>> function, because that is by definition much more in control.
> Sure, the controlling function should (and does in mlx5) drive
> everything here.
>
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
>
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.
>
> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.
>
> So, IMHO, we must assume the kernel is aware of the relationship,
> whatever algorithm it uses to become aware.
>
> It just means the issue is doing the necessary cross-subsystem
> locking.

Agree. This is not the first time we have cross-subsystem interactions 
in the kernel, right ?

>
> That combined with the fact they really are two halfs of the same
> thing - operations on the controlling function have to be sequenced
> with operations on the VFIO device - makes me prefer the single uAPI.
>
>> More importantly any sane design will have easy ways to list and
>> manipulate all the controlled functions from the controlling
>> functions, while getting from the controlled function to the
>> controlling one is extremely awkward, as anything that can be
>> used for that is by definition and information leak.
> We have spend some time looking at this for mlx5. It is a hard
> problem. The VFIO driver cannot operate the device, eg it cannot do
> MMIO and things, so it is limited to items in the PCI config space to
> figure out the device identity.

Christoph,

I'm not sure how awkward is for migration driver to ask the controlling 
device driver to operate a migration action.

The controlling device driver can expose limited API for that matter.

I don't see why is it wrong to submit some admin commands between 
subsystems - we already have a way to send admin commands from linux cli 
or from nvmet drivers to an NVMe device.

Also the concept of primary controller that control it's secondary 
controllers is already in the SPEC so we can use it. It's not introduced 
in this RFC but we're here to fix it.

In our case the primary controller is the PF and the secondary 
controllers are the VFs.

If we'll think of  a concept of new optional "Migration Management" 
admin commands that will be supported by primary controllers to manage 
the migration process of its secondary controllers - we can at least 
solve the initial step of migrating VFs by its parent PF.

>
>> It seems mlx5 just gets away with that by saying controlled
>> functions are always VFs, and the controlling function is a PF, but
>> that will break down very easily,
> Yes, that is part of the current mlx5 model. It is not inherent to the
> device design, but the problems with arbitary nesting were hard enough
> they were not tackled at this point.

Agree. There are a lot of challenges in the non-nested migration that we 
can just limit the NVMe SPEC to it and extend later on - like we did in 
other features such as copy-Offload.

Most of the infrastructure work for LM was done in the VFIO subsystem in 
the last year so we can now focus on the Architecture aspects of NVMe.

>
> Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-06 19:15                 ` Jason Gunthorpe
  2022-12-07  2:30                   ` Max Gurtovoy
@ 2022-12-07  7:54                   ` Christoph Hellwig
  2022-12-07 10:59                     ` Max Gurtovoy
  2022-12-07 13:34                     ` Jason Gunthorpe
  1 sibling, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07  7:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
> 
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.

Yes.  But the right interface for that is to issue the userspace
commands for anything that is not normal PCIe function level
to the controlling funtion, and to discover the controlled functions
based on the controlling functions.

In other words:  there should be absolutely no need to have any
special kernel support for the controlled function.  Instead the
controlling function enumerates all the function it controls exports
that to userspace and exposes the functionality to save state from
and restore state to the controlled functions.

> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.

Yes.  But the direction must go controlling to controlled, not the
other way around. 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07  2:30                   ` Max Gurtovoy
@ 2022-12-07  7:58                     ` Christoph Hellwig
  2022-12-09  2:11                       ` Tian, Kevin
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07  7:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Christoph Hellwig, Lei Rao, kbusch, axboe, kch,
	sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan

On Wed, Dec 07, 2022 at 04:30:20AM +0200, Max Gurtovoy wrote:
> I'm not sure how awkward is for migration driver to ask the controlling 
> device driver to operate a migration action.

It can't.  That's the whole point.  The controlled function that is
being migrate must be absolutely unaware of that (except for things
like quiescing access or FLRs that could happen anyway), because
otherwise your have a fundamental information leak.

> The controlling device driver can expose limited API for that matter.

No, it can't.  It must be in charge.

> Also the concept of primary controller that control it's secondary 
> controllers is already in the SPEC so we can use it. It's not introduced in 
> this RFC but we're here to fix it.

Yes, it is as I've pointed out multiple times.  But, that relationship
is only visible to the primary controller.  It also doesn't help with
the general problem where the secondary controller must be able to
completely change it's identify and thus the subsystem.

> In our case the primary controller is the PF and the secondary controllers 
> are the VFs.

Yes, that's your case, and probably a very common one.  But also far from
the only one, so there is no way Linux or the specification can rely
on that dumb fact.  Never mind that there are virtualization schemes
(look at the s390 code) where the PF to VF relationship gets lost.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07  7:54                   ` Christoph Hellwig
@ 2022-12-07 10:59                     ` Max Gurtovoy
  2022-12-07 13:46                       ` Christoph Hellwig
  2022-12-07 13:34                     ` Jason Gunthorpe
  1 sibling, 1 reply; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-07 10:59 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan


On 12/7/2022 9:54 AM, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
>> What the kernel is doing is providing the abstraction to link the
>> controlling function to the VFIO device in a general way.
>>
>> We don't want to just punt this problem to user space and say 'good
>> luck finding the right cdev for migration control'. If the kernel
>> struggles to link them then userspace will not fare better on its own.
> Yes.  But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.
>
> In other words:  there should be absolutely no need to have any
> special kernel support for the controlled function.  Instead the
> controlling function enumerates all the function it controls exports
> that to userspace and exposes the functionality to save state from
> and restore state to the controlled functions.

Why is it preferred that the migration SW will talk directly to the PF 
and not via VFIO interface ?

It's just an implementation detail.

I feel like it's even sounds more reasonable to have a common API like 
we have today to save_state/resume_state/quiesce_device/freeze_device 
and each device implementation will translate this functionality to its 
own SPEC.

If I understand your direction is to have QEMU code to talk to 
nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

The controlled device is not aware of any of the migration process. Only 
the migration SW, system admin and controlling device.

I see 2 orthogonal discussions here: NVMe standardization for LM and 
Linux implementation for LM.

For the NVMe standardization: I think we all agree, in high level, that 
primary controller manages the LM of the secondary controllers. Primary 
controller can list the secondary controllers. Primary controller expose 
APIs using its admin_queue to manage LM process of its secondary 
controllers. LM Capabilities will be exposed using identify_ctrl admin 
cmd of the primary controller.

For the Linux implementation: the direction we started last year is to 
have vendor specific (mlx5/hisi/..) or protocol specific 
(nvme/virtio/..) vfio drivers. We built an infrastructure to do that by 
dividing the vfio_pci driver to vfio_pci and vfio_pci_core and updated 
uAPIs as well to support the P2P case for live migration. Dirty page 
tracking is also progressing. More work is still to be done to improve 
this infrastructure for sure.
I hope that all the above efforts are going to be used also with NVMe LM 
implementation unless there is something NVMe specific in the way of 
migrating PCI functions that I can't see now.
If there is something that is NVMe specific for LM then the migration SW 
and QEMU will need to be aware of that, and in this awareness we lose 
the benefit of generic VFIO interface.

>
>> Especially, we do not want every VFIO device to have its own crazy way
>> for userspace to link the controlling/controlled functions
>> together. This is something the kernel has to abstract away.
> Yes.  But the direction must go controlling to controlled, not the
> other way around.

So in the source:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable 
(controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration process.


And in the destination:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migration resume to nvme1, nvme2, nvme3 - now these VFs are 
resumable (controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration resume process.


>   

>   

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07  7:54                   ` Christoph Hellwig
  2022-12-07 10:59                     ` Max Gurtovoy
@ 2022-12-07 13:34                     ` Jason Gunthorpe
  2022-12-07 13:52                       ` Christoph Hellwig
  1 sibling, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 13:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 08:54:15AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> > What the kernel is doing is providing the abstraction to link the
> > controlling function to the VFIO device in a general way.
> > 
> > We don't want to just punt this problem to user space and say 'good
> > luck finding the right cdev for migration control'. If the kernel
> > struggles to link them then userspace will not fare better on its own.
> 
> Yes.  But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.

I don't think we should mix up how the HW works, what PCI function the
commands are executed at, with how the uAPI is designed.

The VFIO design assumes that the "vfio migration driver" will talk to
both functions under the hood, and I don't see a fundamental problem
with this beyond it being awkward with the driver core.

It is done this way deliberately. When we did the work on it we found
there are problems with some device models, like when you suspend them
and then wrongly MMIO touch them they can trigger AER and maybe even a
machine check crash. One of the roles of the VFIO driver is to fix
these HW bugs and make the uAPI safe. Eg by revoking mmaps or
whatever.

Even more importantly, we do not want migration to ever be operated
unless VFIO is in control of the device. In general, migration resume
basically allows userspace to command the device to do effectively any
DMA. This is a kernel security problem if the device is not
constrained by a user iommu_domain - for security we must not allow
userspace to resume a VF that is under kernel control and potentially
linked to an passthrough iommu_domain.

VFIO provides the security model to make all of this safe - the two
concepts cannot become disconnected. Even if we create a new migration
uAPI it just means that the nvme driver has to be awkwardly aware of
VFIO VF drivers and interlock with their state, and the uAPI is
useless unless you already have a VFIO FD open.

Even the basic assumption that there would be a controlling/controlled
relationship is not universally true. The mdev type drivers, and 
SIOV-like devices are unlikely to have that. Once you can use PASID
the reasons to split things at the HW level go away, and a VF could
certainly self-migrate.

VFIO's goal is to abstract all of the above stuff. You get one char
device that inherently provides the security guarentees required to
operate migration. It provides all the necessary hooks to fix up HW
issues, which so far every merged device has:

 - mlx5 has a weird issue where FLR on the VF resets the migration
   context, we fix that in the VFIO driver (in mlx5 even though the
   commands are issued via the PF they are logically executed inside
   the VF context)

 - hi-silicon has security problems because it doesn't have
   controlling/controlled, so it needs to carve out BAR MMIO maps and
   other stuff

So while I agree that, in principle, migration and the VFIO VF are
seperate concerns our broken reality makes them linked.

Even the idea that started this thread - that PF/PF could be a problem
- seems to have been explored by the Pensando RFC which is using the
aux devices to connect arbitrary PF/PF for their model.

The logical model we have been using on complex multi-function devices
(like every DPU driver) has been to split the driver into subsystem
local code and thread all the pieces together with aux devices.

The model has a PCI driver that operates the lowest level of the
device, eg the 'admin queue' and then aux devices create
subsystem-local drivers (netdev, rdma, vdpa, iscsi, vfio, etc, etc)
that ride on the common API exported by the PCI driver.

So, when you see both Intel and Pensando proposing this kind of
layered model for NVMe where migration is subsystem-local to VFIO, I
think this is where the inspiration is coming from. Their native DPU
drivers already work this way.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 10:59                     ` Max Gurtovoy
@ 2022-12-07 13:46                       ` Christoph Hellwig
  2022-12-07 14:50                         ` Max Gurtovoy
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07 13:46 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Christoph Hellwig, Jason Gunthorpe, Lei Rao, kbusch, axboe, kch,
	sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan

On Wed, Dec 07, 2022 at 12:59:00PM +0200, Max Gurtovoy wrote:
> Why is it preferred that the migration SW will talk directly to the PF and 
> not via VFIO interface ?

It should never talk directly to any hardware, but through a kernel
interface, and that's probably vfio.  But that interface needs to
centered around the controlling function for all the reasons I've
written down multiple times now.

> It's just an implementation detail.

No, it's not.  While you could come up with awkward ways to map how
the hardware interface must work to a completely contrary kernel
interface that's just going to create the need for lots of boilerplate
code _and_ confuses users.  The function that is beeing migrated can
fundamentally not be in control of itself.  Any interface that pretends
it is broken and a long term nightmare for users and implementers.

> I feel like it's even sounds more reasonable to have a common API like we 
> have today to save_state/resume_state/quiesce_device/freeze_device and each 
> device implementation will translate this functionality to its own SPEC.

Absolutely.

> If I understand your direction is to have QEMU code to talk to 
> nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

No.

> The controlled device is not aware of any of the migration process. Only 
> the migration SW, system admin and controlling device.

Exactly.

> So in the source:
>
> 1. We enable SRIOV on the NVMe driver

Again.  Nothing in live migration is tied to SR-IOV at all.  SR-IOV
is just one way to get multiple functions.

> 2. We list all the secondary controllers: nvme1, nvme2, nvme3
>
> 3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable 
> (controlling to controlled).
>
> 4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
>
> 5. We pass these functions to VM

And you need to pass the controlling function (or rather a handle for
it), because there is absolutely no sane way to discover that from
the controlled function as it can't have that information by the
fact that it is beeing passed to unprivilged VMs.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 13:34                     ` Jason Gunthorpe
@ 2022-12-07 13:52                       ` Christoph Hellwig
  2022-12-07 15:07                         ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> The VFIO design assumes that the "vfio migration driver" will talk to
> both functions under the hood, and I don't see a fundamental problem
> with this beyond it being awkward with the driver core.

And while that is a fine concept per see, the current incarnation of
that is fundamentally broken is it centered around the controlled
VM.  Which really can't work.

> Even the basic assumption that there would be a controlling/controlled
> relationship is not universally true. The mdev type drivers, and 
> SIOV-like devices are unlikely to have that. Once you can use PASID
> the reasons to split things at the HW level go away, and a VF could
> certainly self-migrate.

Even then you need a controlling and a controlled entity.  The
controlling entity even in SIOV remains a PCIe function.  The
controlled entity might just be a bunch of hardware resoures and
a PASID.  Making it important again that all migration is driven
by the controlling entity.

Also the whole concept that only VFIO can do live migration is
a little bogus.  With checkpoint and restart it absolutely
does make sense to live migrate a container, and with that
the hardware interface (e.g. nvme controller) assigned to it.

> So, when you see both Intel and Pensando proposing this kind of
> layered model for NVMe where migration is subsystem-local to VFIO, I
> think this is where the inspiration is coming from. Their native DPU
> drivers already work this way.

Maybe they should have talked to someone not high on their own
supply before designing this.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 13:46                       ` Christoph Hellwig
@ 2022-12-07 14:50                         ` Max Gurtovoy
  2022-12-07 16:35                           ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-07 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan


On 12/7/2022 3:46 PM, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 12:59:00PM +0200, Max Gurtovoy wrote:
>> Why is it preferred that the migration SW will talk directly to the PF and
>> not via VFIO interface ?
> It should never talk directly to any hardware, but through a kernel
> interface, and that's probably vfio.  But that interface needs to
> centered around the controlling function for all the reasons I've
> written down multiple times now.
>
>> It's just an implementation detail.
> No, it's not.  While you could come up with awkward ways to map how
> the hardware interface must work to a completely contrary kernel
> interface that's just going to create the need for lots of boilerplate
> code _and_ confuses users.  The function that is beeing migrated can
> fundamentally not be in control of itself.  Any interface that pretends
> it is broken and a long term nightmare for users and implementers.

We're defining the SPEC and interfaces now :)

Bellow is some possible direction I can think of.

>> I feel like it's even sounds more reasonable to have a common API like we
>> have today to save_state/resume_state/quiesce_device/freeze_device and each
>> device implementation will translate this functionality to its own SPEC.
> Absolutely.
>
>> If I understand your direction is to have QEMU code to talk to
>> nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.
> No.
great.
>
>> The controlled device is not aware of any of the migration process. Only
>> the migration SW, system admin and controlling device.
> Exactly.
>
>> So in the source:
>>
>> 1. We enable SRIOV on the NVMe driver
> Again.  Nothing in live migration is tied to SR-IOV at all.  SR-IOV
> is just one way to get multiple functions.

Sure.

It's just an example. It can be some mdev.

>
>> 2. We list all the secondary controllers: nvme1, nvme2, nvme3
>>
>> 3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable
>> (controlling to controlled).
>>
>> 4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
>>
>> 5. We pass these functions to VM
> And you need to pass the controlling function (or rather a handle for
> it), because there is absolutely no sane way to discover that from
> the controlled function as it can't have that information by the
> fact that it is beeing passed to unprivilged VMs.

Just thinking out loud:

When we perform step #3 we are narrowing it's scope and maybe some caps 
that you're concerned of. After this setting, the controlled function is 
in LM mode (we should define what does that mean in order to be able to 
migrate it correctly) and the controlling function is the migration 
master of it. Both can be aware of that. The only one that can master 
the controlled function is the controlling function in LM mode. Thus, it 
will be easy to keep that handle inside the kernel for VFs and for MDEVs 
as well.
Although I'm not against passing this handle to migration SW somehow in 
the command line of the QEMU but I still can't completely agree it's 
necessary.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 13:52                       ` Christoph Hellwig
@ 2022-12-07 15:07                         ` Jason Gunthorpe
  2022-12-07 16:38                           ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 15:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 02:52:03PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> > The VFIO design assumes that the "vfio migration driver" will talk to
> > both functions under the hood, and I don't see a fundamental problem
> > with this beyond it being awkward with the driver core.
> 
> And while that is a fine concept per see, the current incarnation of
> that is fundamentally broken is it centered around the controlled
> VM.  Which really can't work.

I don't see why you keep saying this. It is centered around the struct
vfio_device object in the kernel, which is definately NOT the VM.

The struct vfio_device is the handle for the hypervisor to control
the physical assigned device - and it is the hypervisor that controls
the migration.

We do not need the hypervisor userspace to have a handle to the hidden
controlling function. It provides no additional functionality,
security or insight to what qemu needs to do. Keeping that
relationship abstracted inside the kernel is a reasonable choice and
is not "fundamentally broken".

> > Even the basic assumption that there would be a controlling/controlled
> > relationship is not universally true. The mdev type drivers, and 
> > SIOV-like devices are unlikely to have that. Once you can use PASID
> > the reasons to split things at the HW level go away, and a VF could
> > certainly self-migrate.
> 
> Even then you need a controlling and a controlled entity.  The
> controlling entity even in SIOV remains a PCIe function.  The
> controlled entity might just be a bunch of hardware resoures and
> a PASID.  Making it important again that all migration is driven
> by the controlling entity.

If they are the same driver implementing vfio_device you may be able
to claim they conceptually exist, but it is pretty artificial to draw
this kind of distinction inside a single driver.

> Also the whole concept that only VFIO can do live migration is
> a little bogus.  With checkpoint and restart it absolutely
> does make sense to live migrate a container, and with that
> the hardware interface (e.g. nvme controller) assigned to it.

I agree people may want to do this, but it is very unclear how SRIOV
live migration can help do this.

SRIOV live migration is all about not disturbing the kernel driver,
assuming it is the same kernel driver on both sides. If you have two
different kernel's there is nothing worth migrating. There isn't even
an assurance the dma API will have IOMMU mapped the same objects to
the same IOVAs. eg so you have re-establish your admin queue, IO
queues, etc after migration anyhow.

Let alone how to solve the security problems of allow userspace to
load arbitary FW blobs into a device with potentially insecure DMA
access..

At that point it isn't really the same kind of migration.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 14:50                         ` Max Gurtovoy
@ 2022-12-07 16:35                           ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07 16:35 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Christoph Hellwig, Jason Gunthorpe, Lei Rao, kbusch, axboe, kch,
	sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan

On Wed, Dec 07, 2022 at 04:50:00PM +0200, Max Gurtovoy wrote:
> When we perform step #3 we are narrowing it's scope and maybe some caps 
> that you're concerned of. After this setting, the controlled function is in 
> LM mode (we should define what does that mean in order to be able to 
> migrate it correctly) and the controlling function is the migration master 
> of it. Both can be aware of that. The only one that can master the 
> controlled function is the controlling function in LM mode. Thus, it will 
> be easy to keep that handle inside the kernel for VFs and for MDEVs as 
> well.

Maybe.  So you'd introduce a kernel linkage that both side would have
to be part of?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 15:07                         ` Jason Gunthorpe
@ 2022-12-07 16:38                           ` Christoph Hellwig
  2022-12-07 17:31                             ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07 16:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 11:07:11AM -0400, Jason Gunthorpe wrote:
> > And while that is a fine concept per see, the current incarnation of
> > that is fundamentally broken is it centered around the controlled
> > VM.  Which really can't work.
> 
> I don't see why you keep saying this. It is centered around the struct
> vfio_device object in the kernel, which is definately NOT the VM.

Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
confusing my mind so much that I start mistyping things.

> > Even then you need a controlling and a controlled entity.  The
> > controlling entity even in SIOV remains a PCIe function.  The
> > controlled entity might just be a bunch of hardware resoures and
> > a PASID.  Making it important again that all migration is driven
> > by the controlling entity.
> 
> If they are the same driver implementing vfio_device you may be able
> to claim they conceptually exist, but it is pretty artificial to draw
> this kind of distinction inside a single driver.

How are they in this reply?  I can't parse how this even relates to
what I wrote.

> > Also the whole concept that only VFIO can do live migration is
> > a little bogus.  With checkpoint and restart it absolutely
> > does make sense to live migrate a container, and with that
> > the hardware interface (e.g. nvme controller) assigned to it.
> 
> I agree people may want to do this, but it is very unclear how SRIOV
> live migration can help do this.

SRIOV live migration doesn't, because honestly there is no such
thing as "SRIOV" live migration to start with.

> Let alone how to solve the security problems of allow userspace to
> load arbitary FW blobs into a device with potentially insecure DMA
> access..

Any time you assign a PCI device to userspace you might get into that.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 16:38                           ` Christoph Hellwig
@ 2022-12-07 17:31                             ` Jason Gunthorpe
  2022-12-07 18:33                               ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 17:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 05:38:57PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 11:07:11AM -0400, Jason Gunthorpe wrote:
> > > And while that is a fine concept per see, the current incarnation of
> > > that is fundamentally broken is it centered around the controlled
> > > VM.  Which really can't work.
> > 
> > I don't see why you keep saying this. It is centered around the struct
> > vfio_device object in the kernel, which is definately NOT the VM.
> 
> Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
> confusing my mind so much that I start mistyping things.

Well, what words do you want to use?

Regardless of VF/VM, it doesn't matter - my point is that the
vfio_device is the hypervisor control for *whatever* is under the
vfio_device and it is not desirable to break it up along arbitrary HW
boundaries.

I've given lots of reasons why not to do this now.

I strongly suspect it can work technically - as ugly as it is Pensando
shows an approach.

So I don't think I've learned anything more about your objection.

"fundamentally broken" doesn't help

It is a major point, because if we are going to have to rip apart all
the uAPI we built here and are completing the qemu work for, we need
to come to an understanding very soon.

And given how difficult it has been to get to this point, I want a
*really* good reason why we have to start again, and rewrite all the
drivers that exist and are coming.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 17:31                             ` Jason Gunthorpe
@ 2022-12-07 18:33                               ` Christoph Hellwig
  2022-12-07 20:08                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-07 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
> > Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
> > confusing my mind so much that I start mistyping things.
> 
> Well, what words do you want to use?

The same I've used through this whole thread:  controlling and
controlled function.

> So I don't think I've learned anything more about your objection.
> 
> "fundamentally broken" doesn't help

The objection is that:

 - in hardware fundamentally only the controlling funtion can
   control live migration features on the controlled function,
   because the controlled function is assigned to a VM which has
   control over it.
 - for the same reason there is no portable way to even find
   the controlling function from a controlled function, unless
   you want to equate PF = controlling and VF = controlled,
   and even that breaks down in some corner cases
 - if you want to control live migration from the controlled
   VM you need a new vfio subdriver for a function that has
   absolutely no new functionality itself related to live
   migration (quirks for bugs excepted).

So by this architecture you build a convoluted mess where you
need tons of vfio subdrivers that mostly just talk to the
driver for the controlling function, which they can't even
sanely discover.  That's what I keep calling fundamentally
broken.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 18:33                               ` Christoph Hellwig
@ 2022-12-07 20:08                                 ` Jason Gunthorpe
  2022-12-09  2:50                                   ` Tian, Kevin
                                                     ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 20:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
> > > Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
> > > confusing my mind so much that I start mistyping things.
> > 
> > Well, what words do you want to use?
> 
> The same I've used through this whole thread:  controlling and
> controlled function.
> 
> > So I don't think I've learned anything more about your objection.
> > 
> > "fundamentally broken" doesn't help
> 
> The objection is that:
> 
>  - in hardware fundamentally only the controlling funtion can
>    control live migration features on the controlled function,
>    because the controlled function is assigned to a VM which has
>    control over it.

Yes

However hisilicon managed to do their implementation without this, or
rather you could say their "controlling function" is a single MMIO BAR
page in their PCI VF and their "controlled function" is the rest of
the PCI VF.

>  - for the same reason there is no portable way to even find
>    the controlling function from a controlled function, unless
>    you want to equate PF = controlling and VF = controlled,
>    and even that breaks down in some corner cases

As you say, the kernel must know the relationship between
controlling->controlled. Nothing else is sane.

If the kernel knows this information then we can find a way for the
vfio_device to have pointers to both controlling and controlled
objects. I have a suggestion below.

>  - if you want to control live migration from the controlled
>    VM you need a new vfio subdriver for a function that has
>    absolutely no new functionality itself related to live
>    migration (quirks for bugs excepted).

I see it differently, the VFIO driver *is* the live migration
driver. Look at all the drivers that have come through and they are
99% live migration code. They have, IMHO, properly split the live
migration concern out of their controlling/PF driver and placed it in
the "VFIO live migration driver".

We've done a pretty good job of allowing the VFIO live migration
driver to pretty much just focus on live migration stuff and delegate
the VFIO part to library code.

Excepting quirks and bugs sounds nice, except we actually can't ignore
them. Having all the VFIO capabilities is exactly how we are fixing
the quirks and bugs in the first place, and I don't see your vision
how we can continue to do that if we split all the live migration code
into yet another subsystem.

For instance how do I trap FLR like mlx5 must do if the
drivers/live_migration code cannot intercept the FLR VFIO ioctl?

How do I mangle and share the BAR like hisilicon does?

Which is really why this is in VFIO in the first place. It actually is
coupled in practice, if not in theory.

> So by this architecture you build a convoluted mess where you need
> tons of vfio subdrivers that mostly just talk to the driver for the
> controlling function, which they can't even sanely discover.  That's
> what I keep calling fundamentally broken.

The VFIO live migration drivers will look basically the same if you
put them under drivers/live_migration. This cannot be considered a
"convoluted mess" as splitting things by subsystem is best-practice,
AFAIK.

If we accept that drivers/vfio can be the "live migration subsystem"
then lets go all the way and have the controlling driver to call
vfio_device_group_register() to create the VFIO char device for the
controlled function.

This solves the "sanely discover" problem because of course the
controlling function driver knows what the controlled function is and
it can acquire both functions before it calls
vfio_device_group_register().

This is actually what I want to do anyhow for SIOV-like functions and
VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
really dislike that our current SRIOV model in Linux forces the VF to
instantly exist without a chance for the controlling driver to
provision it.

We have some challenges on how to do this in the kernel, but I think
we can overcome them. VFIO is ready for this thanks to all the
restructuring work we already did.

I'd really like to get away from VFIO having to do all this crazy
sysfs crap to activate its driver. I think there is a lot of appeal to
having, say, a nvmecli command that just commands the controlling
driver to provision a function, enable live migration, configure it
and then make it visible via VFIO. The same API regardless if the
underlying controlled function technology is PF/VF/SIOV.

At least we have been very interested in doing that for networking
devices.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 13:09       ` Christoph Hellwig
  2022-12-06 13:52         ` Jason Gunthorpe
@ 2022-12-09  2:05         ` Tian, Kevin
  2022-12-09 16:53           ` Li, Yadong
  1 sibling, 1 reply; 65+ messages in thread
From: Tian, Kevin @ 2022-12-09  2:05 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: Rao, Lei, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, mjrosato, linux-kernel,
	linux-nvme, kvm, Dong, Eddie, Li, Yadong, Liu, Yi L, Wilk,
	Konrad, stephen, Yuan, Hang

> From: Christoph Hellwig <hch@lst.de>
> Sent: Tuesday, December 6, 2022 9:09 PM
> 
> 
> > I don't think we know enough about this space at the moment to fix a
> > specification to one path or the other, so I hope the TPAR will settle
> > on something that can support both models in SW and people can try
> > things out.
> 
> I've not seen anyone from Intel actually contributing to the live
> migration TPAR, which is almost two month old by now.

Not a NVMe guy but obviously Intel should join. I have forwarded this
to internal folks to check the actual status.

And I also asked them to prepare a document explaining how this
opaque cmd actually works to address your concerns.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07  7:58                     ` Christoph Hellwig
@ 2022-12-09  2:11                       ` Tian, Kevin
  2022-12-12  7:41                         ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Tian, Kevin @ 2022-12-09  2:11 UTC (permalink / raw)
  To: Christoph Hellwig, Max Gurtovoy
  Cc: Jason Gunthorpe, Rao, Lei, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	mjrosato, linux-kernel, linux-nvme, kvm, Dong, Eddie, Li, Yadong,
	Liu, Yi L, Wilk, Konrad, stephen, Yuan, Hang

> From: Christoph Hellwig <hch@lst.de>
> Sent: Wednesday, December 7, 2022 3:59 PM
> 
> On Wed, Dec 07, 2022 at 04:30:20AM +0200, Max Gurtovoy wrote:
> > I'm not sure how awkward is for migration driver to ask the controlling
> > device driver to operate a migration action.
> 
> It can't.  That's the whole point.  The controlled function that is
> being migrate must be absolutely unaware of that (except for things
> like quiescing access or FLRs that could happen anyway), because
> otherwise your have a fundamental information leak.
> 

Can you elaborate which information is leaked?

No matter who provides the uAPI (controlling or controlled), end of
the day it's the same cmd invoked on the controlling device to save/
restore the state of the controlled device itself.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 20:08                                 ` Jason Gunthorpe
@ 2022-12-09  2:50                                   ` Tian, Kevin
  2022-12-09 18:56                                     ` Dong, Eddie
  2022-12-11 11:39                                   ` Max Gurtovoy
  2022-12-12  7:50                                   ` Christoph Hellwig
  2 siblings, 1 reply; 65+ messages in thread
From: Tian, Kevin @ 2022-12-09  2:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Rao, Lei, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, mjrosato, linux-kernel,
	linux-nvme, kvm, Dong, Eddie, Li, Yadong, Liu, Yi L, Wilk,
	Konrad, stephen, Yuan, Hang

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Thursday, December 8, 2022 4:08 AM
> 
> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
> 
> How do I mangle and share the BAR like hisilicon does?
> 
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.
> 

Those are good questions which I also buy in to stay with the
current VFIO framework.

Actually the same controlling vs. controlled design choice also exists
in the hardware side. There are plenty of SR-IOV devices supporting
doorbells for VF (controlled function) to call services on PF (controlling
function) while the doorbell interface is implemented on the VF side.

If following the direction having controlling function to explicitly
provide services then all those doorbells should be deprecated
and instead we want explicit communications between VF driver
and PF driver.

From userspace driver p.o.v. the VFIO uAPI is kind of a device
programming interface. Here we just present everything related
to the controlled device itself (including the state management)
via a centralized portal though in the kernel there might be linkage
out of VFIO to reach the controlling driver. kind of a sw doorbell. 😊

btw just to add more background to this work. Half a year ago Lei
actually did a flavor [1] in the other way.

The controlling function of Intel IPU card also supports a network gRPC 
protocol to manage the state of controlled NVMe function. 

Then that series attempted to introduce an out-of-band migration
framework in Qemu so instead of doing in-line state management
via kernel VFIO uAPI Qemu can turn to an external plugin which
forwards the state cmd via gRPC to the controlling function.

Just that the controlling driver is not in the kernel.

It's dropped as the inline way was preferred.

Thanks
Kevin

[1] https://lore.kernel.org/all/20220524061848.1615706-14-lei.rao@intel.com/T/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-09  2:05         ` Tian, Kevin
@ 2022-12-09 16:53           ` Li, Yadong
  2022-12-12  8:11             ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Li, Yadong @ 2022-12-09 16:53 UTC (permalink / raw)
  To: Tian, Kevin, Christoph Hellwig, Jason Gunthorpe
  Cc: Rao, Lei, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, mjrosato, linux-kernel,
	linux-nvme, kvm, Dong, Eddie, Liu, Yi L, Wilk, Konrad, stephen,
	Yuan, Hang

>From: Tian, Kevin <kevin.tian@intel.com> 

>Not a NVMe guy but obviously Intel should join. I have forwarded this
>to internal folks to check the actual status.
Intel team has been actively participating in the TPAR definition over last month.
The key leadership in NVMe WG, Peter and Nick (from Intel), are on top of this TPAR,
and help influence the scope of the TPAR to better support IPU/DPUs.  IPU/DPU is a
new type of device that desperately needs a standard based live migration solution.
Soon, you may also see more Intel people to be part of the SW WG for the NVMe live 
migration standardization effort.

>And I also asked them to prepare a document explaining how this
>opaque cmd actually works to address your concerns.
There is nothing secret.  It's the VF states including VF CSR registers, IO QP states, and 
the AdminQ state.  The AdminQ state may also include the pending AER command.
As part of the NVMe live migration standardization effort, it's desirable to standardize
the structure as much as we can, and then have an extension for any device specific
state that may be required to work around specific design HW limitations.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-09  2:50                                   ` Tian, Kevin
@ 2022-12-09 18:56                                     ` Dong, Eddie
  0 siblings, 0 replies; 65+ messages in thread
From: Dong, Eddie @ 2022-12-09 18:56 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Christoph Hellwig
  Cc: Rao, Lei, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, mjrosato, linux-kernel,
	linux-nvme, kvm, Li, Yadong, Liu, Yi L, Wilk, Konrad, stephen,
	Yuan, Hang

> 
> If following the direction having controlling function to explicitly provide
> services then all those doorbells should be deprecated and instead we want
> explicit communications between VF driver and PF driver.

Hardware mechanism of communication (door bell here) can be used to 
support broader scenario when VF driver and PF driver run in 2 different
OS kernels (when the VF is assigned to a VM).

> 
> From userspace driver p.o.v. the VFIO uAPI is kind of a device programming
> interface. Here we just present everything related to the controlled device
> itself (including the state management) via a centralized portal though in the
> kernel there might be linkage out of VFIO to reach the controlling driver. kind
> of a sw doorbell. 😊

Yes.
And sw doorbell is more efficient than hardware doorbell 😊 .

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 20:08                                 ` Jason Gunthorpe
  2022-12-09  2:50                                   ` Tian, Kevin
@ 2022-12-11 11:39                                   ` Max Gurtovoy
  2022-12-12  7:55                                     ` Christoph Hellwig
  2022-12-12  7:50                                   ` Christoph Hellwig
  2 siblings, 1 reply; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-11 11:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan, Oren Duer


On 12/7/2022 10:08 PM, Jason Gunthorpe wrote:
> On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote:
>> On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
>>>> Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
>>>> confusing my mind so much that I start mistyping things.
>>> Well, what words do you want to use?
>> The same I've used through this whole thread:  controlling and
>> controlled function.
>>
>>> So I don't think I've learned anything more about your objection.
>>>
>>> "fundamentally broken" doesn't help
>> The objection is that:
>>
>>   - in hardware fundamentally only the controlling funtion can
>>     control live migration features on the controlled function,
>>     because the controlled function is assigned to a VM which has
>>     control over it.
> Yes
>
> However hisilicon managed to do their implementation without this, or
> rather you could say their "controlling function" is a single MMIO BAR
> page in their PCI VF and their "controlled function" is the rest of
> the PCI VF.
>
>>   - for the same reason there is no portable way to even find
>>     the controlling function from a controlled function, unless
>>     you want to equate PF = controlling and VF = controlled,
>>     and even that breaks down in some corner cases
> As you say, the kernel must know the relationship between
> controlling->controlled. Nothing else is sane.
>
> If the kernel knows this information then we can find a way for the
> vfio_device to have pointers to both controlling and controlled
> objects. I have a suggestion below.
>
>>   - if you want to control live migration from the controlled
>>     VM you need a new vfio subdriver for a function that has
>>     absolutely no new functionality itself related to live
>>     migration (quirks for bugs excepted).
> I see it differently, the VFIO driver *is* the live migration
> driver. Look at all the drivers that have come through and they are
> 99% live migration code. They have, IMHO, properly split the live
> migration concern out of their controlling/PF driver and placed it in
> the "VFIO live migration driver".
>
> We've done a pretty good job of allowing the VFIO live migration
> driver to pretty much just focus on live migration stuff and delegate
> the VFIO part to library code.
>
> Excepting quirks and bugs sounds nice, except we actually can't ignore
> them. Having all the VFIO capabilities is exactly how we are fixing
> the quirks and bugs in the first place, and I don't see your vision
> how we can continue to do that if we split all the live migration code
> into yet another subsystem.
>
> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
>
> How do I mangle and share the BAR like hisilicon does?
>
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.
>
>> So by this architecture you build a convoluted mess where you need
>> tons of vfio subdrivers that mostly just talk to the driver for the
>> controlling function, which they can't even sanely discover.  That's
>> what I keep calling fundamentally broken.
> The VFIO live migration drivers will look basically the same if you
> put them under drivers/live_migration. This cannot be considered a
> "convoluted mess" as splitting things by subsystem is best-practice,
> AFAIK.
>
> If we accept that drivers/vfio can be the "live migration subsystem"
> then lets go all the way and have the controlling driver to call
> vfio_device_group_register() to create the VFIO char device for the
> controlled function.
>
> This solves the "sanely discover" problem because of course the
> controlling function driver knows what the controlled function is and
> it can acquire both functions before it calls
> vfio_device_group_register().
>
> This is actually what I want to do anyhow for SIOV-like functions and
> VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
> really dislike that our current SRIOV model in Linux forces the VF to
> instantly exist without a chance for the controlling driver to
> provision it.
>
> We have some challenges on how to do this in the kernel, but I think
> we can overcome them. VFIO is ready for this thanks to all the
> restructuring work we already did.
>
> I'd really like to get away from VFIO having to do all this crazy
> sysfs crap to activate its driver. I think there is a lot of appeal to
> having, say, a nvmecli command that just commands the controlling
> driver to provision a function, enable live migration, configure it
> and then make it visible via VFIO. The same API regardless if the
> underlying controlled function technology is PF/VF/SIOV.
>
> At least we have been very interested in doing that for networking
> devices.
>
> Jason

Jason/Christoph,

As I mentioned earlier we have 2 orthogonal paths here: implementation 
and SPEC. They are for some reason mixed in this discussion.

I've tried to understand some SPEC related issues that were raised here, 
that if we fix them - the implementation will be possible and all the 
VFIO efforts we did can be re-used.

In high level I think that for the SPEC:

1. Need to define a concept of a "virtual subsystem". A primary 
controller will be able to create a virtual subsystem. Inside this 
subsystem the primary controller will be the master ("the controlling") 
of the migration process. It will also be able to add secondary 
controllers to this virtual subsystem and assign "virtual controller ID" 
to it.
something like:
- nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
--dev_vcid = 1
- nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
--secondary_dev=/dev/nvme2 --secondary_dev_vcid=20

2. Now the primary controller have a list of ctrls inside it's virtual 
subsystem for migration. It has handle to it that doesn't go away after 
binding the controlled function to VFIO.

3. Same virtual subsystem should be created in the destination hypervisor.

4. Now, migration process starts using the VFIO uAPI - we will get to a 
point that VFIO driver of the controlled function needs to ask the 
controlling function to send admin commands to manage the migration process.
     How to do it ? implementation detail. We can set a pointer in 
pci_dev/dev structures or we can ask 
nvme_migration_handle_get(controlled_function) or NVMe can expose API's 
dedicated to migration nvme_state_save(controlled_function).


When creating a virtual subsystem and adding controllers to it, we can 
control any leakage or narrow some functionality that make migration 
impossible. This can be using more admin commands for example.
After the migration process is over, one can remove the secondary 
controller from the virtual subsystem and re-use it.

WDYT ?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 15:01                   ` Christoph Hellwig
  2022-12-06 15:28                     ` Jason Gunthorpe
@ 2022-12-11 12:05                     ` Max Gurtovoy
  2022-12-11 13:21                       ` Rao, Lei
  1 sibling, 1 reply; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-11 12:05 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan, Oren Duer


On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>> use.
> Beward:  The secondary function might as well be a physical function
> as well.  In fact one of the major customers for "smart" multifunction
> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
> symmetric dual ported devices are multi-PF as well).
>
> So this isn't really about a VF live cycle, but how to manage life
> migration, especially on the receive / restore side.  And restoring
> the entire controller state is extremely invasive and can't be done
> on a controller that is in any classic form live.  In fact a lot
> of the state is subsystem-wide, so without some kind of virtualization
> of the subsystem it is impossible to actually restore the state.

ohh, great !

I read this subsystem virtualization proposal of yours after I sent my 
proposal for subsystem virtualization in patch 1/5 thread.
I guess this means that this is the right way to go.
Lets continue brainstorming this idea. I think this can be the way to 
migrate NVMe controllers in a standard way.

>
> To cycle back to the hardware that is posted here, I'm really confused
> how it actually has any chance to work and no one has even tried
> to explain how it is supposed to work.

I guess in vendor specific implementation you can assume some things 
that we are discussing now for making it as a standard.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-11 12:05                     ` Max Gurtovoy
@ 2022-12-11 13:21                       ` Rao, Lei
  2022-12-11 14:51                         ` Max Gurtovoy
  0 siblings, 1 reply; 65+ messages in thread
From: Rao, Lei @ 2022-12-11 13:21 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe
  Cc: kbusch, axboe, kch, sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan, Oren Duer



On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
> 
> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>> use.
>> Beward:  The secondary function might as well be a physical function
>> as well.  In fact one of the major customers for "smart" multifunction
>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>> symmetric dual ported devices are multi-PF as well).
>>
>> So this isn't really about a VF live cycle, but how to manage life
>> migration, especially on the receive / restore side.  And restoring
>> the entire controller state is extremely invasive and can't be done
>> on a controller that is in any classic form live.  In fact a lot
>> of the state is subsystem-wide, so without some kind of virtualization
>> of the subsystem it is impossible to actually restore the state.
> 
> ohh, great !
> 
> I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread.
> I guess this means that this is the right way to go.
> Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way.
> 
>>
>> To cycle back to the hardware that is posted here, I'm really confused
>> how it actually has any chance to work and no one has even tried
>> to explain how it is supposed to work.
> 
> I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard.

Yes, as I wrote in the cover letter, this is a reference implementation to
start a discussion and help drive standardization efforts, but this series
works well for Intel IPU NVMe. As Jason said, there are two use cases:
shared medium and local medium. I think the live migration of the local medium
is complicated due to the large amount of user data that needs to be migrated.
I don't have a good idea to deal with this situation. But for Intel IPU NVMe,
each VF can connect to remote storage via the NVMF protocol to achieve storage
offloading. This is the shared medium. In this case, we don't need to migrate
the user data, which will significantly simplify the work of live migration.
The series tries to solve the problem of live migration of shared medium.
But it still lacks dirty page tracking and P2P support, we are also developing
these features.

About the nvme device state, As described in my document, the VF states include
VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the
implementation, I found that the device state data is small per VF. So, I decided
to use the admin queue of the Primary controller to send the live migration
commands to save and restore the VF states like MLX5.

Thanks,
Lei

> 
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-11 13:21                       ` Rao, Lei
@ 2022-12-11 14:51                         ` Max Gurtovoy
  2022-12-12  1:20                           ` Rao, Lei
  2022-12-12  8:09                           ` Christoph Hellwig
  0 siblings, 2 replies; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-11 14:51 UTC (permalink / raw)
  To: Rao, Lei, Christoph Hellwig, Jason Gunthorpe
  Cc: kbusch, axboe, kch, sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan, Oren Duer


On 12/11/2022 3:21 PM, Rao, Lei wrote:
>
>
> On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
>>
>> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>>> use.
>>> Beward:  The secondary function might as well be a physical function
>>> as well.  In fact one of the major customers for "smart" multifunction
>>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>>> symmetric dual ported devices are multi-PF as well).
>>>
>>> So this isn't really about a VF live cycle, but how to manage life
>>> migration, especially on the receive / restore side.  And restoring
>>> the entire controller state is extremely invasive and can't be done
>>> on a controller that is in any classic form live.  In fact a lot
>>> of the state is subsystem-wide, so without some kind of virtualization
>>> of the subsystem it is impossible to actually restore the state.
>>
>> ohh, great !
>>
>> I read this subsystem virtualization proposal of yours after I sent 
>> my proposal for subsystem virtualization in patch 1/5 thread.
>> I guess this means that this is the right way to go.
>> Lets continue brainstorming this idea. I think this can be the way to 
>> migrate NVMe controllers in a standard way.
>>
>>>
>>> To cycle back to the hardware that is posted here, I'm really confused
>>> how it actually has any chance to work and no one has even tried
>>> to explain how it is supposed to work.
>>
>> I guess in vendor specific implementation you can assume some things 
>> that we are discussing now for making it as a standard.
>
> Yes, as I wrote in the cover letter, this is a reference 
> implementation to
> start a discussion and help drive standardization efforts, but this 
> series
> works well for Intel IPU NVMe. As Jason said, there are two use cases:
> shared medium and local medium. I think the live migration of the 
> local medium
> is complicated due to the large amount of user data that needs to be 
> migrated.
> I don't have a good idea to deal with this situation. But for Intel 
> IPU NVMe,
> each VF can connect to remote storage via the NVMF protocol to achieve 
> storage
> offloading. This is the shared medium. In this case, we don't need to 
> migrate
> the user data, which will significantly simplify the work of live 
> migration.

I don't think that medium migration should be part of the SPEC. We can 
specify it's out of scope.

All the idea of live migration is to have a short downtime and I don't 
think we can guarantee short downtime if we need to copy few terabytes 
throw the networking.
If the media copy is taking few seconds, there is no need to do live 
migration of few milisecs downtime. Just do regular migration of a VM.

>
> The series tries to solve the problem of live migration of shared medium.
> But it still lacks dirty page tracking and P2P support, we are also 
> developing
> these features.
>
> About the nvme device state, As described in my document, the VF 
> states include
> VF CSR registers, Every IO Queue Pair state, and the AdminQ state. 
> During the
> implementation, I found that the device state data is small per VF. 
> So, I decided
> to use the admin queue of the Primary controller to send the live 
> migration
> commands to save and restore the VF states like MLX5.

I think and hope we all agree that the AdminQ of the controlling NVMe 
function will be used to migrate the controlled NVMe function.

which document are you refereeing to ?

>
> Thanks,
> Lei
>
>>
>>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-11 14:51                         ` Max Gurtovoy
@ 2022-12-12  1:20                           ` Rao, Lei
  2022-12-12  8:09                           ` Christoph Hellwig
  1 sibling, 0 replies; 65+ messages in thread
From: Rao, Lei @ 2022-12-12  1:20 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe
  Cc: kbusch, axboe, kch, sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan, Oren Duer



On 12/11/2022 10:51 PM, Max Gurtovoy wrote:
> 
> On 12/11/2022 3:21 PM, Rao, Lei wrote:
>>
>>
>> On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
>>>
>>> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>>>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>>>> use.
>>>> Beward:  The secondary function might as well be a physical function
>>>> as well.  In fact one of the major customers for "smart" multifunction
>>>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>>>> symmetric dual ported devices are multi-PF as well).
>>>>
>>>> So this isn't really about a VF live cycle, but how to manage life
>>>> migration, especially on the receive / restore side.  And restoring
>>>> the entire controller state is extremely invasive and can't be done
>>>> on a controller that is in any classic form live.  In fact a lot
>>>> of the state is subsystem-wide, so without some kind of virtualization
>>>> of the subsystem it is impossible to actually restore the state.
>>>
>>> ohh, great !
>>>
>>> I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread.
>>> I guess this means that this is the right way to go.
>>> Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way.
>>>
>>>>
>>>> To cycle back to the hardware that is posted here, I'm really confused
>>>> how it actually has any chance to work and no one has even tried
>>>> to explain how it is supposed to work.
>>>
>>> I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard.
>>
>> Yes, as I wrote in the cover letter, this is a reference implementation to
>> start a discussion and help drive standardization efforts, but this series
>> works well for Intel IPU NVMe. As Jason said, there are two use cases:
>> shared medium and local medium. I think the live migration of the local medium
>> is complicated due to the large amount of user data that needs to be migrated.
>> I don't have a good idea to deal with this situation. But for Intel IPU NVMe,
>> each VF can connect to remote storage via the NVMF protocol to achieve storage
>> offloading. This is the shared medium. In this case, we don't need to migrate
>> the user data, which will significantly simplify the work of live migration.
> 
> I don't think that medium migration should be part of the SPEC. We can specify it's out of scope.
> 
> All the idea of live migration is to have a short downtime and I don't think we can guarantee short downtime if we need to copy few terabytes throw the networking.
> If the media copy is taking few seconds, there is no need to do live migration of few milisecs downtime. Just do regular migration of a
> 
>>
>> The series tries to solve the problem of live migration of shared medium.
>> But it still lacks dirty page tracking and P2P support, we are also developing
>> these features.
>>
>> About the nvme device state, As described in my document, the VF states include
>> VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the
>> implementation, I found that the device state data is small per VF. So, I decided
>> to use the admin queue of the Primary controller to send the live migration
>> commands to save and restore the VF states like MLX5.
> 
> I think and hope we all agree that the AdminQ of the controlling NVMe function will be used to migrate the controlled NVMe function.

Fully agree.

> 
> which document are you refereeing to ?

The fifth patch includes the definition of these commands and how the firmware handles
these live migration commands. It's the documentation that I referenced.

>>
>>>
>>>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-09  2:11                       ` Tian, Kevin
@ 2022-12-12  7:41                         ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-12  7:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Christoph Hellwig, Max Gurtovoy, Jason Gunthorpe, Rao, Lei,
	kbusch, axboe, kch, sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, mjrosato, linux-kernel, linux-nvme,
	kvm, Dong, Eddie, Li, Yadong, Liu, Yi L, Wilk, Konrad, stephen,
	Yuan, Hang

On Fri, Dec 09, 2022 at 02:11:21AM +0000, Tian, Kevin wrote:
> > It can't.  That's the whole point.  The controlled function that is
> > being migrate must be absolutely unaware of that (except for things
> > like quiescing access or FLRs that could happen anyway), because
> > otherwise your have a fundamental information leak.
> > 
> 
> Can you elaborate which information is leaked?

Information about what controllers exist, what namespaces exist, etc.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-07 20:08                                 ` Jason Gunthorpe
  2022-12-09  2:50                                   ` Tian, Kevin
  2022-12-11 11:39                                   ` Max Gurtovoy
@ 2022-12-12  7:50                                   ` Christoph Hellwig
  2022-12-13 14:01                                     ` Jason Gunthorpe
  2 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-12  7:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote:
> However hisilicon managed to do their implementation without this, or
> rather you could say their "controlling function" is a single MMIO BAR
> page in their PCI VF and their "controlled function" is the rest of
> the PCI VF.

Eww.  So you need to carefully filter the BAR and can't actually do
any DMA at all?  I'm not sure that is actually practical, especially
not for something with a lot of state.

> If the kernel knows this information then we can find a way for the
> vfio_device to have pointers to both controlling and controlled
> objects. I have a suggestion below.

So now we need to write a vfio shim for every function even if there
is absolutely nothing special about that function?  Migrating really
is the controlling functions behavior, and writing a new vfio bit
for every controlled thing just does not scale.

> I see it differently, the VFIO driver *is* the live migration
> driver. Look at all the drivers that have come through and they are
> 99% live migration code.

Yes, and that's the problem, because they are associated with the
controlled function, and now we have a communication problem between
that vfio driver binding to the controlled function and the drive
that actually controlls live migration that is associated with the
controlling function.  In other words: you've created a giant mess.

> Excepting quirks and bugs sounds nice, except we actually can't ignore
> them.

I'm not proposing to ignore them.  But they should not be needed most
of the time.

> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
> 
> How do I mangle and share the BAR like hisilicon does?
> 
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.

So you've created a long term userspace API around working around
around buggy and/or misdesigned early designs and now want to force
it down everyones throat?

Can we please take a step back and think about how things should work,
and only then think how to work around weirdo devices that do strange
things as a second step? 

> If we accept that drivers/vfio can be the "live migration subsystem"
> then lets go all the way and have the controlling driver to call
> vfio_device_group_register() to create the VFIO char device for the
> controlled function.

While creating the VFs from the PF driver makes a lot more sense,
remember that vfio is absolutely not the only use case for VFs.
There are plenty use cases where you want to use them with the normal
kernel driver as well.  So the interface to create VFs needs a now
to decide if it should be vfio exported, or use the normal kernel
binding.

> This solves the "sanely discover" problem because of course the
> controlling function driver knows what the controlled function is and
> it can acquire both functions before it calls
> vfio_device_group_register().

Yes.

> This is actually what I want to do anyhow for SIOV-like functions and
> VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
> really dislike that our current SRIOV model in Linux forces the VF to
> instantly exist without a chance for the controlling driver to
> provision it.

For SIOV you have no other choice anyway.  But I agree that it is
the right thing to do for VFIO.  Now the next step is to control
live migration from the controlling function, so that for most sane
devices the controlled device does not need all the pointless
boilerplate of its own vfio driver.

> I'd really like to get away from VFIO having to do all this crazy
> sysfs crap to activate its driver. I think there is a lot of appeal to
> having, say, a nvmecli command that just commands the controlling
> driver to provision a function, enable live migration, configure it
> and then make it visible via VFIO. The same API regardless if the
> underlying controlled function technology is PF/VF/SIOV.

Yes.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-11 11:39                                   ` Max Gurtovoy
@ 2022-12-12  7:55                                     ` Christoph Hellwig
  2022-12-12 14:49                                       ` Max Gurtovoy
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-12  7:55 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Christoph Hellwig, Lei Rao, kbusch, axboe, kch,
	sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan, Oren Duer

On Sun, Dec 11, 2022 at 01:39:37PM +0200, Max Gurtovoy wrote:
> 1. Need to define a concept of a "virtual subsystem". A primary controller 
> will be able to create a virtual subsystem. Inside this subsystem the 
> primary controller will be the master ("the controlling") of the migration 
> process. It will also be able to add secondary controllers to this virtual 
> subsystem and assign "virtual controller ID" to it.
> something like:
> - nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
> --dev_vcid = 1
> - nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
> --secondary_dev=/dev/nvme2 --secondary_dev_vcid=20

Yes. Note that there is a bit more state than just the NQN.  You also
need at least a serial number, and also probably a different vendor
ID (the PCI vendor ID that is also mirror in Identify Controller and
the IEEE OUI), and new unique namespace identifier.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-06 18:00                         ` Dong, Eddie
@ 2022-12-12  7:57                           ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-12  7:57 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Christoph Hellwig, Jason Gunthorpe, Rao, Lei, kbusch, axboe, kch,
	sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, Tian, Kevin, mjrosato, linux-kernel,
	linux-nvme, kvm, Li, Yadong, Liu, Yi L, Wilk, Konrad, stephen,
	Yuan, Hang

On Tue, Dec 06, 2022 at 06:00:27PM +0000, Dong, Eddie wrote:
> NVMe spec is general, but the implementation details (such as internal state) may 
> be vendor specific. If the migration happens between 2 identical NVMe devices 
> (from same vendor/device w/ same firmware version), migration of 
> subsystem-wide state can be naturally covered, right?

No.  If you want live migration for nvme supported in Linux, it must
be speced in the NVMe technical working group and interoperate between
different implementations.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-11 14:51                         ` Max Gurtovoy
  2022-12-12  1:20                           ` Rao, Lei
@ 2022-12-12  8:09                           ` Christoph Hellwig
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-12  8:09 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Rao, Lei, Christoph Hellwig, Jason Gunthorpe, kbusch, axboe, kch,
	sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, kevin.tian, mjrosato, linux-kernel,
	linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu, Konrad.wilk,
	stephen, hang.yuan, Oren Duer

On Sun, Dec 11, 2022 at 04:51:02PM +0200, Max Gurtovoy wrote:
> I don't think that medium migration should be part of the SPEC. We can 
> specify it's out of scope.

This is the main item in the TPAR in the technical working group,
with SQ/CQ state beeing the other one.  So instead of arguing here
I'd suggest you all get involved in the working group ASAP.

> All the idea of live migration is to have a short downtime and I don't 
> think we can guarantee short downtime if we need to copy few terabytes 
> throw the networking.

You can.  Look at the existing qemu code for live migration for
image based storage, the same concepts also work for hardware offloads.

> If the media copy is taking few seconds, there is no need to do live 
> migration of few milisecs downtime. Just do regular migration of a VM.

The point is of course to not do the data migration during the downtime,
but to track newly written LBAs after the start of the copy proces.
Again look at qemu for how this has been done for years in software.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
  2022-12-09 16:53           ` Li, Yadong
@ 2022-12-12  8:11             ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-12  8:11 UTC (permalink / raw)
  To: Li, Yadong
  Cc: Tian, Kevin, Christoph Hellwig, Jason Gunthorpe, Rao, Lei,
	kbusch, axboe, kch, sagi, alex.williamson, cohuck, yishaih,
	shameerali.kolothum.thodi, mjrosato, linux-kernel, linux-nvme,
	kvm, Dong, Eddie, Liu, Yi L, Wilk, Konrad, stephen, Yuan, Hang

On Fri, Dec 09, 2022 at 04:53:47PM +0000, Li, Yadong wrote:
> The key leadership in NVMe WG, Peter and Nick (from Intel), are on top of this TPAR,
> and help influence the scope of the TPAR to better support IPU/DPUs.

You guys should talk more to each other.  I think Peter especially has
been vocal to reduce the scope and not include this.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-12  7:55                                     ` Christoph Hellwig
@ 2022-12-12 14:49                                       ` Max Gurtovoy
  0 siblings, 0 replies; 65+ messages in thread
From: Max Gurtovoy @ 2022-12-12 14:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan, Oren Duer


On 12/12/2022 9:55 AM, Christoph Hellwig wrote:
> On Sun, Dec 11, 2022 at 01:39:37PM +0200, Max Gurtovoy wrote:
>> 1. Need to define a concept of a "virtual subsystem". A primary controller
>> will be able to create a virtual subsystem. Inside this subsystem the
>> primary controller will be the master ("the controlling") of the migration
>> process. It will also be able to add secondary controllers to this virtual
>> subsystem and assign "virtual controller ID" to it.
>> something like:
>> - nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
>> --dev_vcid = 1
>> - nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
>> --secondary_dev=/dev/nvme2 --secondary_dev_vcid=20
> Yes. Note that there is a bit more state than just the NQN.  You also
> need at least a serial number, and also probably a different vendor
> ID (the PCI vendor ID that is also mirror in Identify Controller and
> the IEEE OUI), and new unique namespace identifier.

Yes for sure there is more bits we should consider.

I wanted to describe the high level.

I think that we can maybe say the when a secondary function is moved to 
a virtual subsystem its feature set of the virtual ctrl is narrowed to 
mandatory NVMe features set.
And we'll provide an API to set/extend it's feature set to a maximum of 
the feature set that the original ctrl of the secondary function had.
Then the sys admin will configure the virtual ctrl in both src/dst hosts.

The high level idea is to have a programmable way to set the features of 
a virtual controller inside a virtual subsystem that is also programmable.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-12  7:50                                   ` Christoph Hellwig
@ 2022-12-13 14:01                                     ` Jason Gunthorpe
  2022-12-13 16:08                                       ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-13 14:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Mon, Dec 12, 2022 at 08:50:46AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote:
> > However hisilicon managed to do their implementation without this, or
> > rather you could say their "controlling function" is a single MMIO BAR
> > page in their PCI VF and their "controlled function" is the rest of
> > the PCI VF.
> 
> Eww.  So you need to carefully filter the BAR and can't actually do
> any DMA at all?  I'm not sure that is actually practical, especially
> not for something with a lot of state.

Indeed, but it is what they did and the HW should be supported by the
kernel, IMO.

> > If the kernel knows this information then we can find a way for the
> > vfio_device to have pointers to both controlling and controlled
> > objects. I have a suggestion below.
> 
> So now we need to write a vfio shim for every function even if there
> is absolutely nothing special about that function?  Migrating really
> is the controlling functions behavior, and writing a new vfio bit
> for every controlled thing just does not scale.

Huh? "does not scale?" We are looking at boilerplates of around 20-30
lines to make a VFIO driver for a real PCI device. Why is that even
something we should worry about optimizing?

And when you get into exciting future devices like SIOV you already
need to make a special VFIO driver anyhow.

> Yes, and that's the problem, because they are associated with the
> controlled function, and now we have a communication problem between
> that vfio driver binding to the controlled function and the drive
> that actually controlls live migration that is associated with the
> controlling function.  In other words: you've created a giant mess.

So far 100% of the drivers that have been presented, including the two
RFC ones, have entanglements between live migration and vfio. Shifting
things to dev/live_migration doesn't make the "communication problem"
away, it just shifted it into another subsystem.

This is my point, I've yet to even see a driver that meets your
theoretical standard that it can exist without vfio entanglement. So
I'd be much more interested in this idea if we had a stable of drivers
that obviously were harmed by VFIO. We don't have that, and I don't
even think that we ever will, considering both RFC drivers have
devices that also stepped on the mlx5 FLR problem too.

The FLR thing actually makes sense, becauase it is not actually the
controlling function that is doing the live migration inside the
devices. The controlling function is just a *proxy* to deliver
commands to the controlled function. So FLR on the controlled device
effects commands being executed on the controlling function. It is a
pain.

As it is, live migration is only useful with VFIO, so they are put
together to keep things simpler. The only complexity is the
controlled/controlling issue and for all existing SRIOV PF/VF
relationships we have an OK solution (at least it isn't buggy).

nvme's higher flexability needs more work, but that doesn't mean the
idea is so wrong. I think you are reall overstating the "mess"

> I'm not proposing to ignore them.  But they should not be needed most
> of the time.

I'm not seeing that in the drivers I've looked at.

> > Which is really why this is in VFIO in the first place. It actually is
> > coupled in practice, if not in theory.
> 
> So you've created a long term userspace API around working around
> around buggy and/or misdesigned early designs and now want to force
> it down everyones throat?

No, we coupled live migration and VFIO because they are never useful
apart, and we accept that we must find reasonable solutions to linking
the controlling/controlled device because it is necessary in all cases
we've seen.

> > If we accept that drivers/vfio can be the "live migration subsystem"
> > then lets go all the way and have the controlling driver to call
> > vfio_device_group_register() to create the VFIO char device for the
> > controlled function.
> 
> While creating the VFs from the PF driver makes a lot more sense,
> remember that vfio is absolutely not the only use case for VFs.
> There are plenty use cases where you want to use them with the normal
> kernel driver as well.  So the interface to create VFs needs a now
> to decide if it should be vfio exported, or use the normal kernel
> binding.

Yes, that is why this problem has been open for so long. Fixing it
well requires some reconsideration of how the driver core works :(

It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
etc.

When we create the function we really want to tell the device what
kind of function it is, and that also tells the kernel what driver
should be bound to it.

mlx5 even has weird limitations, like a controlled function that is
live migration capable has fewer features than a function that is
not. So the user must specify what parameters it wants the controlled
function to have..

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-13 14:01                                     ` Jason Gunthorpe
@ 2022-12-13 16:08                                       ` Christoph Hellwig
  2022-12-13 17:49                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2022-12-13 16:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Lei Rao, kbusch, axboe, kch, sagi,
	alex.williamson, cohuck, yishaih, shameerali.kolothum.thodi,
	kevin.tian, mjrosato, linux-kernel, linux-nvme, kvm, eddie.dong,
	yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

On Tue, Dec 13, 2022 at 10:01:03AM -0400, Jason Gunthorpe wrote:
> > So now we need to write a vfio shim for every function even if there
> > is absolutely nothing special about that function?  Migrating really
> > is the controlling functions behavior, and writing a new vfio bit
> > for every controlled thing just does not scale.
> 
> Huh? "does not scale?" We are looking at boilerplates of around 20-30
> lines to make a VFIO driver for a real PCI device. Why is that even
> something we should worry about optimizing?

But we need a new driver for every controlled function now, which
is very different from the classic VFIO model where we had one
vfio_pci.

> And when you get into exciting future devices like SIOV you already
> need to make a special VFIO driver anyhow.

You need to special support for it.  It's probably not another
Linux driver but part of the parent one, though.

> So far 100% of the drivers that have been presented, including the two
> RFC ones, have entanglements between live migration and vfio. Shifting
> things to dev/live_migration doesn't make the "communication problem"
> away, it just shifted it into another subsystem.

The main entanglement seems to be that it needs to support a vfio
interface for live migration while the actual commands go to the
parent device.

> This is my point, I've yet to even see a driver that meets your
> theoretical standard that it can exist without vfio entanglement.

It can't right now due to the VFIO design.

> > While creating the VFs from the PF driver makes a lot more sense,
> > remember that vfio is absolutely not the only use case for VFs.
> > There are plenty use cases where you want to use them with the normal
> > kernel driver as well.  So the interface to create VFs needs a now
> > to decide if it should be vfio exported, or use the normal kernel
> > binding.
> 
> Yes, that is why this problem has been open for so long. Fixing it
> well requires some reconsideration of how the driver core works :(
> 
> It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
> a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
> etc.

This seems to violate the PCIe spec, which says:

"All VFs associated with a PF must be the same device type as the PF,
(e.g., the same network device type or the same storage device type.)",

which is also enforced by not allowing to read vendor/device/class
fields from VFs.

(not that I'm arguing that this is a good limit, but that's how
PCIe does it).

> When we create the function we really want to tell the device what
> kind of function it is, and that also tells the kernel what driver
> should be bound to it.

I'd rather have different ways to probe by passing a "kind" or "type"
argument along the device IDs during probing.  E.g. "driver"
and "vfio", and then only match for the kind the creator of the device
added them to the device model for. 

> mlx5 even has weird limitations, like a controlled function that is
> live migration capable has fewer features than a function that is
> not. So the user must specify what parameters it wants the controlled
> function to have..

I don't think that is weird.  If you want to live migrate, you need to

 a) make sure the feature set is compatible with the other side
 b) there is only state that actually is migratable

so I'd expect that for any other sufficiently complex device.  NVMe
for sure will have limits like this.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
  2022-12-13 16:08                                       ` Christoph Hellwig
@ 2022-12-13 17:49                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2022-12-13 17:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lei Rao, kbusch, axboe, kch, sagi, alex.williamson, cohuck,
	yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm, eddie.dong, yadong.li, yi.l.liu,
	Konrad.wilk, stephen, hang.yuan

On Tue, Dec 13, 2022 at 05:08:07PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 13, 2022 at 10:01:03AM -0400, Jason Gunthorpe wrote:
> > > So now we need to write a vfio shim for every function even if there
> > > is absolutely nothing special about that function?  Migrating really
> > > is the controlling functions behavior, and writing a new vfio bit
> > > for every controlled thing just does not scale.
> > 
> > Huh? "does not scale?" We are looking at boilerplates of around 20-30
> > lines to make a VFIO driver for a real PCI device. Why is that even
> > something we should worry about optimizing?
> 
> But we need a new driver for every controlled function now, which
> is very different from the classic VFIO model where we had one
> vfio_pci.

To be fair, mainly vfio_pci had that model. Other uses of VFIO have
device specific drivers already. We have the reset drivers in vfio
platform, and the mdevs already. SIOV drivers are coming and they will
not be general either. I know a few coming non-migration VFIO PCI
variant drivers as well to deal with HW issues.

Remember, we did a bunch of work to make this reasonable. Userspace
can properly probe the correct VFIO driver for the HW it wants to use,
just like normal devices. If we spawn the VFIO from the controlling
function then it obviously will bring the correct driver along too.

The mental model I have for VFIO is that every vfio_device has a
driver, and we have three "universal" drivers that wildcard match to
many devices (pci, fsl, and platform acpi reset). Otherwise VFIO is
like every other driver subsystem out there, with physical devices and
matching drivers that support them.

Creating drivers for HW is not a problem, that is what a driver
subsystem is for. We already invested effort in VFIO to make this
scalable.

> > And when you get into exciting future devices like SIOV you already
> > need to make a special VFIO driver anyhow.
> 
> You need to special support for it.  It's probably not another
> Linux driver but part of the parent one, though.

The designs we have done in mlx5 are split. The "parent" has just
enough shim to describe what the SIOV is in terms of a 'slice of the
parents resources' and then we layer another driver, located in the
proper subsystem, to operate that slice. VDPA makes a
/dev/virtio-whatever, VFIO would make a fake PCI function, mlx5 makes
a netdev, etc.

It is not so different from how a PF/VF relationship works, just that
the SIOV is described by a struct auxillary_device not a struct
pci_dev.

I don't really like implementing VFIO drivers outside drivers/vfio, I
think that has historically had bad outcomes in other subsystems.

> > So far 100% of the drivers that have been presented, including the two
> > RFC ones, have entanglements between live migration and vfio. Shifting
> > things to dev/live_migration doesn't make the "communication problem"
> > away, it just shifted it into another subsystem.
> 
> The main entanglement seems to be that it needs to support a vfio
> interface for live migration while the actual commands go to the
> parent device.

Not at all, that is only a couple function calls in 4 of the drivers
so far.

The entanglement is that the live migration FSM and the VFIO device
operation are not isolated. I keep repeating this - mlx5 and the two
RFC drivers must trap VFIO operations and relay them to their
migration logic. hns has to mangle its BARs. These are things that
only exist on the VFIO side.

So, you are viewing live migration as orthogonal and separable to
VFIO, and I don't agree with this because I haven't yet seen any proof
in implementations.

Let's go through the nvme spec process and see how it works out. If
NVMe can address things which are tripping up other implemenations,
like FLR of the controlled function. Then we may have the first
example. If not, then it is just how things are.

FLR is trickey, it not obvious to me that you want a definition of
migration that isolates controlled function FLR from the migration
FSM..

There are advantages to having a reliable, universal, way to bring a
function back to a clean slate, including restoring it to full
operation (ie canceling any migration operation). The current
definition of FLR provides this.

> > It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
> > a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
> > etc.
> 
> This seems to violate the PCIe spec, which says:
> 
> "All VFs associated with a PF must be the same device type as the PF,
> (e.g., the same network device type or the same storage device type.)",

For VFs there are multiple PFs to follow the above, and for SIOV this
language doesn't apply.

It seems the PDS RFC driver does violate this spec requirement though..

> > When we create the function we really want to tell the device what
> > kind of function it is, and that also tells the kernel what driver
> > should be bound to it.
> 
> I'd rather have different ways to probe by passing a "kind" or "type"
> argument along the device IDs during probing.  E.g. "driver"
> and "vfio", and then only match for the kind the creator of the device
> added them to the device model for. 

Not everything can be done during driver probing. There are certainly
steps at SIOV instantiation time or VF provisioning that impact what
exactly is available on the controlled function. Eg on mlx5 when we
create a VDPA device it actually is different from a full-function
mlx5 device and that customization was done before any driver was
probed.

In fact, not only is it done before driver binding, but it can be
enforced as a security property from the DPU side when the DPU is the
thing creating the function.

I like the general idea of type to help specify the driver to probe,
we tried to work on something like that once and it didn't go far, but
I did like the concept of it.

> > mlx5 even has weird limitations, like a controlled function that is
> > live migration capable has fewer features than a function that is
> > not. So the user must specify what parameters it wants the controlled
> > function to have..
> 
> I don't think that is weird.  If you want to live migrate, you need to
> 
>  a) make sure the feature set is compatible with the other side
>  b) there is only state that actually is migratable
> 
> so I'd expect that for any other sufficiently complex device.  NVMe
> for sure will have limits like this.

Oy, this has been pretty hard to define in mlx5 already :( Hopefully
nvme-cli can sort it out for NVMe configurables.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration.
  2022-12-06  5:58 ` [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration Lei Rao
@ 2023-01-19 10:21   ` Max Gurtovoy
  2023-02-09  9:09     ` Rao, Lei
  0 siblings, 1 reply; 65+ messages in thread
From: Max Gurtovoy @ 2023-01-19 10:21 UTC (permalink / raw)
  To: Lei Rao, kbusch, axboe, kch, hch, sagi, alex.williamson, cohuck,
	jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, mjrosato,
	linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan

Hi Leo,

On 06/12/2022 7:58, Lei Rao wrote:
> Implement specific VFIO live migration operations for NVMe devices.
>
> Signed-off-by: Lei Rao <lei.rao@intel.com>
> Signed-off-by: Yadong Li <yadong.li@intel.com>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> Reviewed-by: Eddie Dong <eddie.dong@intel.com>
> Reviewed-by: Hang Yuan <hang.yuan@intel.com>
> ---
> drivers/vfio/pci/nvme/Kconfig |   5 +-
> drivers/vfio/pci/nvme/nvme.c  | 543 ++++++++++++++++++++++++++++++++--
> drivers/vfio/pci/nvme/nvme.h  | 111 +++++++
> 3 files changed, 637 insertions(+), 22 deletions(-)
> create mode 100644 drivers/vfio/pci/nvme/nvme.h
>
> diff --git a/drivers/vfio/pci/nvme/Kconfig 
> b/drivers/vfio/pci/nvme/Kconfig
> index c281fe154007..12e0eaba0de1 100644
> --- a/drivers/vfio/pci/nvme/Kconfig
> +++ b/drivers/vfio/pci/nvme/Kconfig
> @@ -1,9 +1,10 @@
> # SPDX-License-Identifier: GPL-2.0-only
> config NVME_VFIO_PCI
>     tristate "VFIO support for NVMe PCI devices"
> +    depends on NVME_CORE
>     depends on VFIO_PCI_CORE
>     help
> -      This provides generic VFIO PCI support for NVMe device
> -      using the VFIO framework.
> +      This provides migration support for NVMe devices using the
> +      VFIO framework.
>
>       If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
> index f1386d8a9287..698e470a4e53 100644
> --- a/drivers/vfio/pci/nvme/nvme.c
> +++ b/drivers/vfio/pci/nvme/nvme.c
> @@ -13,29 +13,503 @@
> #include <linux/types.h>
> #include <linux/vfio.h>
> #include <linux/anon_inodes.h>
> -#include <linux/kernel.h>
> -#include <linux/vfio_pci_core.h>
> +
> +#include "nvme.h"
> +
> +#define MAX_MIGRATION_SIZE (256 * 1024)
> +
> +static int nvmevf_cmd_suspend_device(struct nvmevf_pci_core_device 
> *nvmevf_dev)
> +{
> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +    struct nvme_live_mig_command c = { };
> +    int ret;
> +
> +    c.suspend.opcode = nvme_admin_live_mig_suspend;
> +    c.suspend.vf_index = nvmevf_dev->vf_id;
> +
> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, 
> NULL, 0);
> +    if (ret) {
> +        dev_warn(&dev->dev, "Suspend virtual function failed 
> (ret=0x%x)\n", ret);
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +static int nvmevf_cmd_resume_device(struct nvmevf_pci_core_device 
> *nvmevf_dev)
> +{
> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +    struct nvme_live_mig_command c = { };
> +    int ret;
> +
> +    c.resume.opcode = nvme_admin_live_mig_resume;
> +    c.resume.vf_index = nvmevf_dev->vf_id;
> +
> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, 
> NULL, 0);
> +    if (ret) {
> +        dev_warn(&dev->dev, "Resume virtual function failed 
> (ret=0x%x)\n", ret);
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +static int nvmevf_cmd_query_data_size(struct nvmevf_pci_core_device 
> *nvmevf_dev,
> +                      size_t *state_size)
> +{
> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +    struct nvme_live_mig_command c = { };
> +    size_t result;
> +    int ret;
> +
> +    c.query.opcode = nvme_admin_live_mig_query_data_size;
> +    c.query.vf_index = nvmevf_dev->vf_id;
> +
> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, &result, 
> NULL, 0);
> +    if (ret) {
> +        dev_warn(&dev->dev, "Query the states size failed 
> (ret=0x%x)\n", ret);
> +        *state_size = 0;
> +        return ret;
> +    }
> +    *state_size = result;
> +    return 0;
> +}
> +
> +static int nvmevf_cmd_save_data(struct nvmevf_pci_core_device 
> *nvmevf_dev,
> +                    void *buffer, size_t buffer_len)
> +{
> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +    struct nvme_live_mig_command c = { };
> +    int ret;
> +
> +    c.save.opcode = nvme_admin_live_mig_save_data;
> +    c.save.vf_index = nvmevf_dev->vf_id;
> +
> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, 
> buffer, buffer_len);
> +    if (ret) {
> +        dev_warn(&dev->dev, "Save the device states failed 
> (ret=0x%x)\n", ret);
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +static int nvmevf_cmd_load_data(struct nvmevf_pci_core_device 
> *nvmevf_dev,
> +                    struct nvmevf_migration_file *migf)
> +{
> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
> +    struct nvme_live_mig_command c = { };
> +    int ret;
> +
> +    c.load.opcode = nvme_admin_live_mig_load_data;
> +    c.load.vf_index = nvmevf_dev->vf_id;
> +    c.load.size = migf->total_length;
> +
> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL,
> +            migf->vf_data, migf->total_length);
> +    if (ret) {
> +        dev_warn(&dev->dev, "Load the device states failed 
> (ret=0x%x)\n", ret);
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev 
> *pdev)
> +{
> +    struct vfio_pci_core_device *core_device = 
> dev_get_drvdata(&pdev->dev);
> +
> +    return container_of(core_device, struct nvmevf_pci_core_device, 
> core_device);
> +}
> +
> +static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
> +{
> +    mutex_lock(&migf->lock);
> +
> +    /* release the device states buffer */
> +    kvfree(migf->vf_data);
> +    migf->vf_data = NULL;
> +    migf->disabled = true;
> +    migf->total_length = 0;
> +    migf->filp->f_pos = 0;
> +    mutex_unlock(&migf->lock);
> +}
> +
> +static int nvmevf_release_file(struct inode *inode, struct file *filp)
> +{
> +    struct nvmevf_migration_file *migf = filp->private_data;
> +
> +    nvmevf_disable_fd(migf);
> +    mutex_destroy(&migf->lock);
> +    kfree(migf);
> +    return 0;
> +}
> +
> +static ssize_t nvmevf_save_read(struct file *filp, char __user *buf, 
> size_t len, loff_t *pos)
> +{
> +    struct nvmevf_migration_file *migf = filp->private_data;
> +    ssize_t done = 0;
> +    int ret;
> +
> +    if (pos)
> +        return -ESPIPE;
> +    pos = &filp->f_pos;
> +
> +    mutex_lock(&migf->lock);
> +    if (*pos > migf->total_length) {
> +        done = -EINVAL;
> +        goto out_unlock;
> +    }
> +
> +    if (migf->disabled) {
> +        done = -EINVAL;
> +        goto out_unlock;
> +    }
> +
> +    len = min_t(size_t, migf->total_length - *pos, len);
> +    if (len) {
> +        ret = copy_to_user(buf, migf->vf_data + *pos, len);
> +        if (ret) {
> +            done = -EFAULT;
> +            goto out_unlock;
> +        }
> +        *pos += len;
> +        done = len;
> +    }
> +
> +out_unlock:
> +    mutex_unlock(&migf->lock);
> +    return done;
> +}
> +
> +static const struct file_operations nvmevf_save_fops = {
> +    .owner = THIS_MODULE,
> +    .read = nvmevf_save_read,
> +    .release = nvmevf_release_file,
> +    .llseek = no_llseek,
> +};
> +
> +static ssize_t nvmevf_resume_write(struct file *filp, const char 
> __user *buf,
> +                       size_t len, loff_t *pos)
> +{
> +    struct nvmevf_migration_file *migf = filp->private_data;
> +    loff_t requested_length;
> +    ssize_t done = 0;
> +    int ret;
> +
> +    if (pos)
> +        return -ESPIPE;
> +    pos = &filp->f_pos;
> +
> +    if (*pos < 0 || check_add_overflow((loff_t)len, *pos, 
> &requested_length))
> +        return -EINVAL;
> +
> +    if (requested_length > MAX_MIGRATION_SIZE)
> +        return -ENOMEM;
> +    mutex_lock(&migf->lock);
> +    if (migf->disabled) {
> +        done = -ENODEV;
> +        goto out_unlock;
> +    }
> +
> +    ret = copy_from_user(migf->vf_data + *pos, buf, len);
> +    if (ret) {
> +        done = -EFAULT;
> +        goto out_unlock;
> +    }
> +    *pos += len;
> +    done = len;
> +    migf->total_length += len;
> +
> +out_unlock:
> +    mutex_unlock(&migf->lock);
> +    return done;
> +}
> +
> +static const struct file_operations nvmevf_resume_fops = {
> +    .owner = THIS_MODULE,
> +    .write = nvmevf_resume_write,
> +    .release = nvmevf_release_file,
> +    .llseek = no_llseek,
> +};
> +
> +static void nvmevf_disable_fds(struct nvmevf_pci_core_device 
> *nvmevf_dev)
> +{
> +    if (nvmevf_dev->resuming_migf) {
> +        nvmevf_disable_fd(nvmevf_dev->resuming_migf);
> +        fput(nvmevf_dev->resuming_migf->filp);
> +        nvmevf_dev->resuming_migf = NULL;
> +    }
> +
> +    if (nvmevf_dev->saving_migf) {
> +        nvmevf_disable_fd(nvmevf_dev->saving_migf);
> +        fput(nvmevf_dev->saving_migf->filp);
> +        nvmevf_dev->saving_migf = NULL;
> +    }
> +}
> +
> +static struct nvmevf_migration_file *
> +nvmevf_pci_resume_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
> +{
> +    struct nvmevf_migration_file *migf;
> +    int ret;
> +
> +    migf = kzalloc(sizeof(*migf), GFP_KERNEL);
> +    if (!migf)
> +        return ERR_PTR(-ENOMEM);
> +
> +    migf->filp = anon_inode_getfile("nvmevf_mig", 
> &nvmevf_resume_fops, migf,
> +                    O_WRONLY);
> +    if (IS_ERR(migf->filp)) {
> +        int err = PTR_ERR(migf->filp);
> +
> +        kfree(migf);
> +        return ERR_PTR(err);
> +    }
> +    stream_open(migf->filp->f_inode, migf->filp);
> +    mutex_init(&migf->lock);
> +
> +    /* Allocate buffer to load the device states and the max states 
> is 256K */
> +    migf->vf_data = kvzalloc(MAX_MIGRATION_SIZE, GFP_KERNEL);
> +    if (!migf->vf_data) {
> +        ret = -ENOMEM;
> +        goto out_free;
> +    }
> +
> +    return migf;
> +
> +out_free:
> +    fput(migf->filp);
> +    return ERR_PTR(ret);
> +}
> +
> +static struct nvmevf_migration_file *
> +nvmevf_pci_save_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
> +{
> +    struct nvmevf_migration_file *migf;
> +    int ret;
> +
> +    migf = kzalloc(sizeof(*migf), GFP_KERNEL);
> +    if (!migf)
> +        return ERR_PTR(-ENOMEM);
> +
> +    migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_save_fops, 
> migf,
> +                    O_RDONLY);
> +    if (IS_ERR(migf->filp)) {
> +        int err = PTR_ERR(migf->filp);
> +
> +        kfree(migf);
> +        return ERR_PTR(err);
> +    }
> +
> +    stream_open(migf->filp->f_inode, migf->filp);
> +    mutex_init(&migf->lock);
> +
> +    ret = nvmevf_cmd_query_data_size(nvmevf_dev, &migf->total_length);
> +    if (ret)
> +        goto out_free;
> +    /* Allocate buffer and save the device states*/
> +    migf->vf_data = kvzalloc(migf->total_length, GFP_KERNEL);
> +    if (!migf->vf_data) {
> +        ret = -ENOMEM;
> +        goto out_free;
> +    }
> +
> +    ret = nvmevf_cmd_save_data(nvmevf_dev, migf->vf_data, 
> migf->total_length);
> +    if (ret)
> +        goto out_free;
> +
> +    return migf;
> +out_free:
> +    fput(migf->filp);
> +    return ERR_PTR(ret);
> +}
> +
> +static struct file *
> +nvmevf_pci_step_device_state_locked(struct nvmevf_pci_core_device 
> *nvmevf_dev, u32 new)
> +{
> +    u32 cur = nvmevf_dev->mig_state;
> +    int ret;
> +
> +    if (cur == VFIO_DEVICE_STATE_RUNNING && new == 
> VFIO_DEVICE_STATE_STOP) {
> +        ret = nvmevf_cmd_suspend_device(nvmevf_dev);
> +        if (ret)
> +            return ERR_PTR(ret);
> +        return NULL;
> +    }
> +
> +    if (cur == VFIO_DEVICE_STATE_STOP && new == 
> VFIO_DEVICE_STATE_STOP_COPY) {
> +        struct nvmevf_migration_file *migf;
> +
> +        migf = nvmevf_pci_save_device_data(nvmevf_dev);
> +        if (IS_ERR(migf))
> +            return ERR_CAST(migf);
> +        get_file(migf->filp);
> +        nvmevf_dev->saving_migf = migf;
> +        return migf->filp;
> +    }
> +
> +    if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == 
> VFIO_DEVICE_STATE_STOP) {
> +        nvmevf_disable_fds(nvmevf_dev);
> +        return NULL;
> +    }
> +
> +    if (cur == VFIO_DEVICE_STATE_STOP && new == 
> VFIO_DEVICE_STATE_RESUMING) {
> +        struct nvmevf_migration_file *migf;
> +
> +        migf = nvmevf_pci_resume_device_data(nvmevf_dev);
> +        if (IS_ERR(migf))
> +            return ERR_CAST(migf);
> +        get_file(migf->filp);
> +        nvmevf_dev->resuming_migf = migf;
> +        return migf->filp;
> +    }
> +
> +    if (cur == VFIO_DEVICE_STATE_RESUMING && new == 
> VFIO_DEVICE_STATE_STOP) {
> +        ret = nvmevf_cmd_load_data(nvmevf_dev, 
> nvmevf_dev->resuming_migf);
> +        if (ret)
> +            return ERR_PTR(ret);
> +        nvmevf_disable_fds(nvmevf_dev);
> +        return NULL;
> +    }
> +
> +    if (cur == VFIO_DEVICE_STATE_STOP && new == 
> VFIO_DEVICE_STATE_RUNNING) {
> +        nvmevf_cmd_resume_device(nvmevf_dev);
> +        return NULL;
> +    }
> +
> +    /* vfio_mig_get_next_state() does not use arcs other than the 
> above */
> +    WARN_ON(true);
> +    return ERR_PTR(-EINVAL);
> +}
> +
> +static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device 
> *nvmevf_dev)
> +{
> +again:
> +    spin_lock(&nvmevf_dev->reset_lock);
> +    if (nvmevf_dev->deferred_reset) {
> +        nvmevf_dev->deferred_reset = false;
> +        spin_unlock(&nvmevf_dev->reset_lock);
> +        nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
> +        nvmevf_disable_fds(nvmevf_dev);
> +        goto again;
> +    }
> +    mutex_unlock(&nvmevf_dev->state_mutex);
> +    spin_unlock(&nvmevf_dev->reset_lock);
> +}
> +
> +static struct file *
> +nvmevf_pci_set_device_state(struct vfio_device *vdev, enum 
> vfio_device_mig_state new_state)
> +{
> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(vdev,
> +            struct nvmevf_pci_core_device, core_device.vdev);
> +    enum vfio_device_mig_state next_state;
> +    struct file *res = NULL;
> +    int ret;
> +
> +    mutex_lock(&nvmevf_dev->state_mutex);
> +    while (new_state != nvmevf_dev->mig_state) {
> +        ret = vfio_mig_get_next_state(vdev, nvmevf_dev->mig_state, 
> new_state, &next_state);
> +        if (ret) {
> +            res = ERR_PTR(-EINVAL);
> +            break;
> +        }
> +
> +        res = nvmevf_pci_step_device_state_locked(nvmevf_dev, 
> next_state);
> +        if (IS_ERR(res))
> +            break;
> +        nvmevf_dev->mig_state = next_state;
> +        if (WARN_ON(res && new_state != nvmevf_dev->mig_state)) {
> +            fput(res);
> +            res = ERR_PTR(-EINVAL);
> +            break;
> +        }
> +    }
> +    nvmevf_state_mutex_unlock(nvmevf_dev);
> +    return res;
> +}
> +
> +static int nvmevf_pci_get_device_state(struct vfio_device *vdev,
> +                       enum vfio_device_mig_state *curr_state)
> +{
> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(
> +            vdev, struct nvmevf_pci_core_device, core_device.vdev);
> +
> +    mutex_lock(&nvmevf_dev->state_mutex);
> +    *curr_state = nvmevf_dev->mig_state;
> +    nvmevf_state_mutex_unlock(nvmevf_dev);
> +    return 0;
> +}
>
> static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
> {
> -    struct vfio_pci_core_device *vdev =
> -        container_of(core_vdev, struct vfio_pci_core_device, vdev);
> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(
> +            core_vdev, struct nvmevf_pci_core_device, core_device.vdev);
> +    struct vfio_pci_core_device *vdev = &nvmevf_dev->core_device;
>     int ret;
>
>     ret = vfio_pci_core_enable(vdev);
>     if (ret)
>         return ret;
>
> +    if (nvmevf_dev->migrate_cap)
> +        nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
>     vfio_pci_core_finish_enable(vdev);
>     return 0;
> }
>
> +static void nvmevf_cmd_close_migratable(struct nvmevf_pci_core_device 
> *nvmevf_dev)
> +{
> +    if (!nvmevf_dev->migrate_cap)
> +        return;
> +
> +    mutex_lock(&nvmevf_dev->state_mutex);
> +    nvmevf_disable_fds(nvmevf_dev);
> +    nvmevf_state_mutex_unlock(nvmevf_dev);
> +}
> +
> +static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
> +{
> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(
> +            core_vdev, struct nvmevf_pci_core_device, core_device.vdev);
> +
> +    nvmevf_cmd_close_migratable(nvmevf_dev);
> +    vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static const struct vfio_migration_ops nvmevf_pci_mig_ops = {
> +    .migration_set_state = nvmevf_pci_set_device_state,
> +    .migration_get_state = nvmevf_pci_get_device_state,
> +};
> +
> +static int nvmevf_migration_init_dev(struct vfio_device *core_vdev)
> +{
> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(core_vdev,
> +                    struct nvmevf_pci_core_device, core_device.vdev);
> +    struct pci_dev *pdev = to_pci_dev(core_vdev->dev);
> +    int vf_id;
> +    int ret = -1;
> +
> +    if (!pdev->is_virtfn)
> +        return ret;
> +
> +    nvmevf_dev->migrate_cap = 1;
> +
> +    vf_id = pci_iov_vf_id(pdev);
> +    if (vf_id < 0)
> +        return ret;
> +    nvmevf_dev->vf_id = vf_id + 1;
> +    core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY;
> +
> +    mutex_init(&nvmevf_dev->state_mutex);
> +    spin_lock_init(&nvmevf_dev->reset_lock);
> +    core_vdev->mig_ops = &nvmevf_pci_mig_ops;
> +
> +    return vfio_pci_core_init_dev(core_vdev);
> +}
> +
> static const struct vfio_device_ops nvmevf_pci_ops = {
>     .name = "nvme-vfio-pci",
> -    .init = vfio_pci_core_init_dev,
> +    .init = nvmevf_migration_init_dev,
>     .release = vfio_pci_core_release_dev,
>     .open_device = nvmevf_pci_open_device,
> -    .close_device = vfio_pci_core_close_device,
> +    .close_device = nvmevf_pci_close_device,
>     .ioctl = vfio_pci_core_ioctl,
>     .device_feature = vfio_pci_core_ioctl_feature,
>     .read = vfio_pci_core_read,
> @@ -47,32 +521,56 @@ static const struct vfio_device_ops 
> nvmevf_pci_ops = {
>
> static int nvmevf_pci_probe(struct pci_dev *pdev, const struct 
> pci_device_id *id)
> {
> -    struct vfio_pci_core_device *vdev;
> +    struct nvmevf_pci_core_device *nvmevf_dev;
>     int ret;
>
> -    vdev = vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev,
> -                &nvmevf_pci_ops);
> -    if (IS_ERR(vdev))
> -        return PTR_ERR(vdev);
> +    nvmevf_dev = vfio_alloc_device(nvmevf_pci_core_device, 
> core_device.vdev,
> +                    &pdev->dev, &nvmevf_pci_ops);
> +    if (IS_ERR(nvmevf_dev))
> +        return PTR_ERR(nvmevf_dev);
>
> -    dev_set_drvdata(&pdev->dev, vdev);
> -    ret = vfio_pci_core_register_device(vdev);
> +    dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device);
> +    ret = vfio_pci_core_register_device(&nvmevf_dev->core_device);
>     if (ret)
>         goto out_put_dev;
> -
>     return 0;
>
> out_put_dev:
> -    vfio_put_device(&vdev->vdev);
> +    vfio_put_device(&nvmevf_dev->core_device.vdev);
>     return ret;
> +
> }
>
> static void nvmevf_pci_remove(struct pci_dev *pdev)
> {
> -    struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
> +    struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
> +
> + vfio_pci_core_unregister_device(&nvmevf_dev->core_device);
> +    vfio_put_device(&nvmevf_dev->core_device.vdev);
> +}
> +
> +static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
> +{
> +    struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
> +
> +    if (!nvmevf_dev->migrate_cap)
> +        return;
>
> -    vfio_pci_core_unregister_device(vdev);
> -    vfio_put_device(&vdev->vdev);
> +    /*
> +     * As the higher VFIO layers are holding locks across reset and 
> using
> +     * those same locks with the mm_lock we need to prevent ABBA 
> deadlock
> +     * with the state_mutex and mm_lock.
> +     * In case the state_mutex was taken already we defer the cleanup 
> work
> +     * to the unlock flow of the other running context.
> +     */
> +    spin_lock(&nvmevf_dev->reset_lock);
> +    nvmevf_dev->deferred_reset = true;
> +    if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
> +        spin_unlock(&nvmevf_dev->reset_lock);
> +        return;
> +    }
> +    spin_unlock(&nvmevf_dev->reset_lock);
> +    nvmevf_state_mutex_unlock(nvmevf_dev);
> }
>
> static const struct pci_device_id nvmevf_pci_table[] = {
> @@ -83,12 +581,17 @@ static const struct pci_device_id 
> nvmevf_pci_table[] = {
>
> MODULE_DEVICE_TABLE(pci, nvmevf_pci_table);
>
> +static const struct pci_error_handlers nvmevf_err_handlers = {
> +    .reset_done = nvmevf_pci_aer_reset_done,
> +    .error_detected = vfio_pci_core_aer_err_detected,
> +};
> +
> static struct pci_driver nvmevf_pci_driver = {
>     .name = KBUILD_MODNAME,
>     .id_table = nvmevf_pci_table,
>     .probe = nvmevf_pci_probe,
>     .remove = nvmevf_pci_remove,
> -    .err_handler = &vfio_pci_core_err_handlers,
> +    .err_handler = &nvmevf_err_handlers,
>     .driver_managed_dma = true,
> };
>
> @@ -96,4 +599,4 @@ module_pci_driver(nvmevf_pci_driver);
>
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Lei Rao <lei.rao@intel.com>");
> -MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe");
> +MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live 
> migration support for NVMe");
> diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
> new file mode 100644
> index 000000000000..c8464554ef53
> --- /dev/null
> +++ b/drivers/vfio/pci/nvme/nvme.h
> @@ -0,0 +1,111 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
> + */
> +
> +#ifndef NVME_VFIO_PCI_H
> +#define NVME_VFIO_PCI_H
> +
> +#include <linux/kernel.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/nvme.h>
> +
> +struct nvme_live_mig_query_size {
> +    __u8    opcode;
> +    __u8    flags;
> +    __u16    command_id;
> +    __u32    rsvd1[9];
> +    __u16    vf_index;
> +    __u16    rsvd2;
> +    __u32    rsvd3[5];
> +};
> +
> +struct nvme_live_mig_suspend {
> +    __u8    opcode;
> +    __u8    flags;
> +    __u16    command_id;
> +    __u32    rsvd1[9];
> +    __u16    vf_index;
> +    __u16    rsvd2;
> +    __u32    rsvd3[5];
> +};
> +
> +struct nvme_live_mig_resume {
> +    __u8    opcode;
> +    __u8    flags;
> +    __u16   command_id;
> +    __u32   rsvd1[9];
> +    __u16   vf_index;
> +    __u16   rsvd2;
> +    __u32   rsvd3[5];
> +};
> +
> +struct nvme_live_mig_save_data {
> +    __u8    opcode;
> +    __u8    flags;
> +    __u16    command_id;
> +    __u32    rsvd1[5];
> +    __le64    prp1;
> +    __le64    prp2;
> +    __u16    vf_index;
> +    __u16    rsvd2;
> +    __u32    rsvd3[5];
> +};

Just noticed that the save_data (similar to READ operation) doesn't have 
size/length member.
How does it work for you ?

BTW,

Are you doing iterative reads from the device to get the state ?

> +
> +struct nvme_live_mig_load_data {
> +    __u8    opcode;
> +    __u8    flags;
> +    __u16   command_id;
> +    __u32   rsvd1[5];
> +    __le64  prp1;
> +    __le64  prp2;
> +    __u16   vf_index;
> +    __u16    rsvd2;
> +    __u32    size;
> +    __u32   rsvd3[4];
> +};
> +
> +enum nvme_live_mig_admin_opcode {
> +    nvme_admin_live_mig_query_data_size    = 0xC4,
> +    nvme_admin_live_mig_suspend        = 0xC8,
> +    nvme_admin_live_mig_resume        = 0xCC,
> +    nvme_admin_live_mig_save_data        = 0xD2,
> +    nvme_admin_live_mig_load_data        = 0xD5,
> +};
> +
> +struct nvme_live_mig_command {
> +    union {
> +        struct nvme_live_mig_query_size query;
> +        struct nvme_live_mig_suspend    suspend;
> +        struct nvme_live_mig_resume    resume;
> +        struct nvme_live_mig_save_data    save;
> +        struct nvme_live_mig_load_data    load;
> +    };
> +};
> +
> +struct nvmevf_migration_file {
> +    struct file *filp;
> +    struct mutex lock;
> +    bool disabled;
> +    u8 *vf_data;
> +    size_t total_length;
> +};
> +
> +struct nvmevf_pci_core_device {
> +    struct vfio_pci_core_device core_device;
> +    int vf_id;
> +    u8 migrate_cap:1;
> +    u8 deferred_reset:1;
> +    /* protect migration state */
> +    struct mutex state_mutex;
> +    enum vfio_device_mig_state mig_state;
> +    /* protect the reset_done flow */
> +    spinlock_t reset_lock;
> +    struct nvmevf_migration_file *resuming_migf;
> +    struct nvmevf_migration_file *saving_migf;
> +};
> +
> +extern int nvme_submit_vf_cmd(struct pci_dev *dev, struct 
> nvme_command *cmd,
> +            size_t *result, void *buffer, unsigned int bufflen);
> +
> +#endif /* NVME_VFIO_PCI_H */

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration.
  2023-01-19 10:21   ` Max Gurtovoy
@ 2023-02-09  9:09     ` Rao, Lei
  0 siblings, 0 replies; 65+ messages in thread
From: Rao, Lei @ 2023-02-09  9:09 UTC (permalink / raw)
  To: Max Gurtovoy, kbusch, axboe, kch, hch, sagi, alex.williamson,
	cohuck, jgg, yishaih, shameerali.kolothum.thodi, kevin.tian,
	mjrosato, linux-kernel, linux-nvme, kvm
  Cc: eddie.dong, yadong.li, yi.l.liu, Konrad.wilk, stephen, hang.yuan



On 1/19/2023 6:21 PM, Max Gurtovoy wrote:
> Hi Leo,
> 
> On 06/12/2022 7:58, Lei Rao wrote:
>> Implement specific VFIO live migration operations for NVMe devices.
>>
>> Signed-off-by: Lei Rao <lei.rao@intel.com>
>> Signed-off-by: Yadong Li <yadong.li@intel.com>
>> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
>> Reviewed-by: Eddie Dong <eddie.dong@intel.com>
>> Reviewed-by: Hang Yuan <hang.yuan@intel.com>
>> ---
>> drivers/vfio/pci/nvme/Kconfig |   5 +-
>> drivers/vfio/pci/nvme/nvme.c  | 543 ++++++++++++++++++++++++++++++++--
>> drivers/vfio/pci/nvme/nvme.h  | 111 +++++++
>> 3 files changed, 637 insertions(+), 22 deletions(-)
>> create mode 100644 drivers/vfio/pci/nvme/nvme.h
>>
>> diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
>> index c281fe154007..12e0eaba0de1 100644
>> --- a/drivers/vfio/pci/nvme/Kconfig
>> +++ b/drivers/vfio/pci/nvme/Kconfig
>> @@ -1,9 +1,10 @@
>> # SPDX-License-Identifier: GPL-2.0-only
>> config NVME_VFIO_PCI
>>     tristate "VFIO support for NVMe PCI devices"
>> +    depends on NVME_CORE
>>     depends on VFIO_PCI_CORE
>>     help
>> -      This provides generic VFIO PCI support for NVMe device
>> -      using the VFIO framework.
>> +      This provides migration support for NVMe devices using the
>> +      VFIO framework.
>>
>>       If you don't know what to do here, say N.
>> diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
>> index f1386d8a9287..698e470a4e53 100644
>> --- a/drivers/vfio/pci/nvme/nvme.c
>> +++ b/drivers/vfio/pci/nvme/nvme.c
>> @@ -13,29 +13,503 @@
>> #include <linux/types.h>
>> #include <linux/vfio.h>
>> #include <linux/anon_inodes.h>
>> -#include <linux/kernel.h>
>> -#include <linux/vfio_pci_core.h>
>> +
>> +#include "nvme.h"
>> +
>> +#define MAX_MIGRATION_SIZE (256 * 1024)
>> +
>> +static int nvmevf_cmd_suspend_device(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
>> +    struct nvme_live_mig_command c = { };
>> +    int ret;
>> +
>> +    c.suspend.opcode = nvme_admin_live_mig_suspend;
>> +    c.suspend.vf_index = nvmevf_dev->vf_id;
>> +
>> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, NULL, 0);
>> +    if (ret) {
>> +        dev_warn(&dev->dev, "Suspend virtual function failed (ret=0x%x)\n", ret);
>> +        return ret;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int nvmevf_cmd_resume_device(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
>> +    struct nvme_live_mig_command c = { };
>> +    int ret;
>> +
>> +    c.resume.opcode = nvme_admin_live_mig_resume;
>> +    c.resume.vf_index = nvmevf_dev->vf_id;
>> +
>> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, NULL, 0);
>> +    if (ret) {
>> +        dev_warn(&dev->dev, "Resume virtual function failed (ret=0x%x)\n", ret);
>> +        return ret;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int nvmevf_cmd_query_data_size(struct nvmevf_pci_core_device *nvmevf_dev,
>> +                      size_t *state_size)
>> +{
>> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
>> +    struct nvme_live_mig_command c = { };
>> +    size_t result;
>> +    int ret;
>> +
>> +    c.query.opcode = nvme_admin_live_mig_query_data_size;
>> +    c.query.vf_index = nvmevf_dev->vf_id;
>> +
>> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, &result, NULL, 0);
>> +    if (ret) {
>> +        dev_warn(&dev->dev, "Query the states size failed (ret=0x%x)\n", ret);
>> +        *state_size = 0;
>> +        return ret;
>> +    }
>> +    *state_size = result;
>> +    return 0;
>> +}
>> +
>> +static int nvmevf_cmd_save_data(struct nvmevf_pci_core_device *nvmevf_dev,
>> +                    void *buffer, size_t buffer_len)
>> +{
>> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
>> +    struct nvme_live_mig_command c = { };
>> +    int ret;
>> +
>> +    c.save.opcode = nvme_admin_live_mig_save_data;
>> +    c.save.vf_index = nvmevf_dev->vf_id;
>> +
>> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, buffer, buffer_len);
>> +    if (ret) {
>> +        dev_warn(&dev->dev, "Save the device states failed (ret=0x%x)\n", ret);
>> +        return ret;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int nvmevf_cmd_load_data(struct nvmevf_pci_core_device *nvmevf_dev,
>> +                    struct nvmevf_migration_file *migf)
>> +{
>> +    struct pci_dev *dev = nvmevf_dev->core_device.pdev;
>> +    struct nvme_live_mig_command c = { };
>> +    int ret;
>> +
>> +    c.load.opcode = nvme_admin_live_mig_load_data;
>> +    c.load.vf_index = nvmevf_dev->vf_id;
>> +    c.load.size = migf->total_length;
>> +
>> +    ret = nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL,
>> +            migf->vf_data, migf->total_length);
>> +    if (ret) {
>> +        dev_warn(&dev->dev, "Load the device states failed (ret=0x%x)\n", ret);
>> +        return ret;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev *pdev)
>> +{
>> +    struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
>> +
>> +    return container_of(core_device, struct nvmevf_pci_core_device, core_device);
>> +}
>> +
>> +static void nvmevf_disable_fd(struct nvmevf_migration_file *migf)
>> +{
>> +    mutex_lock(&migf->lock);
>> +
>> +    /* release the device states buffer */
>> +    kvfree(migf->vf_data);
>> +    migf->vf_data = NULL;
>> +    migf->disabled = true;
>> +    migf->total_length = 0;
>> +    migf->filp->f_pos = 0;
>> +    mutex_unlock(&migf->lock);
>> +}
>> +
>> +static int nvmevf_release_file(struct inode *inode, struct file *filp)
>> +{
>> +    struct nvmevf_migration_file *migf = filp->private_data;
>> +
>> +    nvmevf_disable_fd(migf);
>> +    mutex_destroy(&migf->lock);
>> +    kfree(migf);
>> +    return 0;
>> +}
>> +
>> +static ssize_t nvmevf_save_read(struct file *filp, char __user *buf, size_t len, loff_t *pos)
>> +{
>> +    struct nvmevf_migration_file *migf = filp->private_data;
>> +    ssize_t done = 0;
>> +    int ret;
>> +
>> +    if (pos)
>> +        return -ESPIPE;
>> +    pos = &filp->f_pos;
>> +
>> +    mutex_lock(&migf->lock);
>> +    if (*pos > migf->total_length) {
>> +        done = -EINVAL;
>> +        goto out_unlock;
>> +    }
>> +
>> +    if (migf->disabled) {
>> +        done = -EINVAL;
>> +        goto out_unlock;
>> +    }
>> +
>> +    len = min_t(size_t, migf->total_length - *pos, len);
>> +    if (len) {
>> +        ret = copy_to_user(buf, migf->vf_data + *pos, len);
>> +        if (ret) {
>> +            done = -EFAULT;
>> +            goto out_unlock;
>> +        }
>> +        *pos += len;
>> +        done = len;
>> +    }
>> +
>> +out_unlock:
>> +    mutex_unlock(&migf->lock);
>> +    return done;
>> +}
>> +
>> +static const struct file_operations nvmevf_save_fops = {
>> +    .owner = THIS_MODULE,
>> +    .read = nvmevf_save_read,
>> +    .release = nvmevf_release_file,
>> +    .llseek = no_llseek,
>> +};
>> +
>> +static ssize_t nvmevf_resume_write(struct file *filp, const char __user *buf,
>> +                       size_t len, loff_t *pos)
>> +{
>> +    struct nvmevf_migration_file *migf = filp->private_data;
>> +    loff_t requested_length;
>> +    ssize_t done = 0;
>> +    int ret;
>> +
>> +    if (pos)
>> +        return -ESPIPE;
>> +    pos = &filp->f_pos;
>> +
>> +    if (*pos < 0 || check_add_overflow((loff_t)len, *pos, &requested_length))
>> +        return -EINVAL;
>> +
>> +    if (requested_length > MAX_MIGRATION_SIZE)
>> +        return -ENOMEM;
>> +    mutex_lock(&migf->lock);
>> +    if (migf->disabled) {
>> +        done = -ENODEV;
>> +        goto out_unlock;
>> +    }
>> +
>> +    ret = copy_from_user(migf->vf_data + *pos, buf, len);
>> +    if (ret) {
>> +        done = -EFAULT;
>> +        goto out_unlock;
>> +    }
>> +    *pos += len;
>> +    done = len;
>> +    migf->total_length += len;
>> +
>> +out_unlock:
>> +    mutex_unlock(&migf->lock);
>> +    return done;
>> +}
>> +
>> +static const struct file_operations nvmevf_resume_fops = {
>> +    .owner = THIS_MODULE,
>> +    .write = nvmevf_resume_write,
>> +    .release = nvmevf_release_file,
>> +    .llseek = no_llseek,
>> +};
>> +
>> +static void nvmevf_disable_fds(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +    if (nvmevf_dev->resuming_migf) {
>> +        nvmevf_disable_fd(nvmevf_dev->resuming_migf);
>> +        fput(nvmevf_dev->resuming_migf->filp);
>> +        nvmevf_dev->resuming_migf = NULL;
>> +    }
>> +
>> +    if (nvmevf_dev->saving_migf) {
>> +        nvmevf_disable_fd(nvmevf_dev->saving_migf);
>> +        fput(nvmevf_dev->saving_migf->filp);
>> +        nvmevf_dev->saving_migf = NULL;
>> +    }
>> +}
>> +
>> +static struct nvmevf_migration_file *
>> +nvmevf_pci_resume_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +    struct nvmevf_migration_file *migf;
>> +    int ret;
>> +
>> +    migf = kzalloc(sizeof(*migf), GFP_KERNEL);
>> +    if (!migf)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_resume_fops, migf,
>> +                    O_WRONLY);
>> +    if (IS_ERR(migf->filp)) {
>> +        int err = PTR_ERR(migf->filp);
>> +
>> +        kfree(migf);
>> +        return ERR_PTR(err);
>> +    }
>> +    stream_open(migf->filp->f_inode, migf->filp);
>> +    mutex_init(&migf->lock);
>> +
>> +    /* Allocate buffer to load the device states and the max states is 256K */
>> +    migf->vf_data = kvzalloc(MAX_MIGRATION_SIZE, GFP_KERNEL);
>> +    if (!migf->vf_data) {
>> +        ret = -ENOMEM;
>> +        goto out_free;
>> +    }
>> +
>> +    return migf;
>> +
>> +out_free:
>> +    fput(migf->filp);
>> +    return ERR_PTR(ret);
>> +}
>> +
>> +static struct nvmevf_migration_file *
>> +nvmevf_pci_save_device_data(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +    struct nvmevf_migration_file *migf;
>> +    int ret;
>> +
>> +    migf = kzalloc(sizeof(*migf), GFP_KERNEL);
>> +    if (!migf)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    migf->filp = anon_inode_getfile("nvmevf_mig", &nvmevf_save_fops, migf,
>> +                    O_RDONLY);
>> +    if (IS_ERR(migf->filp)) {
>> +        int err = PTR_ERR(migf->filp);
>> +
>> +        kfree(migf);
>> +        return ERR_PTR(err);
>> +    }
>> +
>> +    stream_open(migf->filp->f_inode, migf->filp);
>> +    mutex_init(&migf->lock);
>> +
>> +    ret = nvmevf_cmd_query_data_size(nvmevf_dev, &migf->total_length);
>> +    if (ret)
>> +        goto out_free;
>> +    /* Allocate buffer and save the device states*/
>> +    migf->vf_data = kvzalloc(migf->total_length, GFP_KERNEL);
>> +    if (!migf->vf_data) {
>> +        ret = -ENOMEM;
>> +        goto out_free;
>> +    }
>> +
>> +    ret = nvmevf_cmd_save_data(nvmevf_dev, migf->vf_data, migf->total_length);
>> +    if (ret)
>> +        goto out_free;
>> +
>> +    return migf;
>> +out_free:
>> +    fput(migf->filp);
>> +    return ERR_PTR(ret);
>> +}
>> +
>> +static struct file *
>> +nvmevf_pci_step_device_state_locked(struct nvmevf_pci_core_device *nvmevf_dev, u32 new)
>> +{
>> +    u32 cur = nvmevf_dev->mig_state;
>> +    int ret;
>> +
>> +    if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) {
>> +        ret = nvmevf_cmd_suspend_device(nvmevf_dev);
>> +        if (ret)
>> +            return ERR_PTR(ret);
>> +        return NULL;
>> +    }
>> +
>> +    if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
>> +        struct nvmevf_migration_file *migf;
>> +
>> +        migf = nvmevf_pci_save_device_data(nvmevf_dev);
>> +        if (IS_ERR(migf))
>> +            return ERR_CAST(migf);
>> +        get_file(migf->filp);
>> +        nvmevf_dev->saving_migf = migf;
>> +        return migf->filp;
>> +    }
>> +
>> +    if (cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) {
>> +        nvmevf_disable_fds(nvmevf_dev);
>> +        return NULL;
>> +    }
>> +
>> +    if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
>> +        struct nvmevf_migration_file *migf;
>> +
>> +        migf = nvmevf_pci_resume_device_data(nvmevf_dev);
>> +        if (IS_ERR(migf))
>> +            return ERR_CAST(migf);
>> +        get_file(migf->filp);
>> +        nvmevf_dev->resuming_migf = migf;
>> +        return migf->filp;
>> +    }
>> +
>> +    if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
>> +        ret = nvmevf_cmd_load_data(nvmevf_dev, nvmevf_dev->resuming_migf);
>> +        if (ret)
>> +            return ERR_PTR(ret);
>> +        nvmevf_disable_fds(nvmevf_dev);
>> +        return NULL;
>> +    }
>> +
>> +    if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING) {
>> +        nvmevf_cmd_resume_device(nvmevf_dev);
>> +        return NULL;
>> +    }
>> +
>> +    /* vfio_mig_get_next_state() does not use arcs other than the above */
>> +    WARN_ON(true);
>> +    return ERR_PTR(-EINVAL);
>> +}
>> +
>> +static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +again:
>> +    spin_lock(&nvmevf_dev->reset_lock);
>> +    if (nvmevf_dev->deferred_reset) {
>> +        nvmevf_dev->deferred_reset = false;
>> +        spin_unlock(&nvmevf_dev->reset_lock);
>> +        nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
>> +        nvmevf_disable_fds(nvmevf_dev);
>> +        goto again;
>> +    }
>> +    mutex_unlock(&nvmevf_dev->state_mutex);
>> +    spin_unlock(&nvmevf_dev->reset_lock);
>> +}
>> +
>> +static struct file *
>> +nvmevf_pci_set_device_state(struct vfio_device *vdev, enum vfio_device_mig_state new_state)
>> +{
>> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(vdev,
>> +            struct nvmevf_pci_core_device, core_device.vdev);
>> +    enum vfio_device_mig_state next_state;
>> +    struct file *res = NULL;
>> +    int ret;
>> +
>> +    mutex_lock(&nvmevf_dev->state_mutex);
>> +    while (new_state != nvmevf_dev->mig_state) {
>> +        ret = vfio_mig_get_next_state(vdev, nvmevf_dev->mig_state, new_state, &next_state);
>> +        if (ret) {
>> +            res = ERR_PTR(-EINVAL);
>> +            break;
>> +        }
>> +
>> +        res = nvmevf_pci_step_device_state_locked(nvmevf_dev, next_state);
>> +        if (IS_ERR(res))
>> +            break;
>> +        nvmevf_dev->mig_state = next_state;
>> +        if (WARN_ON(res && new_state != nvmevf_dev->mig_state)) {
>> +            fput(res);
>> +            res = ERR_PTR(-EINVAL);
>> +            break;
>> +        }
>> +    }
>> +    nvmevf_state_mutex_unlock(nvmevf_dev);
>> +    return res;
>> +}
>> +
>> +static int nvmevf_pci_get_device_state(struct vfio_device *vdev,
>> +                       enum vfio_device_mig_state *curr_state)
>> +{
>> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(
>> +            vdev, struct nvmevf_pci_core_device, core_device.vdev);
>> +
>> +    mutex_lock(&nvmevf_dev->state_mutex);
>> +    *curr_state = nvmevf_dev->mig_state;
>> +    nvmevf_state_mutex_unlock(nvmevf_dev);
>> +    return 0;
>> +}
>>
>> static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
>> {
>> -    struct vfio_pci_core_device *vdev =
>> -        container_of(core_vdev, struct vfio_pci_core_device, vdev);
>> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(
>> +            core_vdev, struct nvmevf_pci_core_device, core_device.vdev);
>> +    struct vfio_pci_core_device *vdev = &nvmevf_dev->core_device;
>>     int ret;
>>
>>     ret = vfio_pci_core_enable(vdev);
>>     if (ret)
>>         return ret;
>>
>> +    if (nvmevf_dev->migrate_cap)
>> +        nvmevf_dev->mig_state = VFIO_DEVICE_STATE_RUNNING;
>>     vfio_pci_core_finish_enable(vdev);
>>     return 0;
>> }
>>
>> +static void nvmevf_cmd_close_migratable(struct nvmevf_pci_core_device *nvmevf_dev)
>> +{
>> +    if (!nvmevf_dev->migrate_cap)
>> +        return;
>> +
>> +    mutex_lock(&nvmevf_dev->state_mutex);
>> +    nvmevf_disable_fds(nvmevf_dev);
>> +    nvmevf_state_mutex_unlock(nvmevf_dev);
>> +}
>> +
>> +static void nvmevf_pci_close_device(struct vfio_device *core_vdev)
>> +{
>> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(
>> +            core_vdev, struct nvmevf_pci_core_device, core_device.vdev);
>> +
>> +    nvmevf_cmd_close_migratable(nvmevf_dev);
>> +    vfio_pci_core_close_device(core_vdev);
>> +}
>> +
>> +static const struct vfio_migration_ops nvmevf_pci_mig_ops = {
>> +    .migration_set_state = nvmevf_pci_set_device_state,
>> +    .migration_get_state = nvmevf_pci_get_device_state,
>> +};
>> +
>> +static int nvmevf_migration_init_dev(struct vfio_device *core_vdev)
>> +{
>> +    struct nvmevf_pci_core_device *nvmevf_dev = container_of(core_vdev,
>> +                    struct nvmevf_pci_core_device, core_device.vdev);
>> +    struct pci_dev *pdev = to_pci_dev(core_vdev->dev);
>> +    int vf_id;
>> +    int ret = -1;
>> +
>> +    if (!pdev->is_virtfn)
>> +        return ret;
>> +
>> +    nvmevf_dev->migrate_cap = 1;
>> +
>> +    vf_id = pci_iov_vf_id(pdev);
>> +    if (vf_id < 0)
>> +        return ret;
>> +    nvmevf_dev->vf_id = vf_id + 1;
>> +    core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY;
>> +
>> +    mutex_init(&nvmevf_dev->state_mutex);
>> +    spin_lock_init(&nvmevf_dev->reset_lock);
>> +    core_vdev->mig_ops = &nvmevf_pci_mig_ops;
>> +
>> +    return vfio_pci_core_init_dev(core_vdev);
>> +}
>> +
>> static const struct vfio_device_ops nvmevf_pci_ops = {
>>     .name = "nvme-vfio-pci",
>> -    .init = vfio_pci_core_init_dev,
>> +    .init = nvmevf_migration_init_dev,
>>     .release = vfio_pci_core_release_dev,
>>     .open_device = nvmevf_pci_open_device,
>> -    .close_device = vfio_pci_core_close_device,
>> +    .close_device = nvmevf_pci_close_device,
>>     .ioctl = vfio_pci_core_ioctl,
>>     .device_feature = vfio_pci_core_ioctl_feature,
>>     .read = vfio_pci_core_read,
>> @@ -47,32 +521,56 @@ static const struct vfio_device_ops nvmevf_pci_ops = {
>>
>> static int nvmevf_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>> {
>> -    struct vfio_pci_core_device *vdev;
>> +    struct nvmevf_pci_core_device *nvmevf_dev;
>>     int ret;
>>
>> -    vdev = vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev,
>> -                &nvmevf_pci_ops);
>> -    if (IS_ERR(vdev))
>> -        return PTR_ERR(vdev);
>> +    nvmevf_dev = vfio_alloc_device(nvmevf_pci_core_device, core_device.vdev,
>> +                    &pdev->dev, &nvmevf_pci_ops);
>> +    if (IS_ERR(nvmevf_dev))
>> +        return PTR_ERR(nvmevf_dev);
>>
>> -    dev_set_drvdata(&pdev->dev, vdev);
>> -    ret = vfio_pci_core_register_device(vdev);
>> +    dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device);
>> +    ret = vfio_pci_core_register_device(&nvmevf_dev->core_device);
>>     if (ret)
>>         goto out_put_dev;
>> -
>>     return 0;
>>
>> out_put_dev:
>> -    vfio_put_device(&vdev->vdev);
>> +    vfio_put_device(&nvmevf_dev->core_device.vdev);
>>     return ret;
>> +
>> }
>>
>> static void nvmevf_pci_remove(struct pci_dev *pdev)
>> {
>> -    struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
>> +    struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
>> +
>> + vfio_pci_core_unregister_device(&nvmevf_dev->core_device);
>> +    vfio_put_device(&nvmevf_dev->core_device.vdev);
>> +}
>> +
>> +static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev)
>> +{
>> +    struct nvmevf_pci_core_device *nvmevf_dev = nvmevf_drvdata(pdev);
>> +
>> +    if (!nvmevf_dev->migrate_cap)
>> +        return;
>>
>> -    vfio_pci_core_unregister_device(vdev);
>> -    vfio_put_device(&vdev->vdev);
>> +    /*
>> +     * As the higher VFIO layers are holding locks across reset and using
>> +     * those same locks with the mm_lock we need to prevent ABBA deadlock
>> +     * with the state_mutex and mm_lock.
>> +     * In case the state_mutex was taken already we defer the cleanup work
>> +     * to the unlock flow of the other running context.
>> +     */
>> +    spin_lock(&nvmevf_dev->reset_lock);
>> +    nvmevf_dev->deferred_reset = true;
>> +    if (!mutex_trylock(&nvmevf_dev->state_mutex)) {
>> +        spin_unlock(&nvmevf_dev->reset_lock);
>> +        return;
>> +    }
>> +    spin_unlock(&nvmevf_dev->reset_lock);
>> +    nvmevf_state_mutex_unlock(nvmevf_dev);
>> }
>>
>> static const struct pci_device_id nvmevf_pci_table[] = {
>> @@ -83,12 +581,17 @@ static const struct pci_device_id nvmevf_pci_table[] = {
>>
>> MODULE_DEVICE_TABLE(pci, nvmevf_pci_table);
>>
>> +static const struct pci_error_handlers nvmevf_err_handlers = {
>> +    .reset_done = nvmevf_pci_aer_reset_done,
>> +    .error_detected = vfio_pci_core_aer_err_detected,
>> +};
>> +
>> static struct pci_driver nvmevf_pci_driver = {
>>     .name = KBUILD_MODNAME,
>>     .id_table = nvmevf_pci_table,
>>     .probe = nvmevf_pci_probe,
>>     .remove = nvmevf_pci_remove,
>> -    .err_handler = &vfio_pci_core_err_handlers,
>> +    .err_handler = &nvmevf_err_handlers,
>>     .driver_managed_dma = true,
>> };
>>
>> @@ -96,4 +599,4 @@ module_pci_driver(nvmevf_pci_driver);
>>
>> MODULE_LICENSE("GPL");
>> MODULE_AUTHOR("Lei Rao <lei.rao@intel.com>");
>> -MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe");
>> +MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live migration support for NVMe");
>> diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h
>> new file mode 100644
>> index 000000000000..c8464554ef53
>> --- /dev/null
>> +++ b/drivers/vfio/pci/nvme/nvme.h
>> @@ -0,0 +1,111 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
>> + */
>> +
>> +#ifndef NVME_VFIO_PCI_H
>> +#define NVME_VFIO_PCI_H
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/vfio_pci_core.h>
>> +#include <linux/nvme.h>
>> +
>> +struct nvme_live_mig_query_size {
>> +    __u8    opcode;
>> +    __u8    flags;
>> +    __u16    command_id;
>> +    __u32    rsvd1[9];
>> +    __u16    vf_index;
>> +    __u16    rsvd2;
>> +    __u32    rsvd3[5];
>> +};
>> +
>> +struct nvme_live_mig_suspend {
>> +    __u8    opcode;
>> +    __u8    flags;
>> +    __u16    command_id;
>> +    __u32    rsvd1[9];
>> +    __u16    vf_index;
>> +    __u16    rsvd2;
>> +    __u32    rsvd3[5];
>> +};
>> +
>> +struct nvme_live_mig_resume {
>> +    __u8    opcode;
>> +    __u8    flags;
>> +    __u16   command_id;
>> +    __u32   rsvd1[9];
>> +    __u16   vf_index;
>> +    __u16   rsvd2;
>> +    __u32   rsvd3[5];
>> +};
>> +
>> +struct nvme_live_mig_save_data {
>> +    __u8    opcode;
>> +    __u8    flags;
>> +    __u16    command_id;
>> +    __u32    rsvd1[5];
>> +    __le64    prp1;
>> +    __le64    prp2;
>> +    __u16    vf_index;
>> +    __u16    rsvd2;
>> +    __u32    rsvd3[5];
>> +};
> 
> Just noticed that the save_data (similar to READ operation) doesn't have size/length member.
> How does it work for you ?

Sorry for late reply due to the CNY.
Before the save_data, it will send the Query command to get the data size, and then the driver
will allocate the data buffer used to receive the device state through DMA. That means both the
controller and Host driver know the device state's size.

> 
> BTW,
> 
> Are you doing iterative reads from the device to get the state ?

Yes, in theory, iterative reads are needed.

The size of the device state is related to the number of queue pairs.
For NVMe, the states of one queue pair that needs to be migrated should have only dozens of bytes.
As we know, NVMe can support a maximum number of 64K IO queues, but the actual hardware
can support far less than 64K queues. Therefore, without considering DMA dirty pages, the size
of the device state is small. Practically one read should be able to retrieve all the device states.

Actually, this is also related to maximum data transfer size (MDTS). If the device state size
exceeds MDTS, it must be read iteratively. If needed, we can add a command field to support iteration.

Lei.

> 
>> +
>> +struct nvme_live_mig_load_data {
>> +    __u8    opcode;
>> +    __u8    flags;
>> +    __u16   command_id;
>> +    __u32   rsvd1[5];
>> +    __le64  prp1;
>> +    __le64  prp2;
>> +    __u16   vf_index;
>> +    __u16    rsvd2;
>> +    __u32    size;
>> +    __u32   rsvd3[4];
>> +};
>> +
>> +enum nvme_live_mig_admin_opcode {
>> +    nvme_admin_live_mig_query_data_size    = 0xC4,
>> +    nvme_admin_live_mig_suspend        = 0xC8,
>> +    nvme_admin_live_mig_resume        = 0xCC,
>> +    nvme_admin_live_mig_save_data        = 0xD2,
>> +    nvme_admin_live_mig_load_data        = 0xD5,
>> +};
>> +
>> +struct nvme_live_mig_command {
>> +    union {
>> +        struct nvme_live_mig_query_size query;
>> +        struct nvme_live_mig_suspend    suspend;
>> +        struct nvme_live_mig_resume    resume;
>> +        struct nvme_live_mig_save_data    save;
>> +        struct nvme_live_mig_load_data    load;
>> +    };
>> +};
>> +
>> +struct nvmevf_migration_file {
>> +    struct file *filp;
>> +    struct mutex lock;
>> +    bool disabled;
>> +    u8 *vf_data;
>> +    size_t total_length;
>> +};
>> +
>> +struct nvmevf_pci_core_device {
>> +    struct vfio_pci_core_device core_device;
>> +    int vf_id;
>> +    u8 migrate_cap:1;
>> +    u8 deferred_reset:1;
>> +    /* protect migration state */
>> +    struct mutex state_mutex;
>> +    enum vfio_device_mig_state mig_state;
>> +    /* protect the reset_done flow */
>> +    spinlock_t reset_lock;
>> +    struct nvmevf_migration_file *resuming_migf;
>> +    struct nvmevf_migration_file *saving_migf;
>> +};
>> +
>> +extern int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
>> +            size_t *result, void *buffer, unsigned int bufflen);
>> +
>> +#endif /* NVME_VFIO_PCI_H */

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-02-09  9:09 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
2022-12-06  5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
2022-12-06  6:19   ` Christoph Hellwig
2022-12-06 13:44     ` Jason Gunthorpe
2022-12-06 13:51       ` Keith Busch
2022-12-06 14:27         ` Jason Gunthorpe
2022-12-06 13:58       ` Christoph Hellwig
2022-12-06 15:22         ` Jason Gunthorpe
2022-12-06 15:38           ` Christoph Hellwig
2022-12-06 15:51             ` Jason Gunthorpe
2022-12-06 16:55               ` Christoph Hellwig
2022-12-06 19:15                 ` Jason Gunthorpe
2022-12-07  2:30                   ` Max Gurtovoy
2022-12-07  7:58                     ` Christoph Hellwig
2022-12-09  2:11                       ` Tian, Kevin
2022-12-12  7:41                         ` Christoph Hellwig
2022-12-07  7:54                   ` Christoph Hellwig
2022-12-07 10:59                     ` Max Gurtovoy
2022-12-07 13:46                       ` Christoph Hellwig
2022-12-07 14:50                         ` Max Gurtovoy
2022-12-07 16:35                           ` Christoph Hellwig
2022-12-07 13:34                     ` Jason Gunthorpe
2022-12-07 13:52                       ` Christoph Hellwig
2022-12-07 15:07                         ` Jason Gunthorpe
2022-12-07 16:38                           ` Christoph Hellwig
2022-12-07 17:31                             ` Jason Gunthorpe
2022-12-07 18:33                               ` Christoph Hellwig
2022-12-07 20:08                                 ` Jason Gunthorpe
2022-12-09  2:50                                   ` Tian, Kevin
2022-12-09 18:56                                     ` Dong, Eddie
2022-12-11 11:39                                   ` Max Gurtovoy
2022-12-12  7:55                                     ` Christoph Hellwig
2022-12-12 14:49                                       ` Max Gurtovoy
2022-12-12  7:50                                   ` Christoph Hellwig
2022-12-13 14:01                                     ` Jason Gunthorpe
2022-12-13 16:08                                       ` Christoph Hellwig
2022-12-13 17:49                                         ` Jason Gunthorpe
2022-12-06  5:58 ` [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device Lei Rao
2022-12-06  5:58 ` [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration Lei Rao
2023-01-19 10:21   ` Max Gurtovoy
2023-02-09  9:09     ` Rao, Lei
2022-12-06  5:58 ` [RFC PATCH 4/5] nvme-vfio: check if the hardware supports " Lei Rao
2022-12-06 13:47   ` Keith Busch
2022-12-06  5:58 ` [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Lei Rao
2022-12-06  6:26   ` Christoph Hellwig
2022-12-06 13:05     ` Jason Gunthorpe
2022-12-06 13:09       ` Christoph Hellwig
2022-12-06 13:52         ` Jason Gunthorpe
2022-12-06 14:00           ` Christoph Hellwig
2022-12-06 14:20             ` Jason Gunthorpe
2022-12-06 14:31               ` Christoph Hellwig
2022-12-06 14:48                 ` Jason Gunthorpe
2022-12-06 15:01                   ` Christoph Hellwig
2022-12-06 15:28                     ` Jason Gunthorpe
2022-12-06 15:35                       ` Christoph Hellwig
2022-12-06 18:00                         ` Dong, Eddie
2022-12-12  7:57                           ` Christoph Hellwig
2022-12-11 12:05                     ` Max Gurtovoy
2022-12-11 13:21                       ` Rao, Lei
2022-12-11 14:51                         ` Max Gurtovoy
2022-12-12  1:20                           ` Rao, Lei
2022-12-12  8:09                           ` Christoph Hellwig
2022-12-09  2:05         ` Tian, Kevin
2022-12-09 16:53           ` Li, Yadong
2022-12-12  8:11             ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).