linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
@ 2021-07-02  9:58 Shameer Kolothum
  2021-07-02  9:58 ` [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
                   ` (5 more replies)
  0 siblings, 6 replies; 55+ messages in thread
From: Shameer Kolothum @ 2021-07-02  9:58 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, mgurtovoy, linuxarm, liulongfang,
	prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

Hi,

This series attempts to add vfio live migration support for
HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
includes both the functional register space and migration 
control register space. As discussed in RFCv1[0], this may create
security issues as these regions get shared between the Guest
driver and the migration driver. Based on the feedback, we tried
to address those concerns in this version. 

This is now based on the new vfio-pci-core framework proposal[1].
Understand that the framework proposal is still under discussion,
but really appreciate any feedback on the approach taken here
to mitigate the security risks.

The major changes from v1 are,

 -Adds a new vendor-specific vfio_pci driver(hisi-acc-vfio-pci)
  for HiSilicon ACC VF devices based on the new vfio-pci-core
  framework proposal.

 -Since HiSilicon ACC VF device MMIO space contains both the
  functional register space and migration control register space,
  override the vfio_device_ops ioctl method to report only the
  functional space to VMs.

 -For a successful migration, we still need access to VF dev
  functional register space mainly to read the status registers.
  But accessing these while the Guest vCPUs are running may leave
  a security hole. To avoid any potential security issues, we
  map/unmap the MMIO regions on a need basis and is safe to do so.
  (Please see hisi_acc_vf_ioremap/unmap() fns in patch #4).
 
 -Dropped debugfs support for now.
 -Uses common QM functions for mailbox access(patch #3).

This is now sanity tested on HiSilicon platforms that support these
ACC devices.

From v1:
 -In order to ensure the compatibility of the devices before and
  after the migration, the device compatibility information check
  will be performed in the Pre-copy stage. If the check fails,
  an error will be returned and the source VM will exit the migration.
 -After the compatibility check is passed, it will enter the
  Stop-and-copy stage. At this time, all the live migration data
  will be copied and saved to the VF device of the destination.
  Finally, the VF device of the destination will be started.

Thanks,
Shameer
[0] https://lore.kernel.org/lkml/20210415220137.GA1672608@nvidia.com/
[1] https://lore.kernel.org/lkml/20210603160809.15845-1-mgurtovoy@nvidia.com/

Longfang Liu (2):
  crypto: hisilicon/qm - Export mailbox functions for common use
  hisi_acc_vfio_pci: Add support for vfio live migration

Shameer Kolothum (2):
  hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size

 drivers/crypto/hisilicon/qm.c        |    8 +-
 drivers/crypto/hisilicon/qm.h        |    4 +
 drivers/vfio/pci/Kconfig             |   13 +
 drivers/vfio/pci/Makefile            |    2 +
 drivers/vfio/pci/hisi_acc_vfio_pci.c | 1155 ++++++++++++++++++++++++++
 drivers/vfio/pci/hisi_acc_vfio_pci.h |  144 ++++
 6 files changed, 1323 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
@ 2021-07-02  9:58 ` Shameer Kolothum
  2021-07-02 20:29   ` Alex Williamson
  2021-07-04  7:03   ` Leon Romanovsky
  2021-07-02  9:58 ` [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size Shameer Kolothum
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Shameer Kolothum @ 2021-07-02  9:58 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, mgurtovoy, linuxarm, liulongfang,
	prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
This will be extended in follow-up patches to add support for
vfio live migration feature.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 drivers/vfio/pci/Kconfig             |   9 +++
 drivers/vfio/pci/Makefile            |   2 +
 drivers/vfio/pci/hisi_acc_vfio_pci.c | 100 +++++++++++++++++++++++++++
 3 files changed, 111 insertions(+)
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 9cdef46dd299..709807c28153 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -57,3 +57,12 @@ config MLX5_VFIO_PCI
 	  framework.
 
 	  If you don't know what to do here, say N.
+
+config HISI_ACC_VFIO_PCI
+	tristate "VFIO support for HiSilicon ACC devices"
+	depends on ARM64 && VFIO_PCI_CORE
+	help
+	  This provides generic PCI support for HiSilicon devices using the VFIO
+	  framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index a0df9c2a4bd9..d1de3e81921f 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -3,6 +3,7 @@
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
+obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisi-acc-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
@@ -11,3 +12,4 @@ vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 
 mlx5-vfio-pci-y := mlx5_vfio_pci.o
+hisi-acc-vfio-pci-y := hisi_acc_vfio_pci.o
diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisi_acc_vfio_pci.c
new file mode 100644
index 000000000000..a9e173098ab5
--- /dev/null
+++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021, HiSilicon Ltd.
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/vfio.h>
+#include <linux/vfio_pci_core.h>
+
+static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	lockdep_assert_held(&core_vdev->reflck->lock);
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	vfio_pci_core_finish_enable(vdev);
+
+	return 0;
+}
+
+static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
+	.name		= "hisi-acc-vfio-pci",
+	.open		= hisi_acc_vfio_pci_open,
+	.release	= vfio_pci_core_release,
+	.ioctl		= vfio_pci_core_ioctl,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+	.reflck_attach	= vfio_pci_core_reflck_attach,
+};
+
+static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_core_device *vdev;
+	int ret;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	ret = vfio_pci_core_register_device(vdev, pdev, &hisi_acc_vfio_pci_ops);
+	if (ret)
+		goto out_free;
+
+	dev_set_drvdata(&pdev->dev, vdev);
+
+	return 0;
+
+out_free:
+	kfree(vdev);
+	return ret;
+}
+
+static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(vdev);
+	kfree(vdev);
+}
+
+static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa256) }, /* SEC VF */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa259) }, /* HPRE VF */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa251) }, /* ZIP VF */
+	{ 0, }
+};
+
+MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table);
+
+static struct pci_driver hisi_acc_vfio_pci_driver = {
+	.name			= "hisi-acc-vfio-pci",
+	.id_table		= hisi_acc_vfio_pci_table,
+	.probe			= hisi_acc_vfio_pci_probe,
+	.remove			= hisi_acc_vfio_pci_remove,
+#ifdef CONFIG_PCI_IOV
+	.sriov_configure	= vfio_pci_core_sriov_configure,
+#endif
+	.err_handler		= &vfio_pci_core_err_handlers,
+};
+
+module_pci_driver(hisi_acc_vfio_pci_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Liu Longfang <liulongfang@huawei.com>");
+MODULE_AUTHOR("Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>");
+MODULE_DESCRIPTION("HiSilicon VFIO PCI - Generic VFIO PCI driver for HiSilicon ACC device family");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size
  2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
  2021-07-02  9:58 ` [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
@ 2021-07-02  9:58 ` Shameer Kolothum
  2021-07-02 20:29   ` Alex Williamson
  2021-07-02  9:58 ` [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use Shameer Kolothum
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Shameer Kolothum @ 2021-07-02  9:58 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, mgurtovoy, linuxarm, liulongfang,
	prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

HiSilicon ACC VF device BAR2 region consists of both functional register
space and migration control register space. From a security point of
view, it's not advisable to export the migration control region to Guest.

Hence, hide the migration region and report only the functional register
space.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 drivers/vfio/pci/hisi_acc_vfio_pci.c | 42 +++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisi_acc_vfio_pci.c
index a9e173098ab5..d57cf6d7adaf 100644
--- a/drivers/vfio/pci/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
@@ -12,6 +12,46 @@
 #include <linux/vfio.h>
 #include <linux/vfio_pci_core.h>
 
+static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+				    unsigned long arg)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+
+	if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+		unsigned long minsz;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_BAR2_REGION_INDEX) {
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+
+			/*
+			 * ACC VF dev BAR2 region(64K) consists of both functional
+			 * register space and migration control register space.
+			 * Report only the first 32K(functional region) to Guest.
+			 */
+			info.size = pci_resource_len(pdev, info.index) / 2;
+
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+					VFIO_REGION_INFO_FLAG_WRITE |
+					VFIO_REGION_INFO_FLAG_MMAP;
+
+			return copy_to_user((void __user *)arg, &info, minsz) ?
+					    -EFAULT : 0;
+		}
+	}
+	return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+}
+
 static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
 {
 	struct vfio_pci_core_device *vdev =
@@ -33,7 +73,7 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
 	.name		= "hisi-acc-vfio-pci",
 	.open		= hisi_acc_vfio_pci_open,
 	.release	= vfio_pci_core_release,
-	.ioctl		= vfio_pci_core_ioctl,
+	.ioctl		= hisi_acc_vfio_pci_ioctl,
 	.read		= vfio_pci_core_read,
 	.write		= vfio_pci_core_write,
 	.mmap		= vfio_pci_core_mmap,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use
  2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
  2021-07-02  9:58 ` [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
  2021-07-02  9:58 ` [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size Shameer Kolothum
@ 2021-07-02  9:58 ` Shameer Kolothum
  2021-07-04  9:34   ` Max Gurtovoy
  2021-07-02  9:58 ` [RFC v2 4/4] hisi_acc_vfio_pci: Add support for vfio live migration Shameer Kolothum
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Shameer Kolothum @ 2021-07-02  9:58 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, mgurtovoy, linuxarm, liulongfang,
	prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

From: Longfang Liu <liulongfang@huawei.com>

Export QM mailbox functions so that they can be used from HiSilicon
ACC vfio live migration driver in follow-up patch.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 drivers/crypto/hisilicon/qm.c | 8 +++++---
 drivers/crypto/hisilicon/qm.h | 4 ++++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
index ce439a0c66c9..87fc0199705e 100644
--- a/drivers/crypto/hisilicon/qm.c
+++ b/drivers/crypto/hisilicon/qm.c
@@ -492,7 +492,7 @@ static bool qm_qp_avail_state(struct hisi_qm *qm, struct hisi_qp *qp,
 }
 
 /* return 0 mailbox ready, -ETIMEDOUT hardware timeout */
-static int qm_wait_mb_ready(struct hisi_qm *qm)
+int qm_wait_mb_ready(struct hisi_qm *qm)
 {
 	u32 val;
 
@@ -500,6 +500,7 @@ static int qm_wait_mb_ready(struct hisi_qm *qm)
 					  val, !((val >> QM_MB_BUSY_SHIFT) &
 					  0x1), POLL_PERIOD, POLL_TIMEOUT);
 }
+EXPORT_SYMBOL_GPL(qm_wait_mb_ready);
 
 /* 128 bit should be written to hardware at one time to trigger a mailbox */
 static void qm_mb_write(struct hisi_qm *qm, const void *src)
@@ -523,8 +524,8 @@ static void qm_mb_write(struct hisi_qm *qm, const void *src)
 		     : "memory");
 }
 
-static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
-		 bool op)
+int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
+	  bool op)
 {
 	struct qm_mailbox mailbox;
 	int ret = 0;
@@ -563,6 +564,7 @@ static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
 		atomic64_inc(&qm->debug.dfx.mb_err_cnt);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(qm_mb);
 
 static void qm_db_v1(struct hisi_qm *qm, u16 qn, u8 cmd, u16 index, u8 priority)
 {
diff --git a/drivers/crypto/hisilicon/qm.h b/drivers/crypto/hisilicon/qm.h
index acefdf8b3a50..18b010d5452d 100644
--- a/drivers/crypto/hisilicon/qm.h
+++ b/drivers/crypto/hisilicon/qm.h
@@ -396,6 +396,10 @@ pci_ers_result_t hisi_qm_dev_slot_reset(struct pci_dev *pdev);
 void hisi_qm_reset_prepare(struct pci_dev *pdev);
 void hisi_qm_reset_done(struct pci_dev *pdev);
 
+int qm_wait_mb_ready(struct hisi_qm *qm);
+int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
+	  bool op);
+
 struct hisi_acc_sgl_pool;
 struct hisi_acc_hw_sgl *hisi_acc_sg_buf_map_to_hw_sgl(struct device *dev,
 	struct scatterlist *sgl, struct hisi_acc_sgl_pool *pool,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC v2 4/4] hisi_acc_vfio_pci: Add support for vfio live migration
  2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
                   ` (2 preceding siblings ...)
  2021-07-02  9:58 ` [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use Shameer Kolothum
@ 2021-07-02  9:58 ` Shameer Kolothum
  2022-02-02 13:14 ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Jason Gunthorpe
  2022-02-18 16:37 ` Jason Gunthorpe
  5 siblings, 0 replies; 55+ messages in thread
From: Shameer Kolothum @ 2021-07-02  9:58 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, mgurtovoy, linuxarm, liulongfang,
	prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

From: Longfang Liu <liulongfang@huawei.com>

VMs assigned with HiSilicon ACC VF devices can now perform
live migration if the VF devices are bind to the hisi-acc-vfio-pci
driver.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 drivers/vfio/pci/Kconfig             |   10 +-
 drivers/vfio/pci/hisi_acc_vfio_pci.c | 1023 +++++++++++++++++++++++++-
 drivers/vfio/pci/hisi_acc_vfio_pci.h |  144 ++++
 3 files changed, 1170 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 709807c28153..25dec1bb7741 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,10 +59,14 @@ config MLX5_VFIO_PCI
 	  If you don't know what to do here, say N.
 
 config HISI_ACC_VFIO_PCI
-	tristate "VFIO support for HiSilicon ACC devices"
+	tristate "VFIO live migration support for HiSilicon ACC devices"
 	depends on ARM64 && VFIO_PCI_CORE
+	select CRYPTO_DEV_HISI_QM
+	depends on PCI && PCI_MSI
+	depends on UACCE || UACCE=n
+	depends on ACPI
 	help
-	  This provides generic PCI support for HiSilicon devices using the VFIO
-	  framework.
+	  This provides live migration support for HiSilicon ACC devices
+	  using the VFIO framework.
 
 	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisi_acc_vfio_pci.c
index d57cf6d7adaf..f7906a151def 100644
--- a/drivers/vfio/pci/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
@@ -12,6 +12,1015 @@
 #include <linux/vfio.h>
 #include <linux/vfio_pci_core.h>
 
+#include "hisi_acc_vfio_pci.h"
+
+/* return 0 VM acc device ready, -ETIMEDOUT hardware timeout */
+static int qm_wait_dev_ready(struct hisi_qm *qm)
+{
+	u32 val;
+
+	return readl_relaxed_poll_timeout(qm->io_base + QM_VF_STATE,
+				val, !(val & 0x1), MB_POLL_PERIOD_US,
+				MB_POLL_TIMEOUT_US);
+}
+
+/*
+ * Each state Reg is checked 100 times,
+ * with a delay of 100 microseconds after each check
+ */
+static u32 acc_check_reg_state(struct hisi_qm *qm, u32 regs)
+{
+	int check_times = 0;
+	u32 state;
+
+	state = readl(qm->io_base + regs);
+	while (state && check_times < ERROR_CHECK_TIMEOUT) {
+		udelay(CHECK_DELAY_TIME);
+		state = readl(qm->io_base + regs);
+		check_times++;
+	}
+
+	return state;
+}
+
+/* Check the PF's RAS state and Function INT state */
+static int qm_check_int_state(struct acc_vf_migration *acc_vf_dev)
+{
+	struct hisi_qm *vfqm = acc_vf_dev->vf_qm;
+	struct hisi_qm *qm = acc_vf_dev->pf_qm;
+	struct pci_dev *vf_pdev = acc_vf_dev->vf_dev;
+	struct device *dev = &qm->pdev->dev;
+	u32 state;
+
+	/* Check RAS state */
+	state = acc_check_reg_state(qm, QM_ABNORMAL_INT_STATUS);
+	if (state) {
+		dev_err(dev, "failed to check QM RAS state!\n");
+		return -EBUSY;
+	}
+
+	/* Check Function Communication state between PF and VF */
+	state = acc_check_reg_state(vfqm, QM_IFC_INT_STATUS);
+	if (state) {
+		dev_err(dev, "failed to check QM IFC INT state!\n");
+		return -EBUSY;
+	}
+	state = acc_check_reg_state(vfqm, QM_IFC_INT_SET_V);
+	if (state) {
+		dev_err(dev, "failed to check QM IFC INT SET state!\n");
+		return -EBUSY;
+	}
+
+	/* Check submodule task state */
+	switch (vf_pdev->device) {
+	case HISI_SEC_VF_DEV_ID:
+		state = acc_check_reg_state(qm, SEC_CORE_INT_STATUS);
+		if (state) {
+			dev_err(dev, "failed to check QM SEC Core INT state!\n");
+			return -EBUSY;
+		}
+		return 0;
+	case HISI_HPRE_VF_DEV_ID:
+		state = acc_check_reg_state(qm, HPRE_HAC_INT_STATUS);
+		if (state) {
+			dev_err(dev, "failed to check QM HPRE HAC INT state!\n");
+			return -EBUSY;
+		}
+		return 0;
+	case HISI_ZIP_VF_DEV_ID:
+		state = acc_check_reg_state(qm, HZIP_CORE_INT_STATUS);
+		if (state) {
+			dev_err(dev, "failed to check QM ZIP Core INT state!\n");
+			return -EBUSY;
+		}
+		return 0;
+	default:
+		dev_err(dev, "failed to detect acc module type!\n");
+		return -EINVAL;
+	}
+}
+
+static int qm_read_reg(struct hisi_qm *qm, u32 reg_addr,
+		       u32 *data, u8 nums)
+{
+	int i;
+
+	if (nums < 1 || nums > QM_REGS_MAX_LEN)
+		return -EINVAL;
+
+	for (i = 0; i < nums; i++) {
+		data[i] = readl(qm->io_base + reg_addr);
+		reg_addr += QM_REG_ADDR_OFFSET;
+	}
+
+	return 0;
+}
+
+static int qm_write_reg(struct hisi_qm *qm, u32 reg,
+			u32 *data, u8 nums)
+{
+	int i;
+
+	if (nums < 1 || nums > QM_REGS_MAX_LEN)
+		return -EINVAL;
+
+	for (i = 0; i < nums; i++)
+		writel(data[i], qm->io_base + reg + i * QM_REG_ADDR_OFFSET);
+
+	return 0;
+}
+
+static int qm_get_vft(struct hisi_qm *qm, u32 *base)
+{
+	u64 sqc_vft;
+	u32 qp_num;
+	int ret;
+
+	ret = qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1);
+	if (ret)
+		return ret;
+
+	sqc_vft = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
+		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) <<
+		  QM_XQC_ADDR_OFFSET);
+	*base = QM_SQC_VFT_BASE_MASK_V2 & (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2);
+	qp_num = (QM_SQC_VFT_NUM_MASK_V2 &
+		  (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1;
+
+	return qp_num;
+}
+
+static int qm_get_sqc(struct hisi_qm *qm, u64 *addr)
+{
+	int ret;
+
+	ret = qm_mb(qm, QM_MB_CMD_SQC_BT, 0, 0, 1);
+	if (ret)
+		return ret;
+
+	*addr = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
+		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) <<
+		  QM_XQC_ADDR_OFFSET);
+
+	return 0;
+}
+
+static int qm_get_cqc(struct hisi_qm *qm, u64 *addr)
+{
+	int ret;
+
+	ret = qm_mb(qm, QM_MB_CMD_CQC_BT, 0, 0, 1);
+	if (ret)
+		return ret;
+
+	*addr = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
+		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) <<
+		  QM_XQC_ADDR_OFFSET);
+
+	return 0;
+}
+
+static int qm_rw_regs_read(struct hisi_qm *qm, struct acc_vf_data *vf_data)
+{
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	ret = qm_read_reg(qm, QM_VF_AEQ_INT_MASK, &vf_data->aeq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_VF_AEQ_INT_MASK\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_VF_EQ_INT_MASK, &vf_data->eq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_VF_EQ_INT_MASK\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_IFC_INT_SOURCE_V,
+			  &vf_data->ifc_int_source, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_IFC_INT_SOURCE_V\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_IFC_INT_MASK, &vf_data->ifc_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_IFC_INT_MASK\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_IFC_INT_SET_V, &vf_data->ifc_int_set, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_IFC_INT_SET_V\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_QUE_ISO_CFG_V, &vf_data->que_iso_cfg, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_QUE_ISO_CFG_V\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_PAGE_SIZE, &vf_data->page_size, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_PAGE_SIZE\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_VF_STATE, &vf_data->vf_state, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_VF_STATE\n");
+		return ret;
+	}
+
+	/* QM_EQC_DW has 7 regs */
+	ret = qm_read_reg(qm, QM_EQC_DW0, vf_data->qm_eqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to read QM_EQC_DW\n");
+		return ret;
+	}
+
+	/* QM_AEQC_DW has 7 regs */
+	ret = qm_read_reg(qm, QM_AEQC_DW0, vf_data->qm_aeqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to read QM_AEQC_DW\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static int qm_rw_regs_write(struct hisi_qm *qm, struct acc_vf_data *vf_data)
+{
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	/* check VF state */
+	if (unlikely(qm_wait_mb_ready(qm))) {
+		dev_err(&qm->pdev->dev, "QM device is not ready to write\n");
+		return -EBUSY;
+	}
+
+	ret = qm_write_reg(qm, QM_VF_AEQ_INT_MASK, &vf_data->aeq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_VF_AEQ_INT_MASK\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_VF_EQ_INT_MASK, &vf_data->eq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_VF_EQ_INT_MASK\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_IFC_INT_SOURCE_V,
+			   &vf_data->ifc_int_source, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_IFC_INT_SOURCE_V\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_IFC_INT_MASK, &vf_data->ifc_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_IFC_INT_MASK\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_IFC_INT_SET_V, &vf_data->ifc_int_set, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_IFC_INT_SET_V\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_QUE_ISO_CFG_V, &vf_data->que_iso_cfg, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_QUE_ISO_CFG_V\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_PAGE_SIZE, &vf_data->page_size, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_PAGE_SIZE\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_VF_STATE, &vf_data->vf_state, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_VF_STATE\n");
+		return ret;
+	}
+
+	/* QM_EQC_DW has 7 regs */
+	ret = qm_write_reg(qm, QM_EQC_DW0, vf_data->qm_eqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to write QM_EQC_DW\n");
+		return ret;
+	}
+
+	/* QM_AEQC_DW has 7 regs */
+	ret = qm_write_reg(qm, QM_AEQC_DW0, vf_data->qm_aeqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to write QM_AEQC_DW\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void qm_db(struct hisi_qm *qm, u16 qn, u8 cmd,
+		  u16 index, u8 priority)
+{
+	u64 doorbell;
+	u64 dbase;
+	u16 randata = 0;
+
+	if (cmd == QM_DOORBELL_CMD_SQ || cmd == QM_DOORBELL_CMD_CQ)
+		dbase = QM_DOORBELL_SQ_CQ_BASE_V2;
+	else
+		dbase = QM_DOORBELL_EQ_AEQ_BASE_V2;
+
+	doorbell = qn | ((u64)cmd << QM_DB_CMD_SHIFT_V2) |
+		   ((u64)randata << QM_DB_RAND_SHIFT_V2) |
+		   ((u64)index << QM_DB_INDEX_SHIFT_V2)	 |
+		   ((u64)priority << QM_DB_PRIORITY_SHIFT_V2);
+
+	writeq(doorbell, qm->io_base + dbase);
+}
+
+static int vf_migration_data_store(struct hisi_qm *qm,
+				   struct acc_vf_migration *acc_vf_dev)
+{
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	ret = qm_rw_regs_read(qm, vf_data);
+	if (ret)
+		return -EINVAL;
+
+	/* Every reg is 32 bit, the dma address is 64 bit. */
+	vf_data->eqe_dma = vf_data->qm_eqc_dw[2];
+	vf_data->eqe_dma <<= QM_XQC_ADDR_OFFSET;
+	vf_data->eqe_dma |= vf_data->qm_eqc_dw[1];
+	vf_data->aeqe_dma = vf_data->qm_aeqc_dw[2];
+	vf_data->aeqe_dma <<= QM_XQC_ADDR_OFFSET;
+	vf_data->aeqe_dma |= vf_data->qm_aeqc_dw[1];
+
+	/* Through SQC_BT/CQC_BT to get sqc and cqc address */
+	ret = qm_get_sqc(qm, &vf_data->sqc_dma);
+	if (ret) {
+		dev_err(dev, "failed to read SQC addr!\n");
+		return -EINVAL;
+	}
+
+	ret = qm_get_cqc(qm, &vf_data->cqc_dma);
+	if (ret) {
+		dev_err(dev, "failed to read CQC addr!\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void qm_dev_cmd_init(struct hisi_qm *qm)
+{
+	/* Clear VF communication status registers. */
+	writel(0x1, qm->io_base + QM_IFC_INT_SOURCE_V);
+
+	/* Enable pf and vf communication. */
+	writel(0x0, qm->io_base + QM_IFC_INT_MASK);
+}
+
+static void vf_qm_fun_restart(struct hisi_qm *qm,
+			      struct acc_vf_migration *acc_vf_dev)
+{
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct device *dev = &qm->pdev->dev;
+	int i;
+
+	/* Check if we need to restart VF */
+	if (vf_data->vf_state == VF_PREPARE) {
+		dev_info(dev, "No need to restart VF\n");
+		return;
+	}
+
+	for (i = 0; i < qm->qp_num; i++)
+		qm_db(qm, i, QM_DOORBELL_CMD_SQ, 0, 1);
+}
+
+static int vf_migration_data_recover(struct hisi_qm *qm,
+				     struct acc_vf_migration *acc_vf_dev)
+{
+	struct device *dev = &qm->pdev->dev;
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	int ret;
+
+	if (!mig_ctl->data_size)
+		return 0;
+
+	qm->eqe_dma = vf_data->eqe_dma;
+	qm->aeqe_dma = vf_data->aeqe_dma;
+	qm->sqc_dma = vf_data->sqc_dma;
+	qm->cqc_dma = vf_data->cqc_dma;
+
+	qm->qp_base = vf_data->qp_base;
+	qm->qp_num = vf_data->qp_num;
+
+	ret = qm_rw_regs_write(qm, vf_data);
+	if (ret) {
+		dev_err(dev, "Set VF regs failed\n");
+		return ret;
+	}
+
+	ret = qm_mb(qm, QM_MB_CMD_SQC_BT, qm->sqc_dma, 0, 0);
+	if (ret) {
+		dev_err(dev, "Set sqc failed\n");
+		return ret;
+	}
+
+	ret = qm_mb(qm, QM_MB_CMD_CQC_BT, qm->cqc_dma, 0, 0);
+	if (ret) {
+		dev_err(dev, "Set cqc failed\n");
+		return ret;
+	}
+
+	qm_dev_cmd_init(qm);
+
+	return 0;
+}
+
+static int vf_qm_cache_wb(struct hisi_qm *qm)
+{
+	unsigned int val;
+
+	writel(0x1, qm->io_base + QM_CACHE_WB_START);
+	if (readl_relaxed_poll_timeout(qm->io_base + QM_CACHE_WB_DONE,
+				       val, val & BIT(0), MB_POLL_PERIOD_US,
+				       MB_POLL_TIMEOUT_US)) {
+		dev_err(&qm->pdev->dev, "vf QM writeback sqc cache fail\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int vf_qm_func_stop(struct hisi_qm *qm)
+{
+	return qm_mb(qm, QM_MB_CMD_PAUSE_QM, 0, 0, 0);
+}
+
+static int pf_qm_get_qp_num(struct hisi_qm *qm, int vf_id, u32 *rbase)
+{
+	unsigned int val;
+	u64 sqc_vft;
+	u32 qp_num;
+	int ret;
+
+	ret = readl_relaxed_poll_timeout(qm->io_base + QM_VFT_CFG_RDY, val,
+					 val & BIT(0), MB_POLL_PERIOD_US,
+					 MB_POLL_TIMEOUT_US);
+	if (ret)
+		return ret;
+
+	writel(0x1, qm->io_base + QM_VFT_CFG_OP_WR);
+	/* 0 mean SQC VFT */
+	writel(0x0, qm->io_base + QM_VFT_CFG_TYPE);
+	writel(vf_id, qm->io_base + QM_VFT_CFG);
+
+	writel(0x0, qm->io_base + QM_VFT_CFG_RDY);
+	writel(0x1, qm->io_base + QM_VFT_CFG_OP_ENABLE);
+
+	ret = readl_relaxed_poll_timeout(qm->io_base + QM_VFT_CFG_RDY, val,
+					 val & BIT(0), MB_POLL_PERIOD_US,
+					 MB_POLL_TIMEOUT_US);
+	if (ret)
+		return ret;
+
+	sqc_vft = readl(qm->io_base + QM_VFT_CFG_DATA_L) |
+		  ((u64)readl(qm->io_base + QM_VFT_CFG_DATA_H) <<
+		  QM_XQC_ADDR_OFFSET);
+	*rbase = QM_SQC_VFT_BASE_MASK_V2 &
+		  (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2);
+	qp_num = (QM_SQC_VFT_NUM_MASK_V2 &
+		  (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1;
+
+	return qp_num;
+}
+
+static int pf_qm_state_pre_save(struct hisi_qm *qm,
+				struct acc_vf_migration *acc_vf_dev)
+{
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	struct device *dev = &qm->pdev->dev;
+	int vf_id = acc_vf_dev->vf_id;
+	int ret;
+
+	/* save device id */
+	vf_data->dev_id = acc_vf_dev->vf_dev->device;
+
+	/* vf qp num save from PF */
+	ret = pf_qm_get_qp_num(qm, vf_id, &qm->qp_base);
+	if (ret <= 0) {
+		dev_err(dev, "failed to get vft qp nums!\n");
+		return -EINVAL;
+	}
+	qm->qp_num = ret;
+	vf_data->qp_base = qm->qp_base;
+	vf_data->qp_num = qm->qp_num;
+
+	/* vf isolation state save from PF */
+	ret = qm_read_reg(qm, QM_QUE_ISO_CFG, &vf_data->que_iso_cfg, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_QUE_ISO_CFG!\n");
+		return ret;
+	}
+
+	mig_ctl->data_size = QM_MATCH_SIZE;
+	mig_ctl->pending_bytes = mig_ctl->data_size;
+
+	return 0;
+}
+
+static int vf_qm_state_save(struct hisi_qm *qm,
+			    struct acc_vf_migration *acc_vf_dev)
+{
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	int ret;
+
+	mig_ctl->data_size = 0;
+	mig_ctl->pending_bytes = 0;
+
+	if (unlikely(qm_wait_dev_ready(qm))) {
+		dev_info(&qm->pdev->dev, "QM device not ready, no data to transfer\n");
+		return 0;
+	}
+
+	/* First stop the ACC vf function */
+	ret = vf_qm_func_stop(qm);
+	if (ret) {
+		dev_err(dev, "failed to stop QM VF function!\n");
+		return ret;
+	}
+
+	ret = qm_check_int_state(acc_vf_dev);
+	if (ret) {
+		dev_err(dev, "failed to check QM INT state!\n");
+		goto state_error;
+	}
+
+	ret = vf_qm_cache_wb(qm);
+	if (ret) {
+		dev_err(dev, "failed to writeback QM Cache!\n");
+		goto state_error;
+	}
+
+	ret = vf_migration_data_store(qm, acc_vf_dev);
+	if (ret) {
+		dev_err(dev, "failed to get and store migration data!\n");
+		goto state_error;
+	}
+
+	mig_ctl->data_size = sizeof(struct acc_vf_data);
+	mig_ctl->pending_bytes = mig_ctl->data_size;
+
+	return 0;
+
+state_error:
+	vf_qm_fun_restart(qm, acc_vf_dev);
+	return ret;
+}
+
+static int vf_qm_state_resume(struct hisi_qm *qm,
+			      struct acc_vf_migration *acc_vf_dev)
+{
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	int ret;
+
+	/* recover data to VF */
+	ret = vf_migration_data_recover(qm, acc_vf_dev);
+	if (ret) {
+		dev_err(dev, "failed to recover the VF!\n");
+		return ret;
+	}
+
+	/* restart all destination VF's QP */
+	vf_qm_fun_restart(qm, acc_vf_dev);
+
+	return 0;
+}
+
+/*
+ * HiSilicon ACC VF dev MMIO space contains both the functional register
+ * space and the migration control register space. We hide the migration
+ * control space from the Guest. But to successfully complete the live
+ * migration, we still need access to the functional MMIO space assigned
+ * to the Guest. To avoid any potential security issues, we need to be
+ * careful not to access this region while the Guest vCPUs are running.
+ *
+ * Hence check the device state before we map the region.
+ */
+static int hisi_acc_vf_ioremap(struct acc_vf_migration *acc_vf_dev, u32 state)
+{
+	struct hisi_qm *vfqm = acc_vf_dev->vf_qm;
+	struct pci_dev *pdev = acc_vf_dev->vf_dev;
+
+	if (state == (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING))
+		return -EINVAL;
+
+	if (vfqm->io_base)
+		return 0;
+
+	vfqm->io_base = ioremap(vfqm->phys_base,
+				pci_resource_len(pdev, VFIO_PCI_BAR2_REGION_INDEX));
+	if (!vfqm->io_base)
+		return -EIO;
+
+	return 0;
+}
+
+static void hisi_acc_vf_iounmap(struct acc_vf_migration *acc_vf_dev)
+{
+	struct hisi_qm *vfqm = acc_vf_dev->vf_qm;
+
+	if (!vfqm->io_base)
+		return;
+
+	iounmap(vfqm->io_base);
+	vfqm->io_base = NULL;
+}
+
+static int hisi_acc_vf_set_device_state(struct acc_vf_migration *acc_vf_dev,
+					u32 state)
+{
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	struct hisi_qm *pfqm = acc_vf_dev->pf_qm;
+	struct hisi_qm *vfqm = acc_vf_dev->vf_qm;
+	int ret = 0;
+
+	if (state == mig_ctl->device_state)
+		return 0;
+
+	switch (state) {
+	case VFIO_DEVICE_STATE_RUNNING:
+		if (mig_ctl->device_state == VFIO_DEVICE_STATE_RESUMING) {
+			ret = hisi_acc_vf_ioremap(acc_vf_dev, state);
+			if (ret)
+				return ret;
+
+			ret = vf_qm_state_resume(vfqm, acc_vf_dev);
+			if (ret)
+				goto out;
+		}
+		break;
+	case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+		ret = pf_qm_state_pre_save(pfqm, acc_vf_dev);
+		if (ret)
+			return ret;
+
+		break;
+	case VFIO_DEVICE_STATE_SAVING:
+		ret = hisi_acc_vf_ioremap(acc_vf_dev, state);
+		if (ret)
+			return ret;
+
+		ret = vf_qm_state_save(vfqm, acc_vf_dev);
+		if (ret)
+			goto out;
+
+		break;
+	case VFIO_DEVICE_STATE_STOP:
+		ret = hisi_acc_vf_ioremap(acc_vf_dev, state);
+		if (ret)
+			return ret;
+
+		vf_qm_fun_restart(vfqm, acc_vf_dev);
+		break;
+	case VFIO_DEVICE_STATE_RESUMING:
+		break;
+	default:
+		return -EFAULT;
+	}
+
+	mig_ctl->device_state = state;
+
+out:
+	hisi_acc_vf_iounmap(acc_vf_dev);
+	return ret;
+}
+
+static int hisi_acc_vf_match_check(struct acc_vf_migration *acc_vf_dev)
+{
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct hisi_qm *qm = acc_vf_dev->vf_qm;
+	struct device *dev = &qm->pdev->dev;
+	u32 que_iso_state;
+	int ret;
+
+	/*
+	 * Check we are in the correct dev state and have enough data to
+	 * perform the check.
+	 */
+	if (mig_ctl->device_state != VFIO_DEVICE_STATE_RESUMING ||
+	    mig_ctl->data_size != QM_MATCH_SIZE)
+		return 0;
+
+	/* vf acc dev type check */
+	if (vf_data->dev_id != acc_vf_dev->vf_dev->device) {
+		dev_err(dev, "failed to match VF devices\n");
+		return -EINVAL;
+	}
+
+	ret = hisi_acc_vf_ioremap(acc_vf_dev, mig_ctl->device_state);
+	if (ret)
+		return ret;
+
+	/* vf qp num check */
+	ret = qm_get_vft(qm, &qm->qp_base);
+	if (ret <= 0) {
+		dev_err(dev, "failed to get vft qp nums\n");
+		ret = -EINVAL;
+		goto out;
+	}
+	qm->qp_num = ret;
+
+	if (vf_data->qp_num != qm->qp_num) {
+		dev_err(dev, "failed to match VF qp num\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* vf isolation state check */
+	ret = qm_read_reg(qm, QM_QUE_ISO_CFG_V, &que_iso_state, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_QUE_ISO_CFG_V\n");
+		goto out;
+	}
+	if (vf_data->que_iso_cfg != que_iso_state) {
+		dev_err(dev, "failed to match isolation state\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* clear the VF match data size */
+	mig_ctl->pending_bytes = 0;
+	mig_ctl->data_size = 0;
+
+out:
+	hisi_acc_vf_iounmap(acc_vf_dev);
+	return ret;
+}
+
+static int hisi_acc_vf_data_transfer(struct acc_vf_migration *acc_vf_dev,
+				     char __user *buf, size_t count, u64 pos,
+				     bool iswrite)
+{
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	void *data_addr = acc_vf_dev->vf_data;
+	int ret = 0;
+
+	if (!count || pos < mig_ctl->data_offset || pos > MIGRATION_REGION_SZ)
+		return -EINVAL;
+
+	data_addr += pos - mig_ctl->data_offset;
+	if (iswrite)  {
+		ret = copy_from_user(data_addr, buf, count);
+		if (ret)
+			return -EFAULT;
+
+		mig_ctl->pending_bytes += count;
+	} else {
+		ret = copy_to_user(buf, data_addr, count);
+		if (ret)
+			return -EFAULT;
+
+		mig_ctl->pending_bytes -= count;
+	}
+
+	return count;
+}
+
+static size_t hisi_acc_vf_migrn_rw(struct vfio_pci_core_device *vdev,
+				   char __user *buf, size_t count, loff_t *ppos,
+				   bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos) -
+				VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_region	*region = &vdev->region[index];
+	struct acc_vf_migration *acc_vf_dev;
+	struct vfio_device_migration_info *mig_ctl;
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct device *dev;
+	u32 device_state;
+	int ret;
+
+	if (!region)
+		return -EINVAL;
+
+	acc_vf_dev = region->data;
+	dev = &acc_vf_dev->vf_dev->dev;
+	mig_ctl = acc_vf_dev->mig_ctl;
+
+	if (pos >= region->size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(region->size - pos));
+
+	switch (pos) {
+	case VDM_OFFSET(device_state):
+		if (count != sizeof(mig_ctl->device_state))
+			return -EFAULT;
+
+		if (iswrite) {
+			ret = copy_from_user(&device_state, buf, count);
+			if (ret)
+				return -EFAULT;
+
+			ret = hisi_acc_vf_set_device_state(acc_vf_dev, device_state);
+			if (ret)
+				return ret;
+		} else {
+			ret = copy_to_user(buf, &mig_ctl->device_state, count);
+			if (ret)
+				return -EFAULT;
+		}
+
+		return count;
+	case VDM_OFFSET(reserved):
+		return 0;
+	case VDM_OFFSET(pending_bytes):
+		if (count != sizeof(mig_ctl->pending_bytes))
+			return -EINVAL;
+
+		if (iswrite)
+			return -EFAULT;
+
+		ret = copy_to_user(buf, &mig_ctl->pending_bytes, count);
+		if (ret)
+			return -EFAULT;
+
+		return count;
+	case VDM_OFFSET(data_offset):
+		if (count != sizeof(mig_ctl->data_offset))
+			return -EINVAL;
+
+		if (iswrite) {
+			ret = copy_from_user(&mig_ctl->data_offset, buf, count);
+			if (ret)
+				return -EFAULT;
+		} else {
+			ret = copy_to_user(buf, &mig_ctl->data_offset, count);
+			if (ret)
+				return -EFAULT;
+		}
+
+		return count;
+	case VDM_OFFSET(data_size):
+		if (count != sizeof(mig_ctl->data_size))
+			return -EINVAL;
+
+		if (iswrite) {
+			ret = copy_from_user(&mig_ctl->data_size, buf, count);
+			if (ret)
+				return -EFAULT;
+
+			/* Check whether the src and dst VF's match */
+			ret = hisi_acc_vf_match_check(acc_vf_dev);
+			if (ret)
+				return ret;
+		} else {
+			ret = copy_to_user(buf, &mig_ctl->data_size, count);
+			if (ret)
+				return -EFAULT;
+		}
+
+		return count;
+	default:
+		/* Transfer data */
+		return hisi_acc_vf_data_transfer(acc_vf_dev, buf,
+						count, pos, iswrite);
+	}
+}
+
+static void hisi_acc_vfio_pci_uninit(struct acc_vf_migration *acc_vf_dev)
+{
+	kfree(acc_vf_dev->mig_ctl);
+	kfree(acc_vf_dev->vf_qm);
+}
+
+static void hisi_acc_vf_migrn_release(struct vfio_pci_core_device *vdev,
+				      struct vfio_pci_region *region)
+{
+	struct acc_vf_migration *acc_vf_dev = region->data;
+
+	hisi_acc_vfio_pci_uninit(acc_vf_dev);
+	kfree(acc_vf_dev);
+}
+
+static const struct vfio_pci_regops hisi_acc_vfio_pci_regops = {
+	.rw = hisi_acc_vf_migrn_rw,
+	.release = hisi_acc_vf_migrn_release,
+};
+
+static int hisi_acc_vf_dev_init(struct pci_dev *pdev, struct hisi_qm *pf_qm,
+				struct acc_vf_migration *acc_vf_dev)
+{
+	struct vfio_device_migration_info *mig_ctl;
+	struct hisi_qm *vf_qm;
+
+	vf_qm = kzalloc(sizeof(*vf_qm), GFP_KERNEL);
+	if (!vf_qm)
+		return -ENOMEM;
+
+	vf_qm->dev_name = pf_qm->dev_name;
+	vf_qm->fun_type = QM_HW_VF;
+	vf_qm->phys_base = pci_resource_start(pdev, VFIO_PCI_BAR2_REGION_INDEX);
+	vf_qm->pdev = pdev;
+	mutex_init(&vf_qm->mailbox_lock);
+
+	acc_vf_dev->vf_qm = vf_qm;
+	acc_vf_dev->pf_qm = pf_qm;
+
+	/* the data region must follow migration info */
+	mig_ctl = kzalloc(MIGRATION_REGION_SZ, GFP_KERNEL);
+	if (!mig_ctl)
+		goto init_qm_error;
+
+	acc_vf_dev->mig_ctl = mig_ctl;
+
+	acc_vf_dev->vf_data = (void *)(mig_ctl + 1);
+
+	mig_ctl->device_state = VFIO_DEVICE_STATE_RUNNING;
+	mig_ctl->data_offset = sizeof(*mig_ctl);
+	mig_ctl->data_size = sizeof(struct acc_vf_data);
+
+	return 0;
+
+init_qm_error:
+	kfree(vf_qm);
+	return -ENOMEM;
+}
+
+static int hisi_acc_vfio_pci_init(struct vfio_pci_core_device *vdev)
+{
+	struct acc_vf_migration *acc_vf_dev;
+	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pf_dev, *vf_dev;
+	struct hisi_qm *pf_qm;
+	int vf_id, ret;
+
+	pf_dev = pdev->physfn;
+	vf_dev = pdev;
+
+	pf_qm = pci_get_drvdata(pf_dev);
+	if (!pf_qm) {
+		pr_err("HiSi ACC qm driver not loaded\n");
+		return -EINVAL;
+	}
+
+	if (pf_qm->ver < QM_HW_V3) {
+		dev_err(&pdev->dev,
+			"Migration not supported, hw version: 0x%x\n",
+			 pf_qm->ver);
+		return -ENODEV;
+	}
+
+	vf_id = PCI_FUNC(vf_dev->devfn);
+	acc_vf_dev = kzalloc(sizeof(*acc_vf_dev), GFP_KERNEL);
+	if (!acc_vf_dev)
+		return -ENOMEM;
+
+	acc_vf_dev->vf_id = vf_id;
+	acc_vf_dev->pf_dev = pf_dev;
+	acc_vf_dev->vf_dev = vf_dev;
+
+	ret = hisi_acc_vf_dev_init(pdev, pf_qm, acc_vf_dev);
+	if (ret) {
+		kfree(acc_vf_dev);
+		return -ENOMEM;
+	}
+
+	ret = vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
+					   VFIO_REGION_SUBTYPE_MIGRATION,
+					   &hisi_acc_vfio_pci_regops,
+					   MIGRATION_REGION_SZ,
+					   VFIO_REGION_INFO_FLAG_READ |
+					   VFIO_REGION_INFO_FLAG_WRITE,
+					   acc_vf_dev);
+	if (ret)
+		goto out;
+
+	return 0;
+out:
+	hisi_acc_vfio_pci_uninit(acc_vf_dev);
+	kfree(acc_vf_dev);
+	return ret;
+}
+
 static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 				    unsigned long arg)
 {
@@ -64,6 +1073,12 @@ static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
 	if (ret)
 		return ret;
 
+	ret = hisi_acc_vfio_pci_init(vdev);
+	if (ret) {
+		vfio_pci_core_disable(vdev);
+		return ret;
+	}
+
 	vfio_pci_core_finish_enable(vdev);
 
 	return 0;
@@ -113,9 +1128,9 @@ static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev)
 }
 
 static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
-	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa256) }, /* SEC VF */
-	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa259) }, /* HPRE VF */
-	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa251) }, /* ZIP VF */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, HISI_SEC_VF_DEV_ID) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, HISI_HPRE_VF_DEV_ID) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, HISI_ZIP_VF_DEV_ID) },
 	{ 0, }
 };
 
@@ -137,4 +1152,4 @@ module_pci_driver(hisi_acc_vfio_pci_driver);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Liu Longfang <liulongfang@huawei.com>");
 MODULE_AUTHOR("Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>");
-MODULE_DESCRIPTION("HiSilicon VFIO PCI - Generic VFIO PCI driver for HiSilicon ACC device family");
+MODULE_DESCRIPTION("HiSilicon VFIO PCI - VFIO PCI driver with live migration support for HiSilicon ACC device family");
diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.h b/drivers/vfio/pci/hisi_acc_vfio_pci.h
new file mode 100644
index 000000000000..f3f8c38d6f1a
--- /dev/null
+++ b/drivers/vfio/pci/hisi_acc_vfio_pci.h
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021 HiSilicon Ltd. */
+
+#ifndef HISI_ACC_VFIO_PCI_H
+#define HISI_ACC_VFIO_PCI_H
+
+#include "../../drivers/crypto/hisilicon/qm.h"
+
+#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
+#define MIGRATION_REGION_SZ (sizeof(struct acc_vf_data) + \
+			      sizeof(struct vfio_device_migration_info))
+
+#define HISI_SEC_VF_DEV_ID		0xa256
+#define HISI_HPRE_VF_DEV_ID		0xa259
+#define HISI_ZIP_VF_DEV_ID		0xa251
+
+#define MB_POLL_PERIOD_US		10
+#define MB_POLL_TIMEOUT_US		1000
+#define QM_CACHE_WB_START		0x204
+#define QM_CACHE_WB_DONE		0x208
+#define QM_MB_CMD_PAUSE_QM		0xe
+#define QM_ABNORMAL_INT_STATUS	0x100008
+#define QM_IFC_INT_STATUS		0x0028
+#define SEC_CORE_INT_STATUS		0x301008
+#define HPRE_HAC_INT_STATUS		0x301800
+#define HZIP_CORE_INT_STATUS		0x3010AC
+#define QM_QUE_ISO_CFG			0x301154
+
+#define QM_VFT_CFG_RDY			0x10006c
+#define QM_VFT_CFG_OP_WR		0x100058
+#define QM_VFT_CFG_TYPE			0x10005c
+#define QM_VFT_CFG			0x100060
+#define QM_VFT_CFG_OP_ENABLE		0x100054
+#define QM_VFT_CFG_DATA_L		0x100064
+#define QM_VFT_CFG_DATA_H		0x100068
+
+#define ERROR_CHECK_TIMEOUT		100
+#define CHECK_DELAY_TIME		100
+
+#define QM_SQC_VFT_BASE_SHIFT_V2	28
+#define QM_SQC_VFT_BASE_MASK_V2	GENMASK(15, 0)
+#define QM_SQC_VFT_NUM_SHIFT_V2	45
+#define QM_SQC_VFT_NUM_MASK_V2	GENMASK(9, 0)
+
+/* mailbox */
+#define QM_MB_CMD_SQC_BT		0x4
+#define QM_MB_CMD_CQC_BT		0x5
+#define QM_MB_CMD_SQC_VFT_V2		0x6
+
+#define QM_MB_CMD_SEND_BASE		0x300
+#define QM_MB_BUSY_SHIFT		13
+#define QM_MB_OP_SHIFT			14
+#define QM_MB_CMD_DATA_ADDR_L		0x304
+#define QM_MB_CMD_DATA_ADDR_H		0x308
+#define QM_MB_MAX_WAIT_CNT		6000
+
+/* doorbell */
+#define QM_DOORBELL_CMD_SQ		0
+#define QM_DOORBELL_CMD_CQ		1
+#define QM_DOORBELL_SQ_CQ_BASE_V2	0x1000
+#define QM_DOORBELL_EQ_AEQ_BASE_V2	0x2000
+#define QM_DB_CMD_SHIFT_V2		12
+#define QM_DB_RAND_SHIFT_V2		16
+#define QM_DB_INDEX_SHIFT_V2		32
+#define QM_DB_PRIORITY_SHIFT_V2	48
+
+/* RW regs */
+#define QM_REGS_MAX_LEN		7
+#define QM_REG_ADDR_OFFSET		0x0004
+
+#define QM_XQC_ADDR_OFFSET		32U
+#define QM_VF_AEQ_INT_MASK		0x0004
+#define QM_VF_EQ_INT_MASK		0x000c
+#define QM_IFC_INT_SOURCE_V		0x0020
+#define QM_IFC_INT_MASK			0x0024
+#define QM_IFC_INT_SET_V		0x002c
+#define QM_QUE_ISO_CFG_V		0x0030
+#define QM_PAGE_SIZE			0x0034
+#define QM_VF_STATE			0x0060
+
+#define QM_EQC_DW0		0X8000
+#define QM_AEQC_DW0		0X8020
+
+#define QM_MATCH_SIZE           32L
+
+enum vf_state {
+	VF_READY = 0x0,
+	VF_NOT_READY,
+	VF_PREPARE,
+};
+
+struct acc_vf_data {
+	/* QM match information */
+	u32 qp_num;
+	u32 dev_id;
+	u32 que_iso_cfg;
+	u32 qp_base;
+	/* QM reserved 4 match information */
+	u32 qm_rsv_state[4];
+
+	/* QM RW regs */
+	u32 aeq_int_mask;
+	u32 eq_int_mask;
+	u32 ifc_int_source;
+	u32 ifc_int_mask;
+	u32 ifc_int_set;
+	u32 page_size;
+	u32 vf_state;
+
+	/*
+	 * QM_VF_MB has 4 regs don't need to migration
+	 * mailbox regs writeback value will cause
+	 * hardware to perform command operations
+	 */
+
+	/* QM_EQC_DW has 7 regs */
+	u32 qm_eqc_dw[7];
+
+	/* QM_AEQC_DW has 7 regs */
+	u32 qm_aeqc_dw[7];
+
+	/* QM reserved 5 regs */
+	u32 qm_rsv_regs[5];
+
+	/* qm memory init information */
+	dma_addr_t eqe_dma;
+	dma_addr_t aeqe_dma;
+	dma_addr_t sqc_dma;
+	dma_addr_t cqc_dma;
+};
+
+struct acc_vf_migration {
+	struct pci_dev			*pf_dev;
+	struct pci_dev			*vf_dev;
+	struct hisi_qm			*pf_qm;
+	struct hisi_qm			*vf_qm;
+	int				vf_id;
+
+	struct vfio_device_migration_info *mig_ctl;
+	struct acc_vf_data		*vf_data;
+};
+
+#endif /* HISI_ACC_VFIO_PCI_H */
+
-- 
2.17.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-02  9:58 ` [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
@ 2021-07-02 20:29   ` Alex Williamson
  2021-07-05  7:20     ` Shameerali Kolothum Thodi
  2021-07-04  7:03   ` Leon Romanovsky
  1 sibling, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2021-07-02 20:29 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvm, linux-kernel, linux-crypto, jgg, mgurtovoy, linuxarm,
	liulongfang, prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

On Fri, 2 Jul 2021 10:58:46 +0100
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:

> Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
> This will be extended in follow-up patches to add support for
> vfio live migration feature.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  drivers/vfio/pci/Kconfig             |   9 +++
>  drivers/vfio/pci/Makefile            |   2 +
>  drivers/vfio/pci/hisi_acc_vfio_pci.c | 100 +++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+)
>  create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 9cdef46dd299..709807c28153 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -57,3 +57,12 @@ config MLX5_VFIO_PCI
>  	  framework.
>  
>  	  If you don't know what to do here, say N.
> +
> +config HISI_ACC_VFIO_PCI
> +	tristate "VFIO support for HiSilicon ACC devices"
> +	depends on ARM64 && VFIO_PCI_CORE
> +	help
> +	  This provides generic PCI support for HiSilicon devices using the VFIO
> +	  framework.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index a0df9c2a4bd9..d1de3e81921f 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -3,6 +3,7 @@
>  obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>  obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> +obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisi-acc-vfio-pci.o
>  
>  vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
> @@ -11,3 +12,4 @@ vfio-pci-y := vfio_pci.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  
>  mlx5-vfio-pci-y := mlx5_vfio_pci.o
> +hisi-acc-vfio-pci-y := hisi_acc_vfio_pci.o
> diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> new file mode 100644
> index 000000000000..a9e173098ab5
> --- /dev/null
> +++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> @@ -0,0 +1,100 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021, HiSilicon Ltd.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +
> +static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> +	int ret;
> +
> +	lockdep_assert_held(&core_vdev->reflck->lock);
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	vfio_pci_core_finish_enable(vdev);
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> +	.name		= "hisi-acc-vfio-pci",
> +	.open		= hisi_acc_vfio_pci_open,
> +	.release	= vfio_pci_core_release,
> +	.ioctl		= vfio_pci_core_ioctl,
> +	.read		= vfio_pci_core_read,
> +	.write		= vfio_pci_core_write,
> +	.mmap		= vfio_pci_core_mmap,
> +	.request	= vfio_pci_core_request,
> +	.match		= vfio_pci_core_match,
> +	.reflck_attach	= vfio_pci_core_reflck_attach,
> +};
> +
> +static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> +	struct vfio_pci_core_device *vdev;
> +	int ret;
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	ret = vfio_pci_core_register_device(vdev, pdev, &hisi_acc_vfio_pci_ops);
> +	if (ret)
> +		goto out_free;
> +
> +	dev_set_drvdata(&pdev->dev, vdev);
> +
> +	return 0;
> +
> +out_free:
> +	kfree(vdev);
> +	return ret;
> +}
> +
> +static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
> +
> +	vfio_pci_core_unregister_device(vdev);
> +	kfree(vdev);
> +}
> +
> +static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa256) }, /* SEC VF */
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa259) }, /* HPRE VF */
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, 0xa251) }, /* ZIP VF */
> +	{ 0, }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table);
> +
> +static struct pci_driver hisi_acc_vfio_pci_driver = {
> +	.name			= "hisi-acc-vfio-pci",
> +	.id_table		= hisi_acc_vfio_pci_table,
> +	.probe			= hisi_acc_vfio_pci_probe,
> +	.remove			= hisi_acc_vfio_pci_remove,
> +#ifdef CONFIG_PCI_IOV
> +	.sriov_configure	= vfio_pci_core_sriov_configure,
> +#endif

The device table suggests only VFs are supported by this driver, so it
really shouldn't need sriov_configure support, right?  Thanks,

Alex

> +	.err_handler		= &vfio_pci_core_err_handlers,
> +};
> +
> +module_pci_driver(hisi_acc_vfio_pci_driver);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Liu Longfang <liulongfang@huawei.com>");
> +MODULE_AUTHOR("Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>");
> +MODULE_DESCRIPTION("HiSilicon VFIO PCI - Generic VFIO PCI driver for HiSilicon ACC device family");


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size
  2021-07-02  9:58 ` [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size Shameer Kolothum
@ 2021-07-02 20:29   ` Alex Williamson
  2021-07-05  7:22     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2021-07-02 20:29 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvm, linux-kernel, linux-crypto, jgg, mgurtovoy, linuxarm,
	liulongfang, prime.zeng, yuzenghui, jonathan.cameron, wangzhou1

On Fri, 2 Jul 2021 10:58:47 +0100
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:

> HiSilicon ACC VF device BAR2 region consists of both functional register
> space and migration control register space. From a security point of
> view, it's not advisable to export the migration control region to Guest.
> 
> Hence, hide the migration region and report only the functional register
> space.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  drivers/vfio/pci/hisi_acc_vfio_pci.c | 42 +++++++++++++++++++++++++++-
>  1 file changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> index a9e173098ab5..d57cf6d7adaf 100644
> --- a/drivers/vfio/pci/hisi_acc_vfio_pci.c
> +++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> @@ -12,6 +12,46 @@
>  #include <linux/vfio.h>
>  #include <linux/vfio_pci_core.h>
>  
> +static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> +				    unsigned long arg)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> +
> +	if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
> +		struct pci_dev *pdev = vdev->pdev;
> +		struct vfio_region_info info;
> +		unsigned long minsz;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_BAR2_REGION_INDEX) {
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +
> +			/*
> +			 * ACC VF dev BAR2 region(64K) consists of both functional
> +			 * register space and migration control register space.
> +			 * Report only the first 32K(functional region) to Guest.
> +			 */
> +			info.size = pci_resource_len(pdev, info.index) / 2;
> +

Great, but what actually prevents the user from accessing the full
extent of the BAR since you're re-using core code for read/write/mmap,
which are all basing access on pci_resource_len()?  Thanks,

Alex

> +			info.flags = VFIO_REGION_INFO_FLAG_READ |
> +					VFIO_REGION_INFO_FLAG_WRITE |
> +					VFIO_REGION_INFO_FLAG_MMAP;
> +
> +			return copy_to_user((void __user *)arg, &info, minsz) ?
> +					    -EFAULT : 0;
> +		}
> +	}
> +	return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> +}
> +
>  static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
>  {
>  	struct vfio_pci_core_device *vdev =
> @@ -33,7 +73,7 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
>  	.name		= "hisi-acc-vfio-pci",
>  	.open		= hisi_acc_vfio_pci_open,
>  	.release	= vfio_pci_core_release,
> -	.ioctl		= vfio_pci_core_ioctl,
> +	.ioctl		= hisi_acc_vfio_pci_ioctl,
>  	.read		= vfio_pci_core_read,
>  	.write		= vfio_pci_core_write,
>  	.mmap		= vfio_pci_core_mmap,


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-02  9:58 ` [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
  2021-07-02 20:29   ` Alex Williamson
@ 2021-07-04  7:03   ` Leon Romanovsky
  2021-07-05  8:47     ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 55+ messages in thread
From: Leon Romanovsky @ 2021-07-04  7:03 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, jgg, mgurtovoy,
	linuxarm, liulongfang, prime.zeng, yuzenghui, jonathan.cameron,
	wangzhou1

On Fri, Jul 02, 2021 at 10:58:46AM +0100, Shameer Kolothum wrote:
> Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
> This will be extended in follow-up patches to add support for
> vfio live migration feature.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  drivers/vfio/pci/Kconfig             |   9 +++
>  drivers/vfio/pci/Makefile            |   2 +
>  drivers/vfio/pci/hisi_acc_vfio_pci.c | 100 +++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+)
>  create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c

<...>

> +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> +	.name		= "hisi-acc-vfio-pci",
> +	.open		= hisi_acc_vfio_pci_open,
> +	.release	= vfio_pci_core_release,
> +	.ioctl		= vfio_pci_core_ioctl,
> +	.read		= vfio_pci_core_read,
> +	.write		= vfio_pci_core_write,
> +	.mmap		= vfio_pci_core_mmap,
> +	.request	= vfio_pci_core_request,
> +	.match		= vfio_pci_core_match,
> +	.reflck_attach	= vfio_pci_core_reflck_attach,

I don't remember what was proposed in vfio-pci-core conversion patches,
but would expect that default behaviour is to fallback to vfio_pci_core_* API
if ".release/.ioctl/e.t.c" are not redefined.

Thanks

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use
  2021-07-02  9:58 ` [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use Shameer Kolothum
@ 2021-07-04  9:34   ` Max Gurtovoy
  2021-07-05 10:23     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 55+ messages in thread
From: Max Gurtovoy @ 2021-07-04  9:34 UTC (permalink / raw)
  To: Shameer Kolothum, kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, linuxarm, liulongfang, prime.zeng,
	yuzenghui, jonathan.cameron, wangzhou1


On 7/2/2021 12:58 PM, Shameer Kolothum wrote:
> From: Longfang Liu <liulongfang@huawei.com>
>
> Export QM mailbox functions so that they can be used from HiSilicon
> ACC vfio live migration driver in follow-up patch.
>
> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>   drivers/crypto/hisilicon/qm.c | 8 +++++---
>   drivers/crypto/hisilicon/qm.h | 4 ++++
>   2 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
> index ce439a0c66c9..87fc0199705e 100644
> --- a/drivers/crypto/hisilicon/qm.c
> +++ b/drivers/crypto/hisilicon/qm.c
> @@ -492,7 +492,7 @@ static bool qm_qp_avail_state(struct hisi_qm *qm, struct hisi_qp *qp,
>   }
>   
>   /* return 0 mailbox ready, -ETIMEDOUT hardware timeout */
> -static int qm_wait_mb_ready(struct hisi_qm *qm)
> +int qm_wait_mb_ready(struct hisi_qm *qm)
>   {
>   	u32 val;
>   
> @@ -500,6 +500,7 @@ static int qm_wait_mb_ready(struct hisi_qm *qm)
>   					  val, !((val >> QM_MB_BUSY_SHIFT) &
>   					  0x1), POLL_PERIOD, POLL_TIMEOUT);
>   }
> +EXPORT_SYMBOL_GPL(qm_wait_mb_ready);
>   
>   /* 128 bit should be written to hardware at one time to trigger a mailbox */
>   static void qm_mb_write(struct hisi_qm *qm, const void *src)
> @@ -523,8 +524,8 @@ static void qm_mb_write(struct hisi_qm *qm, const void *src)
>   		     : "memory");
>   }
>   
> -static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
> -		 bool op)
> +int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
> +	  bool op)
>   {
>   	struct qm_mailbox mailbox;
>   	int ret = 0;
> @@ -563,6 +564,7 @@ static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
>   		atomic64_inc(&qm->debug.dfx.mb_err_cnt);
>   	return ret;
>   }
> +EXPORT_SYMBOL_GPL(qm_mb);
>   
>   static void qm_db_v1(struct hisi_qm *qm, u16 qn, u8 cmd, u16 index, u8 priority)
>   {
> diff --git a/drivers/crypto/hisilicon/qm.h b/drivers/crypto/hisilicon/qm.h
> index acefdf8b3a50..18b010d5452d 100644
> --- a/drivers/crypto/hisilicon/qm.h
> +++ b/drivers/crypto/hisilicon/qm.h
> @@ -396,6 +396,10 @@ pci_ers_result_t hisi_qm_dev_slot_reset(struct pci_dev *pdev);
>   void hisi_qm_reset_prepare(struct pci_dev *pdev);
>   void hisi_qm_reset_done(struct pci_dev *pdev);
>   
> +int qm_wait_mb_ready(struct hisi_qm *qm);
> +int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
> +	  bool op);
> +

maybe you can put it under include/linux/.. ?


>   struct hisi_acc_sgl_pool;
>   struct hisi_acc_hw_sgl *hisi_acc_sg_buf_map_to_hw_sgl(struct device *dev,
>   	struct scatterlist *sgl, struct hisi_acc_sgl_pool *pool,

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-02 20:29   ` Alex Williamson
@ 2021-07-05  7:20     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-07-05  7:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-crypto, jgg, mgurtovoy, Linuxarm,
	liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 02 July 2021 21:29
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; jgg@nvidia.com; mgurtovoy@nvidia.com;
> Linuxarm <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>;
> Zengtao (B) <prime.zeng@hisilicon.com>; yuzenghui
> <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon
> ACC devices
> 
> On Fri, 2 Jul 2021 10:58:46 +0100
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
> > This will be extended in follow-up patches to add support for
> > vfio live migration feature.
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  drivers/vfio/pci/Kconfig             |   9 +++
> >  drivers/vfio/pci/Makefile            |   2 +
> >  drivers/vfio/pci/hisi_acc_vfio_pci.c | 100 +++++++++++++++++++++++++++
> >  3 files changed, 111 insertions(+)
> >  create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
> >
> > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> > index 9cdef46dd299..709807c28153 100644
> > --- a/drivers/vfio/pci/Kconfig
> > +++ b/drivers/vfio/pci/Kconfig
> > @@ -57,3 +57,12 @@ config MLX5_VFIO_PCI
> >  	  framework.
> >
> >  	  If you don't know what to do here, say N.
> > +
> > +config HISI_ACC_VFIO_PCI
> > +	tristate "VFIO support for HiSilicon ACC devices"
> > +	depends on ARM64 && VFIO_PCI_CORE
> > +	help
> > +	  This provides generic PCI support for HiSilicon devices using the VFIO
> > +	  framework.
> > +
> > +	  If you don't know what to do here, say N.
> > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > index a0df9c2a4bd9..d1de3e81921f 100644
> > --- a/drivers/vfio/pci/Makefile
> > +++ b/drivers/vfio/pci/Makefile
> > @@ -3,6 +3,7 @@
> >  obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
> >  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> >  obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
> > +obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisi-acc-vfio-pci.o
> >
> >  vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o
> vfio_pci_config.o
> >  vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
> > @@ -11,3 +12,4 @@ vfio-pci-y := vfio_pci.o
> >  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> >
> >  mlx5-vfio-pci-y := mlx5_vfio_pci.o
> > +hisi-acc-vfio-pci-y := hisi_acc_vfio_pci.o
> > diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c
> b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> > new file mode 100644
> > index 000000000000..a9e173098ab5
> > --- /dev/null
> > +++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> > @@ -0,0 +1,100 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (c) 2021, HiSilicon Ltd.
> > + */
> > +
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/file.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/module.h>
> > +#include <linux/pci.h>
> > +#include <linux/vfio.h>
> > +#include <linux/vfio_pci_core.h>
> > +
> > +static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> > +	int ret;
> > +
> > +	lockdep_assert_held(&core_vdev->reflck->lock);
> > +
> > +	ret = vfio_pci_core_enable(vdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	vfio_pci_core_finish_enable(vdev);
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> > +	.name		= "hisi-acc-vfio-pci",
> > +	.open		= hisi_acc_vfio_pci_open,
> > +	.release	= vfio_pci_core_release,
> > +	.ioctl		= vfio_pci_core_ioctl,
> > +	.read		= vfio_pci_core_read,
> > +	.write		= vfio_pci_core_write,
> > +	.mmap		= vfio_pci_core_mmap,
> > +	.request	= vfio_pci_core_request,
> > +	.match		= vfio_pci_core_match,
> > +	.reflck_attach	= vfio_pci_core_reflck_attach,
> > +};
> > +
> > +static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct
> pci_device_id *id)
> > +{
> > +	struct vfio_pci_core_device *vdev;
> > +	int ret;
> > +
> > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > +	if (!vdev)
> > +		return -ENOMEM;
> > +
> > +	ret = vfio_pci_core_register_device(vdev, pdev, &hisi_acc_vfio_pci_ops);
> > +	if (ret)
> > +		goto out_free;
> > +
> > +	dev_set_drvdata(&pdev->dev, vdev);
> > +
> > +	return 0;
> > +
> > +out_free:
> > +	kfree(vdev);
> > +	return ret;
> > +}
> > +
> > +static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev)
> > +{
> > +	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
> > +
> > +	vfio_pci_core_unregister_device(vdev);
> > +	kfree(vdev);
> > +}
> > +
> > +static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
> > +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI,
> 0xa256) }, /* SEC VF */
> > +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI,
> 0xa259) }, /* HPRE VF */
> > +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI,
> 0xa251) }, /* ZIP VF */
> > +	{ 0, }
> > +};
> > +
> > +MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table);
> > +
> > +static struct pci_driver hisi_acc_vfio_pci_driver = {
> > +	.name			= "hisi-acc-vfio-pci",
> > +	.id_table		= hisi_acc_vfio_pci_table,
> > +	.probe			= hisi_acc_vfio_pci_probe,
> > +	.remove			= hisi_acc_vfio_pci_remove,
> > +#ifdef CONFIG_PCI_IOV
> > +	.sriov_configure	= vfio_pci_core_sriov_configure,
> > +#endif
> 
> The device table suggests only VFs are supported by this driver, so it
> really shouldn't need sriov_configure support, right?  Thanks,

Right. Only VFs are supported. I will remove this.

Thanks,
Shameer

> 
> Alex
> 
> > +	.err_handler		= &vfio_pci_core_err_handlers,
> > +};
> > +
> > +module_pci_driver(hisi_acc_vfio_pci_driver);
> > +
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Liu Longfang <liulongfang@huawei.com>");
> > +MODULE_AUTHOR("Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>");
> > +MODULE_DESCRIPTION("HiSilicon VFIO PCI - Generic VFIO PCI driver for
> HiSilicon ACC device family");


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size
  2021-07-02 20:29   ` Alex Williamson
@ 2021-07-05  7:22     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-07-05  7:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, linux-crypto, jgg, mgurtovoy, Linuxarm,
	liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 02 July 2021 21:30
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; jgg@nvidia.com; mgurtovoy@nvidia.com;
> Linuxarm <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>;
> Zengtao (B) <prime.zeng@hisilicon.com>; yuzenghui
> <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit
> BAR2 region size
> 
> On Fri, 2 Jul 2021 10:58:47 +0100
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > HiSilicon ACC VF device BAR2 region consists of both functional register
> > space and migration control register space. From a security point of
> > view, it's not advisable to export the migration control region to Guest.
> >
> > Hence, hide the migration region and report only the functional register
> > space.
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  drivers/vfio/pci/hisi_acc_vfio_pci.c | 42 +++++++++++++++++++++++++++-
> >  1 file changed, 41 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/hisi_acc_vfio_pci.c
> b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> > index a9e173098ab5..d57cf6d7adaf 100644
> > --- a/drivers/vfio/pci/hisi_acc_vfio_pci.c
> > +++ b/drivers/vfio/pci/hisi_acc_vfio_pci.c
> > @@ -12,6 +12,46 @@
> >  #include <linux/vfio.h>
> >  #include <linux/vfio_pci_core.h>
> >
> > +static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned
> int cmd,
> > +				    unsigned long arg)
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> > +
> > +	if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
> > +		struct pci_dev *pdev = vdev->pdev;
> > +		struct vfio_region_info info;
> > +		unsigned long minsz;
> > +
> > +		minsz = offsetofend(struct vfio_region_info, offset);
> > +
> > +		if (copy_from_user(&info, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (info.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		if (info.index == VFIO_PCI_BAR2_REGION_INDEX) {
> > +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > +
> > +			/*
> > +			 * ACC VF dev BAR2 region(64K) consists of both functional
> > +			 * register space and migration control register space.
> > +			 * Report only the first 32K(functional region) to Guest.
> > +			 */
> > +			info.size = pci_resource_len(pdev, info.index) / 2;
> > +
> 
> Great, but what actually prevents the user from accessing the full
> extent of the BAR since you're re-using core code for read/write/mmap,
> which are all basing access on pci_resource_len()?  Thanks,

Ah..true. Missed that. I will add overrides for read/write/mmap and limit
the access.

Thanks,
Shameer

> 
> Alex
> 
> > +			info.flags = VFIO_REGION_INFO_FLAG_READ |
> > +					VFIO_REGION_INFO_FLAG_WRITE |
> > +					VFIO_REGION_INFO_FLAG_MMAP;
> > +
> > +			return copy_to_user((void __user *)arg, &info, minsz) ?
> > +					    -EFAULT : 0;
> > +		}
> > +	}
> > +	return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> > +}
> > +
> >  static int hisi_acc_vfio_pci_open(struct vfio_device *core_vdev)
> >  {
> >  	struct vfio_pci_core_device *vdev =
> > @@ -33,7 +73,7 @@ static const struct vfio_device_ops
> hisi_acc_vfio_pci_ops = {
> >  	.name		= "hisi-acc-vfio-pci",
> >  	.open		= hisi_acc_vfio_pci_open,
> >  	.release	= vfio_pci_core_release,
> > -	.ioctl		= vfio_pci_core_ioctl,
> > +	.ioctl		= hisi_acc_vfio_pci_ioctl,
> >  	.read		= vfio_pci_core_read,
> >  	.write		= vfio_pci_core_write,
> >  	.mmap		= vfio_pci_core_mmap,


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-04  7:03   ` Leon Romanovsky
@ 2021-07-05  8:47     ` Shameerali Kolothum Thodi
  2021-07-05  9:41       ` Max Gurtovoy
  0 siblings, 1 reply; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-07-05  8:47 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, jgg, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: 04 July 2021 08:04
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; alex.williamson@redhat.com; jgg@nvidia.com;
> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon
> ACC devices
> 
> On Fri, Jul 02, 2021 at 10:58:46AM +0100, Shameer Kolothum wrote:
> > Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
> > This will be extended in follow-up patches to add support for
> > vfio live migration feature.
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  drivers/vfio/pci/Kconfig             |   9 +++
> >  drivers/vfio/pci/Makefile            |   2 +
> >  drivers/vfio/pci/hisi_acc_vfio_pci.c | 100 +++++++++++++++++++++++++++
> >  3 files changed, 111 insertions(+)
> >  create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
> 
> <...>
> 
> > +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> > +	.name		= "hisi-acc-vfio-pci",
> > +	.open		= hisi_acc_vfio_pci_open,
> > +	.release	= vfio_pci_core_release,
> > +	.ioctl		= vfio_pci_core_ioctl,
> > +	.read		= vfio_pci_core_read,
> > +	.write		= vfio_pci_core_write,
> > +	.mmap		= vfio_pci_core_mmap,
> > +	.request	= vfio_pci_core_request,
> > +	.match		= vfio_pci_core_match,
> > +	.reflck_attach	= vfio_pci_core_reflck_attach,
> 
> I don't remember what was proposed in vfio-pci-core conversion patches,
> but would expect that default behaviour is to fallback to vfio_pci_core_* API
> if ".release/.ioctl/e.t.c" are not redefined.

Yes, that would be nice, but don't think it does that in latest(v4).

Hi Max,
Could we please consider fall back to the core defaults, may be check and assign defaults
in vfio_pci_core_register_device() ?

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-05  8:47     ` Shameerali Kolothum Thodi
@ 2021-07-05  9:41       ` Max Gurtovoy
  2021-07-05 10:18         ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 55+ messages in thread
From: Max Gurtovoy @ 2021-07-05  9:41 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Leon Romanovsky
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, jgg, Linuxarm,
	liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)


On 7/5/2021 11:47 AM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Leon Romanovsky [mailto:leon@kernel.org]
>> Sent: 04 July 2021 08:04
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> linux-crypto@vger.kernel.org; alex.williamson@redhat.com; jgg@nvidia.com;
>> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
>> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
>> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
>> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
>> Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon
>> ACC devices
>>
>> On Fri, Jul 02, 2021 at 10:58:46AM +0100, Shameer Kolothum wrote:
>>> Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
>>> This will be extended in follow-up patches to add support for
>>> vfio live migration feature.
>>>
>>> Signed-off-by: Shameer Kolothum
>> <shameerali.kolothum.thodi@huawei.com>
>>> ---
>>>   drivers/vfio/pci/Kconfig             |   9 +++
>>>   drivers/vfio/pci/Makefile            |   2 +
>>>   drivers/vfio/pci/hisi_acc_vfio_pci.c | 100 +++++++++++++++++++++++++++
>>>   3 files changed, 111 insertions(+)
>>>   create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
>> <...>
>>
>>> +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
>>> +	.name		= "hisi-acc-vfio-pci",
>>> +	.open		= hisi_acc_vfio_pci_open,
>>> +	.release	= vfio_pci_core_release,
>>> +	.ioctl		= vfio_pci_core_ioctl,
>>> +	.read		= vfio_pci_core_read,
>>> +	.write		= vfio_pci_core_write,
>>> +	.mmap		= vfio_pci_core_mmap,
>>> +	.request	= vfio_pci_core_request,
>>> +	.match		= vfio_pci_core_match,
>>> +	.reflck_attach	= vfio_pci_core_reflck_attach,
>> I don't remember what was proposed in vfio-pci-core conversion patches,
>> but would expect that default behaviour is to fallback to vfio_pci_core_* API
>> if ".release/.ioctl/e.t.c" are not redefined.
> Yes, that would be nice, but don't think it does that in latest(v4).
>
> Hi Max,
> Could we please consider fall back to the core defaults, may be check and assign defaults
> in vfio_pci_core_register_device() ?

I don't see why we should do this.

vfio_pci_core.ko is just a library driver. It shouldn't decide for the 
vendor driver ops.

If a vendor driver would like to use its helper functions - great.

If it wants to override it - great.

If it wants to leave some op as NULL - it can do it also.


>
> Thanks,
> Shameer

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-05  9:41       ` Max Gurtovoy
@ 2021-07-05 10:18         ` Shameerali Kolothum Thodi
  2021-07-05 18:27           ` Leon Romanovsky
  0 siblings, 1 reply; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-07-05 10:18 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, jgg, Linuxarm,
	liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> Sent: 05 July 2021 10:42
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Leon Romanovsky <leon@kernel.org>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; alex.williamson@redhat.com; jgg@nvidia.com;
> Linuxarm <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>;
> Zengtao (B) <prime.zeng@hisilicon.com>; yuzenghui
> <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon
> ACC devices
> 
> 
> On 7/5/2021 11:47 AM, Shameerali Kolothum Thodi wrote:
> >
> >> -----Original Message-----
> >> From: Leon Romanovsky [mailto:leon@kernel.org]
> >> Sent: 04 July 2021 08:04
> >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> >> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> jgg@nvidia.com;
> >> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> >> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> >> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> >> <jonathan.cameron@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>
> >> Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for
> HiSilicon
> >> ACC devices
> >>
> >> On Fri, Jul 02, 2021 at 10:58:46AM +0100, Shameer Kolothum wrote:
> >>> Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
> >>> This will be extended in follow-up patches to add support for
> >>> vfio live migration feature.
> >>>
> >>> Signed-off-by: Shameer Kolothum
> >> <shameerali.kolothum.thodi@huawei.com>
> >>> ---
> >>>   drivers/vfio/pci/Kconfig             |   9 +++
> >>>   drivers/vfio/pci/Makefile            |   2 +
> >>>   drivers/vfio/pci/hisi_acc_vfio_pci.c | 100
> +++++++++++++++++++++++++++
> >>>   3 files changed, 111 insertions(+)
> >>>   create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
> >> <...>
> >>
> >>> +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> >>> +	.name		= "hisi-acc-vfio-pci",
> >>> +	.open		= hisi_acc_vfio_pci_open,
> >>> +	.release	= vfio_pci_core_release,
> >>> +	.ioctl		= vfio_pci_core_ioctl,
> >>> +	.read		= vfio_pci_core_read,
> >>> +	.write		= vfio_pci_core_write,
> >>> +	.mmap		= vfio_pci_core_mmap,
> >>> +	.request	= vfio_pci_core_request,
> >>> +	.match		= vfio_pci_core_match,
> >>> +	.reflck_attach	= vfio_pci_core_reflck_attach,
> >> I don't remember what was proposed in vfio-pci-core conversion patches,
> >> but would expect that default behaviour is to fallback to vfio_pci_core_*
> API
> >> if ".release/.ioctl/e.t.c" are not redefined.
> > Yes, that would be nice, but don't think it does that in latest(v4).
> >
> > Hi Max,
> > Could we please consider fall back to the core defaults, may be check and
> assign defaults
> > in vfio_pci_core_register_device() ?
> 
> I don't see why we should do this.
> 
> vfio_pci_core.ko is just a library driver. It shouldn't decide for the
> vendor driver ops.
> 
> If a vendor driver would like to use its helper functions - great.
> 
> If it wants to override it - great.
> 
> If it wants to leave some op as NULL - it can do it also.

Based on the documentation of the vfio_device_ops callbacks,
It looks like we already have a precedence in the case of reflck_attach
callback where it uses the vfio core default one, if it is not implemented.

Also I would imagine that in majority use cases the vendor drivers will be
defaulting to core functions. 

I think, in any case, it would be good to update the Documentation based on
which way we end up doing this.

Thanks,
Shameer 

 
> 
> 
> >
> > Thanks,
> > Shameer

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use
  2021-07-04  9:34   ` Max Gurtovoy
@ 2021-07-05 10:23     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-07-05 10:23 UTC (permalink / raw)
  To: Max Gurtovoy, kvm, linux-kernel, linux-crypto
  Cc: alex.williamson, jgg, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> Sent: 04 July 2021 10:34
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org
> Cc: alex.williamson@redhat.com; jgg@nvidia.com; Linuxarm
> <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>; Zengtao (B)
> <prime.zeng@hisilicon.com>; yuzenghui <yuzenghui@huawei.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for
> common use
> 
> 
> On 7/2/2021 12:58 PM, Shameer Kolothum wrote:
> > From: Longfang Liu <liulongfang@huawei.com>
> >
> > Export QM mailbox functions so that they can be used from HiSilicon
> > ACC vfio live migration driver in follow-up patch.
> >
> > Signed-off-by: Longfang Liu <liulongfang@huawei.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >   drivers/crypto/hisilicon/qm.c | 8 +++++---
> >   drivers/crypto/hisilicon/qm.h | 4 ++++
> >   2 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/crypto/hisilicon/qm.c
> > b/drivers/crypto/hisilicon/qm.c index ce439a0c66c9..87fc0199705e
> > 100644
> > --- a/drivers/crypto/hisilicon/qm.c
> > +++ b/drivers/crypto/hisilicon/qm.c
> > @@ -492,7 +492,7 @@ static bool qm_qp_avail_state(struct hisi_qm *qm,
> struct hisi_qp *qp,
> >   }
> >
> >   /* return 0 mailbox ready, -ETIMEDOUT hardware timeout */ -static
> > int qm_wait_mb_ready(struct hisi_qm *qm)
> > +int qm_wait_mb_ready(struct hisi_qm *qm)
> >   {
> >   	u32 val;
> >
> > @@ -500,6 +500,7 @@ static int qm_wait_mb_ready(struct hisi_qm *qm)
> >   					  val, !((val >> QM_MB_BUSY_SHIFT) &
> >   					  0x1), POLL_PERIOD, POLL_TIMEOUT);
> >   }
> > +EXPORT_SYMBOL_GPL(qm_wait_mb_ready);
> >
> >   /* 128 bit should be written to hardware at one time to trigger a mailbox
> */
> >   static void qm_mb_write(struct hisi_qm *qm, const void *src) @@
> > -523,8 +524,8 @@ static void qm_mb_write(struct hisi_qm *qm, const void
> *src)
> >   		     : "memory");
> >   }
> >
> > -static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16
> queue,
> > -		 bool op)
> > +int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
> > +	  bool op)
> >   {
> >   	struct qm_mailbox mailbox;
> >   	int ret = 0;
> > @@ -563,6 +564,7 @@ static int qm_mb(struct hisi_qm *qm, u8 cmd,
> dma_addr_t dma_addr, u16 queue,
> >   		atomic64_inc(&qm->debug.dfx.mb_err_cnt);
> >   	return ret;
> >   }
> > +EXPORT_SYMBOL_GPL(qm_mb);
> >
> >   static void qm_db_v1(struct hisi_qm *qm, u16 qn, u8 cmd, u16 index, u8
> priority)
> >   {
> > diff --git a/drivers/crypto/hisilicon/qm.h
> > b/drivers/crypto/hisilicon/qm.h index acefdf8b3a50..18b010d5452d
> > 100644
> > --- a/drivers/crypto/hisilicon/qm.h
> > +++ b/drivers/crypto/hisilicon/qm.h
> > @@ -396,6 +396,10 @@ pci_ers_result_t hisi_qm_dev_slot_reset(struct
> pci_dev *pdev);
> >   void hisi_qm_reset_prepare(struct pci_dev *pdev);
> >   void hisi_qm_reset_done(struct pci_dev *pdev);
> >
> > +int qm_wait_mb_ready(struct hisi_qm *qm); int qm_mb(struct hisi_qm
> > +*qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
> > +	  bool op);
> > +
> 
> maybe you can put it under include/linux/.. ?

Ok. I suppose we could do that.

Thanks,
Shameer

> 
> 
> >   struct hisi_acc_sgl_pool;
> >   struct hisi_acc_hw_sgl *hisi_acc_sg_buf_map_to_hw_sgl(struct device
> *dev,
> >   	struct scatterlist *sgl, struct hisi_acc_sgl_pool *pool,

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-05 10:18         ` Shameerali Kolothum Thodi
@ 2021-07-05 18:27           ` Leon Romanovsky
  2021-07-05 18:32             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Leon Romanovsky @ 2021-07-05 18:27 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Max Gurtovoy, kvm, linux-kernel, linux-crypto, alex.williamson,
	jgg, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Mon, Jul 05, 2021 at 10:18:59AM +0000, Shameerali Kolothum Thodi wrote:
> 
> 
> > -----Original Message-----
> > From: Max Gurtovoy [mailto:mgurtovoy@nvidia.com]
> > Sent: 05 July 2021 10:42
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > Leon Romanovsky <leon@kernel.org>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-crypto@vger.kernel.org; alex.williamson@redhat.com; jgg@nvidia.com;
> > Linuxarm <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>;
> > Zengtao (B) <prime.zeng@hisilicon.com>; yuzenghui
> > <yuzenghui@huawei.com>; Jonathan Cameron
> > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> > Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon
> > ACC devices
> > 
> > 
> > On 7/5/2021 11:47 AM, Shameerali Kolothum Thodi wrote:
> > >
> > >> -----Original Message-----
> > >> From: Leon Romanovsky [mailto:leon@kernel.org]
> > >> Sent: 04 July 2021 08:04
> > >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > >> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > >> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> > jgg@nvidia.com;
> > >> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> > >> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> > >> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> > >> <jonathan.cameron@huawei.com>; Wangzhou (B)
> > <wangzhou1@hisilicon.com>
> > >> Subject: Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for
> > HiSilicon
> > >> ACC devices
> > >>
> > >> On Fri, Jul 02, 2021 at 10:58:46AM +0100, Shameer Kolothum wrote:
> > >>> Add a vendor-specific vfio_pci driver for HiSilicon ACC devices.
> > >>> This will be extended in follow-up patches to add support for
> > >>> vfio live migration feature.
> > >>>
> > >>> Signed-off-by: Shameer Kolothum
> > >> <shameerali.kolothum.thodi@huawei.com>
> > >>> ---
> > >>>   drivers/vfio/pci/Kconfig             |   9 +++
> > >>>   drivers/vfio/pci/Makefile            |   2 +
> > >>>   drivers/vfio/pci/hisi_acc_vfio_pci.c | 100
> > +++++++++++++++++++++++++++
> > >>>   3 files changed, 111 insertions(+)
> > >>>   create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
> > >> <...>
> > >>
> > >>> +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
> > >>> +	.name		= "hisi-acc-vfio-pci",
> > >>> +	.open		= hisi_acc_vfio_pci_open,
> > >>> +	.release	= vfio_pci_core_release,
> > >>> +	.ioctl		= vfio_pci_core_ioctl,
> > >>> +	.read		= vfio_pci_core_read,
> > >>> +	.write		= vfio_pci_core_write,
> > >>> +	.mmap		= vfio_pci_core_mmap,
> > >>> +	.request	= vfio_pci_core_request,
> > >>> +	.match		= vfio_pci_core_match,
> > >>> +	.reflck_attach	= vfio_pci_core_reflck_attach,
> > >> I don't remember what was proposed in vfio-pci-core conversion patches,
> > >> but would expect that default behaviour is to fallback to vfio_pci_core_*
> > API
> > >> if ".release/.ioctl/e.t.c" are not redefined.
> > > Yes, that would be nice, but don't think it does that in latest(v4).
> > >
> > > Hi Max,
> > > Could we please consider fall back to the core defaults, may be check and
> > assign defaults
> > > in vfio_pci_core_register_device() ?
> > 
> > I don't see why we should do this.
> > 
> > vfio_pci_core.ko is just a library driver. It shouldn't decide for the
> > vendor driver ops.
> > 
> > If a vendor driver would like to use its helper functions - great.
> > 
> > If it wants to override it - great.
> > 
> > If it wants to leave some op as NULL - it can do it also.
> 
> Based on the documentation of the vfio_device_ops callbacks,
> It looks like we already have a precedence in the case of reflck_attach
> callback where it uses the vfio core default one, if it is not implemented.

The reflck_attach pattern is pretty common pattern in the kernel to provide fallback.

> 
> Also I would imagine that in majority use cases the vendor drivers will be
> defaulting to core functions. 

Right, this is whole idea of having core functionality in one place, if
vendor wants/needs, he will overwrite.

> 
> I think, in any case, it would be good to update the Documentation based on
> which way we end up doing this.

The request to update Documentation can be seen as an example of
choosing not-good API decisions. Expectation to see all drivers to
use same callbacks with same vfio-core function calls sounds strange
to me.

Thanks

> 
> Thanks,
> Shameer 
> 
>  
> > 
> > 
> > >
> > > Thanks,
> > > Shameer

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-05 18:27           ` Leon Romanovsky
@ 2021-07-05 18:32             ` Jason Gunthorpe
  2021-07-06  3:59               ` Leon Romanovsky
  2021-07-06  4:39               ` Christoph Hellwig
  0 siblings, 2 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2021-07-05 18:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Shameerali Kolothum Thodi, Max Gurtovoy, kvm, linux-kernel,
	linux-crypto, alex.williamson, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Mon, Jul 05, 2021 at 09:27:37PM +0300, Leon Romanovsky wrote:

> > I think, in any case, it would be good to update the Documentation based on
> > which way we end up doing this.
> 
> The request to update Documentation can be seen as an example of
> choosing not-good API decisions. Expectation to see all drivers to
> use same callbacks with same vfio-core function calls sounds strange
> to me.

It is not vfio-core, it is vfio-pci-core. It is similar to how some of
the fops stuff works, eg the generic_file whatever functions everyone
puts in.

It would be improved a bit by making the ops struct mutable and
populating it at runtime like we do in RDMA. Then the PCI ops and
driver ops could be merged together without the repetition.

Probably something that is more interesting if there are more drivers
as it is a fair amount of typing to make.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-05 18:32             ` Jason Gunthorpe
@ 2021-07-06  3:59               ` Leon Romanovsky
  2021-07-06  4:39               ` Christoph Hellwig
  1 sibling, 0 replies; 55+ messages in thread
From: Leon Romanovsky @ 2021-07-06  3:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, Max Gurtovoy, kvm, linux-kernel,
	linux-crypto, alex.williamson, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Mon, Jul 05, 2021 at 03:32:47PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 05, 2021 at 09:27:37PM +0300, Leon Romanovsky wrote:
> 
> > > I think, in any case, it would be good to update the Documentation based on
> > > which way we end up doing this.
> > 
> > The request to update Documentation can be seen as an example of
> > choosing not-good API decisions. Expectation to see all drivers to
> > use same callbacks with same vfio-core function calls sounds strange
> > to me.
> 
> It is not vfio-core, it is vfio-pci-core. It is similar to how some of
> the fops stuff works, eg the generic_file whatever functions everyone
> puts in.

It doesn't really matter if it is vfio-core or vfio-pci-core. This looks
horrible and it is going to be repeated for every driver:

+       .release        = vfio_pci_core_release,
+       .ioctl          = vfio_pci_core_ioctl,
+       .read           = vfio_pci_core_read,
+       .write          = vfio_pci_core_write,
+       .mmap           = vfio_pci_core_mmap,
+       .request        = vfio_pci_core_request,
+       .match          = vfio_pci_core_match,
+       .reflck_attach  = vfio_pci_core_reflck_attach,
+};

At some point of time you will add new .XXX callback and will
find yourself changing all drivers to have something like
".XXX = vfio_pci_core_XXX,"

Thanks

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-05 18:32             ` Jason Gunthorpe
  2021-07-06  3:59               ` Leon Romanovsky
@ 2021-07-06  4:39               ` Christoph Hellwig
  2021-07-06 11:51                 ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2021-07-06  4:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Shameerali Kolothum Thodi, Max Gurtovoy, kvm,
	linux-kernel, linux-crypto, alex.williamson, Linuxarm,
	liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Mon, Jul 05, 2021 at 03:32:47PM -0300, Jason Gunthorpe wrote:
> It would be improved a bit by making the ops struct mutable and
> populating it at runtime like we do in RDMA. Then the PCI ops and
> driver ops could be merged together without the repetition.

No, that would be everything but an improvement.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  2021-07-06  4:39               ` Christoph Hellwig
@ 2021-07-06 11:51                 ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2021-07-06 11:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Shameerali Kolothum Thodi, Max Gurtovoy, kvm,
	linux-kernel, linux-crypto, alex.williamson, Linuxarm,
	liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Tue, Jul 06, 2021 at 05:39:25AM +0100, Christoph Hellwig wrote:
> On Mon, Jul 05, 2021 at 03:32:47PM -0300, Jason Gunthorpe wrote:
> > It would be improved a bit by making the ops struct mutable and
> > populating it at runtime like we do in RDMA. Then the PCI ops and
> > driver ops could be merged together without the repetition.
> 
> No, that would be everything but an improvement.

Do you have an alternative? It has worked reasonably with the similar
RDMA problems.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
                   ` (3 preceding siblings ...)
  2021-07-02  9:58 ` [RFC v2 4/4] hisi_acc_vfio_pci: Add support for vfio live migration Shameer Kolothum
@ 2022-02-02 13:14 ` Jason Gunthorpe
  2022-02-02 14:34   ` Shameerali Kolothum Thodi
  2022-02-18 16:37 ` Jason Gunthorpe
  5 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02 13:14 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	linuxarm, liulongfang, prime.zeng, yuzenghui, jonathan.cameron,
	wangzhou1

On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> This series attempts to add vfio live migration support for
> HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> includes both the functional register space and migration 
> control register space. As discussed in RFCv1[0], this may create
> security issues as these regions get shared between the Guest
> driver and the migration driver. Based on the feedback, we tried
> to address those concerns in this version. 
> 
> This is now based on the new vfio-pci-core framework proposal[1].
> Understand that the framework proposal is still under discussion,
> but really appreciate any feedback on the approach taken here
> to mitigate the security risks.

Hi, can you look at the v6 proposal for the mlx5 implementation of the
migration API and see if it meets hisilicon acc's needs as well?

https://lore.kernel.org/all/20220130160826.32449-1-yishaih@nvidia.com/

There are few topics to consider:
 - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
   sense for this driver?

   I see pf_qm_state_pre_save() but didn't understand why it wanted to
   send the first 32 bytes in the PRECOPY mode? It is fine, but it
   will add some complexity to continue to do this.

 - I think we discussed the P2P implementation and decided it would
   work for this device? Can you re-read and confirm?

 - Are the arcs we defined going to work here as well? The current
   implementation in hisi_acc_vf_set_device_state() is very far away
   from what the v1 protocol is, so I'm having a hard time guessing,
   but..

      RESUMING -> STOP
        Probably vf_qm_state_resume()

      RUNNING -> STOP
        vf_qm_fun_restart() - that is oddly named..

      STOP -> RESUMING
        Seems to be a nop (likely a bug)

      STOP -> RUNNING
         Not implemented currenty? (also a bug)

      STOP -> STOP_COPY
         pf_qm_state_pre_save / vf_qm_state_save

      STOP_COPY -> STOP
         NOP

   And the modification for the P2P/NO DMA is presumably just
   fun_restart too since stopping the device and stopping DMA are
   going to be the same thing here?

The mlx5 implementation linked above is a full example you can cut and
paste from for how to implement the state function and the how to do
the data transfer. The f_ops read/write implementation for acc looks
trivial as it only streams the fixed size and pre-allocated 'struct
acc_vf_data'

It looks like it would be a short path to implement our v2 proposal
and remove a lot of driver code, as we saw in mlx5.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 13:14 ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Jason Gunthorpe
@ 2022-02-02 14:34   ` Shameerali Kolothum Thodi
  2022-02-02 15:39     ` Jason Gunthorpe
  2022-02-02 17:30     ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Alex Williamson
  0 siblings, 2 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-02-02 14:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 02 February 2022 13:15
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
> 
> On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> > This series attempts to add vfio live migration support for
> > HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> > includes both the functional register space and migration
> > control register space. As discussed in RFCv1[0], this may create
> > security issues as these regions get shared between the Guest
> > driver and the migration driver. Based on the feedback, we tried
> > to address those concerns in this version.
> >
> > This is now based on the new vfio-pci-core framework proposal[1].
> > Understand that the framework proposal is still under discussion,
> > but really appreciate any feedback on the approach taken here
> > to mitigate the security risks.
> 
> Hi, can you look at the v6 proposal for the mlx5 implementation of the
> migration API and see if it meets hisilicon acc's needs as well?
> 
> https://lore.kernel.org/all/20220130160826.32449-1-yishaih@nvidia.com/

Yes, I saw that one. Thanks for that and is now looking into it.

> 
> There are few topics to consider:
>  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
>    sense for this driver?

I think it will be STOP_COPY only for now. We might have PRECOPY feature once
we have the SMMUv3 HTTU support in future.

> 
>    I see pf_qm_state_pre_save() but didn't understand why it wanted to
>    send the first 32 bytes in the PRECOPY mode? It is fine, but it
>    will add some complexity to continue to do this.

That was mainly to do a quick verification between src and dst compatibility
before we start saving the state. I think probably we can delay that check
for later.

>  - I think we discussed the P2P implementation and decided it would
>    work for this device? Can you re-read and confirm?

In our case these devices are Integrated End Point devices and doesn't have
P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature
I guess. Also, since we cannot guarantee a NDMA state in STOP, my
assumption currently is the onus of making sure that no MMIO access happens 
in STOP is on the user. Is that a valid assumption?

>  - Are the arcs we defined going to work here as well? The current
>    implementation in hisi_acc_vf_set_device_state() is very far away
>    from what the v1 protocol is, so I'm having a hard time guessing,
>    but..

Right. The FSM has changed a couple of times since we posted this.
I am going to rebase all that now.

>       RESUMING -> STOP
>         Probably vf_qm_state_resume()
> 
>       RUNNING -> STOP
>         vf_qm_fun_restart() - that is oddly named..
> 
>       STOP -> RESUMING
>         Seems to be a nop (likely a bug)
> 
>       STOP -> RUNNING
>          Not implemented currenty? (also a bug)
> 
>       STOP -> STOP_COPY
>          pf_qm_state_pre_save / vf_qm_state_save
> 
>       STOP_COPY -> STOP
>          NOP

I will check and verify this.

>    And the modification for the P2P/NO DMA is presumably just
>    fun_restart too since stopping the device and stopping DMA are
>    going to be the same thing here?

Yes, in our case stopping device and stopping DMA are effectively the
same thing.

> 
> The mlx5 implementation linked above is a full example you can cut and
> paste from for how to implement the state function and the how to do
> the data transfer. The f_ops read/write implementation for acc looks
> trivial as it only streams the fixed size and pre-allocated 'struct
> acc_vf_data'
> 
> It looks like it would be a short path to implement our v2 proposal
> and remove a lot of driver code, as we saw in mlx5.
> 

Ok. These are the git repo I am using for the rework,
https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2
https://github.com/jgunthorpe/linux/tree/vfio_migration_v2

Please let me know if the above are not up to date.

Also, just noted that my quick prototype is now failing
with below error,

" Error: VFIO device doesn't support migration"

Do we need to set the below before the feature query?
Or am I using a wrong Qemu/kernel repo?

--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
*vbasedev, uint64_t *mig_flags)
     struct vfio_device_feature_migration *mig = (void *)feature->data;

     feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET;
     if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
         return -EOPNOTSUPP;

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 14:34   ` Shameerali Kolothum Thodi
@ 2022-02-02 15:39     ` Jason Gunthorpe
  2022-02-02 16:10       ` Shameerali Kolothum Thodi
  2022-02-02 17:30     ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Alex Williamson
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02 15:39 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi wrote:

> > There are few topics to consider:
> >  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> >    sense for this driver?
> 
> I think it will be STOP_COPY only for now. We might have PRECOPY feature once
> we have the SMMUv3 HTTU support in future.

HTTU is the dirty tracking feature? To be clear VFIO migration support
for PRECOPY has nothing to do with IOMMU based dirty page tracking.
 
> >  - I think we discussed the P2P implementation and decided it would
> >    work for this device? Can you re-read and confirm?
> 
> In our case these devices are Integrated End Point devices and doesn't have
> P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature
> I guess. Also, since we cannot guarantee a NDMA state in STOP, my
> assumption currently is the onus of making sure that no MMIO access happens 
> in STOP is on the user. Is that a valid assumption?

Yes, you can treat RUNNING_P2P as the same as STOP and rely on no MMIO
access to sustain it.

(and I'm wondering sometimes if we should rename RUNNING_P2P to
STOP_P2P - ie the device is stopped but still allows inbound P2P to
make this clearer)

> Do we need to set the below before the feature query?
> Or am I using a wrong Qemu/kernel repo?
> 
> +++ b/hw/vfio/migration.c
> @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
> *vbasedev, uint64_t *mig_flags)
>      struct vfio_device_feature_migration *mig = (void *)feature->data;
> 
>      feature->argsz = sizeof(buf);
> +    feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET;
>      if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
>          return -EOPNOTSUPP;

Oh, this is my mistake I thought this got pushed to that github
already but didn't, I updated it.

If you have a prototype can you post another RFC?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 15:39     ` Jason Gunthorpe
@ 2022-02-02 16:10       ` Shameerali Kolothum Thodi
  2022-02-02 17:03         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-02-02 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 02 February 2022 15:40
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
> 
> On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > > There are few topics to consider:
> > >  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> > >    sense for this driver?
> >
> > I think it will be STOP_COPY only for now. We might have PRECOPY
> > feature once we have the SMMUv3 HTTU support in future.
> 
> HTTU is the dirty tracking feature? To be clear VFIO migration support for
> PRECOPY has nothing to do with IOMMU based dirty page tracking.

Yes, it is based on the IOMMU hardware dirty bit management support.
A RFC was posted sometime back,
https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/

Ok, my guess was that the PRECOPY here was related. Thanks for clarifying.

> 
> > >  - I think we discussed the P2P implementation and decided it would
> > >    work for this device? Can you re-read and confirm?
> >
> > In our case these devices are Integrated End Point devices and doesn't
> > have P2P DMA capability. Hence the FSM arcs will be limited to
> > STOP_COPY feature I guess. Also, since we cannot guarantee a NDMA
> > state in STOP, my assumption currently is the onus of making sure that
> > no MMIO access happens in STOP is on the user. Is that a valid assumption?
> 
> Yes, you can treat RUNNING_P2P as the same as STOP and rely on no MMIO
> access to sustain it.

Ok.
 
> (and I'm wondering sometimes if we should rename RUNNING_P2P to
> STOP_P2P - ie the device is stopped but still allows inbound P2P to make this
> clearer)
> 
> > Do we need to set the below before the feature query?
> > Or am I using a wrong Qemu/kernel repo?
> >
> > +++ b/hw/vfio/migration.c
> > @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
> > *vbasedev, uint64_t *mig_flags)
> >      struct vfio_device_feature_migration *mig = (void
> > *)feature->data;
> >
> >      feature->argsz = sizeof(buf);
> > +    feature->flags = VFIO_DEVICE_FEATURE_MIGRATION |
> > + VFIO_DEVICE_FEATURE_GET;
> >      if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
> >          return -EOPNOTSUPP;
> 
> Oh, this is my mistake I thought this got pushed to that github already but
> didn't, I updated it.

Ok. Thanks.
 
> If you have a prototype can you post another RFC?

Sure, will do. I just started and has only a skeleton proto based on v2 now.
Will send out a RFC soon once I have all the FSM arcs implemented 
and sanity tested.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 16:10       ` Shameerali Kolothum Thodi
@ 2022-02-02 17:03         ` Jason Gunthorpe
  2022-02-02 19:05           ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02 17:03 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Wed, Feb 02, 2022 at 04:10:06PM +0000, Shameerali Kolothum Thodi wrote:
> 
> 
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 02 February 2022 15:40
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> > mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> > yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> > Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
> > 
> > On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi
> > wrote:
> > 
> > > > There are few topics to consider:
> > > >  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> > > >    sense for this driver?
> > >
> > > I think it will be STOP_COPY only for now. We might have PRECOPY
> > > feature once we have the SMMUv3 HTTU support in future.
> > 
> > HTTU is the dirty tracking feature? To be clear VFIO migration support for
> > PRECOPY has nothing to do with IOMMU based dirty page tracking.
> 
> Yes, it is based on the IOMMU hardware dirty bit management support.
> A RFC was posted sometime back,
> https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/

Yes, I saw that. I was hoping to have a discussion on this soon about
how to integrate that with the iommufd work, which I hope will allow
that series, and the other IOMMU drivers that can support this to be
merged..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 14:34   ` Shameerali Kolothum Thodi
  2022-02-02 15:39     ` Jason Gunthorpe
@ 2022-02-02 17:30     ` Alex Williamson
  2022-02-02 18:04       ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2022-02-02 17:30 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, kvm, linux-kernel, linux-crypto, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Wed, 2 Feb 2022 14:34:52 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> 
> > 
> >    I see pf_qm_state_pre_save() but didn't understand why it wanted to
> >    send the first 32 bytes in the PRECOPY mode? It is fine, but it
> >    will add some complexity to continue to do this.  
> 
> That was mainly to do a quick verification between src and dst compatibility
> before we start saving the state. I think probably we can delay that check
> for later.

In the v1 migration scheme, this was considered good practice.  It
shouldn't be limited to PRECOPY, as there's no requirement to use
PRECOPY, but the earlier in the migration process that we can trigger a
device or data stream compatibility fault, the better.  TBH, even in
the case where a device doesn't support live dirty tracking for a
PRECOPY phase, using it for compatibility testing continues to seem
like good practice.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 17:30     ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Alex Williamson
@ 2022-02-02 18:04       ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-02 18:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Wed, Feb 02, 2022 at 10:30:41AM -0700, Alex Williamson wrote:
> On Wed, 2 Feb 2022 14:34:52 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > 
> > > 
> > >    I see pf_qm_state_pre_save() but didn't understand why it wanted to
> > >    send the first 32 bytes in the PRECOPY mode? It is fine, but it
> > >    will add some complexity to continue to do this.  
> > 
> > That was mainly to do a quick verification between src and dst compatibility
> > before we start saving the state. I think probably we can delay that check
> > for later.
> 
> In the v1 migration scheme, this was considered good practice.  It
> shouldn't be limited to PRECOPY, as there's no requirement to use
> PRECOPY, but the earlier in the migration process that we can trigger a
> device or data stream compatibility fault, the better.  TBH, even in
> the case where a device doesn't support live dirty tracking for a
> PRECOPY phase, using it for compatibility testing continues to seem
> like good practice.

At least with our thinking here, we'd rather the device expose an
explicit compatibility data via get/test system calls so we can build
proper infrastructure around this.

Every device will have compatibility requirements and we can build
more shared common code this way. ie qemu can ideally fetch the data
before migration starts and do an exchange with the live migration
target to see if it is OK. Orchestration can inventory the systems,
and automation can select live migration targets that can actually
work.

If it is hidden inside the migration stream it is too invisible to be
fully useful.

This is something we've been talking about here but don't have much
concrete to say for mlx5 yet.

The device still has to self-protect itself against a corrupted
migration stream impacting integrity, of course.

IIRC qemu has a nice spot to put this in the existing protocol.

Just overall, now that PRECOPY is optional, we should avoid using it
without a good reason. The driver implementation does have a cost.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 17:03         ` Jason Gunthorpe
@ 2022-02-02 19:05           ` Joao Martins
  2022-02-03 15:18             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-02 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Shameerali Kolothum Thodi
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/2/22 17:03, Jason Gunthorpe wrote:
> On Wed, Feb 02, 2022 at 04:10:06PM +0000, Shameerali Kolothum Thodi wrote:
>>> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
>>> Sent: 02 February 2022 15:40
>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>>> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>>> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
>>> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
>>> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
>>> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
>>> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
>>> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
>>>
>>> On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi
>>> wrote:
>>>
>>>>> There are few topics to consider:
>>>>>  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
>>>>>    sense for this driver?
>>>>
>>>> I think it will be STOP_COPY only for now. We might have PRECOPY
>>>> feature once we have the SMMUv3 HTTU support in future.
>>>
>>> HTTU is the dirty tracking feature? To be clear VFIO migration support for
>>> PRECOPY has nothing to do with IOMMU based dirty page tracking.
>>
>> Yes, it is based on the IOMMU hardware dirty bit management support.
>> A RFC was posted sometime back,
>> https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/
> 
> Yes, I saw that. I was hoping to have a discussion on this soon about

Sorry to snatch the thread, I was thinking in doing the same this week,
as I was playing around that area, and wanted to ask the community... but
since you mentioned it :D

> how to integrate that with the iommufd work, which I hope will allow
> that series, and the other IOMMU drivers that can support this to be
> merged..

The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
there, but TBH I am not up to speed on iommu-fd yet so I missed something
obvious for sure. When you say 'integrate that with the iommufd' can you
expand on that?

Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
/internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
VMM to scan iommu pagetables itself and see what was marked dirty or not?

My over-simplistic/naive view was that the proposal in the link above sounded a lot
simpler. While iommu-fd had more longevity for many other usecases outside dirty tracking, no?

I have a PoC-ish using the interface in the link, with AMD IOMMU dirty bit supported
(including Qemu emulated amd-iommu for folks lacking the hardware). Albeit the
eager-spliting + collapsing of IOMMU hugepages is not yet done there, and I wanted to play
around the emulated intel-iommu SLADS from specs looks quite similar. Happy to join
existing effort anyways.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-02 19:05           ` Joao Martins
@ 2022-02-03 15:18             ` Jason Gunthorpe
  2022-02-04 19:53               ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-03 15:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> On 2/2/22 17:03, Jason Gunthorpe wrote:

> > how to integrate that with the iommufd work, which I hope will allow
> > that series, and the other IOMMU drivers that can support this to be
> > merged..
> 
> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> obvious for sure. When you say 'integrate that with the iommufd' can you
> expand on that?

The general idea is that iommufd is the place to put all the iommu
driver uAPI for consumption by userspace. The IOMMU feature of dirty
tracking would belong there.

So, some kind of API needs to be designed to meet the needs of the
IOMMU drivers.

> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
> VMM to scan iommu pagetables itself and see what was marked dirty or
> not?

iommufd and VFIO container's don't co-exist, either iommufd is
providing the IOMMU interface, or the current type 1 code - not both
together.

iommfd current approach presents the same ABI as the type1 container
as compatability, and it is a possible direction to provide the
iommu_domain stored dirty bits through that compat API.

But, as you say, it looks unnatural and inefficient when the domain
itself is storing the dirty bits inside the IOPTE.

It need some study I haven't got into yet :)

> My over-simplistic/naive view was that the proposal in the link
> above sounded a lot simpler. While iommu-fd had more longevity for
> many other usecases outside dirty tracking, no?

I'd prefer we don't continue to hack on the type1 code if iommufd is
expected to take over in this role - especially for a multi-vendor
feature like dirty tracking.

It is actually a pretty complicated topic because migration capable
PCI devices are also include their own dirty tracking HW, all this
needs to be harmonized somehow. VFIO proposed to squash everything
into the container code, but I've been mulling about having iommufd
only do system iommu and push the PCI device internal tracking over to
VFIO.

> I have a PoC-ish using the interface in the link, with AMD IOMMU
> dirty bit supported (including Qemu emulated amd-iommu for folks
> lacking the hardware). Albeit the eager-spliting + collapsing of
> IOMMU hugepages is not yet done there, and I wanted to play around
> the emulated intel-iommu SLADS from specs looks quite similar. Happy
> to join existing effort anyways.

This sounds great, I would love to bring the AMD IOMMU along with a
dirty tracking implementation! Can you share some patches so we can
see what the HW implementation looks like?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-03 15:18             ` Jason Gunthorpe
@ 2022-02-04 19:53               ` Joao Martins
  2022-02-04 23:07                 ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-04 19:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/3/22 15:18, Jason Gunthorpe wrote:
> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>> how to integrate that with the iommufd work, which I hope will allow
>>> that series, and the other IOMMU drivers that can support this to be
>>> merged..
>>
>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>> obvious for sure. When you say 'integrate that with the iommufd' can you
>> expand on that?
> 
> The general idea is that iommufd is the place to put all the iommu
> driver uAPI for consumption by userspace. The IOMMU feature of dirty
> tracking would belong there.
> 
> So, some kind of API needs to be designed to meet the needs of the
> IOMMU drivers.
> 
/me nods

I am gonna assume below is the most up-to-date to iommufd (as you pointed
out in another thread IIRC):

  https://github.com/jgunthorpe/linux iommufd

Let me know if it's not :)

>> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
>> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
>> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
>> VMM to scan iommu pagetables itself and see what was marked dirty or
>> not?
> 
> iommufd and VFIO container's don't co-exist, either iommufd is
> providing the IOMMU interface, or the current type 1 code - not both
> together.
> 
> iommfd current approach presents the same ABI as the type1 container
> as compatability, and it is a possible direction to provide the
> iommu_domain stored dirty bits through that compat API.
> 
> But, as you say, it looks unnatural and inefficient when the domain
> itself is storing the dirty bits inside the IOPTE.
> 
How much is this already represented as the io-pgtable in IOMMU internal kAPI
(if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

> It need some study I haven't got into yet :)
> 
Heh :)

Depending on how easy it is to obtain full extent of IO pagetables via iommu_fd
and whether userspace code can scan the dirty bits on its own ... then potentially
VMM/process can more efficiently scan the dirtied set? But if some layer needs to
somehow mediate between the vendor IOPTE representation and an UAPI IOPTE representation,
to be able to make that delegation to userspace ... then maybe both might be inefficient?
I didn't see how iommu-fd would abstract the IOPTEs lookup as far as I glanced through the
code, perhaps that's another ioctl().

>> My over-simplistic/naive view was that the proposal in the link
>> above sounded a lot simpler. While iommu-fd had more longevity for
>> many other usecases outside dirty tracking, no?
> 
> I'd prefer we don't continue to hack on the type1 code if iommufd is
> expected to take over in this role - especially for a multi-vendor
> feature like dirty tracking.
> 
But what strikes /specifically/ on the dirty bit feature is that it looks
simpler with the current VFIO, the heavy lifting seems to be
mostly on the IOMMU vendor. The proposed API above for VFIO looking at
the container (small changes), and IOMMU vendor would do most of it:

* Toggling API for start/stop dirty tracking
* API to get the IOVAs dirtied
* API to clear the IOVAs dirtied
[this last one I am not sure it is needed as the clear could be done as we scan
 the ones, thus minimize TLB flush cost if these are separate]

At the same time, what particularly scares me perf-wise (for the device being migrated)
... is the fact that we need to dynamically split and collapse page tables to
increase the granularity of which we track. In the above interface it splits/collapses
when you turn on/off the dirty tracking (respectively). That's *probably* where we
need more flexibility, not sure.

> It is actually a pretty complicated topic because migration capable
> PCI devices are also include their own dirty tracking HW, all this
> needs to be harmonized somehow. 

Do you have thoughts on what such device-dirty interface could look like?
(Perhaps too early to poke while the FSM/UAPI is being worked out)

I was wondering if container has a dirty scan/sync callback funnelled
by a vendor IOMMU ops implemented (as Shameerali patches proposed), and vfio vendor driver
provides one per device. Or propagate the dirty tracking API to vendor vfio driver[*]. The
reporting of the dirtying, though, looks hazzy to achieve if you try to make
it uniform even to userspace. Perhaps with iommu-fd you're thinking to mmap()
the dirty region back to userspace, or an iommu-fd ioctl() updates the PTEs,
while letting the kernel clear the dirty status via the mmap() object. And that
would be the common API regardless of dirty-hw scheme. Anyway, just thinking
out loud.

[*] considering the device may choose where to place its tracking storage, and
which scheme (bitmap, ring, etc) it might be.

> VFIO proposed to squash everything
> into the container code, but I've been mulling about having iommufd
> only do system iommu and push the PCI device internal tracking over to
> VFIO.
> 

Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
not sure yet how much one can we 'offload' to a generic layer, at least
compared with this other proposal.

>> I have a PoC-ish using the interface in the link, with AMD IOMMU
>> dirty bit supported (including Qemu emulated amd-iommu for folks
>> lacking the hardware). Albeit the eager-spliting + collapsing of
>> IOMMU hugepages is not yet done there, and I wanted to play around
>> the emulated intel-iommu SLADS from specs looks quite similar. Happy
>> to join existing effort anyways.
> 
> This sounds great, I would love to bring the AMD IOMMU along with a
> dirty tracking implementation! Can you share some patches so we can
> see what the HW implementation looks like?

Oh yes for sure! As I said I'm happy to help&implement along.
I would really love to leverage the IOMMU feature, as that relieves the
migrateable PCI device having to do the dirty tracking themselves.

And well, we seem to be getting there -- spec-wise everybody has that
feature *at least* documented :)

Give me some time (few days only, as I gotta sort some things) and I'll
respond here as follow up with link to a branch with the WIP/PoC patches.

Summing up a couple of remarks on hw, hopefully they enlight:

1) On AMD the feature is advertised by their extended feature register
as supported or not. On Intel, same in its equivalent (ECAP). On course
different bits on different IOMMUs registers. If it is supported, the
IOMMU updates (when activated) two bits per page table entry to indicate
'access' and 'dirty'. Slightly different page table format bit-wise on
where access/dirty is located (between the two vendors).

2) Change protection domain flags to enable the dirty/access tracking.
On AMD, it's the DTE flags (a new flag for dirty tracking).
[Intel does this in the PASID table entry]

3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
bit is cheap, clearing them is 'expensive' because one needs to flush
IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
operations to actually update the Access/Dirty bit in concurrency with
the CPU. The AMD manuals are a tad misleading as they talk about marking
non-present, but that would be catastrophic for migration as it would
mean a DMA target abort for the PCI device, unless I missed something obvious.
In any case, this means that the dirty bit *clearing* needs to be
batched as much as possible, to amortize the cost of flushing the IOTLB.
This is the same for Intel *IIUC*.

4) Adjust the granularity of pagetables in place:
[This item wasn't done, but it is generic to any IOMMU because it
is mostly the ability to split existing IO pages in place.]

4.a) Eager splitting and late collapsing IO hugepages -- essentially
to have tracking at finer granularity. as you know generally the
IOMMU map is generally done on the biggest IO page size, specially on
guests. We sadly can't write-protect or anything fancier that we
usually do so otherwise the PCI device gets a DMA target-abort :(
So splitting, when we start dirty tracking (what I mean by eager), and
do for the whole IO page table. When finished tracking, collapse the
pages (which should probably be deemed optional in the common case of
succeeding the migration). Guest expected to have higher IOTLB miss.
This part is particularly worrying for the IO guest performance, but
hopefully it doesn't turn out to be too bad.

4.b) Optionally starting dirtying earlier (at provisioning) and let
userspace dynamically split pages. This is to hopefully minimize the
IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
So dirty tracking would be enabled at creation of the protection domain
after the vfio container is set up, and we would use pages dirtied
as a indication of what needs to be splited. Problem is for IO page
sizes bigger than 1G, which might unnecessarily lead to marking too
much as dirty early on; but at least it's better than transferring the
whole set.

	Joao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-04 19:53               ` Joao Martins
@ 2022-02-04 23:07                 ` Jason Gunthorpe
  2022-02-11 17:28                   ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-04 23:07 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
> On 2/3/22 15:18, Jason Gunthorpe wrote:
> > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> >> On 2/2/22 17:03, Jason Gunthorpe wrote:
> >>> how to integrate that with the iommufd work, which I hope will allow
> >>> that series, and the other IOMMU drivers that can support this to be
> >>> merged..
> >>
> >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> >> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> >> obvious for sure. When you say 'integrate that with the iommufd' can you
> >> expand on that?
> > 
> > The general idea is that iommufd is the place to put all the iommu
> > driver uAPI for consumption by userspace. The IOMMU feature of dirty
> > tracking would belong there.
> > 
> > So, some kind of API needs to be designed to meet the needs of the
> > IOMMU drivers.
> > 
> /me nods
> 
> I am gonna assume below is the most up-to-date to iommufd (as you pointed
> out in another thread IIRC):
> 
>   https://github.com/jgunthorpe/linux iommufd
> 
> Let me know if it's not :)

The iommufd part is pretty good, but there is hacky patch to hook it
into vfio that isn't there, if you want to actually try it.

> > But, as you say, it looks unnatural and inefficient when the domain
> > itself is storing the dirty bits inside the IOPTE.
> 
> How much is this already represented as the io-pgtable in IOMMU internal kAPI
> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

Which are you looking at? AFACIT there is no diry page support in
iommu_ops ?

> then potentially VMM/process can more efficiently scan the dirtied
> set? But if some layer needs to somehow mediate between the vendor
> IOPTE representation and an UAPI IOPTE representation, to be able to
> make that delegation to userspace ... then maybe both might be
> inefficient?  I didn't see how iommu-fd would abstract the IOPTEs
> lookup as far as I glanced through the code, perhaps that's another
> ioctl().

It is based around the same model as VFIO container - map/unmap of
user address space into the IOPTEs and the user space doesn't see
anything resembling a 'pte' - at least for kernel owned IO page
tables.

User space page tables will not be abstracted and the userspace must
know the direct HW format of the IOMMU they is being used.

> But what strikes /specifically/ on the dirty bit feature is that it looks
> simpler with the current VFIO, the heavy lifting seems to be
> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
> the container (small changes), and IOMMU vendor would do most of it:

It is basically the same, almost certainly the user API in iommufd
will be some 'get dirty bits' and 'unmap and give me the dirty bits'
just like vfio has.
 
The tricky details are around how do you manage this when the system
may have multiple things invovled capable, or not, of actualy doing
dirty tracking.

> At the same time, what particularly scares me perf-wise (for the
> device being migrated) ... is the fact that we need to dynamically
> split and collapse page tables to increase the granularity of which
> we track. In the above interface it splits/collapses when you turn
> on/off the dirty tracking (respectively). That's *probably* where we
> need more flexibility, not sure.

For sure that is a particularly big adventure in the iommu driver..

> Do you have thoughts on what such device-dirty interface could look like?
> (Perhaps too early to poke while the FSM/UAPI is being worked out)

I've been thinking the same general read-and-clear of a dirty
bitmap. It matches nicely the the KVM interface.

> I was wondering if container has a dirty scan/sync callback funnelled
> by a vendor IOMMU ops implemented (as Shameerali patches proposed), 

Yes, this is almost certainly how the in-kernel parts will look

> and vfio vendor driver provides one per device. 

But this is less clear..

> Or propagate the dirty tracking API to vendor vfio driver[*]. 
> [*] considering the device may choose where to place its tracking storage, and
> which scheme (bitmap, ring, etc) it might be.

This has been my thinking, yes

> The reporting of the dirtying, though, looks hazzy to achieve if you
> try to make it uniform even to userspace. Perhaps with iommu-fd
> you're thinking to mmap() the dirty region back to userspace, or an
> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
> the dirty status via the mmap() object. And that would be the common
> API regardless of dirty-hw scheme. Anyway, just thinking out loud.

My general thinking has be that iommufd would control only the system
IOMMU hardware. The FD interface directly exposes the iommu_domain as
a manipulable object, so I'd imagine making userspace have a simple
1:1 connection to the iommu_ops of a single iommu_domain.

Doing this avoids all the weirdo questions about what do you do if
there is non-uniformity in the iommu_domain's.

Keeping with that theme the vfio_device would provide a similar
interface, on its own device FD.

I don't know if mmap should be involed here, the dirty bitmaps are not
so big, I suspect a simple get_user_pages_fast() would be entirely OK.

> > VFIO proposed to squash everything
> > into the container code, but I've been mulling about having iommufd
> > only do system iommu and push the PCI device internal tracking over to
> > VFIO.
> > 
> 
> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
> not sure yet how much one can we 'offload' to a generic layer, at least
> compared with this other proposal.

Yes, I expect there is very little generic code here if we go this
way. The generic layer is just marshalling the ioctl(s) to the iommu
drivers. Certainly not providing storage or anything/

> Give me some time (few days only, as I gotta sort some things) and I'll
> respond here as follow up with link to a branch with the WIP/PoC patches.

Great!
  
> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
> bit is cheap, clearing them is 'expensive' because one needs to flush
> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
> operations to actually update the Access/Dirty bit in concurrency with
> the CPU. The AMD manuals are a tad misleading as they talk about marking
> non-present, but that would be catastrophic for migration as it would
> mean a DMA target abort for the PCI device, unless I missed something obvious.
> In any case, this means that the dirty bit *clearing* needs to be
> batched as much as possible, to amortize the cost of flushing the IOTLB.
> This is the same for Intel *IIUC*.

You have to mark it as non-present to do the final read out if
something unmaps while the tracker is on - eg emulating a viommu or
something. Then you mark non-present, flush the iotlb and read back
the dirty bit.

Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
then read and clear them.

> 4) Adjust the granularity of pagetables in place:
> [This item wasn't done, but it is generic to any IOMMU because it
> is mostly the ability to split existing IO pages in place.]

This seems like it would be some interesting amount of driver work,
but yes it could be a generic new iommu_domina op.

> 4.b) Optionally starting dirtying earlier (at provisioning) and let
> userspace dynamically split pages. This is to hopefully minimize the
> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
> So dirty tracking would be enabled at creation of the protection domain
> after the vfio container is set up, and we would use pages dirtied
> as a indication of what needs to be splited. Problem is for IO page
> sizes bigger than 1G, which might unnecessarily lead to marking too
> much as dirty early on; but at least it's better than transferring the
> whole set.

I'm not sure running with dirty tracking permanently on would be good
for guest performance either.

I'd suspect you'd be better to have a warm up period where you track
dirtys and split down pages.

It is interesting, this is a possible reason why device dirty tracking
might actually perfom better because it can operate at a different
granularity from the system iommu without disrupting the guest DMA
performance.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-04 23:07                 ` Jason Gunthorpe
@ 2022-02-11 17:28                   ` Joao Martins
  2022-02-11 17:49                     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-11 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/4/22 23:07, Jason Gunthorpe wrote:
> On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
>> On 2/3/22 15:18, Jason Gunthorpe wrote:
>>> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>>>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>>>> how to integrate that with the iommufd work, which I hope will allow
>>>>> that series, and the other IOMMU drivers that can support this to be
>>>>> merged..
>>>>
>>>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>>>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>>>> obvious for sure. When you say 'integrate that with the iommufd' can you
>>>> expand on that?
>>>
>>> The general idea is that iommufd is the place to put all the iommu
>>> driver uAPI for consumption by userspace. The IOMMU feature of dirty
>>> tracking would belong there.
>>>
>>> So, some kind of API needs to be designed to meet the needs of the
>>> IOMMU drivers.
>>>
>> /me nods
>>
>> I am gonna assume below is the most up-to-date to iommufd (as you pointed
>> out in another thread IIRC):
>>
>>   https://github.com/jgunthorpe/linux iommufd
>>
>> Let me know if it's not :)
> 
> The iommufd part is pretty good, but there is hacky patch to hook it
> into vfio that isn't there, if you want to actually try it.
> 
OK -- Thanks for the heads up.

>>> But, as you say, it looks unnatural and inefficient when the domain
>>> itself is storing the dirty bits inside the IOPTE.
>>
>> How much is this already represented as the io-pgtable in IOMMU internal kAPI
>> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
>> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(
> 
> Which are you looking at? AFACIT there is no diry page support in
> iommu_ops ?
> 
There is not, indeed. But how you manage IO page tables felt *somewhat
hinting* to io-pgtable -- but maybe it's a stretch. The dirty
page support builds on top of that, as ARM IOMMUs use iommu io-pgtable.

In the series in the link, it pretty much boils down to
essentially 3 iommu_ops introduced:

* support_dirty_log(domain)
  - Checks whether a given IOMMU domain supports dirty tracking
* switch_dirty_log(domain, enable, iova, size, prot)
  - Enables/Disables IOMMU dirty tracking on an IOVA range
  (The range I suppose it's because ARM supports range tracking, contrary
   to Intel and AMD which affect all the protection domain pg tables)
* sync_dirty_log(domain, iova, size, &bitmap, ...)

The IO pgtables, have an interesting set of APIs strictly for page table
manipulation, and they are sort of optional (so far ARM and AMD). Those
include, map, unmap, iova_to_pfn translation. So, the dirty log supports
for ARM builds on that and adds split, merge (for splitting and collapsing
IO pagtables), and finally sync_dirty_log() which is where we get the
bitmap of dirtied pages *updated*.

But well, at the end of the day for an IOMMU driver the domain ops are
the important stuff, maybe IO pgtable framework isn't as critical
(Intel for example, doesn't use that at all).

>> then potentially VMM/process can more efficiently scan the dirtied
>> set? But if some layer needs to somehow mediate between the vendor
>> IOPTE representation and an UAPI IOPTE representation, to be able to
>> make that delegation to userspace ... then maybe both might be
>> inefficient?  I didn't see how iommu-fd would abstract the IOPTEs
>> lookup as far as I glanced through the code, perhaps that's another
>> ioctl().
> 
> It is based around the same model as VFIO container - map/unmap of
> user address space into the IOPTEs and the user space doesn't see
> anything resembling a 'pte' - at least for kernel owned IO page
> tables.
> 
Ah! I misinterpreted what you were saying.

> User space page tables will not be abstracted and the userspace must
> know the direct HW format of the IOMMU they is being used.
> 
That's countering earlier sentence? Because hw format (for me at least)
means PTE and protection domain config format too. And if iommufd
abstracts the HW format, modelling after the IOMMU domain and its ops,
then it's abstracting userspace from those details e.g. it works over
IOVAs, but not over its vendor representation of how that IOVA is set up.

I am probably being dense.

>> But what strikes /specifically/ on the dirty bit feature is that it looks
>> simpler with the current VFIO, the heavy lifting seems to be
>> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
>> the container (small changes), and IOMMU vendor would do most of it:
> 
> It is basically the same, almost certainly the user API in iommufd
> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> just like vfio has.
> 

The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
migration flow. Considering that we pause vCPUs, and we quiesce the
VFs, I guess we are mostly supposed to have a guarantee that no DMA is
supposed to be happening (excluding P2P)? So perhaps the unmap is
unneeded after quiescing the VF. Which probably is important
migration-wise considering the unmap is really what's expensive (e.g.
needs IOTLB flush) and that would add up between stop copy and the time it
would resume on the destination, if we need to wait to unmap the whole IOVA
from source.

> The tricky details are around how do you manage this when the system
> may have multiple things invovled capable, or not, of actualy doing
> dirty tracking.
> 
Yeap.

>> At the same time, what particularly scares me perf-wise (for the
>> device being migrated) ... is the fact that we need to dynamically
>> split and collapse page tables to increase the granularity of which
>> we track. In the above interface it splits/collapses when you turn
>> on/off the dirty tracking (respectively). That's *probably* where we
>> need more flexibility, not sure.
> 
> For sure that is a particularly big adventure in the iommu driver..
> 
Yeah.

Me playing around was with VFIO IOMMU hugepages disabled, and this
was something I am planning on working when I get back to this.

>> Do you have thoughts on what such device-dirty interface could look like?
>> (Perhaps too early to poke while the FSM/UAPI is being worked out)
> 
> I've been thinking the same general read-and-clear of a dirty
> bitmap. It matches nicely the the KVM interface.
> 

A bitmap I think it is a good start.

We have a bitmap based interface in KVM, but there's also a recent ring
interface for dirty tracking, which has some probably more determinism than
a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
and breaking its entries on-demand IIRC, whereas Intel resembles something
closer to a 512 entries 'ring' with VMX PML, which tells what has been
dirtied.

>> I was wondering if container has a dirty scan/sync callback funnelled
>> by a vendor IOMMU ops implemented (as Shameerali patches proposed), 
> 
> Yes, this is almost certainly how the in-kernel parts will look
> 
/me nods

>> and vfio vendor driver provides one per device. 
> 
> But this is less clear..
> 
>> Or propagate the dirty tracking API to vendor vfio driver[*]. 
>> [*] considering the device may choose where to place its tracking storage, and
>> which scheme (bitmap, ring, etc) it might be.
> 
> This has been my thinking, yes
> 
/me nods

>> The reporting of the dirtying, though, looks hazzy to achieve if you
>> try to make it uniform even to userspace. Perhaps with iommu-fd
>> you're thinking to mmap() the dirty region back to userspace, or an
>> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
>> the dirty status via the mmap() object. And that would be the common
>> API regardless of dirty-hw scheme. Anyway, just thinking out loud.
> 
> My general thinking has be that iommufd would control only the system
> IOMMU hardware. The FD interface directly exposes the iommu_domain as
> a manipulable object, so I'd imagine making userspace have a simple
> 1:1 connection to the iommu_ops of a single iommu_domain.
> 
> Doing this avoids all the weirdo questions about what do you do if
> there is non-uniformity in the iommu_domain's.
> 
> Keeping with that theme the vfio_device would provide a similar
> interface, on its own device FD.
> 
> I don't know if mmap should be involed here, the dirty bitmaps are not
> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> 
Considering that is 32MB of a bitmap per TB maybe it is cheap.

>>> VFIO proposed to squash everything
>>> into the container code, but I've been mulling about having iommufd
>>> only do system iommu and push the PCI device internal tracking over to
>>> VFIO.
>>>
>>
>> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
>> not sure yet how much one can we 'offload' to a generic layer, at least
>> compared with this other proposal.
> 
> Yes, I expect there is very little generic code here if we go this
> way. The generic layer is just marshalling the ioctl(s) to the iommu
> drivers. Certainly not providing storage or anything/
> 
Gotcha. Perhaps I need to sort out how this would work with iommufd.

But I guess we can hash out how the iommu ops would look like?

We seem to be on the same page, at least that's the feeling I get.

>> Give me some time (few days only, as I gotta sort some things) and I'll
>> respond here as follow up with link to a branch with the WIP/PoC patches.
> 
> Great!
>
Here it is. "A few days" turn into a week sorry :/

https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip

Note, it is an early PoC. I still need to get the split/collapse thing going,
and fix the FIXMEs there, and have a second good look at the iommu page tables.

The Qemu link above has one patch to get device-dirty bitmaps via HMP to be able to
measure the rate by hw dirties and another patch for emulated amd iommu support too,
should you lack the hardware to play (I usually do on both, more for people to be
able to coverage test it). I sadly cannot do the end-to-end migration test.

>> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
>> bit is cheap, clearing them is 'expensive' because one needs to flush
>> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
>> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
>> operations to actually update the Access/Dirty bit in concurrency with
>> the CPU. The AMD manuals are a tad misleading as they talk about marking
>> non-present, but that would be catastrophic for migration as it would
>> mean a DMA target abort for the PCI device, unless I missed something obvious.
>> In any case, this means that the dirty bit *clearing* needs to be
>> batched as much as possible, to amortize the cost of flushing the IOTLB.
>> This is the same for Intel *IIUC*.
> 
> You have to mark it as non-present to do the final read out if
> something unmaps while the tracker is on - eg emulating a viommu or
> something. Then you mark non-present, flush the iotlb and read back
> the dirty bit.
> 
You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
too :) without needing VMM intervention (that's also not supported
in Linux).

But the mark as NP for viommu is something I haven't investigated.

> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> then read and clear them.
> 
It's the other way around AIUI. The dirty bits are sticky, so you flush
the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
again on the next memory transaction (or ATS translation).

I am not entirely sure we need to unmap + mark non-present for non-viommu
That would actually mean something is not properly quiscieing the VF DMA.
Maybe we should .. to gate whether if we should actually continue with LM
if something kept doing DMA when it shouldn't have.

Also, considering the dirty-bitmap check happens on very-very long IOVA extents --
some measurements over how long the scans need to be done. Mostly
validated how much a much NIC was dirtying with traditional networking tools.

>> 4) Adjust the granularity of pagetables in place:
>> [This item wasn't done, but it is generic to any IOMMU because it
>> is mostly the ability to split existing IO pages in place.]
> 
> This seems like it would be some interesting amount of driver work,
> but yes it could be a generic new iommu_domina op.
> 
I am slightly at odds that .split and .collapse at .switch() are enough.
But, with iommu if we are working on top of an IOMMU domain object and
.split and .collapse are iommu_ops perhaps that looks to be enough
flexibility to give userspace the ability to decide what it wants to
split, if it starts eargerly/warming-up tracking dirty pages.

The split and collapsing is something I wanted to work on next, to get
to a stage closer to that of an RFC on the AMD side.

>> 4.b) Optionally starting dirtying earlier (at provisioning) and let
>> userspace dynamically split pages. This is to hopefully minimize the
>> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
>> So dirty tracking would be enabled at creation of the protection domain
>> after the vfio container is set up, and we would use pages dirtied
>> as a indication of what needs to be splited. Problem is for IO page
>> sizes bigger than 1G, which might unnecessarily lead to marking too
>> much as dirty early on; but at least it's better than transferring the
>> whole set.
> 
> I'm not sure running with dirty tracking permanently on would be good
> for guest performance either.
> 
Hmmm, judging how the IOMMU works I am not sure this is particularly
affecting DMA performance (not sure yet about RDMA, it's something I
curious to see how it gets to perform with 4K IOPTEs, and with dirty
tracking always enabled). Considering how the bits are sticky, and
unless CPU clears it, it's short of a nop? Unless of course the checking
for A^D during an atomic memory transaction is expensive. Needs some
performance testing nonetheless.

The IOTLB flushes pre/after clearing, tough, would visibly hurt DMA
performance :(

I forgot to mention, but the early enablement of IOMMU dirty tracking
was also meant to fully know since guest creation what needs to be
sent to the destination. Otherwise, wouldn't we need to send the whole
pinned set to destination, if we only start tracking dirty pages during
migration?

> I'd suspect you'd be better to have a warm up period where you track
> dirtys and split down pages.
> 
Yeah, but right now the splitting is done when switching on and off
dirty tracking, for the entirety of page tables.

And well at least on AMD, the flag is attached to a protection domain.
For SMMUv3. it appears that it's per IOVA range if I read the original code
correctly (which is interesting contrast to AMD/Intel)

Also, this is probably a differentiator for iommufd, if we were to provide
split and collapse semantics to IOMMU domain objects that userspace can use.
That would get more freedom, to switch dirty-tracking, and then do the warm
up thingie and piggy back on what it wants to split before migration.
perhaps the switch() should get some flag to pick where to split, I guess.


> It is interesting, this is a possible reason why device dirty tracking
> might actually perfom better because it can operate at a different
> granularity from the system iommu without disrupting the guest DMA
> performance.

I agree -- and the other advantage is that you don't depend on platform
support for dirty tracking, which is largely nonexistent.

Albeit I personally like to have the IOMMU support because it also
frees endpoints to do rather expensive dirty tracking, which is a rather
even more exclusive feature to have. It's trading off max performance for
LM commodity over, where DMA performance can be less (momentarily).

	Joao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-11 17:28                   ` Joao Martins
@ 2022-02-11 17:49                     ` Jason Gunthorpe
  2022-02-11 21:43                       ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-11 17:49 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:

> But well, at the end of the day for an IOMMU driver the domain ops are
> the important stuff, maybe IO pgtable framework isn't as critical
> (Intel for example, doesn't use that at all).

Right, it doesn't matter what library was used to implement the
domain..

> > User space page tables will not be abstracted and the userspace must
> > know the direct HW format of the IOMMU they is being used.
> > 
> That's countering earlier sentence? Because hw format (for me at least)
> means PTE and protection domain config format too. And if iommufd
> abstracts the HW format, modelling after the IOMMU domain and its ops,
> then it's abstracting userspace from those details e.g. it works over
> IOVAs, but not over its vendor representation of how that IOVA is set up.
> 
> I am probably being dense.

It is both ways, one kind of domain provides a kernel supplied
map/unmap that implements the IO PTE manipulation in kernel memory

But if you want the IOPTE to be in user memory then user must
read/write it and it cannot use that - so a user domain will not have
map/unmap.

> > It is basically the same, almost certainly the user API in iommufd
> > will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> > just like vfio has.
> > 
> 
> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
> migration flow. 

It is essential to implement any kind of viommu behavior where
map/unmap is occuring while the dirty tracking is running. It should
never make a difference except in some ugly edge cases where the dma
and unmap are racing.

> supposed to be happening (excluding P2P)? So perhaps the unmap is
> unneeded after quiescing the VF.

Yes, you don't need to unmap for migration, a simple fully synchronous
read and clear is sufficient. But that final read, while DMA is quite,
must obtain the latest dirty bit data.
 
> We have a bitmap based interface in KVM, but there's also a recent ring
> interface for dirty tracking, which has some probably more determinism than
> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
> and breaking its entries on-demand IIRC, whereas Intel resembles something
> closer to a 512 entries 'ring' with VMX PML, which tells what has been
> dirtied.

KVM has an advantage that it can throttle the rate of dirty generation
so it can rely on logging. The IOMMU can't throttle DMA, so we are
stuck with a bitmap

> > I don't know if mmap should be involed here, the dirty bitmaps are not
> > so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> > 
> Considering that is 32MB of a bitmap per TB maybe it is cheap.

Rigt. GUP fasting a couple huge pages is nothing compared to scanning
1TB of IO page table.

> >> Give me some time (few days only, as I gotta sort some things) and I'll
> >> respond here as follow up with link to a branch with the WIP/PoC patches.
> > 
> > Great!
> >
> Here it is. "A few days" turn into a week sorry :/
> 
> https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
> https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip
> 
> Note, it is an early PoC. I still need to get the split/collapse thing going,
> and fix the FIXMEs there, and have a second good look at the iommu page tables.

Oh I'll try to look a this thanks

> > You have to mark it as non-present to do the final read out if
> > something unmaps while the tracker is on - eg emulating a viommu or
> > something. Then you mark non-present, flush the iotlb and read back
> > the dirty bit.
> > 
> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
> too :) without needing VMM intervention (that's also not supported
> in Linux).

I'm sure, but dirty tracking has to happen on the kernel owned page
table, not the user owned one I think..

> > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> > then read and clear them.
> > 
> It's the other way around AIUI. The dirty bits are sticky, so you flush
> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
> again on the next memory transaction (or ATS translation).

I guess it depends on how the HW works, if it writes the dirty bit
back to ram instantly on the first dirty, or if it only writes the
dirty bit when flushing the iotlb.

In any case you have to synchronize with the HW in some way to ensure
that all dirty bits are 'pulled back' into system memory to resolve
races (ie you need the iommu HW to release and the CPU to acquire on
the dirty data) before trying to read them, at least for the final
quieced system read.

> I am not entirely sure we need to unmap + mark non-present for non-viommu
> That would actually mean something is not properly quiscieing the VF DMA.
> Maybe we should .. to gate whether if we should actually continue with LM
> if something kept doing DMA when it shouldn't have.

It is certainly an edge case. A device would be misbehaving to
continue DMA.

> > This seems like it would be some interesting amount of driver work,
> > but yes it could be a generic new iommu_domina op.
> 
> I am slightly at odds that .split and .collapse at .switch() are enough.
> But, with iommu if we are working on top of an IOMMU domain object and
> .split and .collapse are iommu_ops perhaps that looks to be enough
> flexibility to give userspace the ability to decide what it wants to
> split, if it starts eargerly/warming-up tracking dirty pages.
> 
> The split and collapsing is something I wanted to work on next, to get
> to a stage closer to that of an RFC on the AMD side.

split/collapse seems kind of orthogonal to me it doesn't really
connect to dirty tracking other than being mostly useful during dirty
tracking.

And I wonder how hard split is when trying to atomically preserve any
dirty bit..

> Hmmm, judging how the IOMMU works I am not sure this is particularly
> affecting DMA performance (not sure yet about RDMA, it's something I
> curious to see how it gets to perform with 4K IOPTEs, and with dirty
> tracking always enabled). Considering how the bits are sticky, and
> unless CPU clears it, it's short of a nop? Unless of course the checking
> for A^D during an atomic memory transaction is expensive. Needs some
> performance testing nonetheless.

If you leave the bits all dirty then why do it? The point is to see
the dirties, which means the iommu is generating a workload of dirty
cachelines while operating it didn't do before.

> I forgot to mention, but the early enablement of IOMMU dirty tracking
> was also meant to fully know since guest creation what needs to be
> sent to the destination. Otherwise, wouldn't we need to send the whole
> pinned set to destination, if we only start tracking dirty pages during
> migration?

? At the start of migration you have to send everything. Dirty
tracking is intended to allow the VM to be stopped and then have only
a small set of data to xfer.
 
> Also, this is probably a differentiator for iommufd, if we were to provide
> split and collapse semantics to IOMMU domain objects that userspace can use.
> That would get more freedom, to switch dirty-tracking, and then do the warm
> up thingie and piggy back on what it wants to split before migration.
> perhaps the switch() should get some flag to pick where to split, I guess.

Yes, right. Split/collapse should be completely seperate from dirty
tracking.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-11 17:49                     ` Jason Gunthorpe
@ 2022-02-11 21:43                       ` Joao Martins
  2022-02-12  0:01                         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-11 21:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/11/22 17:49, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:
>>> It is basically the same, almost certainly the user API in iommufd
>>> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
>>> just like vfio has.
>>>
>>
>> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
>> migration flow. 
> 
> It is essential to implement any kind of viommu behavior where
> map/unmap is occuring while the dirty tracking is running. It should
> never make a difference except in some ugly edge cases where the dma
> and unmap are racing.
> 
/me nods

>> supposed to be happening (excluding P2P)? So perhaps the unmap is
>> unneeded after quiescing the VF.
> 
> Yes, you don't need to unmap for migration, a simple fully synchronous
> read and clear is sufficient. But that final read, while DMA is quite,
> must obtain the latest dirty bit data.
>

The final read doesn't need special logic in VFIO/IOMMU IUC
(maybe only in the VMM)

Anyways, the above paragraph matches my understanding.

>> We have a bitmap based interface in KVM, but there's also a recent ring
>> interface for dirty tracking, which has some probably more determinism than
>> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
>> and breaking its entries on-demand IIRC, whereas Intel resembles something
>> closer to a 512 entries 'ring' with VMX PML, which tells what has been
>> dirtied.
> 
> KVM has an advantage that it can throttle the rate of dirty generation
> so it can rely on logging. The IOMMU can't throttle DMA, so we are
> stuck with a bitmap
> 
Yeap, sadly :(

>>> I don't know if mmap should be involed here, the dirty bitmaps are not
>>> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
>>>
>> Considering that is 32MB of a bitmap per TB maybe it is cheap.
> 
> Rigt. GUP fasting a couple huge pages is nothing compared to scanning
> 1TB of IO page table.
> 
... With 4K PTEs which is even more ludricous expensive. Well yeah, a
lesser concern the mangling of the bitmap, when you put that way heh

>>> You have to mark it as non-present to do the final read out if
>>> something unmaps while the tracker is on - eg emulating a viommu or
>>> something. Then you mark non-present, flush the iotlb and read back
>>> the dirty bit.
>>>
>> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
>> too :) without needing VMM intervention (that's also not supported
>> in Linux).
> 
> I'm sure, but dirty tracking has to happen on the kernel owned page
> table, not the user owned one I think..
> 
The plumbing for the hw-accelerated vIOMMU is a little different that
a regular vIOMMU, at least IIUC host does not take an active part in the
GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
have time to fiddle with the SDM:

https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf


>>> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
>>> then read and clear them.
>>>
>> It's the other way around AIUI. The dirty bits are sticky, so you flush
>> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
>> again on the next memory transaction (or ATS translation).
> 
> I guess it depends on how the HW works, if it writes the dirty bit
> back to ram instantly on the first dirty, or if it only writes the
> dirty bit when flushing the iotlb.
> 
The manual says roughly that the update is visible to CPU as soon as the
it updates. Particularly from the IOMMU SDM:

"Note that the setting of accessed and dirty status bits in the page tables is visible to
both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked
operations to update A and D bits must be 64-bit operations and naturally aligned on a
64-bit boundary"

And ...

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

> In any case you have to synchronize with the HW in some way to ensure
> that all dirty bits are 'pulled back' into system memory to resolve
> races (ie you need the iommu HW to release and the CPU to acquire on
> the dirty data) before trying to read them, at least for the final
> quieced system read.
> 
/me nods

>>> This seems like it would be some interesting amount of driver work,
>>> but yes it could be a generic new iommu_domina op.
>>
>> I am slightly at odds that .split and .collapse at .switch() are enough.
>> But, with iommu if we are working on top of an IOMMU domain object and
>> .split and .collapse are iommu_ops perhaps that looks to be enough
>> flexibility to give userspace the ability to decide what it wants to
>> split, if it starts eargerly/warming-up tracking dirty pages.
>>
>> The split and collapsing is something I wanted to work on next, to get
>> to a stage closer to that of an RFC on the AMD side.
> 
> split/collapse seems kind of orthogonal to me it doesn't really
> connect to dirty tracking other than being mostly useful during dirty
> tracking.
> 
> And I wonder how hard split is when trying to atomically preserve any
> dirty bit..
> 
Would would it be hard? The D bit is supposed to be replicated when you
split to smaller page size.

>> Hmmm, judging how the IOMMU works I am not sure this is particularly
>> affecting DMA performance (not sure yet about RDMA, it's something I
>> curious to see how it gets to perform with 4K IOPTEs, and with dirty
>> tracking always enabled). Considering how the bits are sticky, and
>> unless CPU clears it, it's short of a nop? Unless of course the checking
>> for A^D during an atomic memory transaction is expensive. Needs some
>> performance testing nonetheless.
> 
> If you leave the bits all dirty then why do it? The point is to see
> the dirties, which means the iommu is generating a workload of dirty
> cachelines while operating it didn't do before.
> 
My thinking was that if it's dirtied and in the IOTLB most likely the
descriptor in the IOTLB is cached. And if you need to do a IOMMU page walk
to resolve an IOVA, perhaps the check for the A & D bits needing
to be updated is probably the least problem in this path. Naturally, if
it's not split, you have a much higher chance (e.g. with 1GB entries) to stay
in the IOTLB and just compare two bits *prior* to consider starting
the atomic operation to update the descriptor.

>> I forgot to mention, but the early enablement of IOMMU dirty tracking
>> was also meant to fully know since guest creation what needs to be
>> sent to the destination. Otherwise, wouldn't we need to send the whole
>> pinned set to destination, if we only start tracking dirty pages during
>> migration?
> 
> ? At the start of migration you have to send everything. Dirty
> tracking is intended to allow the VM to be stopped and then have only
> a small set of data to xfer.
>  
Right, that's how it works today.

This is just preemptive longterm thinking about the overal problem space (probably
unnecessary noise at this stage). Particularly whenever I need to migrate 1 to 2TB VMs.
Particular that the stage *prior* to precopy takes way too long to transfer the whole
memory. So I was thinking say only transfer the pages that are populated[*] in the
second-stage page tables (for the CPU) coupled with IOMMU tracking from the beginning
(prior to vcpus even entering). That could probably decrease 1024 1GB Dirtied IOVA
entries, to maybe only dirty a smaller subset, saving a whole bootload of time.

[*] VMs without VFIO would be even easier as the first stage page tables are non-present.

>> Also, this is probably a differentiator for iommufd, if we were to provide
>> split and collapse semantics to IOMMU domain objects that userspace can use.
>> That would get more freedom, to switch dirty-tracking, and then do the warm
>> up thingie and piggy back on what it wants to split before migration.
>> perhaps the switch() should get some flag to pick where to split, I guess.
> 
> Yes, right. Split/collapse should be completely seperate from dirty
> tracking.

Yeap.

I wonder if we could start progressing the dirty tracking as a first initial series and
then have the split + collapse handling as a second part? That would be quite
nice to get me going! :D

	Joao

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-11 21:43                       ` Joao Martins
@ 2022-02-12  0:01                         ` Jason Gunthorpe
  2022-02-14 13:34                           ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-12  0:01 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
> The plumbing for the hw-accelerated vIOMMU is a little different that
> a regular vIOMMU, at least IIUC host does not take an active part in the
> GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
> have time to fiddle with the SDM:
> 
> https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf

That looks like the same nesting everyone else is talking about.

I've been refering to this as 'user space page tables' to avoid a lot
of ambiguity becuase what it fundamentally does is allow userspace to
directly manage an IO page table - building the IO PTEs, issue
invalidations, etc.

To be secure the userspace page table is nested under a kernel owned
page table and all the IO PTE offsets/etc are understood to be within
the address space of the kernel owned page table.

When combined with KVM the userspace is just inside a VM.

> 1. Decodes the read and write intent from the memory access.
> 2. If P=0 in the page descriptor, fail the access.
> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
> 4. If the A or D bits need to be updated in the descriptor:

Ah, so the dirty update is actually atomic on the first write before
any DMA happens - and I suppose all of this happens when the entry is
first loaded into the IOTLB.

So the flush is to allow the IOTLB to see the cleared D bit..

> > split/collapse seems kind of orthogonal to me it doesn't really
> > connect to dirty tracking other than being mostly useful during dirty
> > tracking.
> > 
> > And I wonder how hard split is when trying to atomically preserve any
> > dirty bit..
> > 
> Would would it be hard? The D bit is supposed to be replicated when you
> split to smaller page size.

I guess it depends on how the 'acquire' is done, as the CPU has to
atomically replace a large entry with a pointer to a small entry,
flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
in the old entry then it has to sprinkle it into the new entries with
atomics.

> This is just preemptive longterm thinking about the overal problem
> space (probably unnecessary noise at this stage). Particularly
> whenever I need to migrate 1 to 2TB VMs.  Particular that the stage
> *prior* to precopy takes way too long to transfer the whole
> memory. So I was thinking say only transfer the pages that are
> populated[*] in the second-stage page tables (for the CPU) coupled
> with IOMMU tracking from the beginning (prior to vcpus even
> entering). That could probably decrease 1024 1GB Dirtied IOVA
> entries, to maybe only dirty a smaller subset, saving a whole
> bootload of time.

Oh, you want to effectively optimize zero page detection..

> I wonder if we could start progressing the dirty tracking as a first initial series and
> then have the split + collapse handling as a second part? That would be quite
> nice to get me going! :D

I think so, and I think we should. It is such a big problem space, it
needs to get broken up.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-12  0:01                         ` Jason Gunthorpe
@ 2022-02-14 13:34                           ` Joao Martins
  2022-02-14 14:06                             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-14 13:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/12/22 00:01, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
>> 1. Decodes the read and write intent from the memory access.
>> 2. If P=0 in the page descriptor, fail the access.
>> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
>> 4. If the A or D bits need to be updated in the descriptor:
> 
> Ah, so the dirty update is actually atomic on the first write before
> any DMA happens - and I suppose all of this happens when the entry is
> first loaded into the IOTLB.
> 
Intel's equivalent feature suggests me that this works the same way there.

ARM update of ioptes looks to be slightly different[*] but this FEAT_BBM
in SMMUv3.2 makes it work in similar way to Intel/AMD. But I could be
misunderstanding something there.

[*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
and then write the new valid entry. Not sure I understood correctly that this
is the 'break-before-make' thingie.

> So the flush is to allow the IOTLB to see the cleared D bit..
> 
Right.

>>> split/collapse seems kind of orthogonal to me it doesn't really
>>> connect to dirty tracking other than being mostly useful during dirty
>>> tracking.
>>>
>>> And I wonder how hard split is when trying to atomically preserve any
>>> dirty bit..
>>>
>> Would would it be hard? The D bit is supposed to be replicated when you
>> split to smaller page size.
> 
> I guess it depends on how the 'acquire' is done, as the CPU has to
> atomically replace a large entry with a pointer to a small entry,
> flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
> in the old entry then it has to sprinkle it into the new entries with
> atomics.
> 
ISTR some mention to what we are chatting here in the IOMMU SDM:

When a non-default page size is used , software must OR the Dirty bits in
all of the replicated host PTEs used to map the page. The IOMMU does not
guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
of the page may have been written even if the Dirty bit is set in only one
of the replicated PTEs.

>> I wonder if we could start progressing the dirty tracking as a first initial series and
>> then have the split + collapse handling as a second part? That would be quite
>> nice to get me going! :D
> 
> I think so, and I think we should. It is such a big problem space, it
> needs to get broken up.

OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
past except with the x86 support only[*]. And we poke holes there I guess.

[*] I might include Intel too, albeit emulated only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-14 13:34                           ` Joao Martins
@ 2022-02-14 14:06                             ` Jason Gunthorpe
  2022-02-15 16:00                               ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-14 14:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:

> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
> and then write the new valid entry. Not sure I understood correctly that this
> is the 'break-before-make' thingie.

Doesn't that explode if the invalid entry is DMA'd to?

> When a non-default page size is used , software must OR the Dirty bits in
> all of the replicated host PTEs used to map the page. The IOMMU does not
> guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
> of the page may have been written even if the Dirty bit is set in only one
> of the replicated PTEs.

I suspect that is talking about something else

> >> I wonder if we could start progressing the dirty tracking as a first initial series and
> >> then have the split + collapse handling as a second part? That would be quite
> >> nice to get me going! :D
> > 
> > I think so, and I think we should. It is such a big problem space, it
> > needs to get broken up.
> 
> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
> past except with the x86 support only[*]. And we poke holes there I guess.
> 
> [*] I might include Intel too, albeit emulated only.

Like I said, I'd prefer we not build more on the VFIO type 1 code
until we have a conclusion for iommufd..

While returning the dirty data looks straight forward, it is hard to
see an obvious path to enabling and controlling the system iommu the
way vfio is now.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-14 14:06                             ` Jason Gunthorpe
@ 2022-02-15 16:00                               ` Joao Martins
  2022-02-15 16:21                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-15 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/14/22 14:06, Jason Gunthorpe wrote:
> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
> 
>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>> and then write the new valid entry. Not sure I understood correctly that this
>> is the 'break-before-make' thingie.
> 
> Doesn't that explode if the invalid entry is DMA'd to?
> 
Yes, IIUC. Also, the manual has this note:

"Note: For example, to split a block into constituent granules
(or to merge a span of granules into an equivalent block), VMSA
requires the region to be made invalid, a TLB invalidate
performed, then to make the region take the new configuration.

Note: The requirement for a break-before-make sequence can cause
problems for unrelated I/O streams that might use addresses
overlapping a region of interest, because the I/O streams cannot
always be conveniently stopped and might not tolerate translation
faults. It is advantageous to perform live update of a block into
smaller translations, or a set of translations into a larger block
size."

Probably why the original SMMUv3.2 dirty track series requires FEAT_BBM
as it had to do in-place atomic updates to split/collapse IO pgtables.
Not enterily clear if HTTU Dirty access requires the same.

>>>> I wonder if we could start progressing the dirty tracking as a first initial series and
>>>> then have the split + collapse handling as a second part? That would be quite
>>>> nice to get me going! :D
>>>
>>> I think so, and I think we should. It is such a big problem space, it
>>> needs to get broken up.
>>
>> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
>> past except with the x86 support only[*]. And we poke holes there I guess.
>>
>> [*] I might include Intel too, albeit emulated only.
> 
> Like I said, I'd prefer we not build more on the VFIO type 1 code
> until we have a conclusion for iommufd..
> 

I didn't quite understand what you mean by conclusion.

If by conclusion you mean the whole thing to be merged, how can the work be
broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
in terms of direction...

I can build on top of iommufd -- Just trying to understand how this is
going to work out.

> While returning the dirty data looks straight forward, it is hard to
> see an obvious path to enabling and controlling the system iommu the
> way vfio is now.

It seems strange to have a whole UAPI for userspace [*] meant to
return dirty data to userspace, when dirty right now means the whole
pinned page set and so copying the whole guest ... and the guest is
running so we might be racing with the device changing guest pages with the
VMM/CPU unaware of it. Even with no dwelling of IOMMU pagetables (i.e. split/collapse
IO base pages) it would still help greatly the current status quo of copying
the entire thing :(

Hence my thinking was that the patches /if small/ would let us see how dirty
tracking might work for iommu kAPI (and iommufd) too.

Would it be better to do more iterative steps (when possible) as opposed to
scratch and rebuild VFIO type1 IOMMU handling?

	Joao

[*] VFIO_IOMMU_DIRTY_PAGES{_FLAG_START,_FLAG_STOP,_FLAG_GET_BITMAP}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-15 16:00                               ` Joao Martins
@ 2022-02-15 16:21                                 ` Jason Gunthorpe
  2022-02-22 11:55                                   ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-15 16:21 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
> On 2/14/22 14:06, Jason Gunthorpe wrote:
> > On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
> > 
> >> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
> >> and then write the new valid entry. Not sure I understood correctly that this
> >> is the 'break-before-make' thingie.
> > 
> > Doesn't that explode if the invalid entry is DMA'd to?
> > 
> Yes, IIUC. Also, the manual has this note:

Heh, sounds like "this doesn't work" to me :)

> > Like I said, I'd prefer we not build more on the VFIO type 1 code
> > until we have a conclusion for iommufd..
> > 
> 
> I didn't quite understand what you mean by conclusion.

If people are dead-set against doing iommufd, then lets abandon the
idea and go back to hacking up vfio.
 
> If by conclusion you mean the whole thing to be merged, how can the work be
> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> in terms of direction...

I think go ahead and build it on top of iommufd, start working out the
API details, etc. I think once the direction is concluded the new APIs
will go forward.

> > While returning the dirty data looks straight forward, it is hard to
> > see an obvious path to enabling and controlling the system iommu the
> > way vfio is now.
> 
> It seems strange to have a whole UAPI for userspace [*] meant to
> return dirty data to userspace, when dirty right now means the whole
> pinned page set and so copying the whole guest ... 

Yes, the whole thing is only partially implemented, and doesn't have
any in-kernel user. It is another place holder for an implementation
to come someday.

> Hence my thinking was that the patches /if small/ would let us see how dirty
> tracking might work for iommu kAPI (and iommufd) too.

It could be tried, but I think if you go into there you will find it
quickly turns quite complicated to address all the edge cases. Eg what
do you do if you have a mdev present after you turn on system
tracking? What if the mdev is using a PASID? What about hotplug of new
VFIO devices?

Remember, dirty tracking for vfio is totally useless without also
having vfio device migration. Do you already have a migration capable
device to use with this?

> Would it be better to do more iterative steps (when possible) as opposed to
> scratch and rebuild VFIO type1 IOMMU handling?

Possibly, but every thing that gets added has to be carried over to
the new code too, and energy has to be expended trying to figure out
how the half implemented stuff should work while finishing it.

At the very least we must decide what to do with device-provided dirty
tracking before the VFIO type1 stuff can be altered to use the system
IOMMU.

This is very much like the migration FSM, the only appeal is the
existing qemu implementation of the protocol.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
                   ` (4 preceding siblings ...)
  2022-02-02 13:14 ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Jason Gunthorpe
@ 2022-02-18 16:37 ` Jason Gunthorpe
  5 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-18 16:37 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvm, linux-kernel, linux-crypto, alex.williamson, mgurtovoy,
	linuxarm, liulongfang, prime.zeng, yuzenghui, jonathan.cameron,
	wangzhou1

On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This series attempts to add vfio live migration support for
> HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> includes both the functional register space and migration 
> control register space. As discussed in RFCv1[0], this may create
> security issues as these regions get shared between the Guest
> driver and the migration driver. Based on the feedback, we tried
> to address those concerns in this version. 
> 
> This is now based on the new vfio-pci-core framework proposal[1].
> Understand that the framework proposal is still under discussion,
> but really appreciate any feedback on the approach taken here
> to mitigate the security risks.

Do you want to try to get this into this kernel along with the mlx5
driver? How RFCy is this?

If you can post a v2 with the changes discussed soon I think there is
a good chance of this.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-15 16:21                                 ` Jason Gunthorpe
@ 2022-02-22 11:55                                   ` Joao Martins
  2022-02-23  1:03                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-22 11:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/15/22 16:21, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
>> On 2/14/22 14:06, Jason Gunthorpe wrote:
>>> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
>>>
>>>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>>>> and then write the new valid entry. Not sure I understood correctly that this
>>>> is the 'break-before-make' thingie.
>>>
>>> Doesn't that explode if the invalid entry is DMA'd to?
>>>
>> Yes, IIUC. Also, the manual has this note:
> 
> Heh, sounds like "this doesn't work" to me :)
> 
Yeah, but I remember reading in manual that HTTUD (what ARM calls it for dirty
tracking, albeit DBM is another term for the same thing) requires FEAT_BBM
which avoids us to play the above games. So, supposedly, we can "just"
use atomics with IOPTE changes and IOTLB flush. Not if we need the latter
flush before or after on smmuv3.

>>> Like I said, I'd prefer we not build more on the VFIO type 1 code
>>> until we have a conclusion for iommufd..
>>>
>>
>> I didn't quite understand what you mean by conclusion.
> 
> If people are dead-set against doing iommufd, then lets abandon the
> idea and go back to hacking up vfio.
>  
Heh, I was under the impression everybody was investing so much *because*
that direction was set onto iommufd direction.

>> If by conclusion you mean the whole thing to be merged, how can the work be
>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>> in terms of direction...
> 
> I think go ahead and build it on top of iommufd, start working out the
> API details, etc. I think once the direction is concluded the new APIs
> will go forward.
>
/me nods, will do. Looking at your repository it is looking good.

>>> While returning the dirty data looks straight forward, it is hard to
>>> see an obvious path to enabling and controlling the system iommu the
>>> way vfio is now.
>>
>> It seems strange to have a whole UAPI for userspace [*] meant to
>> return dirty data to userspace, when dirty right now means the whole
>> pinned page set and so copying the whole guest ... 
> 
> Yes, the whole thing is only partially implemented, and doesn't have
> any in-kernel user. It is another place holder for an implementation
> to come someday.
> 
Yeap, seems like.

>> Hence my thinking was that the patches /if small/ would let us see how dirty
>> tracking might work for iommu kAPI (and iommufd) too.
> 
> It could be tried, but I think if you go into there you will find it
> quickly turns quite complicated to address all the edge cases. Eg what
> do you do if you have a mdev present after you turn on system
> tracking? What if the mdev is using a PASID?
> What about hotplug of new
> VFIO devices?
> 
> Remember, dirty tracking for vfio is totally useless without also
> having vfio device migration. 

Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
if we can't do the device part first. But if quiescing DMA and saving
state are two hard requirements that are mandatory for a live migrateable
VF, having dirty tracking in the devices I suspect might be more rare.
So perhaps people will look at IOMMUs as a commodity-workaround to avoid
a whole bunch of hardware logic for dirty tracking, even bearing what it
entails for DMA performance (hisilicon might be an example).

> Do you already have a migration capable
> device to use with this?
> 
Not yet, but soon I hope.

>> Would it be better to do more iterative steps (when possible) as opposed to
>> scratch and rebuild VFIO type1 IOMMU handling?
> 
> Possibly, but every thing that gets added has to be carried over to
> the new code too, and energy has to be expended trying to figure out
> how the half implemented stuff should work while finishing it.
> 
/me nods I understand

> At the very least we must decide what to do with device-provided dirty
> tracking before the VFIO type1 stuff can be altered to use the system
> IOMMU.
> 
I, too, have been wondering what that is going to look like -- and how do we
convey the setup of dirty tracking versus the steering of it.

> This is very much like the migration FSM, the only appeal is the
> existing qemu implementation of the protocol.

Yeah.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-22 11:55                                   ` Joao Martins
@ 2022-02-23  1:03                                     ` Jason Gunthorpe
  2022-02-25 19:18                                       ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-23  1:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:

> > If people are dead-set against doing iommufd, then lets abandon the
> > idea and go back to hacking up vfio.
> >  
> Heh, I was under the impression everybody was investing so much *because*
> that direction was set onto iommufd direction.

Such is the hope :)

> >> If by conclusion you mean the whole thing to be merged, how can the work be
> >> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >> in terms of direction...
> > 
> > I think go ahead and build it on top of iommufd, start working out the
> > API details, etc. I think once the direction is concluded the new APIs
> > will go forward.
> >
> /me nods, will do. Looking at your repository it is looking good.

I would like to come with some plan for dirty tracking on iommufd and
combine that with a plan for dirty tracking inside the new migration
drivers.

> Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
> if we can't do the device part first. But if quiescing DMA and saving
> state are two hard requirements that are mandatory for a live migrateable
> VF, having dirty tracking in the devices I suspect might be more
> rare.

So far all but one of the live migration devices I know about can do
dirty tracking internally..

> So perhaps people will look at IOMMUs as a commodity-workaround to avoid
> a whole bunch of hardware logic for dirty tracking, even bearing what it
> entails for DMA performance (hisilicon might be an example).

I do expect this will be true.

> > At the very least we must decide what to do with device-provided dirty
> > tracking before the VFIO type1 stuff can be altered to use the system
> > IOMMU.
> 
> I, too, have been wondering what that is going to look like -- and how do we
> convey the setup of dirty tracking versus the steering of it.

What I suggested was to just split them.

Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
would be on a per-domain basis, not on the ioas.

Some ioctl toward the vfio device will turn on the device's tracker.

Userspace has to iterate and query each enabled tracker. This is not
so bad because the time to make the system call is going to be tiny
compared to the time to marshal XXGB of dirty bits.

This makes qemu more complicated because it has to decide what
trackers to turn on, but that is also the point because we do want
userspace to be able to decide.

The other idea that has some possible interest is to allow the
trackers to dump their dirty bits into the existing kvm tracker, then
userspace just does a single kvm centric dirty pass.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-23  1:03                                     ` Jason Gunthorpe
@ 2022-02-25 19:18                                       ` Joao Martins
  2022-02-25 20:44                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-02-25 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/23/22 01:03, Jason Gunthorpe wrote:
> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>> in terms of direction...
>>>
>>> I think go ahead and build it on top of iommufd, start working out the
>>> API details, etc. I think once the direction is concluded the new APIs
>>> will go forward.
>>>
>> /me nods, will do. Looking at your repository it is looking good.
> 
> I would like to come with some plan for dirty tracking on iommufd and
> combine that with a plan for dirty tracking inside the new migration
> drivers.
> 
I had a few things going on my end over the past weeks, albeit it is
getting a bit better now and I will be coming back to this topic. I hope/want
to give you a more concrete update/feedback over the coming week or two wrt
to dirty-tracking+iommufd+amd.

So far, I am not particularly concerned that this will affect overall iommufd
design. The main thing is really lookups to get vendor iopte, upon on what might
be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking,
I'll be simplifying the interface in the other type1 series I had and making it
a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
enable/disable over a domain. Perhaps same trick could be expanded to other
features to have a bit more control on what userspace is allowed to use. I think
this just needs to set/clear a feature bit in the domain, for VFIO or userspace
to have full control during the different stages of migration of dirty tracking.
In all of the IOMMU implementations/manuals I read it means setting a protection
domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
(albeit past work had also it always-on).

Provided the iommufd does /separately/ more finer granularity on what we can
do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
at will as separate operations, before and after migration respectivally. That logic
would probably be better to be in separate iommufd ioctls(), as that it's going to be
expensive.


>> I, too, have been wondering what that is going to look like -- and how do we
>> convey the setup of dirty tracking versus the steering of it.
> 
> What I suggested was to just split them.
> 
> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> would be on a per-domain basis, not on the ioas.
> 
> Some ioctl toward the vfio device will turn on the device's tracker.
> 
In the activation/fetching-data of either trackers I see some things in common in
terms of UAPI with the difference that whether a device or a list of devices are passed on
as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
your suggestion)

Albeit perhaps the main difference is going to be that one needs to setup with
hardware interface with the device tracker and how we carry the regions of memory that
want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA
space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API
between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped.

In IOMMU this is sort of cheap and 'stateless', but on the setup of the
device tracker might mean giving all the IOVA ranges to the PF (once?).
Perhaps leaving to the vendor driver to pick when to register the IOVA space to
be tracked, or perhaps when you turn on the device's tracker. But on all cases,
the driver needs some form of awareness of and convey that to the PF for
tracking purposes.

> Userspace has to iterate and query each enabled tracker. This is not
> so bad because the time to make the system call is going to be tiny
> compared to the time to marshal XXGB of dirty bits.
> 
Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
following by a smaller cost copying back to userspace (which KVM does too, when it's using
a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
(gup as you mentioned earlier in the thread), or just copying based on the input bitmap
(from PF) number of leading zeroes within some threshold.

> This makes qemu more complicated because it has to decide what
> trackers to turn on, but that is also the point because we do want
> userspace to be able to decide.
> 
If the interface wants extending to pass a device or an array of devices (if I understood
you correctly), it would free/simplify VFIO from having to concatenate potentially
different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
more to avoid performing too much copying.

> The other idea that has some possible interest is to allow the
> trackers to dump their dirty bits into the existing kvm tracker, then
> userspace just does a single kvm centric dirty pass.

That would probably limit certain more modern options of ring based dirty tracking,
as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
would require KVM to always use a bitmap during migration/dirty-rate-estimation with
the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
it's a module).

Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with
reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one.
I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was
made for that purpose).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-25 19:18                                       ` Joao Martins
@ 2022-02-25 20:44                                         ` Jason Gunthorpe
  2022-02-28 13:01                                           ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-25 20:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> On 2/23/22 01:03, Jason Gunthorpe wrote:
> > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>> in terms of direction...
> >>>
> >>> I think go ahead and build it on top of iommufd, start working out the
> >>> API details, etc. I think once the direction is concluded the new APIs
> >>> will go forward.
> >>>
> >> /me nods, will do. Looking at your repository it is looking good.
> > 
> > I would like to come with some plan for dirty tracking on iommufd and
> > combine that with a plan for dirty tracking inside the new migration
> > drivers.
> > 
> I had a few things going on my end over the past weeks, albeit it is
> getting a bit better now and I will be coming back to this topic. I hope/want
> to give you a more concrete update/feedback over the coming week or two wrt
> to dirty-tracking+iommufd+amd.
> 
> So far, I am not particularly concerned that this will affect overall iommufd
> design. The main thing is really lookups to get vendor iopte, upon on what might
> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> the tracking,

I'm not very keen on these multiplexer interfaces. I think you should
just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
'read_dirty_bits'

NULL op means not supported.

IMHO we don't need a kapi wrapper if only iommufd is going to call the
op.

> I'll be simplifying the interface in the other type1 series I had and making it
> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> enable/disable over a domain. Perhaps same trick could be expanded to other
> features to have a bit more control on what userspace is allowed to use. I think
> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> to have full control during the different stages of migration of dirty tracking.
> In all of the IOMMU implementations/manuals I read it means setting a protection
> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> (albeit past work had also it always-on).
> 
> Provided the iommufd does /separately/ more finer granularity on what we can
> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> at will as separate operations, before and after migration respectivally. That logic
> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> expensive.

This all sounds right to me

Questions I have:
 - Do we need ranges for some reason? You mentioned ARM SMMU wants
   ranges? how/what/why?

 - What about the unmap and read dirty without races operation that
   vfio has?

> >> I, too, have been wondering what that is going to look like -- and how do we
> >> convey the setup of dirty tracking versus the steering of it.
> > 
> > What I suggested was to just split them.
> > 
> > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> > would be on a per-domain basis, not on the ioas.
> > 
> > Some ioctl toward the vfio device will turn on the device's tracker.
> > 
> In the activation/fetching-data of either trackers I see some things in common in
> terms of UAPI with the difference that whether a device or a list of devices are passed on
> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
> your suggestion)

I was thinking a VFIO_DEVICE ioctl located on the device FD
implemented in the end VFIO driver (like mlx5_vfio). No lists..

As you say the driver should just take in the request to set dirty
tracking and take core of it somehow. There is no value the core VFIO
code can add here.

> Albeit perhaps the main difference is going to be that one needs to
> setup with hardware interface with the device tracker and how we
> carry the regions of memory that want to be tracked i.e. GPA/IOVA
> ranges that the device should track. The tracking-GPA space is not
> linear GPA space sadly. But at the same point perhaps the internal
> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
> ranges we mapped.

Yes, this is a point that needs some answering. One option is to pass
in the tracking range list from userspace. Another is to query it in
the driver from the currently mapped areas in IOAS.

I know devices have limitations here in terms of how many/how big the
ranges can be, and devices probably can't track dynamic changes.

> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
> device tracker might mean giving all the IOVA ranges to the PF (once?).
> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
> the driver needs some form of awareness of and convey that to the PF for
> tracking purposes.

Yes, this is right
 
> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
> following by a smaller cost copying back to userspace (which KVM does too, when it's using
> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
> (from PF) number of leading zeroes within some threshold.

What I would probably strive for is an API that deliberately OR's in
the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
in your dirty data', then do the next page. etc. That is 134M of IOVA
per chunk which seems OK.

> > This makes qemu more complicated because it has to decide what
> > trackers to turn on, but that is also the point because we do want
> > userspace to be able to decide.
> > 
> If the interface wants extending to pass a device or an array of devices (if I understood
> you correctly), it would free/simplify VFIO from having to concatenate potentially
> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
> more to avoid performing too much copying.

Yes. Currently VFIO maintains its own bitmap so it also saves that
memory by keeping the dirty bits stored in the IOPTEs until read out.
 
> > The other idea that has some possible interest is to allow the
> > trackers to dump their dirty bits into the existing kvm tracker, then
> > userspace just does a single kvm centric dirty pass.
> 
> That would probably limit certain more modern options of ring based dirty tracking,
> as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
> would require KVM to always use a bitmap during migration/dirty-rate-estimation with
> the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
> core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
> it's a module).

Yes, I don't know enough about kvm to say if that is a great idea or
not. The fact the CPUs seem to be going to logging instead of bitmaps
suggests it isn't. I don't think DMA devices can work effectively with
logging..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-25 20:44                                         ` Jason Gunthorpe
@ 2022-02-28 13:01                                           ` Joao Martins
  2022-02-28 21:01                                             ` Jason Gunthorpe
  2022-03-11 13:51                                             ` iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver) Joao Martins
  0 siblings, 2 replies; 55+ messages in thread
From: Joao Martins @ 2022-02-28 13:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/25/22 20:44, Jason Gunthorpe wrote:
> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>> in terms of direction...
>>>>>
>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>> will go forward.
>>>>>
>>>> /me nods, will do. Looking at your repository it is looking good.
>>>
>>> I would like to come with some plan for dirty tracking on iommufd and
>>> combine that with a plan for dirty tracking inside the new migration
>>> drivers.
>>>
>> I had a few things going on my end over the past weeks, albeit it is
>> getting a bit better now and I will be coming back to this topic. I hope/want
>> to give you a more concrete update/feedback over the coming week or two wrt
>> to dirty-tracking+iommufd+amd.
>>
>> So far, I am not particularly concerned that this will affect overall iommufd
>> design. The main thing is really lookups to get vendor iopte, upon on what might
>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>> the tracking,
> 
> I'm not very keen on these multiplexer interfaces. I think you should
> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> 'read_dirty_bits'
> 
> NULL op means not supported.
> 
> IMHO we don't need a kapi wrapper if only iommufd is going to call the
> op.
> 

OK, good point.

Even without a kapi wrapper I am still wondering whether the iommu op needs to
be something like a generic iommu feature toggling (e.g. .set_feature()), rather
than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
the only feature we will change out-of-band in the protection-domain.
I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
because the sync() doesn't only read, but also "writes" to root pagetable to update
the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

And TBH, the question on whether we need a clear op isn't immediately obvious: reading
the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
get_dirty_bitmap() and clear_dirty_ditmap().

Hopefully a futuristic IOMMUs would just get a new DTE/CD/etc field for stashing a set of
PFNs (for a bitmap) that the IOMMU would use for setting the dirty bits there. That is,
rather than forcing sw to split/merge pagetables for better granularity and having to
flush IOTLBs for dirty to be written again :(

>> I'll be simplifying the interface in the other type1 series I had and making it
>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>> enable/disable over a domain. Perhaps same trick could be expanded to other
>> features to have a bit more control on what userspace is allowed to use. I think
>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>> to have full control during the different stages of migration of dirty tracking.
>> In all of the IOMMU implementations/manuals I read it means setting a protection
>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>> (albeit past work had also it always-on).
>>
>> Provided the iommufd does /separately/ more finer granularity on what we can
>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>> at will as separate operations, before and after migration respectivally. That logic
>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>> expensive.
> 
> This all sounds right to me
> 
> Questions I have:
>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>    ranges? how/what/why?
> 
Ignore that. I got mislead by the implementation and when I read the SDM
I realized that the implementation was doing the same thing I was doing
i.e. enabling dirty-bit in the protection domain at start rather than
dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
bit in the context descriptor to enable dirty bits or the STE.S2HD in the
stream table entry for the stage2 equivalent. Nothing here is per-range
basis. And the ranges was only used by the implementation for the eager
splitting/merging of IO page table levels.

>  - What about the unmap and read dirty without races operation that
>    vfio has?
> 
I am afraid that might need a new unmap iommu op that reads the dirty bit
after clearing the page table entry. It's marshalling the bits from
iopte into a bitmap as opposed to some logic added on top. As far as I
looked for AMD this isn't difficult to add, (same for Intel) it can
*I think* reuse all of the unmap code.

>>>> I, too, have been wondering what that is going to look like -- and how do we
>>>> convey the setup of dirty tracking versus the steering of it.
>>>
>>> What I suggested was to just split them.
>>>
>>> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
>>> would be on a per-domain basis, not on the ioas.
>>>
>>> Some ioctl toward the vfio device will turn on the device's tracker.
>>>
>> In the activation/fetching-data of either trackers I see some things in common in
>> terms of UAPI with the difference that whether a device or a list of devices are passed on
>> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
>> your suggestion)
> 
> I was thinking a VFIO_DEVICE ioctl located on the device FD
> implemented in the end VFIO driver (like mlx5_vfio). No lists..
> 
Interesting. I was assuming that we wanted to keep the same ioctl()
interface for dirty-tracking hence me mentioning the device/device-list on
top of the DIRTY ioctl. But well on a second thought it doesn't make sense
given that we query a container and we want to move away from vfio for iopgtbl
related-work and rather into iommufd direct access instead by the VMM. So
placing more dependency on that ioctl wouldn't align with that. I suppose, it
makes sense that this is on a per-vfio-device basis, hence device-vfio ioctl().

> As you say the driver should just take in the request to set dirty
> tracking and take core of it somehow. There is no value the core VFIO
> code can add here.
> 
>> Albeit perhaps the main difference is going to be that one needs to
>> setup with hardware interface with the device tracker and how we
>> carry the regions of memory that want to be tracked i.e. GPA/IOVA
>> ranges that the device should track. The tracking-GPA space is not
>> linear GPA space sadly. But at the same point perhaps the internal
>> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
>> ranges we mapped.
> 
> Yes, this is a point that needs some answering. One option is to pass
> in the tracking range list from userspace. Another is to query it in
> the driver from the currently mapped areas in IOAS.
> 
I sort of like the second given that we de-duplicate info that is already
tracked by IOAS -- it would be mostly internal and then it would be a
matter of when does this device tracker is set up, and whether we should
differentiate tracker "start"/"stop" vs "setup"/"teardown".

> I know devices have limitations here in terms of how many/how big the
> ranges can be, and devices probably can't track dynamic changes.
> 
/me nods

Should this be some sort of capability perhaps? Given that this may limit
how many concurrent VFs can be migrated and how much ranges it can store,
for example.

(interestingly and speaking of VF capabilities, it's yet another thing to
tackle in the migration stream between src and dst hosts)

>> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
>> device tracker might mean giving all the IOVA ranges to the PF (once?).
>> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
>> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
>> the driver needs some form of awareness of and convey that to the PF for
>> tracking purposes.
> 
> Yes, this is right
>  
>> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
>> following by a smaller cost copying back to userspace (which KVM does too, when it's using
>> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
>> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
>> (from PF) number of leading zeroes within some threshold.
> 
> What I would probably strive for is an API that deliberately OR's in
> the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
> in your dirty data', then do the next page. etc. That is 134M of IOVA
> per chunk which seems OK.
> 
Yeap, sort of what I was aiming for.

>>> This makes qemu more complicated because it has to decide what
>>> trackers to turn on, but that is also the point because we do want
>>> userspace to be able to decide.
>>>
>> If the interface wants extending to pass a device or an array of devices (if I understood
>> you correctly), it would free/simplify VFIO from having to concatenate potentially
>> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
>> more to avoid performing too much copying.
> 
> Yes. Currently VFIO maintains its own bitmap so it also saves that
> memory by keeping the dirty bits stored in the IOPTEs until read out.
>  
/me nods with the iommu, yes.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-28 13:01                                           ` Joao Martins
@ 2022-02-28 21:01                                             ` Jason Gunthorpe
  2022-03-01 13:06                                               ` Joao Martins
  2022-03-11 13:51                                             ` iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver) Joao Martins
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 21:01 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
> > On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>>>> in terms of direction...
> >>>>>
> >>>>> I think go ahead and build it on top of iommufd, start working out the
> >>>>> API details, etc. I think once the direction is concluded the new APIs
> >>>>> will go forward.
> >>>>>
> >>>> /me nods, will do. Looking at your repository it is looking good.
> >>>
> >>> I would like to come with some plan for dirty tracking on iommufd and
> >>> combine that with a plan for dirty tracking inside the new migration
> >>> drivers.
> >>>
> >> I had a few things going on my end over the past weeks, albeit it is
> >> getting a bit better now and I will be coming back to this topic. I hope/want
> >> to give you a more concrete update/feedback over the coming week or two wrt
> >> to dirty-tracking+iommufd+amd.
> >>
> >> So far, I am not particularly concerned that this will affect overall iommufd
> >> design. The main thing is really lookups to get vendor iopte, upon on what might
> >> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> >> the tracking,
> > 
> > I'm not very keen on these multiplexer interfaces. I think you should
> > just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> > 'read_dirty_bits'
> > 
> > NULL op means not supported.
> > 
> > IMHO we don't need a kapi wrapper if only iommufd is going to call the
> > op.
> > 
> 
> OK, good point.
> 
> Even without a kapi wrapper I am still wondering whether the iommu op needs to
> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
> the only feature we will change out-of-band in the protection-domain.
> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

I just generally dislike multiplexers like this. We are already
calling through a function pointer struct, why should the driver
implement another switch/case just to find out which function pointer
the caller really ment to use? It doesn't make things faster, it
doesn't make things smaller, it doesn't use less LOC. Why do it?

> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
> because the sync() doesn't only read, but also "writes" to root pagetable to update
> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

'read and clear' is what I'd suggest

> And TBH, the question on whether we need a clear op isn't immediately obvious: reading
> the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
> io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
> it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
> sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
> IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
> isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
> thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
> get_dirty_bitmap() and clear_dirty_ditmap().

Yes, I wouldn't split them.

> > Questions I have:
> >  - Do we need ranges for some reason? You mentioned ARM SMMU wants
> >    ranges? how/what/why?
> > 
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was
> doing

Ok

> >  - What about the unmap and read dirty without races operation that
> >    vfio has?
> > 
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.

Ok. It feels necessary to be complete

> > Yes, this is a point that needs some answering. One option is to pass
> > in the tracking range list from userspace. Another is to query it in
> > the driver from the currently mapped areas in IOAS.
> > 
> I sort of like the second given that we de-duplicate info that is already
> tracked by IOAS -- it would be mostly internal and then it would be a
> matter of when does this device tracker is set up, and whether we should
> differentiate tracker "start"/"stop" vs "setup"/"teardown".

One problem with this is that devices that don't support dynamic
tracking are stuck in vIOMMU cases where the IOAS will have some tiny
set of all memory mapped. 

So I'd probably say qemu should forward the entire guest CPU memory
space, as well as any future memory hotplug area, as ranges and not
rely on the IOAS already mapping anything.

> > I know devices have limitations here in terms of how many/how big the
> > ranges can be, and devices probably can't track dynamic changes.
> > 
> /me nods
> 
> Should this be some sort of capability perhaps? Given that this may limit
> how many concurrent VFs can be migrated and how much ranges it can store,
> for example.
> 
> (interestingly and speaking of VF capabilities, it's yet another thing to
> tackle in the migration stream between src and dst hosts)

I don't know what to do with these limitations right now..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-02-28 21:01                                             ` Jason Gunthorpe
@ 2022-03-01 13:06                                               ` Joao Martins
  2022-03-01 13:54                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-03-01 13:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/28/22 21:01, Jason Gunthorpe wrote:
> On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>>>> in terms of direction...
>>>>>>>
>>>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>>>> will go forward.
>>>>>>>
>>>>>> /me nods, will do. Looking at your repository it is looking good.
>>>>>
>>>>> I would like to come with some plan for dirty tracking on iommufd and
>>>>> combine that with a plan for dirty tracking inside the new migration
>>>>> drivers.
>>>>>
>>>> I had a few things going on my end over the past weeks, albeit it is
>>>> getting a bit better now and I will be coming back to this topic. I hope/want
>>>> to give you a more concrete update/feedback over the coming week or two wrt
>>>> to dirty-tracking+iommufd+amd.
>>>>
>>>> So far, I am not particularly concerned that this will affect overall iommufd
>>>> design. The main thing is really lookups to get vendor iopte, upon on what might
>>>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>>>> the tracking,
>>>
>>> I'm not very keen on these multiplexer interfaces. I think you should
>>> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
>>> 'read_dirty_bits'
>>>
>>> NULL op means not supported.
>>>
>>> IMHO we don't need a kapi wrapper if only iommufd is going to call the
>>> op.
>>>
>>
>> OK, good point.
>>
>> Even without a kapi wrapper I am still wondering whether the iommu op needs to
>> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
>> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
>> the only feature we will change out-of-band in the protection-domain.
>> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
>> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).
> 
> I just generally dislike multiplexers like this. We are already
> calling through a function pointer struct, why should the driver
> implement another switch/case just to find out which function pointer
> the caller really ment to use? It doesn't make things faster, it
> doesn't make things smaller, it doesn't use less LOC. Why do it?
> 
I concur with you in the above but I don't mean it like a multiplexer.
Rather, mimicking the general nature of feature bits in the protection domain
(or the hw pagetable abstraction). So hypothetically for every bit ... if you
wanted to create yet another op that just flips a bit of the
DTEs/PASID-table/CD it would be an excessive use of callbacks to get
to in the iommu_domain_ops if they all set do the same thing.
Right now it's only Dirty (Access bit I don't see immediate use for it
right now) bits, but could there be more? Perhaps this is something to
think about.

Anyways, I am anyways sticking with plain ops function.

>> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
>> because the sync() doesn't only read, but also "writes" to root pagetable to update
>> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
>> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.
> 
> 'read and clear' is what I'd suggest
> 
/me nods


>>>  - What about the unmap and read dirty without races operation that
>>>    vfio has?
>>>
>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>> after clearing the page table entry. It's marshalling the bits from
>> iopte into a bitmap as opposed to some logic added on top. As far as I
>> looked for AMD this isn't difficult to add, (same for Intel) it can
>> *I think* reuse all of the unmap code.
> 
> Ok. It feels necessary to be complete
> 
Yes, I guess it's mandatory if we want to fully implement vfio-compat dirty
tracking IOCTLs, too.

>>> Yes, this is a point that needs some answering. One option is to pass
>>> in the tracking range list from userspace. Another is to query it in
>>> the driver from the currently mapped areas in IOAS.
>>>
>> I sort of like the second given that we de-duplicate info that is already
>> tracked by IOAS -- it would be mostly internal and then it would be a
>> matter of when does this device tracker is set up, and whether we should
>> differentiate tracker "start"/"stop" vs "setup"/"teardown".
> 
> One problem with this is that devices that don't support dynamic
> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> set of all memory mapped. 
> 
Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
that you have no limitation of ranges and fw/hw can cope with being fed of
'new-ranges' to enable dirty-tracking. Or does it mean something else at
hw level? Like dirty bitmap being pulled out of device memory, and hw keeps
its own state of dirtying and sw 'just' needs to sync the bitmap every scan.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-03-01 13:06                                               ` Joao Martins
@ 2022-03-01 13:54                                                 ` Jason Gunthorpe
  2022-03-01 14:27                                                   ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-03-01 13:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:

> I concur with you in the above but I don't mean it like a multiplexer.
> Rather, mimicking the general nature of feature bits in the protection domain
> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
> wanted to create yet another op that just flips a bit of the
> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
> to in the iommu_domain_ops if they all set do the same thing.
> Right now it's only Dirty (Access bit I don't see immediate use for it
> right now) bits, but could there be more? Perhaps this is something to
> think about.

Not sure it matters :)

> > One problem with this is that devices that don't support dynamic
> > tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> > set of all memory mapped. 
> > 
> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
> that you have no limitation of ranges and fw/hw can cope with being fed of
> 'new-ranges' to enable dirty-tracking. 

Yes, the ranges can change once the tracking starts, like the normal
IOMMU can do

We are looking at devices where the HW can track a range at the start
of tracking and that range cannot change over time.

So, we need to track the whole guest memory and some extra for memory
hot plug, not just the currently viommu mapped things.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
  2022-03-01 13:54                                                 ` Jason Gunthorpe
@ 2022-03-01 14:27                                                   ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2022-03-01 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 3/1/22 13:54, Jason Gunthorpe wrote:
> On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:
> 
>> I concur with you in the above but I don't mean it like a multiplexer.
>> Rather, mimicking the general nature of feature bits in the protection domain
>> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
>> wanted to create yet another op that just flips a bit of the
>> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
>> to in the iommu_domain_ops if they all set do the same thing.
>> Right now it's only Dirty (Access bit I don't see immediate use for it
>> right now) bits, but could there be more? Perhaps this is something to
>> think about.
> 
> Not sure it matters :)
> 
Hehe, most likely it doesn't :)

>>> One problem with this is that devices that don't support dynamic
>>> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
>>> set of all memory mapped. 
>>>
>> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
>> that you have no limitation of ranges and fw/hw can cope with being fed of
>> 'new-ranges' to enable dirty-tracking. 
> 
> Yes, the ranges can change once the tracking starts, like the normal
> IOMMU can do
> 
> We are looking at devices where the HW can track a range at the start
> of tracking and that range cannot change over time.
> 
> So, we need to track the whole guest memory and some extra for memory
> hot plug, not just the currently viommu mapped things.

I'm aware of the guest memmap accounting foo, so I was already factoring
that in the equation. I was just mainly grokking at hardware limitations
you are coming from to be captured in these future VFIO extensions.

Tracking the max possible GPA (as you hinted earlier) could solve the two natures
of write-tracking, also given that the IOMMU ultimately protects what's really
allowed to be DMA-written to. Unless such tracking-bitmap is placed on device
memory, where space probably needs to be more carefully managed.

Anyways, thanks for the insights!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver)
  2022-02-28 13:01                                           ` Joao Martins
  2022-02-28 21:01                                             ` Jason Gunthorpe
@ 2022-03-11 13:51                                             ` Joao Martins
  2022-03-15 19:29                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-03-11 13:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 2/28/22 13:01, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>> I'll be simplifying the interface in the other type1 series I had and making it
>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>> features to have a bit more control on what userspace is allowed to use. I think
>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>> to have full control during the different stages of migration of dirty tracking.
>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>> (albeit past work had also it always-on).
>>>
>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>> at will as separate operations, before and after migration respectivally. That logic
>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>> expensive.
>>
>> This all sounds right to me
>>
>> Questions I have:
>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>    ranges? how/what/why?
>>
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was doing
> i.e. enabling dirty-bit in the protection domain at start rather than
> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> stream table entry for the stage2 equivalent. Nothing here is per-range
> basis. And the ranges was only used by the implementation for the eager
> splitting/merging of IO page table levels.
> 
>>  - What about the unmap and read dirty without races operation that
>>    vfio has?
>>
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.
> 

OK, made some progress.

It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
revising locking, bugs, and more -- works with my emulated qemu patches which
is a good sign. But hopefully starts some sort of skeleton of what we were
talking about in the above thread.

The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
the vfio-compat one as it was easier provided there's existing userspace to work
with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
steering the dirty tracking, read-clear the dirty bits, unmap and get dirty. But
as we discussed earlier, the one gap of VFIO dirty API is that it lacks controls
for upgrading/downgrading area/IOPTE sizes which is where IOMMUFD would most
likely shine. When that latter part is done we can probably adopt an eager-split
approach inside vfio-compat.

Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
Some of the commits made notes of some of research I made, so *I think* the APIs
introduced capture h/w semantics for all the three IOMMUs supporting dirty
tracking.

I am storing my dirty-tracking bits here:

	https://github.com/jpemartins/linux	iommufd

It's a version of it rebased on top of your iommufd branch.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver)
  2022-03-11 13:51                                             ` iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver) Joao Martins
@ 2022-03-15 19:29                                               ` Jason Gunthorpe
  2022-03-16 16:36                                                 ` iommufd(+vfio-compat) dirty tracking Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-03-15 19:29 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
> On 2/28/22 13:01, Joao Martins wrote:
> > On 2/25/22 20:44, Jason Gunthorpe wrote:
> >> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >>> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>> I'll be simplifying the interface in the other type1 series I had and making it
> >>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> >>> enable/disable over a domain. Perhaps same trick could be expanded to other
> >>> features to have a bit more control on what userspace is allowed to use. I think
> >>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> >>> to have full control during the different stages of migration of dirty tracking.
> >>> In all of the IOMMU implementations/manuals I read it means setting a protection
> >>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> >>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> >>> (albeit past work had also it always-on).
> >>>
> >>> Provided the iommufd does /separately/ more finer granularity on what we can
> >>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> >>> at will as separate operations, before and after migration respectivally. That logic
> >>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> >>> expensive.
> >>
> >> This all sounds right to me
> >>
> >> Questions I have:
> >>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
> >>    ranges? how/what/why?
> >>
> > Ignore that. I got mislead by the implementation and when I read the SDM
> > I realized that the implementation was doing the same thing I was doing
> > i.e. enabling dirty-bit in the protection domain at start rather than
> > dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> > bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> > stream table entry for the stage2 equivalent. Nothing here is per-range
> > basis. And the ranges was only used by the implementation for the eager
> > splitting/merging of IO page table levels.
> > 
> >>  - What about the unmap and read dirty without races operation that
> >>    vfio has?
> >>
> > I am afraid that might need a new unmap iommu op that reads the dirty bit
> > after clearing the page table entry. It's marshalling the bits from
> > iopte into a bitmap as opposed to some logic added on top. As far as I
> > looked for AMD this isn't difficult to add, (same for Intel) it can
> > *I think* reuse all of the unmap code.
> > 
> 
> OK, made some progress.

Nice!

 
> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
> revising locking, bugs, and more -- works with my emulated qemu patches which
> is a good sign. But hopefully starts some sort of skeleton of what we were
> talking about in the above thread.

I'm a bit bogged with the coming merge window right now, but will have
more to say in a bit
 
> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
> the vfio-compat one as it was easier provided there's existing userspace to work
> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
> steering the dirty tracking, read-clear the dirty bits, unmap and
> get dirty. 

I think this is fine to start experimenting with

> But as we discussed earlier, the one gap of VFIO dirty API is that
> it lacks controls for upgrading/downgrading area/IOPTE sizes which
> is where IOMMUFD would most likely shine. When that latter part is
> done we can probably adopt an eager-split approach inside
> vfio-compat.

I think the native API should be new ioctls that operate on the
hw_pagetable object to:
  - enable/disable dirty tracking 
  - read&clear a bitmap from a range
  - read&unmap a bitmap from a range
  - Manipulate IOPTE sizes

As you say it should not be much distance from the VFIO compat stuff

Most probably I would say to leave dirty tracking out of the type1 api
and compat for it. Maybe we can make some limited cases work back
compat, like the whole ioas supports iommu dirty tracking or
something..

Need to understand if it is wortwhile - remember to use any of this
you need a qemu that is updated to the v2 migration interface,
so there is little practical value in back compat to old qemu if we
expect qemu will use the native interface anyhow.

> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
> Some of the commits made notes of some of research I made, so *I think* the APIs
> introduced capture h/w semantics for all the three IOMMUs supporting dirty
> tracking.

I think the primitives are pretty basic, I'd be surprised if there is
something different :)

Though things to be thinking about are how does this work with nested
translation and other advanced features.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: iommufd(+vfio-compat) dirty tracking
  2022-03-15 19:29                                               ` Jason Gunthorpe
@ 2022-03-16 16:36                                                 ` Joao Martins
  2022-03-16 20:37                                                   ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-03-16 16:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 3/15/22 19:29, Jason Gunthorpe wrote:
> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>> On 2/28/22 13:01, Joao Martins wrote:
>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> I'll be simplifying the interface in the other type1 series I had and making it
>>>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>>>> features to have a bit more control on what userspace is allowed to use. I think
>>>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>>>> to have full control during the different stages of migration of dirty tracking.
>>>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>>>> (albeit past work had also it always-on).
>>>>>
>>>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>>>> at will as separate operations, before and after migration respectivally. That logic
>>>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>>>> expensive.
>>>>
>>>> This all sounds right to me
>>>>
>>>> Questions I have:
>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>    ranges? how/what/why?
>>>>
>>> Ignore that. I got mislead by the implementation and when I read the SDM
>>> I realized that the implementation was doing the same thing I was doing
>>> i.e. enabling dirty-bit in the protection domain at start rather than
>>> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
>>> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
>>> stream table entry for the stage2 equivalent. Nothing here is per-range
>>> basis. And the ranges was only used by the implementation for the eager
>>> splitting/merging of IO page table levels.
>>>
>>>>  - What about the unmap and read dirty without races operation that
>>>>    vfio has?
>>>>
>>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>>> after clearing the page table entry. It's marshalling the bits from
>>> iopte into a bitmap as opposed to some logic added on top. As far as I
>>> looked for AMD this isn't difficult to add, (same for Intel) it can
>>> *I think* reuse all of the unmap code.
>>>
>>
>> OK, made some progress.
> 
> Nice!
> 
>  
>> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
>> revising locking, bugs, and more -- works with my emulated qemu patches which
>> is a good sign. But hopefully starts some sort of skeleton of what we were
>> talking about in the above thread.
> 
> I'm a bit bogged with the coming merge window right now, but will have
> more to say in a bit
> 

Thanks! Take your time :)

>> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
>> the vfio-compat one as it was easier provided there's existing userspace to work
>> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
>> steering the dirty tracking, read-clear the dirty bits, unmap and
>> get dirty. 
> 
> I think this is fine to start experimenting with
> 
>> But as we discussed earlier, the one gap of VFIO dirty API is that
>> it lacks controls for upgrading/downgrading area/IOPTE sizes which
>> is where IOMMUFD would most likely shine. When that latter part is
>> done we can probably adopt an eager-split approach inside
>> vfio-compat.
> 
> I think the native API should be new ioctls that operate on the
> hw_pagetable object to:
>   - enable/disable dirty tracking 
>   - read&clear a bitmap from a range
>   - read&unmap a bitmap from a range
>   - Manipulate IOPTE sizes
> 
Yeap -- heh nice, it is roughly what I was into already. I will take the
opportunity of doing the zerocopy/gup of the bitmap when writing the
iommufd native ioctls. The vfio-compat is still copying.

other things on my mind for iommufd native interface are:

1) The iopte split *could* be while we read the
dirty bits of a pgsize != than the args pgsize. The
read_and_clear_dirty() is already expensive, but I wonder about the idea
to take advantage of the dirty-clearing needed IOTLB flush, to also also
update the IOPTEs on the next memory-request/translation-request. Albeit
iopte collpase still needs to be routed via a specific ioctl() to promote
to a higher page size (should the migration fail or something).

(I referred to this in the past as lazy-split contrast to explicit IOPTE
size manipulation iommufd ioctls i.e. {manual,eager}-split)

2) I was thinking io_pagetable new APIs instead of hw_pagetable, but if you're
sort of seeing this as a hw_pagetable object primitive, then I am
wondering why? Albeit at the end of the day we operate on the iopt inside the hw_pagetable.

Note: IIUC that hw_pagetable is supposed to model a iommu_domain
direct manipulation which dirty fits in, but wouldn't that be applicable
to _MAP and _UNMAP too maybe?

> As you say it should not be much distance from the VFIO compat stuff
> 
> Most probably I would say to leave dirty tracking out of the type1 api
> and compat for it. Maybe we can make some limited cases work back
> compat, like the whole ioas supports iommu dirty tracking or
> something..
> 
> Need to understand if it is wortwhile - remember to use any of this
> you need a qemu that is updated to the v2 migration interface,
> so there is little practical value in back compat to old qemu if we
> expect qemu will use the native interface anyhow.
> 

Perhaps the value is really just *how much* applications need to get rewritten
to adjust to iommufd versus iteratively adding small (new) parts of it .
migration-v2 didn't turn into a complete rewrite of the vfio initial part iiuc.
Albeit I suspect keeping the compat was perhaps a struggle to keep, which we
might not what we desire if the semantics diverge too much.

e.g. If the semantics are about the same, for example, new VMMs could use the
iommufd new features using IOPTE size change ioctls() via the VFIO_IOAS (if available)
while leaving vfio-compat somewhat usable on the existent dirty ioctls. While slowly
moving to newer iommufd (and less vfio-iommu-type1) until the point you totally deprecate
it. Even without IOPTE size modification in vfio-compat you would still mark as dirty
a lot less than the current logic would (which really represents the whole bitmap).

I guess at the end of the day it is the maintenance cost -- Fortunately
APIs are looking similar and most of vfio-compat is dealing with userspace
args/validation.

Anyhow just some thoughts, open to anything really :)

>> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
>> Some of the commits made notes of some of research I made, so *I think* the APIs
>> introduced capture h/w semantics for all the three IOMMUs supporting dirty
>> tracking.
> 
> I think the primitives are pretty basic, I'd be surprised if there is
> something different :)
> 
Yeah, they are very basic.

My main unknown in the rework was how the hw protection domain context can deal with
dynamically switching dirty on and off. But I think I answered myself that restriction at
least from groking into specs and experimentation.

> Though things to be thinking about are how does this work with nested
> translation and other advanced features
/me nods

There's usually some flags/bits on the 'stage 2' domain, at least ARM (S2CD) and Intel
(second-level table). But for AMD, you usually control via the vIOMMU what features the hw
exposes[*], and I think there's an extra bit to set to disable vIOMMU dirty tracking from
VMM. But I think still the only thing needed is first-stage tracking, as that's what
matters for L0 live migration.

[*] AMD vIOMMU also offloads some of the command processing, not just the pagetables. Not
sure if ARM/Intel has the equivalent.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: iommufd(+vfio-compat) dirty tracking
  2022-03-16 16:36                                                 ` iommufd(+vfio-compat) dirty tracking Joao Martins
@ 2022-03-16 20:37                                                   ` Joao Martins
  2022-03-18 17:12                                                     ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-03-16 20:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 3/16/22 16:36, Joao Martins wrote:
> On 3/15/22 19:29, Jason Gunthorpe wrote:
>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>>> On 2/28/22 13:01, Joao Martins wrote:
>>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> Questions I have:
>>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>>    ranges? how/what/why?
>>>>>

An amend here.

Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing
the ARM side. Apparently you have two controls: one is a 'feature bit'
just like x86 and another is a modifier (arm-only).

The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor
equivalent for second-level. That's the top-level enabler to actually a *second*
modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so
called DBM (dirty-bit-modifier). The latter when set, changes the meaning of
read/write access-flags of the PTE AP[2].

If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a
transition in the SMMU from "writable Clean" to "written" means that the the
access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0)
if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault
like it normally would be with DBM = 0). Same thing for stage-2, except that
the access-bits are reversed (S2AP[1] is set when "written" and it's cleared
when it's "writable" (when DBM is also set).

Now you could say that this allows you to control on a per-range basis.
Gah, no, more like a per-PTE basis is more accurate.

And in practice I suppose that means that dynamically switching on/off SMMU
dirty-tracking *dynamically* means not only setting CD.HD but also walking the
page tables, and atomically setting/clearing both the DBM and AP[2].

References:

DDI0487H, Table D5-30 Data access permissions
SMMU 3.2 spec, 3.13.3 Dirty flag hardware update

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: iommufd(+vfio-compat) dirty tracking
  2022-03-16 20:37                                                   ` Joao Martins
@ 2022-03-18 17:12                                                     ` Joao Martins
  2022-03-18 17:34                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-03-18 17:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On 3/16/22 20:37, Joao Martins wrote:
> On 3/16/22 16:36, Joao Martins wrote:
>> On 3/15/22 19:29, Jason Gunthorpe wrote:
>>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>>>> On 2/28/22 13:01, Joao Martins wrote:
>>>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> Questions I have:
>>>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>>>    ranges? how/what/why?
>>>>>>
> 
> An amend here.
> 
> Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing
> the ARM side. Apparently you have two controls: one is a 'feature bit'
> just like x86 and another is a modifier (arm-only).
> 
> The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor
> equivalent for second-level. That's the top-level enabler to actually a *second*
> modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so
> called DBM (dirty-bit-modifier). The latter when set, changes the meaning of
> read/write access-flags of the PTE AP[2].
> 
> If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a
> transition in the SMMU from "writable Clean" to "written" means that the the
> access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0)
> if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault
> like it normally would be with DBM = 0). Same thing for stage-2, except that
> the access-bits are reversed (S2AP[1] is set when "written" and it's cleared
> when it's "writable" (when DBM is also set).
> 
> Now you could say that this allows you to control on a per-range basis.
> Gah, no, more like a per-PTE basis is more accurate.
> 
> And in practice I suppose that means that dynamically switching on/off SMMU
> dirty-tracking *dynamically* means not only setting CD.HD but also walking the
> page tables, and atomically setting/clearing both the DBM and AP[2].
> 
> References:
> 
> DDI0487H, Table D5-30 Data access permissions
> SMMU 3.2 spec, 3.13.3 Dirty flag hardware update

I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based
on the past work) and adjusted the 'set tracking' structure to cover this slightly
different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API,
which is internal in iommufd obviously. The UAPI won't change ofc.

It's only compile-tested sadly as I have no SMMUv3.2 hardware, and to have this SMMUv3
DBM/HTTU support there's some requirements on the processor that I am not sure they can be
fully emulated. The Intel iommu implementation follows same model as AMD and I will get to
that next with a compliant iommu emulation too.

But now, I will be focusing on hw_pagetable UAPI part. I understand your thinking of being
tied to the hw_pagetable obj as opposed to IOAS for the dirty tracking APIs.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: iommufd(+vfio-compat) dirty tracking
  2022-03-18 17:12                                                     ` Joao Martins
@ 2022-03-18 17:34                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, kvm, linux-kernel, linux-crypto,
	alex.williamson, mgurtovoy, Linuxarm, liulongfang, Zengtao (B),
	yuzenghui, Jonathan Cameron, Wangzhou (B)

On Fri, Mar 18, 2022 at 05:12:01PM +0000, Joao Martins wrote:
> I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based
> on the past work) and adjusted the 'set tracking' structure to cover this slightly
> different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API,
> which is internal in iommufd obviously. The UAPI won't change ofc.

Shameerali has HW, AFAIK..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2022-03-18 17:34 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-02  9:58 [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Shameer Kolothum
2021-07-02  9:58 ` [RFC v2 1/4] hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
2021-07-02 20:29   ` Alex Williamson
2021-07-05  7:20     ` Shameerali Kolothum Thodi
2021-07-04  7:03   ` Leon Romanovsky
2021-07-05  8:47     ` Shameerali Kolothum Thodi
2021-07-05  9:41       ` Max Gurtovoy
2021-07-05 10:18         ` Shameerali Kolothum Thodi
2021-07-05 18:27           ` Leon Romanovsky
2021-07-05 18:32             ` Jason Gunthorpe
2021-07-06  3:59               ` Leon Romanovsky
2021-07-06  4:39               ` Christoph Hellwig
2021-07-06 11:51                 ` Jason Gunthorpe
2021-07-02  9:58 ` [RFC v2 2/4] hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size Shameer Kolothum
2021-07-02 20:29   ` Alex Williamson
2021-07-05  7:22     ` Shameerali Kolothum Thodi
2021-07-02  9:58 ` [RFC v2 3/4] crypto: hisilicon/qm - Export mailbox functions for common use Shameer Kolothum
2021-07-04  9:34   ` Max Gurtovoy
2021-07-05 10:23     ` Shameerali Kolothum Thodi
2021-07-02  9:58 ` [RFC v2 4/4] hisi_acc_vfio_pci: Add support for vfio live migration Shameer Kolothum
2022-02-02 13:14 ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Jason Gunthorpe
2022-02-02 14:34   ` Shameerali Kolothum Thodi
2022-02-02 15:39     ` Jason Gunthorpe
2022-02-02 16:10       ` Shameerali Kolothum Thodi
2022-02-02 17:03         ` Jason Gunthorpe
2022-02-02 19:05           ` Joao Martins
2022-02-03 15:18             ` Jason Gunthorpe
2022-02-04 19:53               ` Joao Martins
2022-02-04 23:07                 ` Jason Gunthorpe
2022-02-11 17:28                   ` Joao Martins
2022-02-11 17:49                     ` Jason Gunthorpe
2022-02-11 21:43                       ` Joao Martins
2022-02-12  0:01                         ` Jason Gunthorpe
2022-02-14 13:34                           ` Joao Martins
2022-02-14 14:06                             ` Jason Gunthorpe
2022-02-15 16:00                               ` Joao Martins
2022-02-15 16:21                                 ` Jason Gunthorpe
2022-02-22 11:55                                   ` Joao Martins
2022-02-23  1:03                                     ` Jason Gunthorpe
2022-02-25 19:18                                       ` Joao Martins
2022-02-25 20:44                                         ` Jason Gunthorpe
2022-02-28 13:01                                           ` Joao Martins
2022-02-28 21:01                                             ` Jason Gunthorpe
2022-03-01 13:06                                               ` Joao Martins
2022-03-01 13:54                                                 ` Jason Gunthorpe
2022-03-01 14:27                                                   ` Joao Martins
2022-03-11 13:51                                             ` iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver) Joao Martins
2022-03-15 19:29                                               ` Jason Gunthorpe
2022-03-16 16:36                                                 ` iommufd(+vfio-compat) dirty tracking Joao Martins
2022-03-16 20:37                                                   ` Joao Martins
2022-03-18 17:12                                                     ` Joao Martins
2022-03-18 17:34                                                       ` Jason Gunthorpe
2022-02-02 17:30     ` [RFC v2 0/4] vfio/hisilicon: add acc live migration driver Alex Williamson
2022-02-02 18:04       ` Jason Gunthorpe
2022-02-18 16:37 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).