[RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver
@ 2021-04-13  3:36 Longfang Liu
  2021-04-13  3:36 ` [RFC PATCH 1/3] " Longfang Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Longfang Liu @ 2021-04-13  3:36 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, linux-kernel, linuxarm, liulongfang

The live migration solution relies on the vfio_device_migration_info protocol.
The structure vfio_device_migration_info is placed at the 0th offset of
the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
migration information. Field accesses from this structure are only supported
at their native width and alignment. Otherwise, the result is undefined and
vendor drivers should return an error.

(1).The driver framework is based on vfio_pci_register_dev_region() of vfio-pci,
and then a new live migration region is added, and the live migration is
realized through the ops of this region.

(2).In order to ensure the compatibility of the devices before and after the
migration, the device compatibility information check will be performed in
the Pre-copy stage. If the check fails, an error will be returned and the
source VM will exit the migration function.

(3).After the compatibility check is passed, it will enter the Stop-and-copy
stage. At this time, all the live migration data will be copied, and then
saved to the VF device of the destination, and then the VF device of the
destination will be started and the VM of the source will be exited.

Longfang Liu (3):
  vfio/hisilicon: add acc live migration driver
  vfio/hisilicon: register the driver to vfio
  vfio/hisilicom: add debugfs for driver

 drivers/vfio/pci/Kconfig                      |    8 +
 drivers/vfio/pci/Makefile                     |    1 +
 drivers/vfio/pci/hisilicon/acc_vf_migration.c | 1337 +++++++++++++++++++++++++
 drivers/vfio/pci/hisilicon/acc_vf_migration.h |  170 ++++
 drivers/vfio/pci/vfio_pci.c                   |   11 +
 drivers/vfio/pci/vfio_pci_private.h           |    9 +
 6 files changed, 1536 insertions(+)
 create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.c
 create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.h

-- 
2.8.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/3] vfio/hisilicon: add acc live migration driver
  2021-04-13  3:36 [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
@ 2021-04-13  3:36 ` Longfang Liu
  2021-04-13  3:36 ` [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio Longfang Liu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Longfang Liu @ 2021-04-13  3:36 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, linux-kernel, linuxarm, liulongfang

This driver adds the code required by Hisilicon
accelerator device to realize the live migration function.
It mainly includes the following functions:
(1).Match the accelerator device with the vfio-pci
driver framework.
(2).Processing of the status of the live migration
function and processing of the migration data.
(3).Operation and data processing of accelerator hardware devices

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/Kconfig                      |    8 +
 drivers/vfio/pci/Makefile                     |    1 +
 drivers/vfio/pci/hisilicon/acc_vf_migration.c | 1144 +++++++++++++++++++++++++
 drivers/vfio/pci/hisilicon/acc_vf_migration.h |  168 ++++
 4 files changed, 1321 insertions(+)
 create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.c
 create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 4abddbe..c1b181f 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -45,3 +45,11 @@ config VFIO_PCI_NVLINK2
 	depends on VFIO_PCI && PPC_POWERNV && SPAPR_TCE_IOMMU
 	help
 	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+
+config VFIO_PCI_HISI_MIGRATION
+	bool "Support for Hisilicon Live Migaration"
+	depends on ARM64 && ACPI
+	select PCI_IOV
+	default y
+	help
+	  Support for HiSilicon vfio pci to support VF live migration.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index eff97a7..d17cec4 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -4,5 +4,6 @@ vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-$(CONFIG_S390) += vfio_pci_zdev.o
+vfio-pci-$(CONFIG_VFIO_PCI_HISI_MIGRATION) += hisilicon/acc_vf_migration.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/hisilicon/acc_vf_migration.c b/drivers/vfio/pci/hisilicon/acc_vf_migration.c
new file mode 100644
index 0000000..5d8650d
--- /dev/null
+++ b/drivers/vfio/pci/hisilicon/acc_vf_migration.c
@@ -0,0 +1,1144 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 HiSilicon Limited. */
+
+#include <linux/device.h>
+#include <linux/debugfs.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/sysfs.h>
+#include <linux/vfio.h>
+
+#include "acc_vf_migration.h"
+
+#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
+void vfio_pci_hisilicon_acc_uninit(struct acc_vf_migration *acc_vf_dev);
+
+/* return 0 mailbox ready, -ETIMEDOUT hardware timeout */
+static int qm_wait_mb_ready(struct hisi_qm *qm)
+{
+	u32 val;
+
+	return readl_relaxed_poll_timeout(qm->io_base + QM_MB_CMD_SEND_BASE,
+					val, !((val >> QM_MB_BUSY_SHIFT) &
+					0x1), MB_POLL_PERIOD_US,
+					MB_POLL_TIMEOUT_US);
+}
+
+/* return 0 VM acc device ready, -ETIMEDOUT hardware timeout */
+static int qm_wait_dev_ready(struct hisi_qm *qm)
+{
+	u32 val;
+
+	return readl_relaxed_poll_timeout(qm->io_base + QM_VF_EQ_INT_MASK,
+				val, !(val & 0x1), MB_POLL_PERIOD_US,
+				MB_POLL_TIMEOUT_US);
+}
+
+
+/* 128 bit should be written to hardware at one time to trigger a mailbox */
+static void qm_mb_write(struct hisi_qm *qm, const void *src)
+{
+	void __iomem *fun_base = qm->io_base + QM_MB_CMD_SEND_BASE;
+	unsigned long tmp0 = 0;
+	unsigned long tmp1 = 0;
+
+	if (!IS_ENABLED(CONFIG_ARM64)) {
+		memcpy_toio(fun_base, src, 16);
+		wmb();
+		return;
+	}
+
+	asm volatile("ldp %0, %1, %3\n"
+		     "stp %0, %1, %2\n"
+		     "dsb sy\n"
+		     : "=&r" (tmp0),
+		       "=&r" (tmp1),
+		       "+Q" (*((char __iomem *)fun_base))
+		     : "Q" (*((char *)src))
+		     : "memory");
+}
+
+static void qm_mb_pre_init(struct qm_mailbox *mailbox, u8 cmd,
+			   u16 queue, bool op)
+{
+	mailbox->w0 = cpu_to_le16(cmd |
+		     (op ? 0x1 << QM_MB_OP_SHIFT : 0) |
+		     (0x1 << QM_MB_BUSY_SHIFT));
+	mailbox->queue_num = cpu_to_le16(queue);
+	mailbox->rsvd = 0;
+}
+
+static int qm_mb_nolock(struct hisi_qm *qm, struct qm_mailbox *mailbox)
+{
+	int cnt = 0;
+
+	if (unlikely(qm_wait_mb_ready(qm))) {
+		dev_err(&qm->pdev->dev, "QM mailbox is busy to start!\n");
+		return -EBUSY;
+	}
+
+	qm_mb_write(qm, mailbox);
+	while (true) {
+		if (!qm_wait_mb_ready(qm))
+			break;
+		if (++cnt > QM_MB_MAX_WAIT_CNT) {
+			dev_err(&qm->pdev->dev, "QM mailbox operation timeout!\n");
+			return -EBUSY;
+		}
+	}
+	return 0;
+}
+
+static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
+		 bool op)
+{
+	struct qm_mailbox mailbox;
+	int ret;
+
+	dev_dbg(&qm->pdev->dev, "QM mailbox request to q%u: %u-0x%llx\n",
+			queue, cmd, (unsigned long long)dma_addr);
+
+	qm_mb_pre_init(&mailbox, cmd, queue, op);
+	mailbox.base_l = cpu_to_le32(lower_32_bits(dma_addr));
+	mailbox.base_h = cpu_to_le32(upper_32_bits(dma_addr));
+
+	mutex_lock(&qm->mailbox_lock);
+	ret = qm_mb_nolock(qm, &mailbox);
+	mutex_unlock(&qm->mailbox_lock);
+
+	return ret;
+}
+
+/*
+ * Each state Reg is checked 100 times,
+ * with a delay of 100 microseconds after each check
+ */
+static u32 acc_check_reg_state(struct hisi_qm *qm, u32 regs)
+{
+	int check_times = 0;
+	u32 state;
+
+	state = readl(qm->io_base + regs);
+	while (state && check_times < ERROR_CHECK_TIMEOUT) {
+		udelay(CHECK_DELAY_TIME);
+		state = readl(qm->io_base + regs);
+		check_times++;
+	}
+
+	return state;
+}
+
+/* Check the  PF's RAS state and Function INT state */
+static int qm_check_int_state(struct acc_vf_migration *acc_vf_dev)
+{
+	struct hisi_qm *vfqm = acc_vf_dev->vf_qm;
+	struct hisi_qm *qm = acc_vf_dev->pf_qm;
+	struct device *dev = &qm->pdev->dev;
+	u32 state;
+
+	/* Check RAS state */
+	state = acc_check_reg_state(qm, QM_ABNORMAL_INT_STATUS);
+	if (state) {
+		dev_err(dev, "failed to check QM RAS state!\n");
+		return -EBUSY;
+	}
+
+	/* Check Function Communication  state between PF and VF */
+	state = acc_check_reg_state(vfqm, QM_IFC_INT_STATUS);
+	if (state) {
+		dev_err(dev, "failed to check QM IFC INT state!\n");
+		return -EBUSY;
+	}
+	state = acc_check_reg_state(vfqm, QM_IFC_INT_SET_V);
+	if (state) {
+		dev_err(dev, "failed to check QM IFC INT SET state!\n");
+		return -EBUSY;
+	}
+
+	/* Check submodule task state */
+	switch (acc_vf_dev->acc_type) {
+	case HISI_SEC:
+		state = acc_check_reg_state(qm, SEC_CORE_INT_STATUS);
+		if (state) {
+			dev_err(dev, "failed to check QM SEC Core INT state!\n");
+			return -EBUSY;
+		}
+		break;
+	case HISI_HPRE:
+		state = acc_check_reg_state(qm, HPRE_HAC_INT_STATUS);
+		if (state) {
+			dev_err(dev, "failed to check QM HPRE HAC INT state!\n");
+			return -EBUSY;
+		}
+		break;
+	case HISI_ZIP:
+		state = acc_check_reg_state(qm, HZIP_CORE_INT_STATUS);
+		if (state) {
+			dev_err(dev, "failed to check QM ZIP Core INT state!\n");
+			return -EBUSY;
+		}
+		break;
+	default:
+		dev_err(dev, "failed to detect acc module type!\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int qm_read_reg(struct hisi_qm *qm, u32 reg_addr,
+			 u32 *data, u8 nums)
+{
+	int i;
+
+	if (nums < 1 || nums > QM_REGS_MAX_LEN) {
+		dev_err(&qm->pdev->dev, "QM read input parameter is error!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nums; i++) {
+		data[i] = readl(qm->io_base + reg_addr);
+		reg_addr += QM_REG_ADDR_OFFSET;
+	}
+
+	return 0;
+}
+
+static int qm_write_reg(struct hisi_qm *qm, u32 reg_addr,
+			 u32 *data, u8 nums)
+{
+	int i;
+
+	if (nums < 1 || nums > QM_REGS_MAX_LEN) {
+		dev_err(&qm->pdev->dev, "QM write input parameter is error!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nums; i++) {
+		writel(data[i], qm->io_base + reg_addr);
+		reg_addr += QM_REG_ADDR_OFFSET;
+	}
+
+	return 0;
+}
+
+static int qm_get_vft(struct hisi_qm *qm, u32 *base, u32 *number)
+{
+	u64 sqc_vft;
+	int ret;
+
+	ret = qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1);
+	if (ret)
+		return ret;
+
+	sqc_vft = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
+		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) <<
+		  QM_XQC_ADDR_OFFSET);
+	*base = QM_SQC_VFT_BASE_MASK_V2 & (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2);
+	*number = (QM_SQC_VFT_NUM_MASK_V2 &
+		  (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1;
+
+	return 0;
+}
+
+static int qm_get_sqc(struct hisi_qm *qm, u64 *addr)
+{
+	int ret;
+
+	ret = qm_mb(qm, QM_MB_CMD_SQC_BT, 0, 0, 1);
+	if (ret)
+		return ret;
+
+	*addr = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
+		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) <<
+		  QM_XQC_ADDR_OFFSET);
+
+	return 0;
+}
+
+static int qm_get_cqc(struct hisi_qm *qm, u64 *addr)
+{
+	int ret;
+
+	ret = qm_mb(qm, QM_MB_CMD_CQC_BT, 0, 0, 1);
+	if (ret)
+		return ret;
+
+	*addr = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
+		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) <<
+		  QM_XQC_ADDR_OFFSET);
+
+	return 0;
+}
+
+static int qm_rw_regs_read(struct hisi_qm *qm, struct acc_vf_data *vf_data)
+{
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	/* check VM task driver state */
+	if (unlikely(qm_wait_dev_ready(qm))) {
+		dev_err(&qm->pdev->dev, "QM device is not ready to read!\n");
+		return -EBUSY;
+	}
+
+	ret = qm_read_reg(qm, QM_VF_AEQ_INT_MASK, &vf_data->aeq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_VF_AEQ_INT_MASK!\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_VF_EQ_INT_MASK, &vf_data->eq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_VF_EQ_INT_MASK!\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_IFC_INT_SOURCE_V,
+			   &vf_data->ifc_int_source, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_IFC_INT_SOURCE_V!\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_IFC_INT_MASK, &vf_data->ifc_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_IFC_INT_MASK!\n");
+		return ret;
+	}
+
+	ret = qm_read_reg(qm, QM_IFC_INT_SET_V, &vf_data->ifc_int_set, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_IFC_INT_SET_V!\n");
+		return ret;
+	}
+
+	/* QM_EQC_DW has 7 regs */
+	ret = qm_read_reg(qm, QM_EQC_DW0, vf_data->qm_eqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to read QM_EQC_DW!\n");
+		return ret;
+	}
+
+	/* QM_AEQC_DW has 7 regs */
+	ret = qm_read_reg(qm, QM_AEQC_DW0, vf_data->qm_aeqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to read QM_AEQC_DW!\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static int qm_rw_regs_write(struct hisi_qm *qm, struct acc_vf_data *vf_data)
+{
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	/* check VF state */
+	if (unlikely(qm_wait_mb_ready(qm))) {
+		dev_err(&qm->pdev->dev, "QM device is not ready to write!\n");
+		return -EBUSY;
+	}
+
+	ret = qm_write_reg(qm, QM_VF_AEQ_INT_MASK, &vf_data->aeq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_VF_AEQ_INT_MASK!\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_VF_EQ_INT_MASK, &vf_data->eq_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_VF_EQ_INT_MASK!\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_IFC_INT_SOURCE_V,
+			   &vf_data->ifc_int_source, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_IFC_INT_SOURCE_V!\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_IFC_INT_MASK, &vf_data->ifc_int_mask, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_IFC_INT_MASK!\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_IFC_INT_SET_V, &vf_data->ifc_int_set, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_IFC_INT_SET_V!\n");
+		return ret;
+	}
+
+	ret = qm_write_reg(qm, QM_QUE_ISO_CFG_V, &vf_data->que_iso_cfg, 1);
+	if (ret) {
+		dev_err(dev, "failed to write QM_QUE_ISO_CFG_V!\n");
+		return ret;
+	}
+
+	/* QM_EQC_DW has 7 regs */
+	ret = qm_write_reg(qm, QM_EQC_DW0, vf_data->qm_eqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to write QM_EQC_DW!\n");
+		return ret;
+	}
+
+	/* QM_AEQC_DW has 7 regs */
+	ret = qm_write_reg(qm, QM_AEQC_DW0, vf_data->qm_aeqc_dw, 7);
+	if (ret) {
+		dev_err(dev, "failed to write QM_AEQC_DW!\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * the vf QM have unbind from host, insmod in the VM
+ * so, qm just have the addr from pci dev
+ * others is null.
+ * so we need read from the SEC hardware REGs.
+ */
+static int vf_migration_data_store(struct hisi_qm *qm,
+			struct acc_vf_migration *acc_vf_dev)
+{
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	ret = qm_rw_regs_read(qm, vf_data);
+	if (ret) {
+		dev_err(dev, "failed to read QM regs!\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * every Reg is 32 bit, the dma address is 64 bit
+	 * so, the dma address is store in the Reg2 and Reg1
+	 */
+	vf_data->eqe_dma = vf_data->qm_eqc_dw[2];
+	vf_data->eqe_dma <<= QM_XQC_ADDR_OFFSET;
+	vf_data->eqe_dma |= vf_data->qm_eqc_dw[1];
+	vf_data->aeqe_dma = vf_data->qm_aeqc_dw[2];
+	vf_data->aeqe_dma <<= QM_XQC_ADDR_OFFSET;
+	vf_data->aeqe_dma |= vf_data->qm_aeqc_dw[1];
+
+	/* Through SQC_BT/CQC_BT to get sqc and cqc address */
+	ret = qm_get_sqc(qm, &vf_data->sqc_dma);
+	if (ret) {
+		dev_err(dev, "failed to read SQC addr!\n");
+		return -EINVAL;
+	}
+
+	ret = qm_get_cqc(qm, &vf_data->cqc_dma);
+	if (ret) {
+		dev_err(dev, "failed to read CQC addr!\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void qm_dev_cmd_init(struct hisi_qm *qm)
+{
+	/* clear VF communication status registers. */
+	writel(0x1, qm->io_base + QM_IFC_INT_SOURCE_V);
+
+	/* enable pf and vf communication. */
+	writel(0x0, qm->io_base + QM_IFC_INT_MASK);
+}
+
+static void qm_db(struct hisi_qm *qm, u16 qn, u8 cmd,
+	u16 index, u8 priority)
+{
+	u64 doorbell;
+	u64 dbase;
+	u16 randata = 0;
+
+	if (cmd == QM_DOORBELL_CMD_SQ || cmd == QM_DOORBELL_CMD_CQ)
+		dbase = QM_DOORBELL_SQ_CQ_BASE_V2;
+	else
+		dbase = QM_DOORBELL_EQ_AEQ_BASE_V2;
+
+	doorbell = qn | ((u64)cmd << QM_DB_CMD_SHIFT_V2) |
+		   ((u64)randata << QM_DB_RAND_SHIFT_V2) |
+		   ((u64)index << QM_DB_INDEX_SHIFT_V2)	 |
+		   ((u64)priority << QM_DB_PRIORITY_SHIFT_V2);
+
+	writeq(doorbell, qm->io_base + dbase);
+}
+
+static void vf_qm_fun_restart(struct hisi_qm *qm)
+{
+	int i;
+
+	for (i = 0; i < qm->qp_num; i++)
+		qm_db(qm, i, QM_DOORBELL_CMD_SQ, 0, 1);
+}
+
+static int vf_info_check(struct hisi_qm *qm,
+	struct acc_vf_migration *acc_vf_dev)
+{
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct device *dev = &qm->pdev->dev;
+	u32 que_iso_state;
+	int ret;
+
+	/* vf acc type check */
+	if (vf_data->acc_type != acc_vf_dev->acc_type) {
+		dev_err(dev, "failed to match VF acc type!\n");
+		return -EINVAL;
+	}
+
+	/* vf qp num check */
+	ret = qm_get_vft(qm, &qm->qp_base, &qm->qp_num);
+	if (ret || qm->qp_num <= 1) {
+		dev_err(dev, "failed to get vft qp nums!\n");
+		return ret;
+	}
+	if (vf_data->qp_num != qm->qp_num) {
+		dev_err(dev, "failed to match VF qp num!\n");
+		return -EINVAL;
+	}
+
+	/* vf isolation state check */
+	ret = qm_read_reg(qm, QM_QUE_ISO_CFG_V, &que_iso_state, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_QUE_ISO_CFG_V!\n");
+		return ret;
+	}
+	if (vf_data->que_iso_cfg != que_iso_state) {
+		dev_err(dev, "failed to match isolation state!\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int vf_migration_data_recover(struct hisi_qm *qm,
+	struct acc_vf_data *vf_data)
+{
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	qm->eqe_dma = vf_data->eqe_dma;
+	qm->aeqe_dma = vf_data->aeqe_dma;
+	qm->sqc_dma = vf_data->sqc_dma;
+	qm->cqc_dma = vf_data->cqc_dma;
+
+	qm->qp_base = vf_data->qp_base;
+	qm->qp_num = vf_data->qp_num;
+
+	ret = qm_rw_regs_write(qm, vf_data);
+	if (ret) {
+		dev_err(dev, "Set VF regs failed!\n");
+		return ret;
+	}
+
+	ret = qm_mb(qm, QM_MB_CMD_SQC_BT, qm->sqc_dma, 0, 0);
+	if (ret) {
+		dev_err(dev, "Set sqc failed!\n");
+		return ret;
+	}
+
+	ret = qm_mb(qm, QM_MB_CMD_CQC_BT, qm->cqc_dma, 0, 0);
+	if (ret) {
+		dev_err(dev, "Set cqc failed!\n");
+		return ret;
+	}
+
+	/* which ACC module need to reinit? */
+	qm_dev_cmd_init(qm);
+
+	/* reopen the VF msi interrruption */
+	writel(0x0, qm->io_base + QM_VF_EQ_INT_MASK);
+	writel(0x0, qm->io_base + QM_VF_AEQ_INT_MASK);
+
+	return 0;
+}
+
+static int vf_qm_cache_wb(struct hisi_qm *qm)
+{
+	unsigned int val;
+
+	writel(0x1, qm->io_base + QM_CACHE_WB_START);
+	if (readl_relaxed_poll_timeout(qm->io_base + QM_CACHE_WB_DONE,
+					val, val & BIT(0), MB_POLL_PERIOD_US,
+					MB_POLL_TIMEOUT_US)) {
+		dev_err(&qm->pdev->dev, "vf QM writeback sqc cache fail!\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int vf_qm_func_stop(struct hisi_qm *qm)
+{
+	return qm_mb(qm, QM_MB_CMD_PAUSE_QM, 0, 0, 0);
+}
+
+static int vf_qm_state_pre_save(struct hisi_qm *qm,
+		struct acc_vf_migration *acc_vf_dev)
+{
+	struct acc_vf_data *vf_data = acc_vf_dev->vf_data;
+	struct device *dev = &qm->pdev->dev;
+	int ret;
+
+	/* check VM task driver state */
+	if (unlikely(qm_wait_dev_ready(qm))) {
+		dev_err(&qm->pdev->dev, "QM device is not ready to read!\n");
+		return -EBUSY;
+	}
+
+	/* vf acc type save */
+	vf_data->acc_type = acc_vf_dev->acc_type;
+
+	/* vf qp num save */
+	ret = qm_get_vft(qm, &qm->qp_base, &qm->qp_num);
+	if (ret || qm->qp_num <= 1) {
+		dev_err(dev, "failed to get vft qp nums!\n");
+		return -EINVAL;
+	}
+	vf_data->qp_base = qm->qp_base;
+	vf_data->qp_num = qm->qp_num;
+
+	/* vf isolation state save */
+	ret = qm_read_reg(qm, QM_QUE_ISO_CFG_V, &vf_data->que_iso_cfg, 1);
+	if (ret) {
+		dev_err(dev, "failed to read QM_QUE_ISO_CFG_V!\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static int vf_qm_state_save(struct hisi_qm *qm,
+		struct acc_vf_migration *acc_vf_dev)
+{
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	int ret;
+
+	/* First stop the ACC vf function */
+	ret = vf_qm_func_stop(qm);
+	if (ret) {
+		dev_err(dev, "failed to stop QM VF function!\n");
+		goto state_error;
+	}
+
+	/* Check the VF's RAS and Interrution state */
+	ret = qm_check_int_state(acc_vf_dev);
+	if (ret) {
+		dev_err(dev, "failed to check QM INT state!\n");
+		goto state_error;
+	}
+
+	/* hisi_qm_cache_wb store cache data to DDR */
+	ret = vf_qm_cache_wb(qm);
+	if (ret) {
+		dev_err(dev, "failed to writeback QM Cache!\n");
+		goto state_error;
+	}
+
+	ret = vf_migration_data_store(qm, acc_vf_dev);
+	if (ret) {
+		dev_err(dev, "failed to get and store migration data!\n");
+		goto state_error;
+	}
+
+	return 0;
+
+state_error:
+	vf_qm_fun_restart(qm);
+	return ret;
+}
+
+static int vf_qm_state_resume(struct hisi_qm *qm,
+		struct acc_vf_migration *acc_vf_dev)
+{
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	int ret;
+
+	/* recover data to VF */
+	ret = vf_migration_data_recover(qm, acc_vf_dev->vf_data);
+	if (ret) {
+		dev_err(dev, "failed to recover the VF!\n");
+		return ret;
+	}
+
+	/* restart all destination VF's QP */
+	vf_qm_fun_restart(qm);
+
+	return 0;
+}
+
+static int acc_vf_set_device_state(struct acc_vf_migration *acc_vf_dev,
+				       u32 state)
+{
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	struct hisi_qm *qm = acc_vf_dev->vf_qm;
+	int ret = 0;
+
+	if (state == mig_ctl->device_state)
+		return 0;
+
+	switch (state) {
+	case VFIO_DEVICE_STATE_RUNNING:
+		if (mig_ctl->device_state == VFIO_DEVICE_STATE_RESUMING) {
+			ret = vf_qm_state_resume(qm, acc_vf_dev);
+			if (ret) {
+				dev_err(dev, "failed to resume device!\n");
+				return -EFAULT;
+			}
+		}
+
+		break;
+	case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+		/* ACC should in the pre cycle to read match information data */
+		ret = vf_qm_state_pre_save(qm, acc_vf_dev);
+		if (ret) {
+			dev_err(dev, "failed to pre save device state!\n");
+			return -EFAULT;
+		}
+
+		/* set the pending_byte and match data size */
+		mig_ctl->data_size = QM_MATCH_SIZE;
+		mig_ctl->pending_bytes = mig_ctl->data_size;
+
+		break;
+	case VFIO_DEVICE_STATE_SAVING:
+		/* stop the vf function */
+		ret = vf_qm_state_save(qm, acc_vf_dev);
+		if (ret) {
+			dev_err(dev, "failed to save device state!\n");
+			return -EFAULT;
+		}
+
+		/* set the pending_byte and data_size */
+		mig_ctl->data_size = sizeof(struct acc_vf_data);
+		mig_ctl->pending_bytes = mig_ctl->data_size;
+
+		break;
+	case VFIO_DEVICE_STATE_STOP:
+		/* restart all  VF's QP */
+		vf_qm_fun_restart(qm);
+
+		break;
+	case VFIO_DEVICE_STATE_RESUMING:
+
+		break;
+	default:
+		ret = -EFAULT;
+	}
+
+	if (!ret) {
+		dev_info(dev, "migration state: %s ----------> %s!\n",
+				vf_dev_state[mig_ctl->device_state],
+				vf_dev_state[state]);
+		mig_ctl->device_state = state;
+	}
+
+	return ret;
+}
+
+static int acc_vf_data_transfer(struct acc_vf_migration *acc_vf_dev,
+	char __user *buf, size_t count, u64 pos, bool iswrite)
+{
+	struct vfio_device_migration_info *mig_ctl = acc_vf_dev->mig_ctl;
+	void *data_addr = acc_vf_dev->vf_data;
+	int ret = 0;
+
+	if (!count) {
+		dev_err(&acc_vf_dev->vf_dev->dev,
+			 "Qemu operation data size error!\n");
+		return -EINVAL;
+	}
+
+	data_addr += pos - mig_ctl->data_offset;
+	if (iswrite)  {
+		ret = copy_from_user(data_addr, buf, count) ?
+				     -EFAULT : count;
+		if (ret == count)
+			mig_ctl->pending_bytes += count;
+	} else {
+		ret = copy_to_user(buf, data_addr, count) ?
+				   -EFAULT : count;
+		if (ret == count)
+			mig_ctl->pending_bytes -= count;
+	}
+
+	return ret;
+}
+
+static size_t acc_vf_migration_rw(struct vfio_pci_device *vdev,
+	char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos) -
+				VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_region	*region = &vdev->region[index];
+	struct vfio_device_migration_info *mig_ctl;
+	struct acc_vf_migration *acc_vf_dev;
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct hisi_qm *qm;
+	struct device *dev;
+	u32 device_state;
+	int ret = 0;
+
+	if (!region) {
+		pr_err("Failed to get valid migration region!\n");
+		return -EINVAL;
+	}
+
+	acc_vf_dev = region->data;
+	qm = acc_vf_dev->vf_qm;
+	dev = &acc_vf_dev->vf_dev->dev;
+	mig_ctl = acc_vf_dev->mig_ctl;
+	if (pos >= region->size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(region->size - pos));
+
+	switch (pos) {
+	case VDM_OFFSET(device_state):
+		if (count != sizeof(mig_ctl->device_state)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (iswrite) {
+			if (copy_from_user(&device_state, buf, count)) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = acc_vf_set_device_state(acc_vf_dev,
+					   device_state) ? ret : count;
+		} else {
+			ret = copy_to_user(buf, &mig_ctl->device_state,
+					   count) ? -EFAULT : count;
+		}
+		break;
+	case VDM_OFFSET(reserved):
+		ret = -EFAULT;
+		break;
+	case VDM_OFFSET(pending_bytes):
+		if (count != sizeof(mig_ctl->pending_bytes)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (iswrite)
+			ret = -EFAULT;
+		else
+			ret = copy_to_user(buf, &mig_ctl->pending_bytes,
+					   count) ? -EFAULT : count;
+		break;
+	case VDM_OFFSET(data_offset):
+		if (count != sizeof(mig_ctl->data_offset)) {
+			ret = -EINVAL;
+			break;
+		}
+		if (iswrite)
+			ret = copy_from_user(&mig_ctl->data_offset, buf,
+					count) ? -EFAULT : count;
+		else
+			ret = copy_to_user(buf, &mig_ctl->data_offset,
+					count) ? -EFAULT : count;
+		break;
+	case VDM_OFFSET(data_size):
+		if (count != sizeof(mig_ctl->data_size)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (iswrite)
+			ret = copy_from_user(&mig_ctl->data_size, buf, count) ?
+					   -EFAULT : count;
+		else
+			ret = copy_to_user(buf, &mig_ctl->data_size, count) ?
+					   -EFAULT : count;
+		break;
+	default:
+		ret = -EFAULT;
+		break;
+	}
+
+	/* Transfer data section */
+	if (pos >= mig_ctl->data_offset &&
+	    pos < MIGRATION_REGION_SZ) {
+		ret = acc_vf_data_transfer(acc_vf_dev, buf,
+				count, pos, iswrite);
+		if (ret != count)
+			return ret;
+	}
+
+	if (mig_ctl->device_state == VFIO_DEVICE_STATE_RESUMING &&
+	    mig_ctl->pending_bytes == QM_MATCH_SIZE &&
+	    mig_ctl->data_size == QM_MATCH_SIZE) {
+		/* check the VF match information */
+		ret = vf_info_check(qm, acc_vf_dev);
+		if (ret) {
+			dev_err(dev, "failed to check match information!\n");
+			return -EFAULT;
+		}
+		ret = count;
+
+		/* clear the VF match data size */
+		mig_ctl->pending_bytes = 0;
+		mig_ctl->data_size = 0;
+	}
+
+	return ret;
+}
+
+static int acc_vf_migration_mmap(struct vfio_pci_device *vdev,
+				  struct vfio_pci_region *region,
+				  struct vm_area_struct *vma)
+{
+	return -EFAULT;
+}
+
+static void acc_vf_migration_release(struct vfio_pci_device *vdev,
+				      struct vfio_pci_region *region)
+{
+	struct acc_vf_migration *acc_vf_dev = region->data;
+
+	vfio_pci_hisilicon_acc_uninit(acc_vf_dev);
+}
+
+static int acc_vf_migration_add_capability(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vfio_info_cap *caps)
+{
+	struct vfio_region_info_cap_type cap_type;
+
+	cap_type.header.id = VFIO_REGION_INFO_CAP_TYPE;
+	cap_type.header.version = 1;
+	cap_type.type = region->type;
+	cap_type.subtype = region->subtype;
+
+	return vfio_info_add_capability(caps, &cap_type.header,
+				       sizeof(cap_type));
+}
+
+static const struct vfio_pci_regops vfio_pci_acc_regops = {
+	.rw = acc_vf_migration_rw,
+	.mmap = acc_vf_migration_mmap,
+	.release = acc_vf_migration_release,
+	.add_capability = acc_vf_migration_add_capability,
+};
+
+static int qm_acc_type_init(struct acc_vf_migration *acc_vf_dev)
+{
+	struct pci_dev *pdev = acc_vf_dev->vf_dev;
+
+	switch (pdev->device) {
+	case SEC_VF_DEVICE_ID:
+		acc_vf_dev->acc_type = HISI_SEC;
+		break;
+	case HPRE_VF_DEVICE_ID:
+		acc_vf_dev->acc_type = HISI_HPRE;
+		break;
+	case ZIP_VF_DEVICE_ID:
+		acc_vf_dev->acc_type = HISI_ZIP;
+		break;
+	default:
+		acc_vf_dev->acc_type = 0;
+		dev_err(&pdev->dev, "failed to check acc type!\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int vf_qm_pci_init(struct pci_dev *pdev, struct hisi_qm *vfqm)
+{
+	struct device *dev = &pdev->dev;
+	int ret;
+
+	ret = pci_request_mem_regions(pdev, vfqm->dev_name);
+	if (ret < 0) {
+		dev_err(dev, "failed to request mem regions!\n");
+		return ret;
+	}
+
+	vfqm->phys_base = pci_resource_start(pdev, PCI_BASE_BAR);
+	vfqm->io_base = devm_ioremap(dev,
+					pci_resource_start(pdev, PCI_BASE_BAR),
+					pci_resource_len(pdev, PCI_BASE_BAR));
+	if (!vfqm->io_base) {
+		ret = -EIO;
+		goto err_ioremap;
+	}
+
+	vfqm->pdev = pdev;
+	mutex_init(&vfqm->mailbox_lock);
+
+	/*
+	 * Allow VF devices to be loaded in VM when
+	 * it loaded in migration driver
+	 */
+	pci_release_mem_regions(pdev);
+
+	return 0;
+
+err_ioremap:
+	pci_release_mem_regions(pdev);
+	return ret;
+}
+
+static int acc_vf_dev_init(struct pci_dev *pdev, struct hisi_qm *pf_qm,
+			struct acc_vf_migration *acc_vf_dev)
+{
+	struct vfio_device_migration_info *mig_ctl;
+	struct hisi_qm *vf_qm;
+	__u64 mig_offset;
+	void *vf_data;
+	int ret;
+
+	vf_qm = kzalloc(sizeof(*vf_qm), GFP_KERNEL);
+	if (!vf_qm)
+		return -ENOMEM;
+
+	/* get vf qm dev name from pf */
+	vf_qm->dev_name = pf_qm->dev_name;
+	vf_qm->fun_type = QM_HW_VF;
+	acc_vf_dev->vf_qm = vf_qm;
+	acc_vf_dev->pf_qm = pf_qm;
+
+	ret = vf_qm_pci_init(pdev, vf_qm);
+	if (ret)
+		goto init_qm_error;
+
+	ret = qm_acc_type_init(acc_vf_dev);
+	if (ret)
+		goto init_qm_error;
+
+	/* the data region must follow migration info */
+	mig_offset = sizeof(*mig_ctl);
+	mig_ctl = kzalloc(MIGRATION_REGION_SZ, GFP_KERNEL);
+	if (!mig_ctl)
+		goto init_qm_error;
+
+	acc_vf_dev->mig_ctl = mig_ctl;
+
+	vf_data = (void *)mig_ctl + mig_offset;
+	acc_vf_dev->vf_data = vf_data;
+
+	mig_ctl->device_state = VFIO_DEVICE_STATE_RUNNING;
+	mig_ctl->data_offset = mig_offset;
+	mig_ctl->data_size = sizeof(struct acc_vf_data);
+
+	return 0;
+
+init_qm_error:
+	kfree(vf_qm);
+	return -ENOMEM;
+}
+
+static int pci_dev_id_check(struct pci_dev *pdev)
+{
+	switch (pdev->device) {
+	case SEC_VF_DEVICE_ID:
+	case HPRE_VF_DEVICE_ID:
+	case ZIP_VF_DEVICE_ID:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev)
+{
+	struct acc_vf_migration *acc_vf_dev;
+	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pf_dev, *vf_dev;
+	struct hisi_qm *pf_qm;
+	int vf_id, ret;
+
+	pf_dev = pdev->physfn;
+	vf_dev = pdev;
+
+	if (pci_dev_id_check(vf_dev)) {
+		dev_err(&pdev->dev, "device not match migration!\n");
+		return -ENODEV;
+	}
+
+	pf_qm = pci_get_drvdata(pf_dev);
+	if (!pf_qm) {
+		dev_err(&pdev->dev, "host qm driver not insmod!\n");
+		return -EINVAL;
+	}
+	if (pf_qm->ver < QM_HW_V3) {
+		dev_err(&pdev->dev,
+			 "device can't support migration! version: 0x%x\n",
+			 pf_qm->ver);
+		return -ENODEV;
+	}
+
+	vf_id = PCI_FUNC(vf_dev->devfn);
+	acc_vf_dev = kzalloc(sizeof(*acc_vf_dev), GFP_KERNEL);
+	if (!acc_vf_dev)
+		return -ENOMEM;
+
+	if (pci_num_vf(pf_dev) <= 0) {
+		dev_info(&pdev->dev, "vf device: %s, vf id: %d\n",
+			  pf_qm->dev_name, vf_id);
+		return -EINVAL;
+	}
+	acc_vf_dev->vf_id = vf_id;
+	acc_vf_dev->pf_dev = pf_dev;
+	acc_vf_dev->vf_dev = vf_dev;
+	mutex_init(&acc_vf_dev->reflock);
+
+	ret = acc_vf_dev_init(pdev, pf_qm, acc_vf_dev);
+	if (ret) {
+		kfree(acc_vf_dev);
+		return -ENOMEM;
+	}
+
+	ret = vfio_pci_register_dev_region(vdev,
+						VFIO_REGION_TYPE_MIGRATION,
+						VFIO_REGION_SUBTYPE_MIGRATION,
+						&vfio_pci_acc_regops,
+						MIGRATION_REGION_SZ,
+						VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE,
+						acc_vf_dev);
+	if (ret) {
+		dev_info(&pdev->dev, "register vfio_pci_acc_regops Error!\n");
+		goto register_error;
+	}
+
+	return 0;
+
+register_error:
+	vfio_pci_hisilicon_acc_uninit(acc_vf_dev);
+	return ret;
+}
+
+void vfio_pci_hisilicon_acc_uninit(struct acc_vf_migration *acc_vf_dev)
+{
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	struct hisi_qm *qm = acc_vf_dev->vf_qm;
+
+	kfree(acc_vf_dev->mig_ctl);
+	acc_vf_dev->mig_ctl = NULL;
+
+	devm_iounmap(dev, qm->io_base);
+	kfree(qm);
+
+	kfree(acc_vf_dev->regions);
+	acc_vf_dev->regions = NULL;
+	acc_vf_dev->num_regions = 0;
+
+	kfree(acc_vf_dev);
+}
+
diff --git a/drivers/vfio/pci/hisilicon/acc_vf_migration.h b/drivers/vfio/pci/hisilicon/acc_vf_migration.h
new file mode 100644
index 0000000..e8b116be
--- /dev/null
+++ b/drivers/vfio/pci/hisilicon/acc_vf_migration.h
@@ -0,0 +1,168 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021 HiSilicon Limited. */
+
+#ifndef ACC_MIG_H
+#define ACC_MIG_H
+
+#include <linux/mdev.h>
+#include <linux/pci.h>
+#include <linux/vfio.h>
+
+#include "../vfio_pci_private.h"
+#include "../../../drivers/crypto/hisilicon/qm.h"
+
+#define MIGRATION_REGION_SZ (sizeof(struct acc_vf_data) + \
+			      sizeof(struct vfio_device_migration_info))
+#define VFIO_DEV_DBG_LEN		256
+#define VFIO_DBG_LOG_LEN		16
+#define VFIO_DEVFN_MASK		0xFF
+#define SEC_VF_DEVICE_ID		0xa256
+#define HPRE_VF_DEVICE_ID		0xa259
+#define ZIP_VF_DEVICE_ID		0xa251
+
+#define PCI_BASE_BAR			2
+#define MB_POLL_PERIOD_US		10
+#define MB_POLL_TIMEOUT_US		1000
+#define QM_CACHE_WB_START		0x204
+#define QM_CACHE_WB_DONE		0x208
+#define QM_MB_CMD_PAUSE_QM		0xe
+#define QM_ABNORMAL_INT_STATUS	0x100008
+#define QM_IFC_INT_STATUS		0x0028
+#define SEC_CORE_INT_STATUS		0x301008
+#define HPRE_HAC_INT_STATUS		0x301800
+#define HZIP_CORE_INT_STATUS		0x3010AC
+#define ERROR_CHECK_TIMEOUT		100
+#define CHECK_DELAY_TIME		100
+
+#define QM_SQC_VFT_BASE_SHIFT_V2	28
+#define QM_SQC_VFT_BASE_MASK_V2	GENMASK(15, 0)
+#define QM_SQC_VFT_NUM_SHIFT_V2	45
+#define QM_SQC_VFT_NUM_MASK_V2	GENMASK(9, 0)
+
+/* mailbox */
+#define QM_MB_CMD_SQC_BT		0x4
+#define QM_MB_CMD_CQC_BT		0x5
+#define QM_MB_CMD_SQC_VFT_V2		0x6
+
+#define QM_MB_CMD_SEND_BASE		0x300
+#define QM_MB_BUSY_SHIFT		13
+#define QM_MB_OP_SHIFT			14
+#define QM_MB_CMD_DATA_ADDR_L		0x304
+#define QM_MB_CMD_DATA_ADDR_H		0x308
+#define QM_MB_MAX_WAIT_CNT		6000
+
+/* doorbell */
+#define QM_DOORBELL_CMD_SQ		0
+#define QM_DOORBELL_CMD_CQ		1
+#define QM_DOORBELL_SQ_CQ_BASE_V2	0x1000
+#define QM_DOORBELL_EQ_AEQ_BASE_V2	0x2000
+#define QM_DB_CMD_SHIFT_V2		12
+#define QM_DB_RAND_SHIFT_V2		16
+#define QM_DB_INDEX_SHIFT_V2		32
+#define QM_DB_PRIORITY_SHIFT_V2	48
+
+/* RW regs */
+#define QM_REGS_MAX_LEN		7
+#define QM_REG_ADDR_OFFSET		0x0004
+
+#define QM_XQC_ADDR_OFFSET		32U
+#define QM_VF_AEQ_INT_MASK		0x0004
+#define QM_VF_EQ_INT_MASK		0x000c
+#define QM_IFC_INT_SOURCE_V		0x0020
+#define QM_IFC_INT_MASK		0x0024
+#define QM_IFC_INT_SET_V		0x002c
+#define QM_QUE_ISO_CFG_V		0x0030
+
+#define QM_EQC_DW0		0X3000
+#define QM_AEQC_DW0		0X3020
+
+struct qm_mailbox {
+	__le16 w0;
+	__le16 queue_num;
+	__le32 base_l;
+	__le32 base_h;
+	__le32 rsvd;
+};
+
+enum acc_type {
+	HISI_SEC = 0x1,
+	HISI_HPRE = 0x2,
+	HISI_ZIP = 0x3,
+};
+
+struct vf_acc_type {
+	const char *name;
+	u32 type;
+};
+
+enum mig_debug_cmd {
+	STATE_SAVE,
+	STATE_RESUME,
+	MB_TEST,
+	MIG_DATA_DUMP,
+	MIG_DEV_SHOW,
+};
+
+static const char * const vf_dev_state[] = {
+	"Stop",
+	"Running",
+	"Saving",
+	"Running & Saving",
+	"Resuming",
+};
+
+#define QM_MATCH_SIZE		32L
+struct acc_vf_data {
+	/* QM match information */
+	u32 qp_num;
+	u32 acc_type;
+	u32 que_iso_cfg;
+	u32 qp_base;
+	/* QM reserved 4 match information */
+	u32 qm_rsv_state[4];
+
+	/* QM RW regs */
+	u32 aeq_int_mask;
+	u32 eq_int_mask;
+	u32 ifc_int_source;
+	u32 ifc_int_mask;
+	u32 ifc_int_set;
+
+	/*
+	 * QM_VF_MB has 4 regs don't need to migration
+	 * mailbox regs writeback value will cause
+	 * hardware to perform command operations
+	 */
+
+	/* QM_EQC_DW has 7 regs */
+	u32 qm_eqc_dw[7];
+
+	/* QM_AEQC_DW has 7 regs */
+	u32 qm_aeqc_dw[7];
+
+	/* QM reserved 5 regs */
+	u32 qm_rsv_regs[5];
+
+	/* qm memory init information */
+	dma_addr_t eqe_dma;
+	dma_addr_t aeqe_dma;
+	dma_addr_t sqc_dma;
+	dma_addr_t cqc_dma;
+};
+
+struct acc_vf_migration {
+	struct pci_dev			*pf_dev;
+	struct pci_dev			*vf_dev;
+	struct hisi_qm			*pf_qm;
+	struct hisi_qm			*vf_qm;
+	int				vf_id;
+	u8				acc_type;
+
+	struct vfio_device_migration_info *mig_ctl;
+	struct acc_vf_data		*vf_data;
+	struct vfio_pci_region		*regions;
+	int				num_regions;
+};
+
+#endif /* ACC_MIG_H */
+
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-13  3:36 [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
  2021-04-13  3:36 ` [RFC PATCH 1/3] " Longfang Liu
@ 2021-04-13  3:36 ` Longfang Liu
  2021-04-15 22:01   ` Jason Gunthorpe
  2021-04-13  3:36 ` [RFC PATCH 3/3] vfio/hisilicom: add debugfs for driver Longfang Liu
  2021-04-15 21:35 ` [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Alex Williamson
  3 siblings, 1 reply; 34+ messages in thread
From: Longfang Liu @ 2021-04-13  3:36 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, linux-kernel, linuxarm, liulongfang

Register the live migration driver of the accelerator module to vfio

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
 drivers/vfio/pci/vfio_pci_private.h |  9 +++++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b..e1b0e37 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -407,6 +407,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
+		ret = vfio_pci_hisilicon_acc_init(vdev);
+		if (ret && ret != -ENODEV) {
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup Hisilicon ACC region\n");
+			vfio_pci_disable(vdev);
+			return ret;
+		}
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9cd1882..83c51be 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -214,6 +214,15 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 }
 #endif
 
+#ifdef CONFIG_VFIO_PCI_HISI_MIGRATION
+extern int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
+
 #ifdef CONFIG_S390
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 				       struct vfio_info_cap *caps);
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/3] vfio/hisilicom: add debugfs for driver
  2021-04-13  3:36 [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
  2021-04-13  3:36 ` [RFC PATCH 1/3] " Longfang Liu
  2021-04-13  3:36 ` [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio Longfang Liu
@ 2021-04-13  3:36 ` Longfang Liu
  2021-04-15 21:35 ` [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Alex Williamson
  3 siblings, 0 replies; 34+ messages in thread
From: Longfang Liu @ 2021-04-13  3:36 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, linux-kernel, linuxarm, liulongfang

Add debugfs debugging interface to live migration driver

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/hisilicon/acc_vf_migration.c | 193 ++++++++++++++++++++++++++
 drivers/vfio/pci/hisilicon/acc_vf_migration.h |   2 +
 2 files changed, 195 insertions(+)

diff --git a/drivers/vfio/pci/hisilicon/acc_vf_migration.c b/drivers/vfio/pci/hisilicon/acc_vf_migration.c
index 5d8650d..d4eacaf 100644
--- a/drivers/vfio/pci/hisilicon/acc_vf_migration.c
+++ b/drivers/vfio/pci/hisilicon/acc_vf_migration.c
@@ -16,6 +16,9 @@
 
 #define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
 void vfio_pci_hisilicon_acc_uninit(struct acc_vf_migration *acc_vf_dev);
+static void vf_debugfs_exit(struct acc_vf_migration *acc_vf_dev);
+static struct dentry *mig_debugfs_root;
+static int mig_root_ref;
 
 /* return 0 mailbox ready, -ETIMEDOUT hardware timeout */
 static int qm_wait_mb_ready(struct hisi_qm *qm)
@@ -933,6 +936,193 @@ static const struct vfio_pci_regops vfio_pci_acc_regops = {
 	.add_capability = acc_vf_migration_add_capability,
 };
 
+static ssize_t acc_vf_debug_read(struct file *filp, char __user *buffer,
+			   size_t count, loff_t *pos)
+{
+	char buf[VFIO_DEV_DBG_LEN];
+	int len;
+
+	len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+			"echo 0: test vf data store\n"
+			"echo 1: test vf data writeback\n"
+			"echo 2: test vf send mailbox\n"
+			"echo 3: dump vf dev data\n"
+			"echo 4: dump migration state\n");
+
+	return simple_read_from_buffer(buffer, count, pos, buf, len);
+}
+
+static ssize_t acc_vf_debug_write(struct file *filp, const char __user *buffer,
+			    size_t count, loff_t *pos)
+{
+	struct acc_vf_migration *acc_vf_dev = filp->private_data;
+	struct device *dev = &acc_vf_dev->vf_dev->dev;
+	struct hisi_qm *qm = acc_vf_dev->vf_qm;
+	char tbuf[VFIO_DEV_DBG_LEN];
+	unsigned long val;
+	u64 data;
+	int len, ret;
+
+	if (*pos)
+		return 0;
+
+	if (count >= VFIO_DEV_DBG_LEN)
+		return -ENOSPC;
+
+	len = simple_write_to_buffer(tbuf, VFIO_DEV_DBG_LEN - 1,
+					pos, buffer, count);
+	if (len < 0)
+		return len;
+	tbuf[len] = '\0';
+	if (kstrtoul(tbuf, 0, &val))
+		return -EFAULT;
+
+	switch (val) {
+	case STATE_SAVE:
+		ret = vf_qm_state_save(qm, acc_vf_dev);
+		if (ret)
+			return -EINVAL;
+		break;
+	case STATE_RESUME:
+		ret = vf_qm_state_resume(qm, acc_vf_dev);
+		if (ret)
+			return -EINVAL;
+		break;
+	case MB_TEST:
+		data = readl(qm->io_base + QM_MB_CMD_SEND_BASE);
+		dev_info(dev, "debug mailbox addr: 0x%lx, mailbox val: 0x%llx\n",
+			 (uintptr_t)qm->io_base, data);
+		break;
+	case MIG_DATA_DUMP:
+		dev_info(dev, "dumped vf migration data:\n");
+		print_hex_dump(KERN_INFO, "Mig Data:", DUMP_PREFIX_OFFSET,
+				VFIO_DBG_LOG_LEN, 1,
+				(unsigned char *)acc_vf_dev->vf_data,
+				sizeof(struct acc_vf_data), false);
+		break;
+	case MIG_DEV_SHOW:
+		if (!acc_vf_dev->mig_ctl)
+			dev_info(dev, "migration region have release!\n");
+		else
+			dev_info(dev,
+				 "device  state: %u\n"
+				 "pending bytes: %llu\n"
+				 "data   offset: %llu\n"
+				 "data     size: %llu\n"
+				 "data     addr: 0x%lx\n",
+				 acc_vf_dev->mig_ctl->device_state,
+				 acc_vf_dev->mig_ctl->pending_bytes,
+				 acc_vf_dev->mig_ctl->data_offset,
+				 acc_vf_dev->mig_ctl->data_size,
+				 (uintptr_t)acc_vf_dev->vf_data);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+static const struct file_operations acc_vf_debug_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.read = acc_vf_debug_read,
+	.write = acc_vf_debug_write,
+};
+
+static ssize_t acc_vf_state_read(struct file *filp, char __user *buffer,
+			   size_t count, loff_t *pos)
+{
+	struct acc_vf_migration *acc_vf_dev = filp->private_data;
+	char buf[VFIO_DEV_DBG_LEN];
+	u32 state;
+	int len;
+
+	if (!acc_vf_dev->mig_ctl) {
+		len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n", "Invalid\n");
+	} else {
+		state = acc_vf_dev->mig_ctl->device_state;
+		switch (state) {
+		case VFIO_DEVICE_STATE_RUNNING:
+			len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+				"RUNNING\n");
+			break;
+		case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+			len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+				"SAVING and RUNNING\n");
+			break;
+		case VFIO_DEVICE_STATE_SAVING:
+			len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+				"SAVING\n");
+			break;
+		case VFIO_DEVICE_STATE_STOP:
+			len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+				"STOP\n");
+			break;
+		case VFIO_DEVICE_STATE_RESUMING:
+			len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+				"RESUMING\n");
+			break;
+		default:
+			len = scnprintf(buf, VFIO_DEV_DBG_LEN, "%s\n",
+				"Error\n");
+		}
+	}
+
+	return simple_read_from_buffer(buffer, count, pos, buf, len);
+}
+
+static const struct file_operations acc_vf_state_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.read = acc_vf_state_read,
+};
+
+static void vf_debugfs_init(struct acc_vf_migration *acc_vf_dev)
+{
+	char name[VFIO_DEV_DBG_LEN];
+	int node_id;
+
+	if (!mig_root_ref)
+		mig_debugfs_root = debugfs_create_dir("vfio_acc", NULL);
+	mutex_lock(&acc_vf_dev->reflock);
+	mig_root_ref++;
+	mutex_unlock(&acc_vf_dev->reflock);
+
+	node_id = dev_to_node(&acc_vf_dev->vf_dev->dev);
+	if (node_id < 0)
+		node_id = 0;
+
+	if (acc_vf_dev->acc_type == HISI_SEC)
+		scnprintf(name, VFIO_DEV_DBG_LEN, "sec_vf%d-%d",
+			  node_id, acc_vf_dev->vf_id);
+	else if (acc_vf_dev->acc_type == HISI_HPRE)
+		scnprintf(name, VFIO_DEV_DBG_LEN, "hpre_vf%d-%d",
+			  node_id, acc_vf_dev->vf_id);
+	else
+		scnprintf(name, VFIO_DEV_DBG_LEN, "zip_vf%d-%d",
+			  node_id, acc_vf_dev->vf_id);
+
+	acc_vf_dev->debug_root = debugfs_create_dir(name, mig_debugfs_root);
+
+	debugfs_create_file("debug", 0644, acc_vf_dev->debug_root,
+			      acc_vf_dev, &acc_vf_debug_fops);
+	debugfs_create_file("state", 0444, acc_vf_dev->debug_root,
+			      acc_vf_dev, &acc_vf_state_fops);
+}
+
+static void vf_debugfs_exit(struct acc_vf_migration *acc_vf_dev)
+{
+	debugfs_remove_recursive(acc_vf_dev->debug_root);
+
+	mutex_lock(&acc_vf_dev->reflock);
+	mig_root_ref--;
+	mutex_unlock(&acc_vf_dev->reflock);
+
+	if (!mig_root_ref)
+		debugfs_remove_recursive(mig_debugfs_root);
+}
+
 static int qm_acc_type_init(struct acc_vf_migration *acc_vf_dev)
 {
 	struct pci_dev *pdev = acc_vf_dev->vf_dev;
@@ -1117,6 +1307,8 @@ int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev)
 		goto register_error;
 	}
 
+	vf_debugfs_init(acc_vf_dev);
+
 	return 0;
 
 register_error:
@@ -1139,6 +1331,7 @@ void vfio_pci_hisilicon_acc_uninit(struct acc_vf_migration *acc_vf_dev)
 	acc_vf_dev->regions = NULL;
 	acc_vf_dev->num_regions = 0;
 
+	vf_debugfs_exit(acc_vf_dev);
 	kfree(acc_vf_dev);
 }
 
diff --git a/drivers/vfio/pci/hisilicon/acc_vf_migration.h b/drivers/vfio/pci/hisilicon/acc_vf_migration.h
index e8b116be..0a5438f 100644
--- a/drivers/vfio/pci/hisilicon/acc_vf_migration.h
+++ b/drivers/vfio/pci/hisilicon/acc_vf_migration.h
@@ -157,11 +157,13 @@ struct acc_vf_migration {
 	struct hisi_qm			*vf_qm;
 	int				vf_id;
 	u8				acc_type;
+	struct mutex			reflock;
 
 	struct vfio_device_migration_info *mig_ctl;
 	struct acc_vf_data		*vf_data;
 	struct vfio_pci_region		*regions;
 	int				num_regions;
+	struct dentry			*debug_root;
 };
 
 #endif /* ACC_MIG_H */
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver
  2021-04-13  3:36 [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
                   ` (2 preceding siblings ...)
  2021-04-13  3:36 ` [RFC PATCH 3/3] vfio/hisilicom: add debugfs for driver Longfang Liu
@ 2021-04-15 21:35 ` Alex Williamson
  3 siblings, 0 replies; 34+ messages in thread
From: Alex Williamson @ 2021-04-15 21:35 UTC (permalink / raw)
  To: Longfang Liu
  Cc: cohuck, linux-kernel, linuxarm, Tarun Gupta, Neo Jia,
	Max Gurtovoy, Jason Gunthorpe

[Cc+ NVIDIA folks both from migration and vfio-pci-core discussion]

On Tue, 13 Apr 2021 11:36:20 +0800
Longfang Liu <liulongfang@huawei.com> wrote:

> The live migration solution relies on the vfio_device_migration_info protocol.
> The structure vfio_device_migration_info is placed at the 0th offset of
> the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
> migration information. Field accesses from this structure are only supported
> at their native width and alignment. Otherwise, the result is undefined and
> vendor drivers should return an error.
> 
> (1).The driver framework is based on vfio_pci_register_dev_region() of vfio-pci,
> and then a new live migration region is added, and the live migration is
> realized through the ops of this region.
> 
> (2).In order to ensure the compatibility of the devices before and after the
> migration, the device compatibility information check will be performed in
> the Pre-copy stage. If the check fails, an error will be returned and the
> source VM will exit the migration function.
> 
> (3).After the compatibility check is passed, it will enter the Stop-and-copy
> stage. At this time, all the live migration data will be copied, and then
> saved to the VF device of the destination, and then the VF device of the
> destination will be started and the VM of the source will be exited.
> 
> Longfang Liu (3):
>   vfio/hisilicon: add acc live migration driver
>   vfio/hisilicon: register the driver to vfio
>   vfio/hisilicom: add debugfs for driver
> 
>  drivers/vfio/pci/Kconfig                      |    8 +
>  drivers/vfio/pci/Makefile                     |    1 +
>  drivers/vfio/pci/hisilicon/acc_vf_migration.c | 1337 +++++++++++++++++++++++++
>  drivers/vfio/pci/hisilicon/acc_vf_migration.h |  170 ++++
>  drivers/vfio/pci/vfio_pci.c                   |   11 +
>  drivers/vfio/pci/vfio_pci_private.h           |    9 +
>  6 files changed, 1536 insertions(+)
>  create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.c
>  create mode 100644 drivers/vfio/pci/hisilicon/acc_vf_migration.h
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-13  3:36 ` [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio Longfang Liu
@ 2021-04-15 22:01   ` Jason Gunthorpe
  2021-04-19 12:24     ` liulongfang
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2021-04-15 22:01 UTC (permalink / raw)
  To: Longfang Liu; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On Tue, Apr 13, 2021 at 11:36:22AM +0800, Longfang Liu wrote:
> Register the live migration driver of the accelerator module to vfio
> 
> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
>  drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |  9 +++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b..e1b0e37 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -407,6 +407,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
> +	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
> +		ret = vfio_pci_hisilicon_acc_init(vdev);

This is exactly what we want to avoid with the work we are doing on
the vfio-pci modularity.

It is a complete mess to just keep adding more stuff here, and as
we've discussed to death really ugly to have such a ridiculously wide
ID match.

You should be working with us on that project and base your work on
top of Max's series.. Alex given the interest here I think we should
find some way forward while we work on completed version of the mlx5
pci vfio driver..

I'm also confused how this works securely at all, as a general rule a
VFIO PCI driver cannot access the MMIO memory of the function it is
planning to assign to the guest. There is a lot of danger that the
guest could access that MMIO space one way or another.

Here I see the driver obtaining a devm_ioremap() of the same pdev it
is going to assign (and I really wonder why pci_release_mem_regions()
exists at all..)

This is why the mlx5 RFC was showing how to juggle two PCI devices via
the aux device connector.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-15 22:01   ` Jason Gunthorpe
@ 2021-04-19 12:24     ` liulongfang
  2021-04-19 12:33       ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: liulongfang @ 2021-04-19 12:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On 2021/4/16 6:01, Jason Gunthorpe wrote:
> On Tue, Apr 13, 2021 at 11:36:22AM +0800, Longfang Liu wrote:
>> Register the live migration driver of the accelerator module to vfio
>>
>> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
>>  drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |  9 +++++++++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 65e7e6b..e1b0e37 100644
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -407,6 +407,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  		}
>>  	}
>>  
>> +	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
>> +		ret = vfio_pci_hisilicon_acc_init(vdev);
> 
> This is exactly what we want to avoid with the work we are doing on
> the vfio-pci modularity.
> 
> It is a complete mess to just keep adding more stuff here, and as
> we've discussed to death really ugly to have such a ridiculously wide
> ID match.
> 
> You should be working with us on that project and base your work on
> top of Max's series.. Alex given the interest here I think we should
> find some way forward while we work on completed version of the mlx5
> pci vfio driver..
> 
> I'm also confused how this works securely at all, as a general rule a
> VFIO PCI driver cannot access the MMIO memory of the function it is
> planning to assign to the guest. There is a lot of danger that the
> guest could access that MMIO space one way or another.
> 
VF's MMIO memory is divided into two parts, one is the guest part,
and the other is the live migration part. They do not affect each other,
so there is no security problem.

> Here I see the driver obtaining a devm_ioremap() of the same pdev it
> is going to assign (and I really wonder why pci_release_mem_regions()
> exists at all..)
> 
The driver here obtains the VF's MMIO memory address through devm_ioremap(),
and adding pci_release_mem_regions() is to get the MMIO memory address
in the guest after the driver here obtains the MMIO memory address.

If pci_release_mem_regions() is not added here,
The guests using pci_request_mem_regions() will return an error.
Then, the guest will not be able to obtain the MMIO address of the VF.

> This is why the mlx5 RFC was showing how to juggle two PCI devices via
> the aux device connector.
> 
I also noticed the mlx5 RFC, but it will take some time to analyze it. If the analysis
process goes well. I will import the mlx5 RFC into our driver and use this solution
to test and verify our live migration function.

If we use the mlx5 RFC, what should we pay attention to?

> Jason
> .
> 
Thanks.
Longfang.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-19 12:24     ` liulongfang
@ 2021-04-19 12:33       ` Jason Gunthorpe
  2021-04-20 12:50         ` liulongfang
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2021-04-19 12:33 UTC (permalink / raw)
  To: liulongfang; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:

> > I'm also confused how this works securely at all, as a general rule a
> > VFIO PCI driver cannot access the MMIO memory of the function it is
> > planning to assign to the guest. There is a lot of danger that the
> > guest could access that MMIO space one way or another.
> 
> VF's MMIO memory is divided into two parts, one is the guest part,
> and the other is the live migration part. They do not affect each other,
> so there is no security problem.

AFAIK there are several scenarios where a guest can access this MMIO
memory using DMA even if it is not mapped into the guest for CPU
access.

> If pci_release_mem_regions() is not added here,
> The guests using pci_request_mem_regions() will return an error.
> Then, the guest will not be able to obtain the MMIO address of the VF.

Which is why VFIO has this protection to prevent sharing MMIO regions
on the VF assigned to the guest

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-19 12:33       ` Jason Gunthorpe
@ 2021-04-20 12:50         ` liulongfang
  2021-04-20 12:59           ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: liulongfang @ 2021-04-20 12:50 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On 2021/4/19 20:33, Jason Gunthorpe wrote:
> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> 
>>> I'm also confused how this works securely at all, as a general rule a
>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>> planning to assign to the guest. There is a lot of danger that the
>>> guest could access that MMIO space one way or another.
>>
>> VF's MMIO memory is divided into two parts, one is the guest part,
>> and the other is the live migration part. They do not affect each other,
>> so there is no security problem.
> 
> AFAIK there are several scenarios where a guest can access this MMIO
> memory using DMA even if it is not mapped into the guest for CPU
> access.
> 
The hardware divides VF's MMIO memory into two parts. The live migration
driver in the host uses the live migration part, and the device driver in
the guest uses the guest part. They obtain the address of VF's MMIO memory
in their respective drivers, although these two parts The memory is
continuous on the hardware device, but due to the needs of the drive function,
they will not perform operations on another part of the memory, and the
device hardware also independently responds to the operation commands of
the two parts.
So, I still don't understand what the security risk you are talking about is,
and what do you think the security design should look like?
Can you elaborate on it?

>> If pci_release_mem_regions() is not added here,
>> The guests using pci_request_mem_regions() will return an error.
>> Then, the guest will not be able to obtain the MMIO address of the VF.
> 
> Which is why VFIO has this protection to prevent sharing MMIO regions
> on the VF assigned to the guest
> 
> Jason
> .
> 

Thanks.
Longfang.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 12:50         ` liulongfang
@ 2021-04-20 12:59           ` Jason Gunthorpe
  2021-04-20 13:28             ` liulongfang
  2021-04-20 22:04             ` Alex Williamson
  0 siblings, 2 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2021-04-20 12:59 UTC (permalink / raw)
  To: liulongfang; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
> On 2021/4/19 20:33, Jason Gunthorpe wrote:
> > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> > 
> >>> I'm also confused how this works securely at all, as a general rule a
> >>> VFIO PCI driver cannot access the MMIO memory of the function it is
> >>> planning to assign to the guest. There is a lot of danger that the
> >>> guest could access that MMIO space one way or another.
> >>
> >> VF's MMIO memory is divided into two parts, one is the guest part,
> >> and the other is the live migration part. They do not affect each other,
> >> so there is no security problem.
> > 
> > AFAIK there are several scenarios where a guest can access this MMIO
> > memory using DMA even if it is not mapped into the guest for CPU
> > access.
> > 
> The hardware divides VF's MMIO memory into two parts. The live migration
> driver in the host uses the live migration part, and the device driver in
> the guest uses the guest part. They obtain the address of VF's MMIO memory
> in their respective drivers, although these two parts The memory is
> continuous on the hardware device, but due to the needs of the drive function,
> they will not perform operations on another part of the memory, and the
> device hardware also independently responds to the operation commands of
> the two parts.

It doesn't matter, the memory is still under the same PCI BDF and VFIO
supports scenarios where devices in the same IOMMU group are not
isolated from each other.

This is why the granual of isolation is a PCI BDF - VFIO directly
blocks kernel drivers from attaching to PCI BDFs that are not
completely isolated from VFIO BDF.

Bypassing this prevention and attaching a kernel driver directly to
the same BDF being exposed to the guest breaks that isolation model.

> So, I still don't understand what the security risk you are talking about is,
> and what do you think the security design should look like?
> Can you elaborate on it?

Each security domain must have its own PCI BDF.

The migration control registers must be on a different VF from the VF
being plugged into a guest and the two VFs have to be in different
IOMMU groups to ensure they are isolated from each other.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 12:59           ` Jason Gunthorpe
@ 2021-04-20 13:28             ` liulongfang
  2021-04-20 14:55               ` Jason Gunthorpe
  2021-04-20 22:04             ` Alex Williamson
  1 sibling, 1 reply; 34+ messages in thread
From: liulongfang @ 2021-04-20 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On 2021/4/20 20:59, Jason Gunthorpe wrote:
> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
>> On 2021/4/19 20:33, Jason Gunthorpe wrote:
>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>
>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>> guest could access that MMIO space one way or another.
>>>>
>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>> and the other is the live migration part. They do not affect each other,
>>>> so there is no security problem.
>>>
>>> AFAIK there are several scenarios where a guest can access this MMIO
>>> memory using DMA even if it is not mapped into the guest for CPU
>>> access.
>>>
>> The hardware divides VF's MMIO memory into two parts. The live migration
>> driver in the host uses the live migration part, and the device driver in
>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>> in their respective drivers, although these two parts The memory is
>> continuous on the hardware device, but due to the needs of the drive function,
>> they will not perform operations on another part of the memory, and the
>> device hardware also independently responds to the operation commands of
>> the two parts.
> 
> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> supports scenarios where devices in the same IOMMU group are not
> isolated from each other.
> 
> This is why the granual of isolation is a PCI BDF - VFIO directly
> blocks kernel drivers from attaching to PCI BDFs that are not
> completely isolated from VFIO BDF.
> 
> Bypassing this prevention and attaching a kernel driver directly to
> the same BDF being exposed to the guest breaks that isolation model.
> 
>> So, I still don't understand what the security risk you are talking about is,
>> and what do you think the security design should look like?
>> Can you elaborate on it?
> 
> Each security domain must have its own PCI BDF.
> 
The basic unit to perform the live migration function is the VF, and the basic
function of the VF is the business function of the device. If the live migration
function and the business function are completely separated, and they are bound
to their respective VFs, it will result in the ability to execute the business
in the guest A functional VF cannot perform live migration, and a VF with a live
migration function cannot perform business functions in the guest.

In fact, the scenario that requires live migration is to migrate its business
functions to the VFs of other hosts when the VF's business functions are heavily
loaded.
This usage scenario itself requires that the VF needs to have these two functions
at the same time.

> The migration control registers must be on a different VF from the VF
> being plugged into a guest and the two VFs have to be in different
> IOMMU groups to ensure they are isolated from each other.
> 
> Jason
> .
> 

Thanks.
Longfang.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 13:28             ` liulongfang
@ 2021-04-20 14:55               ` Jason Gunthorpe
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2021-04-20 14:55 UTC (permalink / raw)
  To: liulongfang; +Cc: alex.williamson, cohuck, linux-kernel, linuxarm

On Tue, Apr 20, 2021 at 09:28:46PM +0800, liulongfang wrote:
> >> So, I still don't understand what the security risk you are talking about is,
> >> and what do you think the security design should look like?
> >> Can you elaborate on it?
> > 
> > Each security domain must have its own PCI BDF.
> > 
> The basic unit to perform the live migration function is the VF, and the basic
> function of the VF is the business function of the device. If the live migration
> function and the business function are completely separated, and they are bound
> to their respective VFs, it will result in the ability to execute the business
> in the guest A functional VF cannot perform live migration, and a VF with a live
> migration function cannot perform business functions in the guest.
> 
> In fact, the scenario that requires live migration is to migrate its business
> functions to the VFs of other hosts when the VF's business functions are heavily
> loaded.
> This usage scenario itself requires that the VF needs to have these two functions
> at the same time.

Other vendors are managing, it is not an unsolvable problem.

I think these patches can't advance in any form without somehow
addressing the isolation issue.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 12:59           ` Jason Gunthorpe
  2021-04-20 13:28             ` liulongfang
@ 2021-04-20 22:04             ` Alex Williamson
  2021-04-20 23:18               ` Jason Gunthorpe
  2021-04-21  9:59               ` liulongfang
  1 sibling, 2 replies; 34+ messages in thread
From: Alex Williamson @ 2021-04-20 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: liulongfang, cohuck, linux-kernel, linuxarm

On Tue, 20 Apr 2021 09:59:57 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
> > On 2021/4/19 20:33, Jason Gunthorpe wrote:  
> > > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> > >   
> > >>> I'm also confused how this works securely at all, as a general rule a
> > >>> VFIO PCI driver cannot access the MMIO memory of the function it is
> > >>> planning to assign to the guest. There is a lot of danger that the
> > >>> guest could access that MMIO space one way or another.  
> > >>
> > >> VF's MMIO memory is divided into two parts, one is the guest part,
> > >> and the other is the live migration part. They do not affect each other,
> > >> so there is no security problem.  
> > > 
> > > AFAIK there are several scenarios where a guest can access this MMIO
> > > memory using DMA even if it is not mapped into the guest for CPU
> > > access.
> > >   
> > The hardware divides VF's MMIO memory into two parts. The live migration
> > driver in the host uses the live migration part, and the device driver in
> > the guest uses the guest part. They obtain the address of VF's MMIO memory
> > in their respective drivers, although these two parts The memory is
> > continuous on the hardware device, but due to the needs of the drive function,
> > they will not perform operations on another part of the memory, and the
> > device hardware also independently responds to the operation commands of
> > the two parts.  
> 
> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> supports scenarios where devices in the same IOMMU group are not
> isolated from each other.
> 
> This is why the granual of isolation is a PCI BDF - VFIO directly
> blocks kernel drivers from attaching to PCI BDFs that are not
> completely isolated from VFIO BDF.
> 
> Bypassing this prevention and attaching a kernel driver directly to
> the same BDF being exposed to the guest breaks that isolation model.
> 
> > So, I still don't understand what the security risk you are talking about is,
> > and what do you think the security design should look like?
> > Can you elaborate on it?  
> 
> Each security domain must have its own PCI BDF.
> 
> The migration control registers must be on a different VF from the VF
> being plugged into a guest and the two VFs have to be in different
> IOMMU groups to ensure they are isolated from each other.

I think that's a solution, I don't know if it's the only solution.
AIUI, the issue here is that we have a device specific kernel driver
extending vfio-pci with migration support for this device by using an
MMIO region of the same device.  This is susceptible to DMA
manipulation by the user device.   Whether that's a security issue or
not depends on how the user can break the device.  If the scope is
limited to breaking their own device, they can do that any number of
ways and it's not very interesting.  If the user can manipulate device
state in order to trigger an exploit of the host-side kernel driver,
that's obviously more of a problem.

The other side of this is that if migration support can be implemented
entirely within the VF using this portion of the device MMIO space, why
do we need the host kernel to support this rather than implementing it
in userspace?  For example, QEMU could know about this device,
manipulate the BAR size to expose only the operational portion of MMIO
to the VM and use the remainder to support migration itself.  I'm
afraid that just like mdev, the vfio migration uAPI is going to be used
as an excuse to create kernel drivers simply to be able to make use of
that uAPI.  I haven't looked at this driver to know if it has some
other reason to exist beyond what could be done through vfio-pci and
userspace migration support.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 22:04             ` Alex Williamson
@ 2021-04-20 23:18               ` Jason Gunthorpe
  2021-04-21  9:59                 ` liulongfang
  2021-04-21  9:59               ` liulongfang
  1 sibling, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2021-04-20 23:18 UTC (permalink / raw)
  To: Alex Williamson; +Cc: liulongfang, cohuck, linux-kernel, linuxarm

On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:

> > The migration control registers must be on a different VF from the VF
> > being plugged into a guest and the two VFs have to be in different
> > IOMMU groups to ensure they are isolated from each other.
> 
> I think that's a solution, I don't know if it's the only solution.

Maybe, but that approach does offer DMA access for the migration. For
instance to implement something that needs a lot of data like
migrating a complicated device state, or dirty page tracking or
whatver.

This driver seems very simple - it has only 17 state elements - and
doesn't use DMA.

I can't quite tell, but does this pass the hypervisor BAR into the
guest anyhow? That would certainly be an adquate statement that it is
safe, assuming someone did a good security analysis.

> ways and it's not very interesting.  If the user can manipulate device
> state in order to trigger an exploit of the host-side kernel driver,
> that's obviously more of a problem.

Well, for instance, we have an implementation of
(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the
guest CPUs are still running and a hostile guest can be manipulating
the device.

But this driver is running code, like vf_qm_state_pre_save() in this
state. Looks very suspicious.

One quick attack I can imagine is to use the guest CPU to DOS the
migration and permanently block it, eg by causing qm_mb() or other
looping functions to fail.

There may be worse things possible, it is a bit hard to tell just from
the code.

.. also drivers should not be open coding ARM assembly as in
qm_mb_write()

.. and also, code can not randomly call pci_get_drvdata() on a struct
device it isn't attached to haven't verified the right driver is
bound, or locked correctly.

> manipulate the BAR size to expose only the operational portion of MMIO
> to the VM and use the remainder to support migration itself.  I'm
> afraid that just like mdev, the vfio migration uAPI is going to be used
> as an excuse to create kernel drivers simply to be able to make use of
> that uAPI.

I thought that is the general direction people had agreed on during
the IDXD mdev discussion?

People want the IOCTLs from VFIO to be the single API to program all
the VMMs to and to not implement user space drivers..

This actually seems like a great candidate for a userspace driver.

I would like to know we are still settled on this direction as the
mlx5 drivers we are working on also have some complicated option to be
user space only.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 22:04             ` Alex Williamson
  2021-04-20 23:18               ` Jason Gunthorpe
@ 2021-04-21  9:59               ` liulongfang
  2021-04-21 18:12                 ` Alex Williamson
  1 sibling, 1 reply; 34+ messages in thread
From: liulongfang @ 2021-04-21  9:59 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe; +Cc: cohuck, linux-kernel, linuxarm

On 2021/4/21 6:04, Alex Williamson wrote:
> On Tue, 20 Apr 2021 09:59:57 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
>>> On 2021/4/19 20:33, Jason Gunthorpe wrote:  
>>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>>   
>>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>>> guest could access that MMIO space one way or another.  
>>>>>
>>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>>> and the other is the live migration part. They do not affect each other,
>>>>> so there is no security problem.  
>>>>
>>>> AFAIK there are several scenarios where a guest can access this MMIO
>>>> memory using DMA even if it is not mapped into the guest for CPU
>>>> access.
>>>>   
>>> The hardware divides VF's MMIO memory into two parts. The live migration
>>> driver in the host uses the live migration part, and the device driver in
>>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>>> in their respective drivers, although these two parts The memory is
>>> continuous on the hardware device, but due to the needs of the drive function,
>>> they will not perform operations on another part of the memory, and the
>>> device hardware also independently responds to the operation commands of
>>> the two parts.  
>>
>> It doesn't matter, the memory is still under the same PCI BDF and VFIO
>> supports scenarios where devices in the same IOMMU group are not
>> isolated from each other.
>>
>> This is why the granual of isolation is a PCI BDF - VFIO directly
>> blocks kernel drivers from attaching to PCI BDFs that are not
>> completely isolated from VFIO BDF.
>>
>> Bypassing this prevention and attaching a kernel driver directly to
>> the same BDF being exposed to the guest breaks that isolation model.
>>
>>> So, I still don't understand what the security risk you are talking about is,
>>> and what do you think the security design should look like?
>>> Can you elaborate on it?  
>>
>> Each security domain must have its own PCI BDF.
>>
>> The migration control registers must be on a different VF from the VF
>> being plugged into a guest and the two VFs have to be in different
>> IOMMU groups to ensure they are isolated from each other.
> 
> I think that's a solution, I don't know if it's the only solution.
> AIUI, the issue here is that we have a device specific kernel driver
> extending vfio-pci with migration support for this device by using an

If the two parts of the MMIO region are split into different BAR spaces on
the device, the MMIO region of the business function is still placed in BAR2,
and the MMIO region of the live migration function is moved to BAR4.
Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
This can solve this security issue.

> MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
> limited to breaking their own device, they can do that any number of
> ways and it's not very interesting.  If the user can manipulate device
> state in order to trigger an exploit of the host-side kernel driver,
> that's obviously more of a problem.
> 
> The other side of this is that if migration support can be implemented
> entirely within the VF using this portion of the device MMIO space, why
> do we need the host kernel to support this rather than implementing it
> in userspace?  For example, QEMU could know about this device,
> manipulate the BAR size to expose only the operational portion of MMIO
> to the VM and use the remainder to support migration itself.  I'm
> afraid that just like mdev, the vfio migration uAPI is going to be used
> as an excuse to create kernel drivers simply to be able to make use of
> that uAPI.  I haven't looked at this driver to know if it has some

When the accelerator device is designed to support the live migration
function, it is based on the uAPI of the migration region to realize the
live migration function, so the live migration function requires a driver
that connects to this uAPI.
Is this set of interfaces not open to us now?

> other reason to exist beyond what could be done through vfio-pci and
> userspace migration support.  Thanks,
> 
> Alex
> 
> .
> 
Thanks，
Longfang.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-20 23:18               ` Jason Gunthorpe
@ 2021-04-21  9:59                 ` liulongfang
  0 siblings, 0 replies; 34+ messages in thread
From: liulongfang @ 2021-04-21  9:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson; +Cc: cohuck, linux-kernel, linuxarm

On 2021/4/21 7:18, Jason Gunthorpe wrote:
> On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:
> 
>>> The migration control registers must be on a different VF from the VF
>>> being plugged into a guest and the two VFs have to be in different
>>> IOMMU groups to ensure they are isolated from each other.
>>
>> I think that's a solution, I don't know if it's the only solution.
> 
> Maybe, but that approach does offer DMA access for the migration. For
> instance to implement something that needs a lot of data like
> migrating a complicated device state, or dirty page tracking or
> whatver.
> 
> This driver seems very simple - it has only 17 state elements - and
> doesn't use DMA.
> 
Yes，the operating address of this driver is the MMIO address,
not the DMA address, but the internal hardware DMA address is used
as data for migration.

> I can't quite tell, but does this pass the hypervisor BAR into the
> guest anyhow? That would certainly be an adquate statement that it is
> safe, assuming someone did a good security analysis.
> 
>> ways and it's not very interesting.  If the user can manipulate device
>> state in order to trigger an exploit of the host-side kernel driver,
>> that's obviously more of a problem.
> 
> Well, for instance, we have an implementation of
> (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the
> guest CPUs are still running and a hostile guest can be manipulating
> the device.
> 
> But this driver is running code, like vf_qm_state_pre_save() in this
> state. Looks very suspicious.
> 
> One quick attack I can imagine is to use the guest CPU to DOS the
> migration and permanently block it, eg by causing qm_mb() or other
> looping functions to fail.
> 
> There may be worse things possible, it is a bit hard to tell just from
> the code.
> 
> .. also drivers should not be open coding ARM assembly as in
> qm_mb_write()
> 
OK, these codes need to be encapsulated and should not be presented in
this driver.

> .. and also, code can not randomly call pci_get_drvdata() on a struct
> device it isn't attached to haven't verified the right driver is
> bound, or locked correctly.
> 
Yes, This call needs to be placed in an encapsulation interface,
and access protection needs to be added.

>> manipulate the BAR size to expose only the operational portion of MMIO
>> to the VM and use the remainder to support migration itself.  I'm
>> afraid that just like mdev, the vfio migration uAPI is going to be used
>> as an excuse to create kernel drivers simply to be able to make use of
>> that uAPI.
> 
> I thought that is the general direction people had agreed on during
> the IDXD mdev discussion?
> 
> People want the IOCTLs from VFIO to be the single API to program all
> the VMMs to and to not implement user space drivers..
> 
> This actually seems like a great candidate for a userspace driver.
> 
> I would like to know we are still settled on this direction as the
> mlx5 drivers we are working on also have some complicated option to be
> user space only.
> 
> Jason
> .
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-21  9:59               ` liulongfang
@ 2021-04-21 18:12                 ` Alex Williamson
  2021-04-26 11:49                   ` liulongfang
  2021-05-12  8:39                   ` liulongfang
  0 siblings, 2 replies; 34+ messages in thread
From: Alex Williamson @ 2021-04-21 18:12 UTC (permalink / raw)
  To: liulongfang; +Cc: Jason Gunthorpe, cohuck, linux-kernel, linuxarm

On Wed, 21 Apr 2021 17:59:02 +0800
liulongfang <liulongfang@huawei.com> wrote:

> On 2021/4/21 6:04, Alex Williamson wrote:
> > On Tue, 20 Apr 2021 09:59:57 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:  
> >>> On 2021/4/19 20:33, Jason Gunthorpe wrote:    
> >>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> >>>>     
> >>>>>> I'm also confused how this works securely at all, as a general rule a
> >>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
> >>>>>> planning to assign to the guest. There is a lot of danger that the
> >>>>>> guest could access that MMIO space one way or another.    
> >>>>>
> >>>>> VF's MMIO memory is divided into two parts, one is the guest part,
> >>>>> and the other is the live migration part. They do not affect each other,
> >>>>> so there is no security problem.    
> >>>>
> >>>> AFAIK there are several scenarios where a guest can access this MMIO
> >>>> memory using DMA even if it is not mapped into the guest for CPU
> >>>> access.
> >>>>     
> >>> The hardware divides VF's MMIO memory into two parts. The live migration
> >>> driver in the host uses the live migration part, and the device driver in
> >>> the guest uses the guest part. They obtain the address of VF's MMIO memory
> >>> in their respective drivers, although these two parts The memory is
> >>> continuous on the hardware device, but due to the needs of the drive function,
> >>> they will not perform operations on another part of the memory, and the
> >>> device hardware also independently responds to the operation commands of
> >>> the two parts.    
> >>
> >> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> >> supports scenarios where devices in the same IOMMU group are not
> >> isolated from each other.
> >>
> >> This is why the granual of isolation is a PCI BDF - VFIO directly
> >> blocks kernel drivers from attaching to PCI BDFs that are not
> >> completely isolated from VFIO BDF.
> >>
> >> Bypassing this prevention and attaching a kernel driver directly to
> >> the same BDF being exposed to the guest breaks that isolation model.
> >>  
> >>> So, I still don't understand what the security risk you are talking about is,
> >>> and what do you think the security design should look like?
> >>> Can you elaborate on it?    
> >>
> >> Each security domain must have its own PCI BDF.
> >>
> >> The migration control registers must be on a different VF from the VF
> >> being plugged into a guest and the two VFs have to be in different
> >> IOMMU groups to ensure they are isolated from each other.  
> > 
> > I think that's a solution, I don't know if it's the only solution.
> > AIUI, the issue here is that we have a device specific kernel driver
> > extending vfio-pci with migration support for this device by using an  
> 
> If the two parts of the MMIO region are split into different BAR spaces on
> the device, the MMIO region of the business function is still placed in BAR2,
> and the MMIO region of the live migration function is moved to BAR4.
> Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
> This can solve this security issue.

The concern is really the "on the device" part rather than whether the
resources are within the same BAR or not.  We need to assume that a
user driver can generate a DMA targeting any address in the system,
including in this case the user driver could generate a DMA targeting
this migration BAR.  Ideally this would be routed upstream to the IOMMU
where it would be blocked for lack of a translation entry.  However,
because this range resides on the same PCIe requester ID, it's
logically more similar to a two-function device where the functions are
considered non-isolated and are therefore exposed within the same IOMMU
group.  We would not allow a kernel driver for one of those functions
and a userspace driver for the other.  In this case those drivers are
strongly related, but we still need to consider to what extent a
malicious user driver can interfere with or exploit the kernel side
driver.


> > MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
> > limited to breaking their own device, they can do that any number of
> > ways and it's not very interesting.  If the user can manipulate device
> > state in order to trigger an exploit of the host-side kernel driver,
> > that's obviously more of a problem.
> > 
> > The other side of this is that if migration support can be implemented
> > entirely within the VF using this portion of the device MMIO space, why
> > do we need the host kernel to support this rather than implementing it
> > in userspace?  For example, QEMU could know about this device,
> > manipulate the BAR size to expose only the operational portion of MMIO
> > to the VM and use the remainder to support migration itself.  I'm
> > afraid that just like mdev, the vfio migration uAPI is going to be used
> > as an excuse to create kernel drivers simply to be able to make use of
> > that uAPI.  I haven't looked at this driver to know if it has some  
> 
> When the accelerator device is designed to support the live migration
> function, it is based on the uAPI of the migration region to realize the
> live migration function, so the live migration function requires a driver
> that connects to this uAPI.
> Is this set of interfaces not open to us now?

In your model, if both BARs are exposed to userspace and a device
specific extension in QEMU claims the migration BAR rather than
exposing it to the VM, could that driver mimic the migration region
uAPI from userspace?  For example, you don't need page pinning to
interact with the IOMMU, you don't need resources beyond the scope
of the endpoint device itself, and the migration info BAR is safe for
userspace to manage?  If so, then a kernel-based driver to expose a
migration uAPI seems like it's only a risk for the kernel, ie. moving
what could be a userspace driver into the kernel for the convenience of
re-using a kernel uAPI.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-21 18:12                 ` Alex Williamson
@ 2021-04-26 11:49                   ` liulongfang
  2021-05-12  8:39                   ` liulongfang
  1 sibling, 0 replies; 34+ messages in thread
From: liulongfang @ 2021-04-26 11:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jason Gunthorpe, cohuck, linux-kernel, linuxarm

On 2021/4/22 2:12, Alex Williamson wrote:
> On Wed, 21 Apr 2021 17:59:02 +0800
> liulongfang <liulongfang@huawei.com> wrote:
> 
>> On 2021/4/21 6:04, Alex Williamson wrote:
>>> On Tue, 20 Apr 2021 09:59:57 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>   
>>>> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:  
>>>>> On 2021/4/19 20:33, Jason Gunthorpe wrote:    
>>>>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>>>>     
>>>>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>>>>> guest could access that MMIO space one way or another.    
>>>>>>>
>>>>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>>>>> and the other is the live migration part. They do not affect each other,
>>>>>>> so there is no security problem.    
>>>>>>
>>>>>> AFAIK there are several scenarios where a guest can access this MMIO
>>>>>> memory using DMA even if it is not mapped into the guest for CPU
>>>>>> access.
>>>>>>     
>>>>> The hardware divides VF's MMIO memory into two parts. The live migration
>>>>> driver in the host uses the live migration part, and the device driver in
>>>>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>>>>> in their respective drivers, although these two parts The memory is
>>>>> continuous on the hardware device, but due to the needs of the drive function,
>>>>> they will not perform operations on another part of the memory, and the
>>>>> device hardware also independently responds to the operation commands of
>>>>> the two parts.    
>>>>
>>>> It doesn't matter, the memory is still under the same PCI BDF and VFIO
>>>> supports scenarios where devices in the same IOMMU group are not
>>>> isolated from each other.
>>>>
>>>> This is why the granual of isolation is a PCI BDF - VFIO directly
>>>> blocks kernel drivers from attaching to PCI BDFs that are not
>>>> completely isolated from VFIO BDF.
>>>>
>>>> Bypassing this prevention and attaching a kernel driver directly to
>>>> the same BDF being exposed to the guest breaks that isolation model.
>>>>  
>>>>> So, I still don't understand what the security risk you are talking about is,
>>>>> and what do you think the security design should look like?
>>>>> Can you elaborate on it?    
>>>>
>>>> Each security domain must have its own PCI BDF.
>>>>
>>>> The migration control registers must be on a different VF from the VF
>>>> being plugged into a guest and the two VFs have to be in different
>>>> IOMMU groups to ensure they are isolated from each other.  
>>>
>>> I think that's a solution, I don't know if it's the only solution.
>>> AIUI, the issue here is that we have a device specific kernel driver
>>> extending vfio-pci with migration support for this device by using an  
>>
>> If the two parts of the MMIO region are split into different BAR spaces on
>> the device, the MMIO region of the business function is still placed in BAR2,
>> and the MMIO region of the live migration function is moved to BAR4.
>> Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
>> This can solve this security issue.
> 
> The concern is really the "on the device" part rather than whether the
> resources are within the same BAR or not.  We need to assume that a
> user driver can generate a DMA targeting any address in the system,
> including in this case the user driver could generate a DMA targeting
> this migration BAR.  Ideally this would be routed upstream to the IOMMU
> where it would be blocked for lack of a translation entry.  However,
> because this range resides on the same PCIe requester ID, it's
> logically more similar to a two-function device where the functions are
> considered non-isolated and are therefore exposed within the same IOMMU
> group.  We would not allow a kernel driver for one of those functions
> and a userspace driver for the other.  In this case those drivers are
> strongly related, but we still need to consider to what extent a
> malicious user driver can interfere with or exploit the kernel side
> driver.
> 
> 
>>> MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
>>> limited to breaking their own device, they can do that any number of
>>> ways and it's not very interesting.  If the user can manipulate device
>>> state in order to trigger an exploit of the host-side kernel driver,
>>> that's obviously more of a problem.
>>>
>>> The other side of this is that if migration support can be implemented
>>> entirely within the VF using this portion of the device MMIO space, why
>>> do we need the host kernel to support this rather than implementing it
>>> in userspace?  For example, QEMU could know about this device,
>>> manipulate the BAR size to expose only the operational portion of MMIO
>>> to the VM and use the remainder to support migration itself.  I'm
>>> afraid that just like mdev, the vfio migration uAPI is going to be used
>>> as an excuse to create kernel drivers simply to be able to make use of
>>> that uAPI.  I haven't looked at this driver to know if it has some  
>>
>> When the accelerator device is designed to support the live migration
>> function, it is based on the uAPI of the migration region to realize the
>> live migration function, so the live migration function requires a driver
>> that connects to this uAPI.
>> Is this set of interfaces not open to us now?
> 
> In your model, if both BARs are exposed to userspace and a device
> specific extension in QEMU claims the migration BAR rather than
> exposing it to the VM, could that driver mimic the migration region
> uAPI from userspace?  For example, you don't need page pinning to
> interact with the IOMMU, you don't need resources beyond the scope
> of the endpoint device itself, and the migration info BAR is safe for
> userspace to manage?  If so, then a kernel-based driver to expose a
> migration uAPI seems like it's only a risk for the kernel, ie. moving
> what could be a userspace driver into the kernel for the convenience of
> re-using a kernel uAPI.  Thanks,
> 
> Alex
> 
> .
> 
Thanks for your review and suggestions.

We discussed the security issues you mentioned and put forward some new solutions.
These new solutions are now being implemented. After they are tested successfully,
they will be issued in the second version of the RFC.

Thanks,
Longfang.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-21 18:12                 ` Alex Williamson
  2021-04-26 11:49                   ` liulongfang
@ 2021-05-12  8:39                   ` liulongfang
  2021-05-12 12:10                     ` Jason Gunthorpe
  1 sibling, 1 reply; 34+ messages in thread
From: liulongfang @ 2021-05-12  8:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jason Gunthorpe, cohuck, linux-kernel, linuxarm

On 2021/4/22 2:12, Alex Williamson Wrote:
> On Wed, 21 Apr 2021 17:59:02 +0800
> liulongfang <liulongfang@huawei.com> wrote:
> 
>> On 2021/4/21 6:04, Alex Williamson wrote:
>>> On Tue, 20 Apr 2021 09:59:57 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>   
>>>> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:  
>>>>> On 2021/4/19 20:33, Jason Gunthorpe wrote:    
>>>>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>>>>     
>>>>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>>>>> guest could access that MMIO space one way or another.    
>>>>>>>
>>>>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>>>>> and the other is the live migration part. They do not affect each other,
>>>>>>> so there is no security problem.    
>>>>>>
>>>>>> AFAIK there are several scenarios where a guest can access this MMIO
>>>>>> memory using DMA even if it is not mapped into the guest for CPU
>>>>>> access.
>>>>>>     
>>>>> The hardware divides VF's MMIO memory into two parts. The live migration
>>>>> driver in the host uses the live migration part, and the device driver in
>>>>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>>>>> in their respective drivers, although these two parts The memory is
>>>>> continuous on the hardware device, but due to the needs of the drive function,
>>>>> they will not perform operations on another part of the memory, and the
>>>>> device hardware also independently responds to the operation commands of
>>>>> the two parts.    
>>>>
>>>> It doesn't matter, the memory is still under the same PCI BDF and VFIO
>>>> supports scenarios where devices in the same IOMMU group are not
>>>> isolated from each other.
>>>>
>>>> This is why the granual of isolation is a PCI BDF - VFIO directly
>>>> blocks kernel drivers from attaching to PCI BDFs that are not
>>>> completely isolated from VFIO BDF.
>>>>
>>>> Bypassing this prevention and attaching a kernel driver directly to
>>>> the same BDF being exposed to the guest breaks that isolation model.
>>>>  
>>>>> So, I still don't understand what the security risk you are talking about is,
>>>>> and what do you think the security design should look like?
>>>>> Can you elaborate on it?    
>>>>
>>>> Each security domain must have its own PCI BDF.
>>>>
>>>> The migration control registers must be on a different VF from the VF
>>>> being plugged into a guest and the two VFs have to be in different
>>>> IOMMU groups to ensure they are isolated from each other.  
>>>
>>> I think that's a solution, I don't know if it's the only solution.
>>> AIUI, the issue here is that we have a device specific kernel driver
>>> extending vfio-pci with migration support for this device by using an  
>>
>> If the two parts of the MMIO region are split into different BAR spaces on
>> the device, the MMIO region of the business function is still placed in BAR2,
>> and the MMIO region of the live migration function is moved to BAR4.
>> Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
>> This can solve this security issue.
> 
> The concern is really the "on the device" part rather than whether the
> resources are within the same BAR or not.  We need to assume that a
> user driver can generate a DMA targeting any address in the system,
> including in this case the user driver could generate a DMA targeting
> this migration BAR.  Ideally this would be routed upstream to the IOMMU
> where it would be blocked for lack of a translation entry.  However,
> because this range resides on the same PCIe requester ID, it's
> logically more similar to a two-function device where the functions are
> considered non-isolated and are therefore exposed within the same IOMMU
> group.  We would not allow a kernel driver for one of those functions
> and a userspace driver for the other.  In this case those drivers are
> strongly related, but we still need to consider to what extent a
> malicious user driver can interfere with or exploit the kernel side
> driver.
> 

Hello Alex! When updating the drive scheme, we tested the method of dividing
the BAR configuration space in VFIO, the length of the BAR configuration space
is limited to 0x2000, so that the guest cannot obtain the configuration of the
migration region (it is after 0x2000), and then in the guest when trying to
access the configuration space beyond 0x2000 in the driver, the test found that
the OS in the Guest will directly report a memory access error.

Therefore, this method of limiting the length of the BAR configuration space can
prevent unsafe operations of the memory.
> 
>>> MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
>>> limited to breaking their own device, they can do that any number of
>>> ways and it's not very interesting.  If the user can manipulate device
>>> state in order to trigger an exploit of the host-side kernel driver,
>>> that's obviously more of a problem.
>>>
>>> The other side of this is that if migration support can be implemented
>>> entirely within the VF using this portion of the device MMIO space, why
>>> do we need the host kernel to support this rather than implementing it
>>> in userspace?  For example, QEMU could know about this device,
>>> manipulate the BAR size to expose only the operational portion of MMIO
>>> to the VM and use the remainder to support migration itself.  I'm
>>> afraid that just like mdev, the vfio migration uAPI is going to be used
>>> as an excuse to create kernel drivers simply to be able to make use of
>>> that uAPI.  I haven't looked at this driver to know if it has some  
>>
>> When the accelerator device is designed to support the live migration
>> function, it is based on the uAPI of the migration region to realize the
>> live migration function, so the live migration function requires a driver
>> that connects to this uAPI.
>> Is this set of interfaces not open to us now?
> 
> In your model, if both BARs are exposed to userspace and a device
> specific extension in QEMU claims the migration BAR rather than
> exposing it to the VM, could that driver mimic the migration region
> uAPI from userspace?  For example, you don't need page pinning to
> interact with the IOMMU, you don't need resources beyond the scope
> of the endpoint device itself, and the migration info BAR is safe for
> userspace to manage?  If so, then a kernel-based driver to expose a
> migration uAPI seems like it's only a risk for the kernel, ie. moving
> what could be a userspace driver into the kernel for the convenience of
> re-using a kernel uAPI.  Thanks,
> 
> Alex
> 
> .
> 
Thanks.
Longfang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-12  8:39                   ` liulongfang
@ 2021-05-12 12:10                     ` Jason Gunthorpe
  2021-05-13  2:08                       ` liulongfang
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2021-05-12 12:10 UTC (permalink / raw)
  To: liulongfang; +Cc: Alex Williamson, cohuck, linux-kernel, linuxarm

On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:

> Therefore, this method of limiting the length of the BAR
> configuration space can prevent unsafe operations of the memory.

The issue is DMA controlled by the guest accessing the secure BAR
area, not the guest CPU.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-12 12:10                     ` Jason Gunthorpe
@ 2021-05-13  2:08                       ` liulongfang
  2021-05-13 13:44                         ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: liulongfang @ 2021-05-13  2:08 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, cohuck, linux-kernel, linuxarm

On 2021/5/12 20:10, Jason Gunthorpe wrote:
> On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> 
>> Therefore, this method of limiting the length of the BAR
>> configuration space can prevent unsafe operations of the memory.
> 
> The issue is DMA controlled by the guest accessing the secure BAR
> area, not the guest CPU.
> 
> Jason
> .
> 
This secure BAR area is not presented to the Guest,
which makes it impossible for the Guest to obtain the secure BAR area
when establishing the DMA mapping of the configuration space.
If the DMA controller accesses the secure BAR area, the access will
be blocked by the SMMU.

Thanks
Longfang.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13  2:08                       ` liulongfang
@ 2021-05-13 13:44                         ` Jason Gunthorpe
  2021-05-13 15:49                           ` [Linuxarm] " Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2021-05-13 13:44 UTC (permalink / raw)
  To: liulongfang; +Cc: Alex Williamson, cohuck, linux-kernel, linuxarm

On Thu, May 13, 2021 at 10:08:28AM +0800, liulongfang wrote:
> On 2021/5/12 20:10, Jason Gunthorpe wrote:
> > On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> > 
> >> Therefore, this method of limiting the length of the BAR
> >> configuration space can prevent unsafe operations of the memory.
> > 
> > The issue is DMA controlled by the guest accessing the secure BAR
> > area, not the guest CPU.
> > 
> > Jason
> > .
> > 
> This secure BAR area is not presented to the Guest,
> which makes it impossible for the Guest to obtain the secure BAR area
> when establishing the DMA mapping of the configuration space.
> If the DMA controller accesses the secure BAR area, the access will
> be blocked by the SMMU.

There are scenarios where this is not true.

At a minimum the mdev driver should refuse to work in those cases.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 13:44                         ` Jason Gunthorpe
@ 2021-05-13 15:49                           ` Shameerali Kolothum Thodi
  2021-05-13 17:03                             ` Alex Williamson
  0 siblings, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Jason Gunthorpe, liulongfang
  Cc: Alex Williamson, cohuck, linux-kernel, linuxarm



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 13 May 2021 14:44
> To: liulongfang <liulongfang@huawei.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; cohuck@redhat.com;
> linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> vfio
> 
> On Thu, May 13, 2021 at 10:08:28AM +0800, liulongfang wrote:
> > On 2021/5/12 20:10, Jason Gunthorpe wrote:
> > > On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> > >
> > >> Therefore, this method of limiting the length of the BAR
> > >> configuration space can prevent unsafe operations of the memory.
> > >
> > > The issue is DMA controlled by the guest accessing the secure BAR
> > > area, not the guest CPU.
> > >
> > > Jason
> > > .
> > >
> > This secure BAR area is not presented to the Guest,
> > which makes it impossible for the Guest to obtain the secure BAR area
> > when establishing the DMA mapping of the configuration space.
> > If the DMA controller accesses the secure BAR area, the access will
> > be blocked by the SMMU.
> 
> There are scenarios where this is not true.
> 
> At a minimum the mdev driver should refuse to work in those cases.
> 

Hi,

I think the idea here is not a generic solution, but a quirk for this specific dev.

Something like, 

--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -866,7 +866,12 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
                        break;
                case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
                        info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-                       info.size = pci_resource_len(pdev, info.index);
+
+                       if (check_hisi_acc_quirk(pdev, info))
+                               info.size = new_size;// BAR is limited without migration region.
+                       else
+                               info.size = pci_resource_len(pdev, info.index);
+
                        if (!info.size) {
                                info.flags = 0;
                                break;

Is this an acceptable/workable solution here?

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 15:49                           ` [Linuxarm] " Shameerali Kolothum Thodi
@ 2021-05-13 17:03                             ` Alex Williamson
  2021-05-13 17:52                               ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Williamson @ 2021-05-13 17:03 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, liulongfang, cohuck, linux-kernel, linuxarm

On Thu, 13 May 2021 15:49:25 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 13 May 2021 14:44
> > To: liulongfang <liulongfang@huawei.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>; cohuck@redhat.com;
> > linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> > Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> > vfio
> > 
> > On Thu, May 13, 2021 at 10:08:28AM +0800, liulongfang wrote:  
> > > On 2021/5/12 20:10, Jason Gunthorpe wrote:  
> > > > On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> > > >  
> > > >> Therefore, this method of limiting the length of the BAR
> > > >> configuration space can prevent unsafe operations of the memory.  
> > > >
> > > > The issue is DMA controlled by the guest accessing the secure BAR
> > > > area, not the guest CPU.
> > > >
> > > > Jason
> > > > .
> > > >  
> > > This secure BAR area is not presented to the Guest,
> > > which makes it impossible for the Guest to obtain the secure BAR area
> > > when establishing the DMA mapping of the configuration space.
> > > If the DMA controller accesses the secure BAR area, the access will
> > > be blocked by the SMMU.  
> > 
> > There are scenarios where this is not true.
> > 
> > At a minimum the mdev driver should refuse to work in those cases.
> >   
> 
> Hi,
> 
> I think the idea here is not a generic solution, but a quirk for this specific dev.
> 
> Something like, 
> 
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -866,7 +866,12 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
>                         break;
>                 case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>                         info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> -                       info.size = pci_resource_len(pdev, info.index);
> +
> +                       if (check_hisi_acc_quirk(pdev, info))
> +                               info.size = new_size;// BAR is limited without migration region.
> +                       else
> +                               info.size = pci_resource_len(pdev, info.index);
> +
>                         if (!info.size) {
>                                 info.flags = 0;
>                                 break;
> 
> Is this an acceptable/workable solution here?

As Jason says, this only restricts CPU access to the BAR, the issue is
DMA access.  As the hardware vendor you may be able to guarantee that
a DMA transaction generated by the device targeting the remainder of
the BAR will always go upstream, but can you guarantee the routing
between the device and the SMMU?  For instance if this device can be
implemented as a plugin card, then it can be installed into a
downstream port that may not support ACS.  That downstream port may
implement request redirection allowing the transaction to reflect back
to the device without IOMMU translation.  At that point the userspace
driver can target the kernel driver half of the BAR and potentially
expose a security risk.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 17:03                             ` Alex Williamson
@ 2021-05-13 17:52                               ` Shameerali Kolothum Thodi
  2021-05-13 18:22                                 ` Alex Williamson
  2021-05-13 18:24                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-05-13 17:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, liulongfang, cohuck, linux-kernel, linuxarm

Hi Alex,

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 13 May 2021 18:04
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; liulongfang
> <liulongfang@huawei.com>; cohuck@redhat.com;
> linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> vfio
> 
> On Thu, 13 May 2021 15:49:25 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> wrote:
> 
> > > -----Original Message-----
> > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > > Sent: 13 May 2021 14:44
> > > To: liulongfang <liulongfang@huawei.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>; cohuck@redhat.com;
> > > linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> > > Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> > > vfio
> > >
> > > On Thu, May 13, 2021 at 10:08:28AM +0800, liulongfang wrote:
> > > > On 2021/5/12 20:10, Jason Gunthorpe wrote:
> > > > > On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> > > > >
> > > > >> Therefore, this method of limiting the length of the BAR
> > > > >> configuration space can prevent unsafe operations of the memory.
> > > > >
> > > > > The issue is DMA controlled by the guest accessing the secure BAR
> > > > > area, not the guest CPU.
> > > > >
> > > > > Jason
> > > > > .
> > > > >
> > > > This secure BAR area is not presented to the Guest,
> > > > which makes it impossible for the Guest to obtain the secure BAR area
> > > > when establishing the DMA mapping of the configuration space.
> > > > If the DMA controller accesses the secure BAR area, the access will
> > > > be blocked by the SMMU.
> > >
> > > There are scenarios where this is not true.
> > >
> > > At a minimum the mdev driver should refuse to work in those cases.
> > >
> >
> > Hi,
> >
> > I think the idea here is not a generic solution, but a quirk for this specific dev.
> >
> > Something like,
> >
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -866,7 +866,12 @@ static long vfio_pci_ioctl(struct vfio_device
> *core_vdev,
> >                         break;
> >                 case VFIO_PCI_BAR0_REGION_INDEX ...
> VFIO_PCI_BAR5_REGION_INDEX:
> >                         info.offset =
> VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > -                       info.size = pci_resource_len(pdev, info.index);
> > +
> > +                       if (check_hisi_acc_quirk(pdev, info))
> > +                               info.size = new_size;// BAR is limited
> without migration region.
> > +                       else
> > +                               info.size = pci_resource_len(pdev,
> info.index);
> > +
> >                         if (!info.size) {
> >                                 info.flags = 0;
> >                                 break;
> >
> > Is this an acceptable/workable solution here?
> 
> As Jason says, this only restricts CPU access to the BAR, the issue is
> DMA access.  As the hardware vendor you may be able to guarantee that
> a DMA transaction generated by the device targeting the remainder of
> the BAR will always go upstream, but can you guarantee the routing
> between the device and the SMMU?  For instance if this device can be
> implemented as a plugin card, then it can be installed into a
> downstream port that may not support ACS.  That downstream port may
> implement request redirection allowing the transaction to reflect back
> to the device without IOMMU translation.  At that point the userspace
> driver can target the kernel driver half of the BAR and potentially
> expose a security risk.  Thanks,

The ACC devices on this platform are not pluggable devices. They are exposed
as integrated endpoint devices. So I am not sure the above concern is valid in this
case.

I had a look at the userspace driver approach you suggested. But unfortunately
the migration state change for the vf has to check some of the pf registers for
confirming the state. So even if we move the implementation to Qemu, we
still may have to use the migration uAPI to access the pf device registers.

Since the devices we are concerned here are all integrated endpoints and if the 
above quirk is an acceptable one, then we can use the uAPI as done in this
series without overly complicating things here.

Please let me know your thoughts,

Thanks,
Shameer



> _______________________________________________
> Linuxarm mailing list -- linuxarm@openeuler.org
> To unsubscribe send an email to linuxarm-leave@openeuler.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 17:52                               ` Shameerali Kolothum Thodi
@ 2021-05-13 18:22                                 ` Alex Williamson
  2021-05-13 18:31                                   ` Shameerali Kolothum Thodi
  2021-05-13 18:34                                   ` Jason Gunthorpe
  2021-05-13 18:24                                 ` Jason Gunthorpe
  1 sibling, 2 replies; 34+ messages in thread
From: Alex Williamson @ 2021-05-13 18:22 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, liulongfang, cohuck, linux-kernel, linuxarm

On Thu, 13 May 2021 17:52:56 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> Hi Alex,
> 
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: 13 May 2021 18:04
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>; liulongfang
> > <liulongfang@huawei.com>; cohuck@redhat.com;
> > linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> > Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> > vfio
> > 
> > On Thu, 13 May 2021 15:49:25 +0000
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > wrote:
> >   
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > > > Sent: 13 May 2021 14:44
> > > > To: liulongfang <liulongfang@huawei.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>; cohuck@redhat.com;
> > > > linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> > > > Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> > > > vfio
> > > >
> > > > On Thu, May 13, 2021 at 10:08:28AM +0800, liulongfang wrote:  
> > > > > On 2021/5/12 20:10, Jason Gunthorpe wrote:  
> > > > > > On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> > > > > >  
> > > > > >> Therefore, this method of limiting the length of the BAR
> > > > > >> configuration space can prevent unsafe operations of the memory.  
> > > > > >
> > > > > > The issue is DMA controlled by the guest accessing the secure BAR
> > > > > > area, not the guest CPU.
> > > > > >
> > > > > > Jason
> > > > > > .
> > > > > >  
> > > > > This secure BAR area is not presented to the Guest,
> > > > > which makes it impossible for the Guest to obtain the secure BAR area
> > > > > when establishing the DMA mapping of the configuration space.
> > > > > If the DMA controller accesses the secure BAR area, the access will
> > > > > be blocked by the SMMU.  
> > > >
> > > > There are scenarios where this is not true.
> > > >
> > > > At a minimum the mdev driver should refuse to work in those cases.
> > > >  
> > >
> > > Hi,
> > >
> > > I think the idea here is not a generic solution, but a quirk for this specific dev.
> > >
> > > Something like,
> > >
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -866,7 +866,12 @@ static long vfio_pci_ioctl(struct vfio_device  
> > *core_vdev,  
> > >                         break;
> > >                 case VFIO_PCI_BAR0_REGION_INDEX ...  
> > VFIO_PCI_BAR5_REGION_INDEX:  
> > >                         info.offset =  
> > VFIO_PCI_INDEX_TO_OFFSET(info.index);  
> > > -                       info.size = pci_resource_len(pdev, info.index);
> > > +
> > > +                       if (check_hisi_acc_quirk(pdev, info))
> > > +                               info.size = new_size;// BAR is limited  
> > without migration region.  
> > > +                       else
> > > +                               info.size = pci_resource_len(pdev,  
> > info.index);  
> > > +
> > >                         if (!info.size) {
> > >                                 info.flags = 0;
> > >                                 break;
> > >
> > > Is this an acceptable/workable solution here?  
> > 
> > As Jason says, this only restricts CPU access to the BAR, the issue is
> > DMA access.  As the hardware vendor you may be able to guarantee that
> > a DMA transaction generated by the device targeting the remainder of
> > the BAR will always go upstream, but can you guarantee the routing
> > between the device and the SMMU?  For instance if this device can be
> > implemented as a plugin card, then it can be installed into a
> > downstream port that may not support ACS.  That downstream port may
> > implement request redirection allowing the transaction to reflect back
> > to the device without IOMMU translation.  At that point the userspace
> > driver can target the kernel driver half of the BAR and potentially
> > expose a security risk.  Thanks,  
> 
> The ACC devices on this platform are not pluggable devices. They are exposed
> as integrated endpoint devices. So I am not sure the above concern is valid in this
> case.
> 
> I had a look at the userspace driver approach you suggested. But unfortunately
> the migration state change for the vf has to check some of the pf registers for
> confirming the state. So even if we move the implementation to Qemu, we
> still may have to use the migration uAPI to access the pf device registers.
> 
> Since the devices we are concerned here are all integrated endpoints and if the 
> above quirk is an acceptable one, then we can use the uAPI as done in this
> series without overly complicating things here.

If you expect this device to appear only as an integrated endpoint, then
I think Jason's suggestion above is correct.  Your driver that supports
migration can refuse to load for devices there the topology is other
than expected and you're effectively guaranteeing DMA isolation of the
user and in-kernel drivers by hardware DMA semantics and topology.

Requiring access to the PF to support the migration protocol also
suggests that an in-kernel driver to support migration is our best
option.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 17:52                               ` Shameerali Kolothum Thodi
  2021-05-13 18:22                                 ` Alex Williamson
@ 2021-05-13 18:24                                 ` Jason Gunthorpe
  2021-05-13 18:35                                   ` Shameerali Kolothum Thodi
  2021-05-27 10:11                                   ` Shameerali Kolothum Thodi
  1 sibling, 2 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2021-05-13 18:24 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Alex Williamson, liulongfang, cohuck, linux-kernel, linuxarm

On Thu, May 13, 2021 at 05:52:56PM +0000, Shameerali Kolothum Thodi wrote:

> Since the devices we are concerned here are all integrated endpoints and if the 
> above quirk is an acceptable one, then we can use the uAPI as done in this
> series without overly complicating things here.

IMHO such a quirk in the core code should not be done. You need to
override this in your own driver like Max was showing.

I think we are very close to having worked out a great solution to the
lingering questions on Max's last RFC, hopefully we can post an
updated version soon

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 18:22                                 ` Alex Williamson
@ 2021-05-13 18:31                                   ` Shameerali Kolothum Thodi
  2021-05-13 18:34                                   ` Jason Gunthorpe
  1 sibling, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-05-13 18:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, liulongfang, cohuck, linux-kernel, linuxarm



> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 13 May 2021 19:23
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; liulongfang
> <liulongfang@huawei.com>; cohuck@redhat.com;
> linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> Subject: Re: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> vfio
> 
> On Thu, 13 May 2021 17:52:56 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> wrote:
> 
> > Hi Alex,
> >
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: 13 May 2021 18:04
> > > To: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> > > Cc: Jason Gunthorpe <jgg@nvidia.com>; liulongfang
> > > <liulongfang@huawei.com>; cohuck@redhat.com;
> > > linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> > > Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> > > vfio
> > >
> > > On Thu, 13 May 2021 15:49:25 +0000
> > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > > wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > > > > Sent: 13 May 2021 14:44
> > > > > To: liulongfang <liulongfang@huawei.com>
> > > > > Cc: Alex Williamson <alex.williamson@redhat.com>;
> cohuck@redhat.com;
> > > > > linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> > > > > Subject: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the
> driver to
> > > > > vfio
> > > > >
> > > > > On Thu, May 13, 2021 at 10:08:28AM +0800, liulongfang wrote:
> > > > > > On 2021/5/12 20:10, Jason Gunthorpe wrote:
> > > > > > > On Wed, May 12, 2021 at 04:39:43PM +0800, liulongfang wrote:
> > > > > > >
> > > > > > >> Therefore, this method of limiting the length of the BAR
> > > > > > >> configuration space can prevent unsafe operations of the memory.
> > > > > > >
> > > > > > > The issue is DMA controlled by the guest accessing the secure BAR
> > > > > > > area, not the guest CPU.
> > > > > > >
> > > > > > > Jason
> > > > > > > .
> > > > > > >
> > > > > > This secure BAR area is not presented to the Guest,
> > > > > > which makes it impossible for the Guest to obtain the secure BAR area
> > > > > > when establishing the DMA mapping of the configuration space.
> > > > > > If the DMA controller accesses the secure BAR area, the access will
> > > > > > be blocked by the SMMU.
> > > > >
> > > > > There are scenarios where this is not true.
> > > > >
> > > > > At a minimum the mdev driver should refuse to work in those cases.
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > I think the idea here is not a generic solution, but a quirk for this specific
> dev.
> > > >
> > > > Something like,
> > > >
> > > > --- a/drivers/vfio/pci/vfio_pci.c
> > > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > > @@ -866,7 +866,12 @@ static long vfio_pci_ioctl(struct vfio_device
> > > *core_vdev,
> > > >                         break;
> > > >                 case VFIO_PCI_BAR0_REGION_INDEX ...
> > > VFIO_PCI_BAR5_REGION_INDEX:
> > > >                         info.offset =
> > > VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > > > -                       info.size = pci_resource_len(pdev, info.index);
> > > > +
> > > > +                       if (check_hisi_acc_quirk(pdev, info))
> > > > +                               info.size = new_size;// BAR is
> limited
> > > without migration region.
> > > > +                       else
> > > > +                               info.size = pci_resource_len(pdev,
> > > info.index);
> > > > +
> > > >                         if (!info.size) {
> > > >                                 info.flags = 0;
> > > >                                 break;
> > > >
> > > > Is this an acceptable/workable solution here?
> > >
> > > As Jason says, this only restricts CPU access to the BAR, the issue is
> > > DMA access.  As the hardware vendor you may be able to guarantee that
> > > a DMA transaction generated by the device targeting the remainder of
> > > the BAR will always go upstream, but can you guarantee the routing
> > > between the device and the SMMU?  For instance if this device can be
> > > implemented as a plugin card, then it can be installed into a
> > > downstream port that may not support ACS.  That downstream port may
> > > implement request redirection allowing the transaction to reflect back
> > > to the device without IOMMU translation.  At that point the userspace
> > > driver can target the kernel driver half of the BAR and potentially
> > > expose a security risk.  Thanks,
> >
> > The ACC devices on this platform are not pluggable devices. They are
> exposed
> > as integrated endpoint devices. So I am not sure the above concern is valid in
> this
> > case.
> >
> > I had a look at the userspace driver approach you suggested. But
> unfortunately
> > the migration state change for the vf has to check some of the pf registers for
> > confirming the state. So even if we move the implementation to Qemu, we
> > still may have to use the migration uAPI to access the pf device registers.
> >
> > Since the devices we are concerned here are all integrated endpoints and if
> the
> > above quirk is an acceptable one, then we can use the uAPI as done in this
> > series without overly complicating things here.
> 
> If you expect this device to appear only as an integrated endpoint, then
> I think Jason's suggestion above is correct.  Your driver that supports
> migration can refuse to load for devices there the topology is other
> than expected and you're effectively guaranteeing DMA isolation of the
> user and in-kernel drivers by hardware DMA semantics and topology.

Ok. Will take a look at the recommendation.

> Requiring access to the PF to support the migration protocol also
> suggests that an in-kernel driver to support migration is our best
> option.  

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 18:22                                 ` Alex Williamson
  2021-05-13 18:31                                   ` Shameerali Kolothum Thodi
@ 2021-05-13 18:34                                   ` Jason Gunthorpe
  1 sibling, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2021-05-13 18:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Shameerali Kolothum Thodi, liulongfang, cohuck, linux-kernel, linuxarm

On Thu, May 13, 2021 at 12:22:32PM -0600, Alex Williamson wrote:

> If you expect this device to appear only as an integrated endpoint, then
> I think Jason's suggestion above is correct.  Your driver that supports
> migration can refuse to load for devices there the topology is other
> than expected and you're effectively guaranteeing DMA isolation of the
> user and in-kernel drivers by hardware DMA semantics and topology.

Some kind of VFIO api 'vfio_is_this_pci_dev_is_safe_to_share()'
seems appropriate here.

I saw Intel doing the same thing in one of their VDPA driver proposals.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 18:24                                 ` Jason Gunthorpe
@ 2021-05-13 18:35                                   ` Shameerali Kolothum Thodi
  2021-05-27 10:11                                   ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-05-13 18:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, liulongfang, cohuck, linux-kernel, linuxarm



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 13 May 2021 19:24
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; liulongfang
> <liulongfang@huawei.com>; cohuck@redhat.com;
> linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> Subject: Re: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> vfio
> 
> On Thu, May 13, 2021 at 05:52:56PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > Since the devices we are concerned here are all integrated endpoints and if
> the
> > above quirk is an acceptable one, then we can use the uAPI as done in this
> > series without overly complicating things here.
> 
> IMHO such a quirk in the core code should not be done. You need to
> override this in your own driver like Max was showing.

Ok. Sure, will take a look at that.

> I think we are very close to having worked out a great solution to the
> lingering questions on Max's last RFC, hopefully we can post an
> updated version soon

Cool. If possible kindly CC us when it happens.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [Linuxarm]  Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-13 18:24                                 ` Jason Gunthorpe
  2021-05-13 18:35                                   ` Shameerali Kolothum Thodi
@ 2021-05-27 10:11                                   ` Shameerali Kolothum Thodi
  2021-05-27 10:30                                     ` Max Gurtovoy
  1 sibling, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-05-27 10:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: liulongfang, cohuck, linux-kernel, linuxarm, mgurtovoy, Zengtao (B)

[+]

Hi,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 13 May 2021 19:24
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; liulongfang
> <liulongfang@huawei.com>; cohuck@redhat.com;
> linux-kernel@vger.kernel.org; linuxarm@openeuler.org
> Subject: Re: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
> vfio
> 
> On Thu, May 13, 2021 at 05:52:56PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > Since the devices we are concerned here are all integrated endpoints and if
> the
> > above quirk is an acceptable one, then we can use the uAPI as done in this
> > series without overly complicating things here.
> 
> IMHO such a quirk in the core code should not be done. You need to
> override this in your own driver like Max was showing.
> 
> I think we are very close to having worked out a great solution to the
> lingering questions on Max's last RFC, hopefully we can post an
> updated version soon

We have now integrated this HiSilicon ACC live migration driver into the
proposed[0] vfio-pci-core subsystem framework. Basically now have
a new vendor extension driver based on this framework. It indeed makes it
easy to do the registration to vfio core and overriding of functions if required.
Performed some basic sanity tests with the prototype and it seems to be
working.

Also, we now managed to get rid of any access to Guest VF dev MMIO space
during the (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING)
state. However we still need to access the Guest VF MMIO space during 
few other device states eg; VFIO_DEVICE_STATE_SAVING, VFIO_DEVICE_STATE_STOP
etc. These accesses are to save and restore device register states. As per our
analysis the Guest vCPUs are in stopped state at these device states. So I
hope we don't have any security related holes with these accesses.

Please let us know if we miss something or anything else to be taken care of with
this approach.

We are planning to send out a revised RFC soon, and if there is a plan to
post an updated one to Max's series as mentioned above, will base
it on that one. Please let us know.

Thanks,
Shameer
[0] https://lore.kernel.org/lkml/20210310123127.GT2356281@nvidia.com/T/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-05-27 10:11                                   ` Shameerali Kolothum Thodi
@ 2021-05-27 10:30                                     ` Max Gurtovoy
  0 siblings, 0 replies; 34+ messages in thread
From: Max Gurtovoy @ 2021-05-27 10:30 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Jason Gunthorpe, Alex Williamson
  Cc: liulongfang, cohuck, linux-kernel, linuxarm, Zengtao (B), Oren Duer


On 5/27/2021 1:11 PM, Shameerali Kolothum Thodi wrote:
> [+]
>
> Hi,
>
>> -----Original Message-----
>> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
>> Sent: 13 May 2021 19:24
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>> Cc: Alex Williamson <alex.williamson@redhat.com>; liulongfang
>> <liulongfang@huawei.com>; cohuck@redhat.com;
>> linux-kernel@vger.kernel.org; linuxarm@openeuler.org
>> Subject: Re: [Linuxarm] Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to
>> vfio
>>
>> On Thu, May 13, 2021 at 05:52:56PM +0000, Shameerali Kolothum Thodi
>> wrote:
>>
>>> Since the devices we are concerned here are all integrated endpoints and if
>> the
>>> above quirk is an acceptable one, then we can use the uAPI as done in this
>>> series without overly complicating things here.
>> IMHO such a quirk in the core code should not be done. You need to
>> override this in your own driver like Max was showing.
>>
>> I think we are very close to having worked out a great solution to the
>> lingering questions on Max's last RFC, hopefully we can post an
>> updated version soon
> We have now integrated this HiSilicon ACC live migration driver into the
> proposed[0] vfio-pci-core subsystem framework. Basically now have
> a new vendor extension driver based on this framework. It indeed makes it
> easy to do the registration to vfio core and overriding of functions if required.
> Performed some basic sanity tests with the prototype and it seems to be
> working.

Great news.


>
> Also, we now managed to get rid of any access to Guest VF dev MMIO space
> during the (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING)
> state. However we still need to access the Guest VF MMIO space during
> few other device states eg; VFIO_DEVICE_STATE_SAVING, VFIO_DEVICE_STATE_STOP
> etc. These accesses are to save and restore device register states. As per our
> analysis the Guest vCPUs are in stopped state at these device states. So I
> hope we don't have any security related holes with these accesses.

We can start talking about more API's that are missing.

>
> Please let us know if we miss something or anything else to be taken care of with
> this approach.
>
> We are planning to send out a revised RFC soon, and if there is a plan to
> post an updated one to Max's series as mentioned above, will base
> it on that one. Please let us know.

I'll send a new version of the series soon with many improvements that 
we'll taken from previous discussion.

I'll add you as CC.


>
> Thanks,
> Shameer
> [0] https://lore.kernel.org/lkml/20210310123127.GT2356281@nvidia.com/T/
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-13  1:20 [RFC PATCH v3 " Longfang Liu
@ 2021-04-13  1:20 ` Longfang Liu
  0 siblings, 0 replies; 34+ messages in thread
From: Longfang Liu @ 2021-04-13  1:20 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, linux-kernel, linuxarm, liulongfang

Register the live migration driver of the accelerator module to vfio

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
 drivers/vfio/pci/vfio_pci_private.h | 10 ++++++++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f041b1a..50d1138 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -293,6 +293,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
+		ret = vfio_pci_hisilicon_acc_init(vdev);
+		if (ret && ret != -ENODEV) {
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup Hisilicon ACC region\n");
+			vfio_pci_disable(vdev);
+			return ret;
+		}
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index f561ac1..74bb611 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -138,4 +138,14 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+
+#ifdef CONFIG_VFIO_PCI_HISI_MIGRATION
+extern int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
+
 #endif /* VFIO_PCI_PRIVATE_H */
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio
  2021-04-12  8:53 [RFC PATCH v3 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
@ 2021-04-12  8:53 ` Longfang Liu
  0 siblings, 0 replies; 34+ messages in thread
From: Longfang Liu @ 2021-04-12  8:53 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, linux-kernel, linuxarm, liulongfang

Register the live migration driver of the accelerator module to vfio

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Zengtao <prime.zeng@hisilicon.com>
---
 drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
 drivers/vfio/pci/vfio_pci_private.h | 10 ++++++++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f041b1a..50d1138 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -293,6 +293,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
+		ret = vfio_pci_hisilicon_acc_init(vdev);
+		if (ret && ret != -ENODEV) {
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup Hisilicon ACC region\n");
+			vfio_pci_disable(vdev);
+			return ret;
+		}
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index f561ac1..74bb611 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -138,4 +138,14 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+
+#ifdef CONFIG_VFIO_PCI_HISI_MIGRATION
+extern int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
+
 #endif /* VFIO_PCI_PRIVATE_H */
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-05-27 10:30 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13  3:36 [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
2021-04-13  3:36 ` [RFC PATCH 1/3] " Longfang Liu
2021-04-13  3:36 ` [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio Longfang Liu
2021-04-15 22:01   ` Jason Gunthorpe
2021-04-19 12:24     ` liulongfang
2021-04-19 12:33       ` Jason Gunthorpe
2021-04-20 12:50         ` liulongfang
2021-04-20 12:59           ` Jason Gunthorpe
2021-04-20 13:28             ` liulongfang
2021-04-20 14:55               ` Jason Gunthorpe
2021-04-20 22:04             ` Alex Williamson
2021-04-20 23:18               ` Jason Gunthorpe
2021-04-21  9:59                 ` liulongfang
2021-04-21  9:59               ` liulongfang
2021-04-21 18:12                 ` Alex Williamson
2021-04-26 11:49                   ` liulongfang
2021-05-12  8:39                   ` liulongfang
2021-05-12 12:10                     ` Jason Gunthorpe
2021-05-13  2:08                       ` liulongfang
2021-05-13 13:44                         ` Jason Gunthorpe
2021-05-13 15:49                           ` [Linuxarm] " Shameerali Kolothum Thodi
2021-05-13 17:03                             ` Alex Williamson
2021-05-13 17:52                               ` Shameerali Kolothum Thodi
2021-05-13 18:22                                 ` Alex Williamson
2021-05-13 18:31                                   ` Shameerali Kolothum Thodi
2021-05-13 18:34                                   ` Jason Gunthorpe
2021-05-13 18:24                                 ` Jason Gunthorpe
2021-05-13 18:35                                   ` Shameerali Kolothum Thodi
2021-05-27 10:11                                   ` Shameerali Kolothum Thodi
2021-05-27 10:30                                     ` Max Gurtovoy
2021-04-13  3:36 ` [RFC PATCH 3/3] vfio/hisilicom: add debugfs for driver Longfang Liu
2021-04-15 21:35 ` [RFC PATCH 0/3] vfio/hisilicon: add acc live migration driver Alex Williamson
  -- strict thread matches above, loose matches on Subject: below --
2021-04-13  1:20 [RFC PATCH v3 " Longfang Liu
2021-04-13  1:20 ` [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio Longfang Liu
2021-04-12  8:53 [RFC PATCH v3 0/3] vfio/hisilicon: add acc live migration driver Longfang Liu
2021-04-12  8:53 ` [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio Longfang Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.