kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device
@ 2019-06-08 13:21 Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 1/9] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patchset aims to add a vfio-pci-like meta driver as a demo
user of the vfio changes introduced in "vfio/mdev: IOMMU aware
mediated device" patchset from Baolu Lu. Besides the test purpose,
per Alex's comments, it could also be a good base driver for
experimenting with device specific mdev migration.

Specific interface tested in this proposal:
 *) int mdev_set_iommu_device(struct device *dev,
 				struct device *iommu_device)
    introduced in the patch as below:
    "[PATCH v5 6/8] vfio/mdev: Add iommu related member in mdev_device"

Patch Overview:
 *) patch 1 ~ 7: code refactor for existing vfio-pci module
                 move the common codes from vfio_pci.c to
                 vfio_pci_common.c
 *) patch 8: add protection to perm_bits alloc/free
 *) patch 9: add vfio-mdev-pci sample driver

Links:
 *) Link of "vfio/mdev: IOMMU aware mediated device"
         https://lwn.net/Articles/780522/
 *) Previous versions:
         RFC v1: https://lkml.org/lkml/2019/3/4/529
         RFC v2: https://lkml.org/lkml/2019/3/13/113
         RFC v3: https://lkml.org/lkml/2019/4/24/495
 *) may try it with the codes in below repo
    current version is branch "v5.2-rc3-pci-mdev":
         https://github.com/luxis1999/vfio-mdev-pci-sample-driver.git

Please feel free give your comments.

Thanks,
Yi Liu

Change log:
  RFC v3 -> patch v1:
  - split the patchset from 3 patches to 9 patches to better demonstrate
    the changes step by step

  v2->v3:
  - use vfio-mdev-pci instead of vfio-pci-mdev
  - place the new driver under drivers/vfio/pci while define
    Kconfig in samples/Kconfig to clarify it is a sample driver

  v1->v2:
  - instead of adding kernel option to existing vfio-pci
    module in v1, v2 follows Alex's suggestion to add a
    separate vfio-pci-mdev module.
  - new patchset subject: "vfio/pci: wrap pci device as a mediated device"

Liu Yi L (9):
  vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header
  vfio_pci: refine user config reference in vfio-pci module
  vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  vfio_pci: make common functions be extern
  vfio_pci: duplicate vfio_pci.c
  vfio_pci: shrink vfio_pci_common.c
  vfio_pci: shrink vfio_pci.c
  vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
  smaples: add vfio-mdev-pci driver

 drivers/vfio/pci/Makefile           |    9 +-
 drivers/vfio/pci/vfio_mdev_pci.c    |  403 ++++++++++
 drivers/vfio/pci/vfio_pci.c         | 1449 +---------------------------------
 drivers/vfio/pci/vfio_pci_common.c  | 1458 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  |    9 +
 drivers/vfio/pci/vfio_pci_private.h |   36 +
 samples/Kconfig                     |   11 +
 7 files changed, 1933 insertions(+), 1442 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_mdev_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_common.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v1 1/9] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 2/9] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch fix an issue regards to always_inline. e.g.:

"error: inlining failed in call to always_inline ‘vfio_pci_is_vga’:
function body not available".

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 14 --------------
 drivers/vfio/pci/vfio_pci_private.h | 14 ++++++++++++++
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index cab71da..3841460 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -57,15 +57,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static inline bool vfio_vga_disabled(void)
-{
-#ifdef CONFIG_VFIO_PCI_VGA
-	return disable_vga;
-#else
-	return true;
-#endif
-}
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -105,11 +96,6 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 	return decodes;
 }
 
-static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
-{
-	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-}
-
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
 	struct resource *res;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 1812cf2..60c03e6 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -133,6 +133,20 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
+{
+	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
+}
+
+static inline bool vfio_vga_disabled(void)
+{
+#ifdef CONFIG_VFIO_PCI_VGA
+	return disable_vga;
+#else
+	return true;
+#endif
+}
+
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 2/9] vfio_pci: refine user config reference in vfio-pci module
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 1/9] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 3/9] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch adds three fields in struct vfio_pci_device to pass the user
configs of vfio-pci module to some functions which could be common in
future usage.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 24 +++++++++++++++---------
 drivers/vfio/pci/vfio_pci_private.h |  9 +++++++--
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 3841460..2aa8a84 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -72,7 +72,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 	unsigned char max_busnr;
 	unsigned int decodes;
 
-	if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
+	if (single_vga || !vfio_vga_disabled(vdev) ||
+		pci_is_root_bus(pdev->bus))
 		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
 		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
 
@@ -276,7 +277,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	if (!vdev->pci_saved_state)
 		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
 
-	if (likely(!nointxmask)) {
+	if (likely(!vdev->nointxmask)) {
 		if (vfio_pci_nointx(pdev)) {
 			pci_info(pdev, "Masking broken INTx support\n");
 			vdev->nointx = true;
@@ -313,7 +314,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	} else
 		vdev->msix_bar = 0xFF;
 
-	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
+	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
 
@@ -439,7 +440,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
 	vfio_pci_try_bus_reset(vdev);
 
-	if (!disable_idle_d3)
+	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
@@ -1307,6 +1308,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret) {
@@ -1331,7 +1337,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	vfio_pci_probe_power_state(vdev);
 
-	if (!disable_idle_d3) {
+	if (!vdev->disable_idle_d3) {
 		/*
 		 * pci-core sets the device power state to an unknown value at
 		 * bootup and after being removed from a driver.  The only
@@ -1362,7 +1368,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	kfree(vdev->region);
 	mutex_destroy(&vdev->ioeventfds_lock);
 
-	if (!disable_idle_d3)
+	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D0);
 
 	kfree(vdev->pm_save);
@@ -1597,7 +1603,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 		if (!ret) {
 			tmp->needs_reset = false;
 
-			if (tmp != vdev && !disable_idle_d3)
+			if (tmp != vdev && !tmp->disable_idle_d3)
 				vfio_pci_set_power_state(tmp, PCI_D3hot);
 		}
 
@@ -1613,7 +1619,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(void)
+static void __init vfio_pci_fill_ids(char *ids)
 {
 	char *p, *id;
 	int rc;
@@ -1668,7 +1674,7 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
-	vfio_pci_fill_ids();
+	vfio_pci_fill_ids(&ids[0]);
 
 	return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 60c03e6..b53fe34 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -125,6 +125,11 @@ struct vfio_pci_device {
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
+	bool			nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	bool			disable_vga;
+#endif
+	bool			disable_idle_d3;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -138,10 +143,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
-static inline bool vfio_vga_disabled(void)
+static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
-	return disable_vga;
+	return vdev->disable_vga;
 #else
 	return true;
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 3/9] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 1/9] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 2/9] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 4/9] vfio_pci: make common functions be extern Liu Yi L
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch replaces the vfio_pci_driver reference in vfio_pci.c with
pci_dev_driver(vdev->pdev) which is more helpful to make the functions
be generic to module types.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 2aa8a84..af28e4c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1445,24 +1445,25 @@ static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
 
 static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 {
-	struct vfio_pci_reflck **preflck = data;
+	struct vfio_pci_device *vdev = data;
+	struct vfio_pci_reflck **preflck = &vdev->reflck;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_device *tmp;
 
 	device = vfio_device_get_from_dev(&pdev->dev);
 	if (!device)
 		return 0;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(device);
 		return 0;
 	}
 
-	vdev = vfio_device_data(device);
+	tmp = vfio_device_data(device);
 
-	if (vdev->reflck) {
-		vfio_pci_reflck_get(vdev->reflck);
-		*preflck = vdev->reflck;
+	if (tmp->reflck) {
+		vfio_pci_reflck_get(tmp->reflck);
+		*preflck = tmp->reflck;
 		vfio_device_put(device);
 		return 1;
 	}
@@ -1479,7 +1480,7 @@ static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 
 	if (pci_is_root_bus(vdev->pdev->bus) ||
 	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
-					  &vdev->reflck, slot) <= 0)
+					  vdev, slot) <= 0)
 		vdev->reflck = vfio_pci_reflck_alloc();
 
 	mutex_unlock(&reflck_lock);
@@ -1504,6 +1505,7 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 
 struct vfio_devices {
 	struct vfio_device **devices;
+	struct vfio_pci_device *vdev;
 	int cur_index;
 	int max_index;
 };
@@ -1512,7 +1514,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_device *tmp;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -1521,15 +1523,15 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
+	tmp = vfio_device_data(device);
 
 	/* Fault if the device is not unused */
-	if (vdev->refcnt) {
+	if (tmp->refcnt) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
@@ -1575,6 +1577,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	if (!devs.devices)
 		return;
 
+	devs.vdev = vdev;
 	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
 					  vfio_pci_get_unused_devs,
 					  &devs, slot))
@@ -1619,7 +1622,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(char *ids)
+static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
@@ -1647,7 +1650,7 @@ static void __init vfio_pci_fill_ids(char *ids)
 			continue;
 		}
 
-		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
+		rc = pci_add_dynid(driver, vendor, device,
 				   subvendor, subdevice, class, class_mask, 0);
 		if (rc)
 			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
@@ -1674,7 +1677,7 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
-	vfio_pci_fill_ids(&ids[0]);
+	vfio_pci_fill_ids(&ids[0], &vfio_pci_driver);
 
 	return 0;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 4/9] vfio_pci: make common functions be extern
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (2 preceding siblings ...)
  2019-06-08 13:21 ` [PATCH v1 3/9] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 5/9] vfio_pci: duplicate vfio_pci.c Liu Yi L
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch makes the common functions (module agnostic functions) in
vfio_pci.c to extern. So that such functions could be moved to a common
source file.
  *) vfio_pci_set_vga_decode
  *) vfio_pci_enable
  *) vfio_pci_disable
  *) vfio_pci_ioctl
  *) vfio_pci_read
  *) vfio_pci_write
  *) vfio_pci_mmap
  *) vfio_pci_request
  *) vfio_pci_fill_ids
  *) vfio_pci_reflck_attach
  *) vfio_pci_reflck_put
  *) vfio_pci_probe_power_state

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 30 +++++++++++++-----------------
 drivers/vfio/pci/vfio_pci_private.h | 15 +++++++++++++++
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index af28e4c..4da653e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -65,7 +65,7 @@ MODULE_PARM_DESC(disable_idle_d3,
  * has no way to get to it and routing can be disabled externally at the
  * bridge.
  */
-static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
+unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 {
 	struct vfio_pci_device *vdev = opaque;
 	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
@@ -166,7 +166,6 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 }
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-static void vfio_pci_disable(struct vfio_pci_device *vdev);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -197,7 +196,7 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
-static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u16 pmcsr;
@@ -248,7 +247,7 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 	return ret;
 }
 
-static int vfio_pci_enable(struct vfio_pci_device *vdev)
+int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int ret;
@@ -355,7 +354,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	return ret;
 }
 
-static void vfio_pci_disable(struct vfio_pci_device *vdev)
+void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
@@ -669,8 +668,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static long vfio_pci_ioctl(void *device_data,
-			   unsigned int cmd, unsigned long arg)
+long vfio_pci_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
 {
 	struct vfio_pci_device *vdev = device_data;
 	unsigned long minsz;
@@ -1155,7 +1154,7 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 	return -EINVAL;
 }
 
-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
 			     size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1164,7 +1163,7 @@ static ssize_t vfio_pci_read(void *device_data, char __user *buf,
 	return vfio_pci_rw(device_data, buf, count, ppos, false);
 }
 
-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 			      size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1173,7 +1172,7 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1235,7 +1234,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 			       req_len, vma->vm_page_prot);
 }
 
-static void vfio_pci_request(void *device_data, unsigned int count)
+void vfio_pci_request(void *device_data, unsigned int count)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1267,9 +1266,6 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.request	= vfio_pci_request,
 };
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
-static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
-
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
@@ -1472,7 +1468,7 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 {
 	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
 
@@ -1498,7 +1494,7 @@ static void vfio_pci_reflck_release(struct kref *kref)
 	mutex_unlock(&reflck_lock);
 }
 
-static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 {
 	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
 }
@@ -1622,7 +1618,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index b53fe34..7b99881 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -185,6 +185,21 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 
 extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
 				    pci_power_t state);
+extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga);
+extern int vfio_pci_enable(struct vfio_pci_device *vdev);
+extern void vfio_pci_disable(struct vfio_pci_device *vdev);
+extern long vfio_pci_ioctl(void *device_data,
+			unsigned int cmd, unsigned long arg);
+extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			size_t count, loff_t *ppos);
+extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
+extern void vfio_pci_request(void *device_data, unsigned int count);
+extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
+extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
 
 #ifdef CONFIG_VFIO_PCI_IGD
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 5/9] vfio_pci: duplicate vfio_pci.c
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (3 preceding siblings ...)
  2019-06-08 13:21 ` [PATCH v1 4/9] vfio_pci: make common functions be extern Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 6/9] vfio_pci: shrink vfio_pci_common.c Liu Yi L
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch has no code change, just a file copy. In following patches,
vfio_pci_common.c will be modified to only include the common functions
and related static functions in original vfio_pci.c. Meanwhile, vfio_pci.c
will be modified to only include vfio-pci module specific codes.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_common.c | 1691 ++++++++++++++++++++++++++++++++++++
 1 file changed, 1691 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_common.c

diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
new file mode 100644
index 0000000..4da653e
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -0,0 +1,1691 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#define dev_fmt pr_fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+/*
+ * Our VGA arbiter participation is limited since we don't know anything
+ * about the device itself.  However, if the device is the only VGA device
+ * downstream of a bridge and VFIO VGA support is disabled, then we can
+ * safely return legacy VGA IO and memory as not decoded since the user
+ * has no way to get to it and routing can be disabled externally at the
+ * bridge.
+ */
+unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
+{
+	struct vfio_pci_device *vdev = opaque;
+	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
+	unsigned char max_busnr;
+	unsigned int decodes;
+
+	if (single_vga || !vfio_vga_disabled(vdev) ||
+		pci_is_root_bus(pdev->bus))
+		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
+
+	max_busnr = pci_bus_max_busnr(pdev->bus);
+	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
+
+	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
+		if (tmp == pdev ||
+		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
+		    pci_is_root_bus(tmp->bus))
+			continue;
+
+		if (tmp->bus->number >= pdev->bus->number &&
+		    tmp->bus->number <= max_busnr) {
+			pci_dev_put(tmp);
+			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
+			break;
+		}
+	}
+
+	return decodes;
+}
+
+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+	struct resource *res;
+	int bar;
+	struct vfio_pci_dummy_resource *dummy_res;
+
+	INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		res = vdev->pdev->resource + bar;
+
+		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+			goto no_mmap;
+
+		if (!(res->flags & IORESOURCE_MEM))
+			goto no_mmap;
+
+		/*
+		 * The PCI core shouldn't set up a resource with a
+		 * type but zero size. But there may be bugs that
+		 * cause us to do that.
+		 */
+		if (!resource_size(res))
+			goto no_mmap;
+
+		if (resource_size(res) >= PAGE_SIZE) {
+			vdev->bar_mmap_supported[bar] = true;
+			continue;
+		}
+
+		if (!(res->start & ~PAGE_MASK)) {
+			/*
+			 * Add a dummy resource to reserve the remainder
+			 * of the exclusive page in case that hot-add
+			 * device's bar is assigned into it.
+			 */
+			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+			if (dummy_res == NULL)
+				goto no_mmap;
+
+			dummy_res->resource.name = "vfio sub-page reserved";
+			dummy_res->resource.start = res->end + 1;
+			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+			dummy_res->resource.flags = res->flags;
+			if (request_resource(res->parent,
+						&dummy_res->resource)) {
+				kfree(dummy_res);
+				goto no_mmap;
+			}
+			dummy_res->index = bar;
+			list_add(&dummy_res->res_next,
+					&vdev->dummy_resources_list);
+			vdev->bar_mmap_supported[bar] = true;
+			continue;
+		}
+		/*
+		 * Here we don't handle the case when the BAR is not page
+		 * aligned because we can't expect the BAR will be
+		 * assigned into the same location in a page in guest
+		 * when we passthrough the BAR. And it's hard to access
+		 * this BAR in userspace because we have no way to get
+		 * the BAR's location in a page.
+		 */
+no_mmap:
+		vdev->bar_mmap_supported[bar] = false;
+	}
+}
+
+static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
+
+/*
+ * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
+ * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
+ * If a device implements the former but not the latter we would typically
+ * expect broken_intx_masking be set and require an exclusive interrupt.
+ * However since we do have control of the device's ability to assert INTx,
+ * we can instead pretend that the device does not implement INTx, virtualizing
+ * the pin register to report zero and maintaining DisINTx set on the host.
+ */
+static bool vfio_pci_nointx(struct pci_dev *pdev)
+{
+	switch (pdev->vendor) {
+	case PCI_VENDOR_ID_INTEL:
+		switch (pdev->device) {
+		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
+		case 0x1572:
+		case 0x1574:
+		case 0x1580 ... 0x1581:
+		case 0x1583 ... 0x158b:
+		case 0x37d0 ... 0x37d2:
+			return true;
+		default:
+			return false;
+		}
+	}
+
+	return false;
+}
+
+void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 pmcsr;
+
+	if (!pdev->pm_cap)
+		return;
+
+	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
+
+	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
+}
+
+/*
+ * pci_set_power_state() wrapper handling devices which perform a soft reset on
+ * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
+ * restore when returned to D0.  Saved separately from pci_saved_state for use
+ * by PM capability emulation and separately from pci_dev internal saved state
+ * to avoid it being overwritten and consumed around other resets.
+ */
+int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	bool needs_restore = false, needs_save = false;
+	int ret;
+
+	if (vdev->needs_pm_restore) {
+		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
+			pci_save_state(pdev);
+			needs_save = true;
+		}
+
+		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
+			needs_restore = true;
+	}
+
+	ret = pci_set_power_state(pdev, state);
+
+	if (!ret) {
+		/* D3 might be unsupported via quirk, skip unless in D3 */
+		if (needs_save && pdev->current_state >= PCI_D3hot) {
+			vdev->pm_save = pci_store_saved_state(pdev);
+		} else if (needs_restore) {
+			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
+			pci_restore_state(pdev);
+		}
+	}
+
+	return ret;
+}
+
+int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	u16 cmd;
+	u8 msix_pos;
+
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
+	/* Don't allow our initial saved state to include busmaster */
+	pci_clear_master(pdev);
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		return ret;
+
+	/* If reset fails because of the device lock, fail this path entirely */
+	ret = pci_try_reset_function(pdev);
+	if (ret == -EAGAIN) {
+		pci_disable_device(pdev);
+		return ret;
+	}
+
+	vdev->reset_works = !ret;
+	pci_save_state(pdev);
+	vdev->pci_saved_state = pci_store_saved_state(pdev);
+	if (!vdev->pci_saved_state)
+		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
+
+	if (likely(!vdev->nointxmask)) {
+		if (vfio_pci_nointx(pdev)) {
+			pci_info(pdev, "Masking broken INTx support\n");
+			vdev->nointx = true;
+			pci_intx(pdev, 0);
+		} else
+			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+	}
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+		cmd &= ~PCI_COMMAND_INTX_DISABLE;
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+	}
+
+	ret = vfio_config_init(vdev);
+	if (ret) {
+		kfree(vdev->pci_saved_state);
+		vdev->pci_saved_state = NULL;
+		pci_disable_device(pdev);
+		return ret;
+	}
+
+	msix_pos = pdev->msix_cap;
+	if (msix_pos) {
+		u16 flags;
+		u32 table;
+
+		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
+		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
+		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+	} else
+		vdev->msix_bar = 0xFF;
+
+	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
+		vdev->has_vga = true;
+
+
+	if (vfio_pci_is_vga(pdev) &&
+	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
+		ret = vfio_pci_igd_init(vdev);
+		if (ret) {
+			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_ibm_npu2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
+			goto disable_exit;
+		}
+	}
+
+	vfio_pci_probe_mmaps(vdev);
+
+	return 0;
+
+disable_exit:
+	vfio_pci_disable(vdev);
+	return ret;
+}
+
+void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_dummy_resource *dummy_res, *tmp;
+	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
+	int i, bar;
+
+	/* Stop the device from further DMA */
+	pci_clear_master(pdev);
+
+	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+				VFIO_IRQ_SET_ACTION_TRIGGER,
+				vdev->irq_type, 0, 0, NULL);
+
+	/* Device closed, don't need mutex here */
+	list_for_each_entry_safe(ioeventfd, ioeventfd_tmp,
+				 &vdev->ioeventfds_list, next) {
+		vfio_virqfd_disable(&ioeventfd->virqfd);
+		list_del(&ioeventfd->next);
+		kfree(ioeventfd);
+	}
+	vdev->ioeventfds_nr = 0;
+
+	vdev->virq_disabled = false;
+
+	for (i = 0; i < vdev->num_regions; i++)
+		vdev->region[i].ops->release(vdev, &vdev->region[i]);
+
+	vdev->num_regions = 0;
+	kfree(vdev->region);
+	vdev->region = NULL; /* don't krealloc a freed pointer */
+
+	vfio_config_free(vdev);
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		if (!vdev->barmap[bar])
+			continue;
+		pci_iounmap(pdev, vdev->barmap[bar]);
+		pci_release_selected_regions(pdev, 1 << bar);
+		vdev->barmap[bar] = NULL;
+	}
+
+	list_for_each_entry_safe(dummy_res, tmp,
+				 &vdev->dummy_resources_list, res_next) {
+		list_del(&dummy_res->res_next);
+		release_resource(&dummy_res->resource);
+		kfree(dummy_res);
+	}
+
+	vdev->needs_reset = true;
+
+	/*
+	 * If we have saved state, restore it.  If we can reset the device,
+	 * even better.  Resetting with current state seems better than
+	 * nothing, but saving and restoring current state without reset
+	 * is just busy work.
+	 */
+	if (pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)) {
+		pci_info(pdev, "%s: Couldn't reload saved state\n", __func__);
+
+		if (!vdev->reset_works)
+			goto out;
+
+		pci_save_state(pdev);
+	}
+
+	/*
+	 * Disable INTx and MSI, presumably to avoid spurious interrupts
+	 * during reset.  Stolen from pci_reset_function()
+	 */
+	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+
+	/*
+	 * Try to reset the device.  The success of this is dependent on
+	 * being able to lock the device, which is not always possible.
+	 */
+	if (vdev->reset_works && !pci_try_reset_function(pdev))
+		vdev->needs_reset = false;
+
+	pci_restore_state(pdev);
+out:
+	pci_disable_device(pdev);
+
+	vfio_pci_try_bus_reset(vdev);
+
+	if (!vdev->disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+
+		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
+		    vdev->nointx || vdev->pdev->is_virtfn)
+			return 0;
+
+		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+
+		return pin ? 1 : 0;
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = vdev->pdev->msi_cap;
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSI_FLAGS, &flags);
+			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = vdev->pdev->msix_cap;
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSIX_FLAGS, &flags);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+		if (pci_is_pcie(vdev->pdev))
+			return 1;
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+		return 1;
+	}
+
+	return 0;
+}
+
+static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
+{
+	(*(int *)data)++;
+	return 0;
+}
+
+struct vfio_pci_fill_info {
+	int max;
+	int cur;
+	struct vfio_pci_dependent_device *devices;
+};
+
+static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_fill_info *fill = data;
+	struct iommu_group *iommu_group;
+
+	if (fill->cur == fill->max)
+		return -EAGAIN; /* Something changed, try again */
+
+	iommu_group = iommu_group_get(&pdev->dev);
+	if (!iommu_group)
+		return -EPERM; /* Cannot reset non-isolated devices */
+
+	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
+	fill->devices[fill->cur].bus = pdev->bus->number;
+	fill->devices[fill->cur].devfn = pdev->devfn;
+	fill->cur++;
+	iommu_group_put(iommu_group);
+	return 0;
+}
+
+struct vfio_pci_group_entry {
+	struct vfio_group *group;
+	int id;
+};
+
+struct vfio_pci_group_info {
+	int count;
+	struct vfio_pci_group_entry *groups;
+};
+
+static int vfio_pci_validate_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_group_info *info = data;
+	struct iommu_group *group;
+	int id, i;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EPERM;
+
+	id = iommu_group_id(group);
+
+	for (i = 0; i < info->count; i++)
+		if (info->groups[i].id == id)
+			break;
+
+	iommu_group_put(group);
+
+	return (i == info->count) ? -EINVAL : 0;
+}
+
+static bool vfio_pci_dev_below_slot(struct pci_dev *pdev, struct pci_slot *slot)
+{
+	for (; pdev; pdev = pdev->bus->self)
+		if (pdev->bus == slot->bus)
+			return (pdev->slot == slot);
+	return false;
+}
+
+struct vfio_pci_walk_info {
+	int (*fn)(struct pci_dev *, void *data);
+	void *data;
+	struct pci_dev *pdev;
+	bool slot;
+	int ret;
+};
+
+static int vfio_pci_walk_wrapper(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_walk_info *walk = data;
+
+	if (!walk->slot || vfio_pci_dev_below_slot(pdev, walk->pdev->slot))
+		walk->ret = walk->fn(pdev, walk->data);
+
+	return walk->ret;
+}
+
+static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
+					 int (*fn)(struct pci_dev *,
+						   void *data), void *data,
+					 bool slot)
+{
+	struct vfio_pci_walk_info walk = {
+		.fn = fn, .data = data, .pdev = pdev, .slot = slot, .ret = 0,
+	};
+
+	pci_walk_bus(pdev->bus, vfio_pci_walk_wrapper, &walk);
+
+	return walk.ret;
+}
+
+static int msix_mmappable_cap(struct vfio_pci_device *vdev,
+			      struct vfio_info_cap *caps)
+{
+	struct vfio_info_cap_header header = {
+		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
+		.version = 1
+	};
+
+	return vfio_info_add_capability(caps, &header, sizeof(header));
+}
+
+int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+				 unsigned int type, unsigned int subtype,
+				 const struct vfio_pci_regops *ops,
+				 size_t size, u32 flags, void *data)
+{
+	struct vfio_pci_region *region;
+
+	region = krealloc(vdev->region,
+			  (vdev->num_regions + 1) * sizeof(*region),
+			  GFP_KERNEL);
+	if (!region)
+		return -ENOMEM;
+
+	vdev->region = region;
+	vdev->region[vdev->num_regions].type = type;
+	vdev->region[vdev->num_regions].subtype = subtype;
+	vdev->region[vdev->num_regions].ops = ops;
+	vdev->region[vdev->num_regions].size = size;
+	vdev->region[vdev->num_regions].flags = flags;
+	vdev->region[vdev->num_regions].data = data;
+
+	vdev->num_regions++;
+
+	return 0;
+}
+
+long vfio_pci_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_device *vdev = device_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (vdev->reset_works)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		int i, ret;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = pdev->cfg_size;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = pci_resource_len(pdev, info.index);
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			if (vdev->bar_mmap_supported[info.index]) {
+				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+				if (info.index == vdev->msix_bar) {
+					ret = msix_mmappable_cap(vdev, &caps);
+					if (ret)
+						return ret;
+				}
+			}
+
+			break;
+		case VFIO_PCI_ROM_REGION_INDEX:
+		{
+			void __iomem *io;
+			size_t size;
+			u16 orig_cmd;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.flags = 0;
+
+			/* Report the BAR size, not the ROM size */
+			info.size = pci_resource_len(pdev, info.index);
+			if (!info.size) {
+				/* Shadow ROMs appear as PCI option ROMs */
+				if (pdev->resource[PCI_ROM_RESOURCE].flags &
+							IORESOURCE_ROM_SHADOW)
+					info.size = 0x20000;
+				else
+					break;
+			}
+
+			/*
+			 * Is it really there?  Enable memory decode for
+			 * implicit access in pci_map_rom().
+			 */
+			pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
+			pci_write_config_word(pdev, PCI_COMMAND,
+					      orig_cmd | PCI_COMMAND_MEMORY);
+
+			io = pci_map_rom(pdev, &size);
+			if (io) {
+				info.flags = VFIO_REGION_INFO_FLAG_READ;
+				pci_unmap_rom(pdev, io);
+			} else {
+				info.size = 0;
+			}
+
+			pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
+			break;
+		}
+		case VFIO_PCI_VGA_REGION_INDEX:
+			if (!vdev->has_vga)
+				return -EINVAL;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+
+			break;
+		default:
+		{
+			struct vfio_region_info_cap_type cap_type = {
+					.header.id = VFIO_REGION_INFO_CAP_TYPE,
+					.header.version = 1 };
+
+			if (info.index >=
+			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
+				return -EINVAL;
+			info.index = array_index_nospec(info.index,
+							VFIO_PCI_NUM_REGIONS +
+							vdev->num_regions);
+
+			i = info.index - VFIO_PCI_NUM_REGIONS;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->region[i].size;
+			info.flags = vdev->region[i].flags;
+
+			cap_type.type = vdev->region[i].type;
+			cap_type.subtype = vdev->region[i].subtype;
+
+			ret = vfio_info_add_capability(&caps, &cap_type.header,
+						       sizeof(cap_type));
+			if (ret)
+				return ret;
+
+			if (vdev->region[i].ops->add_capability) {
+				ret = vdev->region[i].ops->add_capability(vdev,
+						&vdev->region[i], &caps);
+				if (ret)
+					return ret;
+			}
+		}
+		}
+
+		if (caps.size) {
+			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg +
+						  sizeof(info), caps.buf,
+						  caps.size)) {
+					kfree(caps.buf);
+					return -EFAULT;
+				}
+				info.cap_offset = sizeof(info);
+			}
+
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+		case VFIO_PCI_ERR_IRQ_INDEX:
+			if (pci_is_pcie(vdev->pdev))
+				break;
+		/* fall through */
+		default:
+			return -EINVAL;
+		}
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+		info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+				       VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		int max, ret = 0;
+		size_t data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		max = vfio_pci_get_irq_count(vdev, hdr.index);
+
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
+						 VFIO_PCI_NUM_IRQS, &data_size);
+		if (ret)
+			return ret;
+
+		if (data_size) {
+			data = memdup_user((void __user *)(arg + minsz),
+					    data_size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		mutex_lock(&vdev->igate);
+
+		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+					      hdr.start, hdr.count, data);
+
+		mutex_unlock(&vdev->igate);
+		kfree(data);
+
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_RESET) {
+		return vdev->reset_works ?
+			pci_try_reset_function(vdev->pdev) : -EINVAL;
+
+	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
+		struct vfio_pci_hot_reset_info hdr;
+		struct vfio_pci_fill_info fill = { 0 };
+		struct vfio_pci_dependent_device *devices = NULL;
+		bool slot = false;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_pci_hot_reset_info, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz)
+			return -EINVAL;
+
+		hdr.flags = 0;
+
+		/* Can we do a slot or bus reset or neither? */
+		if (!pci_probe_reset_slot(vdev->pdev->slot))
+			slot = true;
+		else if (pci_probe_reset_bus(vdev->pdev->bus))
+			return -ENODEV;
+
+		/* How many devices are affected? */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_count_devs,
+						    &fill.max, slot);
+		if (ret)
+			return ret;
+
+		WARN_ON(!fill.max); /* Should always be at least one */
+
+		/*
+		 * If there's enough space, fill it now, otherwise return
+		 * -ENOSPC and the number of devices affected.
+		 */
+		if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
+			ret = -ENOSPC;
+			hdr.count = fill.max;
+			goto reset_info_exit;
+		}
+
+		devices = kcalloc(fill.max, sizeof(*devices), GFP_KERNEL);
+		if (!devices)
+			return -ENOMEM;
+
+		fill.devices = devices;
+
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_fill_devs,
+						    &fill, slot);
+
+		/*
+		 * If a device was removed between counting and filling,
+		 * we may come up short of fill.max.  If a device was
+		 * added, we'll have a return of -EAGAIN above.
+		 */
+		if (!ret)
+			hdr.count = fill.cur;
+
+reset_info_exit:
+		if (copy_to_user((void __user *)arg, &hdr, minsz))
+			ret = -EFAULT;
+
+		if (!ret) {
+			if (copy_to_user((void __user *)(arg + minsz), devices,
+					 hdr.count * sizeof(*devices)))
+				ret = -EFAULT;
+		}
+
+		kfree(devices);
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_PCI_HOT_RESET) {
+		struct vfio_pci_hot_reset hdr;
+		int32_t *group_fds;
+		struct vfio_pci_group_entry *groups;
+		struct vfio_pci_group_info info;
+		bool slot = false;
+		int i, count = 0, ret = 0;
+
+		minsz = offsetofend(struct vfio_pci_hot_reset, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.flags)
+			return -EINVAL;
+
+		/* Can we do a slot or bus reset or neither? */
+		if (!pci_probe_reset_slot(vdev->pdev->slot))
+			slot = true;
+		else if (pci_probe_reset_bus(vdev->pdev->bus))
+			return -ENODEV;
+
+		/*
+		 * We can't let userspace give us an arbitrarily large
+		 * buffer to copy, so verify how many we think there
+		 * could be.  Note groups can have multiple devices so
+		 * one group per device is the max.
+		 */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_count_devs,
+						    &count, slot);
+		if (ret)
+			return ret;
+
+		/* Somewhere between 1 and count is OK */
+		if (!hdr.count || hdr.count > count)
+			return -EINVAL;
+
+		group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
+		groups = kcalloc(hdr.count, sizeof(*groups), GFP_KERNEL);
+		if (!group_fds || !groups) {
+			kfree(group_fds);
+			kfree(groups);
+			return -ENOMEM;
+		}
+
+		if (copy_from_user(group_fds, (void __user *)(arg + minsz),
+				   hdr.count * sizeof(*group_fds))) {
+			kfree(group_fds);
+			kfree(groups);
+			return -EFAULT;
+		}
+
+		/*
+		 * For each group_fd, get the group through the vfio external
+		 * user interface and store the group and iommu ID.  This
+		 * ensures the group is held across the reset.
+		 */
+		for (i = 0; i < hdr.count; i++) {
+			struct vfio_group *group;
+			struct fd f = fdget(group_fds[i]);
+			if (!f.file) {
+				ret = -EBADF;
+				break;
+			}
+
+			group = vfio_group_get_external_user(f.file);
+			fdput(f);
+			if (IS_ERR(group)) {
+				ret = PTR_ERR(group);
+				break;
+			}
+
+			groups[i].group = group;
+			groups[i].id = vfio_external_user_iommu_id(group);
+		}
+
+		kfree(group_fds);
+
+		/* release reference to groups on error */
+		if (ret)
+			goto hot_reset_release;
+
+		info.count = hdr.count;
+		info.groups = groups;
+
+		/*
+		 * Test whether all the affected devices are contained
+		 * by the set of groups provided by the user.
+		 */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_validate_devs,
+						    &info, slot);
+		if (!ret)
+			/* User has access, do the reset */
+			ret = pci_reset_bus(vdev->pdev);
+
+hot_reset_release:
+		for (i--; i >= 0; i--)
+			vfio_group_put_external_user(groups[i].group);
+
+		kfree(groups);
+		return ret;
+	} else if (cmd == VFIO_DEVICE_IOEVENTFD) {
+		struct vfio_device_ioeventfd ioeventfd;
+		int count;
+
+		minsz = offsetofend(struct vfio_device_ioeventfd, fd);
+
+		if (copy_from_user(&ioeventfd, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ioeventfd.argsz < minsz)
+			return -EINVAL;
+
+		if (ioeventfd.flags & ~VFIO_DEVICE_IOEVENTFD_SIZE_MASK)
+			return -EINVAL;
+
+		count = ioeventfd.flags & VFIO_DEVICE_IOEVENTFD_SIZE_MASK;
+
+		if (hweight8(count) != 1 || ioeventfd.fd < -1)
+			return -EINVAL;
+
+		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
+					  ioeventfd.data, count, ioeventfd.fd);
+	}
+
+	return -ENOTTY;
+}
+
+static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+		if (iswrite)
+			return -EINVAL;
+		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_VGA_REGION_INDEX:
+		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+	default:
+		index -= VFIO_PCI_NUM_REGIONS;
+		return vdev->region[index].ops->rw(vdev, buf,
+						   count, ppos, iswrite);
+	}
+
+	return -EINVAL;
+}
+
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	if (!count)
+		return 0;
+
+	return vfio_pci_rw(device_data, buf, count, ppos, false);
+}
+
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	if (!count)
+		return 0;
+
+	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
+}
+
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index;
+	u64 phys_len, req_len, pgoff, req_start;
+	int ret;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap &&
+		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	if (!vdev->bar_mmap_supported[index])
+		return -EINVAL;
+
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (req_start + req_len > phys_len)
+		return -EINVAL;
+
+	/*
+	 * Even though we don't make use of the barmap for the mmap,
+	 * we need to request the region and the barmap tracks that.
+	 */
+	if (!vdev->barmap[index]) {
+		ret = pci_request_selected_regions(pdev,
+						   1 << index, "vfio-pci");
+		if (ret)
+			return ret;
+
+		vdev->barmap[index] = pci_iomap(pdev, index, 0);
+		if (!vdev->barmap[index]) {
+			pci_release_selected_regions(pdev, 1 << index);
+			return -ENOMEM;
+		}
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			       req_len, vma->vm_page_prot);
+}
+
+void vfio_pci_request(void *device_data, unsigned int count)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	mutex_lock(&vdev->igate);
+
+	if (vdev->req_trigger) {
+		if (!(count % 10))
+			pci_notice_ratelimited(pdev,
+				"Relaying device request to user (#%u)\n",
+				count);
+		eventfd_signal(vdev->req_trigger, 1);
+	} else if (count == 0) {
+		pci_warn(pdev,
+			"No device request channel registered, blocked until released by user\n");
+	}
+
+	mutex_unlock(&vdev->igate);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_ioctl,
+	.read		= vfio_pci_read,
+	.write		= vfio_pci_write,
+	.mmap		= vfio_pci_mmap,
+	.request	= vfio_pci_request,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	struct iommu_group *group;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci.  Just reject these PFs
+	 * and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	group = vfio_iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		return -ENOMEM;
+	}
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (ret) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		vfio_del_group_dev(&pdev->dev);
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = vfio_del_group_dev(&pdev->dev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!vdev->disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+	kfree(vdev);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+}
+
+static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
+						  pci_channel_state_t state)
+{
+	struct vfio_pci_device *vdev;
+	struct vfio_device *device;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (device == NULL)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	vdev = vfio_device_data(device);
+	if (vdev == NULL) {
+		vfio_device_put(device);
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	mutex_lock(&vdev->igate);
+
+	if (vdev->err_trigger)
+		eventfd_signal(vdev->err_trigger, 1);
+
+	mutex_unlock(&vdev->igate);
+
+	vfio_device_put(device);
+
+	return PCI_ERS_RESULT_CAN_RECOVER;
+}
+
+static const struct pci_error_handlers vfio_err_handlers = {
+	.error_detected = vfio_pci_aer_err_detected,
+};
+
+static struct pci_driver vfio_pci_driver = {
+	.name		= "vfio-pci",
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_pci_probe,
+	.remove		= vfio_pci_remove,
+	.err_handler	= &vfio_err_handlers,
+};
+
+static DEFINE_MUTEX(reflck_lock);
+
+static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
+{
+	struct vfio_pci_reflck *reflck;
+
+	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
+	if (!reflck)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&reflck->kref);
+	mutex_init(&reflck->lock);
+
+	return reflck;
+}
+
+static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
+{
+	kref_get(&reflck->kref);
+}
+
+static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_device *vdev = data;
+	struct vfio_pci_reflck **preflck = &vdev->reflck;
+	struct vfio_device *device;
+	struct vfio_pci_device *tmp;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return 0;
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	tmp = vfio_device_data(device);
+
+	if (tmp->reflck) {
+		vfio_pci_reflck_get(tmp->reflck);
+		*preflck = tmp->reflck;
+		vfio_device_put(device);
+		return 1;
+	}
+
+	vfio_device_put(device);
+	return 0;
+}
+
+int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+{
+	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
+
+	mutex_lock(&reflck_lock);
+
+	if (pci_is_root_bus(vdev->pdev->bus) ||
+	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
+					  vdev, slot) <= 0)
+		vdev->reflck = vfio_pci_reflck_alloc();
+
+	mutex_unlock(&reflck_lock);
+
+	return PTR_ERR_OR_ZERO(vdev->reflck);
+}
+
+static void vfio_pci_reflck_release(struct kref *kref)
+{
+	struct vfio_pci_reflck *reflck = container_of(kref,
+						      struct vfio_pci_reflck,
+						      kref);
+
+	kfree(reflck);
+	mutex_unlock(&reflck_lock);
+}
+
+void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+{
+	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
+}
+
+struct vfio_devices {
+	struct vfio_device **devices;
+	struct vfio_pci_device *vdev;
+	int cur_index;
+	int max_index;
+};
+
+static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_devices *devs = data;
+	struct vfio_device *device;
+	struct vfio_pci_device *tmp;
+
+	if (devs->cur_index == devs->max_index)
+		return -ENOSPC;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return -EINVAL;
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	tmp = vfio_device_data(device);
+
+	/* Fault if the device is not unused */
+	if (tmp->refcnt) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	devs->devices[devs->cur_index++] = device;
+	return 0;
+}
+
+/*
+ * If a bus or slot reset is available for the provided device and:
+ *  - All of the devices affected by that bus or slot reset are unused
+ *    (!refcnt)
+ *  - At least one of the affected devices is marked dirty via
+ *    needs_reset (such as by lack of FLR support)
+ * Then attempt to perform that bus or slot reset.  Callers are required
+ * to hold vdev->reflck->lock, protecting the bus/slot reset group from
+ * concurrent opens.  A vfio_device reference is acquired for each device
+ * to prevent unbinds during the reset operation.
+ *
+ * NB: vfio-core considers a group to be viable even if some devices are
+ * bound to drivers like pci-stub or pcieport.  Here we require all devices
+ * to be bound to vfio_pci since that's the only way we can be sure they
+ * stay put.
+ */
+static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
+{
+	struct vfio_devices devs = { .cur_index = 0 };
+	int i = 0, ret = -EINVAL;
+	bool slot = false;
+	struct vfio_pci_device *tmp;
+
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return;
+
+	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+					  &i, slot) || !i)
+		return;
+
+	devs.max_index = i;
+	devs.devices = kcalloc(i, sizeof(struct vfio_device *), GFP_KERNEL);
+	if (!devs.devices)
+		return;
+
+	devs.vdev = vdev;
+	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
+					  vfio_pci_get_unused_devs,
+					  &devs, slot))
+		goto put_devs;
+
+	/* Does at least one need a reset? */
+	for (i = 0; i < devs.cur_index; i++) {
+		tmp = vfio_device_data(devs.devices[i]);
+		if (tmp->needs_reset) {
+			ret = pci_reset_bus(vdev->pdev);
+			break;
+		}
+	}
+
+put_devs:
+	for (i = 0; i < devs.cur_index; i++) {
+		tmp = vfio_device_data(devs.devices[i]);
+
+		/*
+		 * If reset was successful, affected devices no longer need
+		 * a reset and we should return all the collateral devices
+		 * to low power.  If not successful, we either didn't reset
+		 * the bus or timed out waiting for it, so let's not touch
+		 * the power state.
+		 */
+		if (!ret) {
+			tmp->needs_reset = false;
+
+			if (tmp != vdev && !tmp->disable_idle_d3)
+				vfio_pci_set_power_state(tmp, PCI_D3hot);
+		}
+
+		vfio_device_put(devs.devices[i]);
+	}
+
+	kfree(devs.devices);
+}
+
+static void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+	vfio_pci_uninit_perm_bits();
+}
+
+void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+{
+	char *p, *id;
+	int rc;
+
+	/* no ids passed actually */
+	if (ids[0] == '\0')
+		return;
+
+	/* add ids specified in the module parameter */
+	p = ids;
+	while ((id = strsep(&p, ","))) {
+		unsigned int vendor, device, subvendor = PCI_ANY_ID,
+			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+		int fields;
+
+		if (!strlen(id))
+			continue;
+
+		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
+				&vendor, &device, &subvendor, &subdevice,
+				&class, &class_mask);
+
+		if (fields < 2) {
+			pr_warn("invalid id string \"%s\"\n", id);
+			continue;
+		}
+
+		rc = pci_add_dynid(driver, vendor, device,
+				   subvendor, subdevice, class, class_mask, 0);
+		if (rc)
+			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask, rc);
+		else
+			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask);
+	}
+}
+
+static int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	vfio_pci_fill_ids(&ids[0], &vfio_pci_driver);
+
+	return 0;
+
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 6/9] vfio_pci: shrink vfio_pci_common.c
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (4 preceding siblings ...)
  2019-06-08 13:21 ` [PATCH v1 5/9] vfio_pci: duplicate vfio_pci.c Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 7/9] vfio_pci: shrink vfio_pci.c Liu Yi L
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch removes the vfio-pci module specific codes in vfio_pci_common.c
to make vfio_pci_common.c be a common source file.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_common.c | 233 -------------------------------------
 1 file changed, 233 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index 4da653e..3aab938 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -33,30 +33,6 @@
 
 #include "vfio_pci_private.h"
 
-#define DRIVER_VERSION  "0.2"
-#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
-#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
-
-static char ids[1024] __initdata;
-module_param_string(ids, ids, sizeof(ids), 0);
-MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
-
-static bool nointxmask;
-module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(nointxmask,
-		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
-
-#ifdef CONFIG_VFIO_PCI_VGA
-static bool disable_vga;
-module_param(disable_vga, bool, S_IRUGO);
-MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-pci");
-#endif
-
-static bool disable_idle_d3;
-module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(disable_idle_d3,
-		 "Disable using the PCI D3 low power state for idle, unused devices");
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -443,47 +419,6 @@ void vfio_pci_disable(struct vfio_pci_device *vdev)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
-static void vfio_pci_release(void *device_data)
-{
-	struct vfio_pci_device *vdev = device_data;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!(--vdev->refcnt)) {
-		vfio_spapr_pci_eeh_release(vdev->pdev);
-		vfio_pci_disable(vdev);
-	}
-
-	mutex_unlock(&vdev->reflck->lock);
-
-	module_put(THIS_MODULE);
-}
-
-static int vfio_pci_open(void *device_data)
-{
-	struct vfio_pci_device *vdev = device_data;
-	int ret = 0;
-
-	if (!try_module_get(THIS_MODULE))
-		return -ENODEV;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!vdev->refcnt) {
-		ret = vfio_pci_enable(vdev);
-		if (ret)
-			goto error;
-
-		vfio_spapr_pci_eeh_open(vdev->pdev);
-	}
-	vdev->refcnt++;
-error:
-	mutex_unlock(&vdev->reflck->lock);
-	if (ret)
-		module_put(THIS_MODULE);
-	return ret;
-}
-
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 {
 	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
@@ -1255,129 +1190,6 @@ void vfio_pci_request(void *device_data, unsigned int count)
 	mutex_unlock(&vdev->igate);
 }
 
-static const struct vfio_device_ops vfio_pci_ops = {
-	.name		= "vfio-pci",
-	.open		= vfio_pci_open,
-	.release	= vfio_pci_release,
-	.ioctl		= vfio_pci_ioctl,
-	.read		= vfio_pci_read,
-	.write		= vfio_pci_write,
-	.mmap		= vfio_pci_mmap,
-	.request	= vfio_pci_request,
-};
-
-static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
-{
-	struct vfio_pci_device *vdev;
-	struct iommu_group *group;
-	int ret;
-
-	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-		return -EINVAL;
-
-	/*
-	 * Prevent binding to PFs with VFs enabled, this too easily allows
-	 * userspace instance with VFs and PFs from the same device, which
-	 * cannot work.  Disabling SR-IOV here would initiate removing the
-	 * VFs, which would unbind the driver, which is prone to blocking
-	 * if that VF is also in use by vfio-pci.  Just reject these PFs
-	 * and let the user sort it out.
-	 */
-	if (pci_num_vf(pdev)) {
-		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-		return -EBUSY;
-	}
-
-	group = vfio_iommu_group_get(&pdev->dev);
-	if (!group)
-		return -EINVAL;
-
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev) {
-		vfio_iommu_group_put(group, &pdev->dev);
-		return -ENOMEM;
-	}
-
-	vdev->pdev = pdev;
-	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	mutex_init(&vdev->igate);
-	spin_lock_init(&vdev->irqlock);
-	mutex_init(&vdev->ioeventfds_lock);
-	INIT_LIST_HEAD(&vdev->ioeventfds_list);
-	vdev->nointxmask = nointxmask;
-#ifdef CONFIG_VFIO_PCI_VGA
-	vdev->disable_vga = disable_vga;
-#endif
-	vdev->disable_idle_d3 = disable_idle_d3;
-
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
-	if (ret) {
-		vfio_iommu_group_put(group, &pdev->dev);
-		kfree(vdev);
-		return ret;
-	}
-
-	ret = vfio_pci_reflck_attach(vdev);
-	if (ret) {
-		vfio_del_group_dev(&pdev->dev);
-		vfio_iommu_group_put(group, &pdev->dev);
-		kfree(vdev);
-		return ret;
-	}
-
-	if (vfio_pci_is_vga(pdev)) {
-		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
-		vga_set_legacy_decoding(pdev,
-					vfio_pci_set_vga_decode(vdev, false));
-	}
-
-	vfio_pci_probe_power_state(vdev);
-
-	if (!vdev->disable_idle_d3) {
-		/*
-		 * pci-core sets the device power state to an unknown value at
-		 * bootup and after being removed from a driver.  The only
-		 * transition it allows from this unknown state is to D0, which
-		 * typically happens when a driver calls pci_enable_device().
-		 * We're not ready to enable the device yet, but we do want to
-		 * be able to get to D3.  Therefore first do a D0 transition
-		 * before going to D3.
-		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-	}
-
-	return ret;
-}
-
-static void vfio_pci_remove(struct pci_dev *pdev)
-{
-	struct vfio_pci_device *vdev;
-
-	vdev = vfio_del_group_dev(&pdev->dev);
-	if (!vdev)
-		return;
-
-	vfio_pci_reflck_put(vdev->reflck);
-
-	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
-	kfree(vdev->region);
-	mutex_destroy(&vdev->ioeventfds_lock);
-
-	if (!vdev->disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
-
-	kfree(vdev->pm_save);
-	kfree(vdev);
-
-	if (vfio_pci_is_vga(pdev)) {
-		vga_client_register(pdev, NULL, NULL, NULL);
-		vga_set_legacy_decoding(pdev,
-				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
-				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
-	}
-}
-
 static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 						  pci_channel_state_t state)
 {
@@ -1410,14 +1222,6 @@ static const struct pci_error_handlers vfio_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
 
-static struct pci_driver vfio_pci_driver = {
-	.name		= "vfio-pci",
-	.id_table	= NULL, /* only dynamic ids */
-	.probe		= vfio_pci_probe,
-	.remove		= vfio_pci_remove,
-	.err_handler	= &vfio_err_handlers,
-};
-
 static DEFINE_MUTEX(reflck_lock);
 
 static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
@@ -1612,12 +1416,6 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	kfree(devs.devices);
 }
 
-static void __exit vfio_pci_cleanup(void)
-{
-	pci_unregister_driver(&vfio_pci_driver);
-	vfio_pci_uninit_perm_bits();
-}
-
 void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
@@ -1658,34 +1456,3 @@ void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 				class, class_mask);
 	}
 }
-
-static int __init vfio_pci_init(void)
-{
-	int ret;
-
-	/* Allocate shared config space permision data used by all devices */
-	ret = vfio_pci_init_perm_bits();
-	if (ret)
-		return ret;
-
-	/* Register and scan for devices */
-	ret = pci_register_driver(&vfio_pci_driver);
-	if (ret)
-		goto out_driver;
-
-	vfio_pci_fill_ids(&ids[0], &vfio_pci_driver);
-
-	return 0;
-
-out_driver:
-	vfio_pci_uninit_perm_bits();
-	return ret;
-}
-
-module_init(vfio_pci_init);
-module_exit(vfio_pci_cleanup);
-
-MODULE_VERSION(DRIVER_VERSION);
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR(DRIVER_AUTHOR);
-MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 7/9] vfio_pci: shrink vfio_pci.c
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (5 preceding siblings ...)
  2019-06-08 13:21 ` [PATCH v1 6/9] vfio_pci: shrink vfio_pci_common.c Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 8/9] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 9/9] smaples: add vfio-mdev-pci driver Liu Yi L
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

This patch removes the common codes in vfio_pci.c, instead, vfio-pci
module will leverage the common functions implemented in vfio_pci_common.c.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Makefile           |    3 +-
 drivers/vfio/pci/vfio_pci.c         | 1424 -----------------------------------
 drivers/vfio/pci/vfio_pci_common.c  |    2 +-
 drivers/vfio/pci/vfio_pci_private.h |    2 +
 4 files changed, 5 insertions(+), 1426 deletions(-)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f027f8a..d94317a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+		vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 4da653e..48abbb9 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -57,392 +57,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-/*
- * Our VGA arbiter participation is limited since we don't know anything
- * about the device itself.  However, if the device is the only VGA device
- * downstream of a bridge and VFIO VGA support is disabled, then we can
- * safely return legacy VGA IO and memory as not decoded since the user
- * has no way to get to it and routing can be disabled externally at the
- * bridge.
- */
-unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
-{
-	struct vfio_pci_device *vdev = opaque;
-	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
-	unsigned char max_busnr;
-	unsigned int decodes;
-
-	if (single_vga || !vfio_vga_disabled(vdev) ||
-		pci_is_root_bus(pdev->bus))
-		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
-		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
-
-	max_busnr = pci_bus_max_busnr(pdev->bus);
-	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
-
-	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
-		if (tmp == pdev ||
-		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
-		    pci_is_root_bus(tmp->bus))
-			continue;
-
-		if (tmp->bus->number >= pdev->bus->number &&
-		    tmp->bus->number <= max_busnr) {
-			pci_dev_put(tmp);
-			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
-			break;
-		}
-	}
-
-	return decodes;
-}
-
-static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
-{
-	struct resource *res;
-	int bar;
-	struct vfio_pci_dummy_resource *dummy_res;
-
-	INIT_LIST_HEAD(&vdev->dummy_resources_list);
-
-	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
-		res = vdev->pdev->resource + bar;
-
-		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
-			goto no_mmap;
-
-		if (!(res->flags & IORESOURCE_MEM))
-			goto no_mmap;
-
-		/*
-		 * The PCI core shouldn't set up a resource with a
-		 * type but zero size. But there may be bugs that
-		 * cause us to do that.
-		 */
-		if (!resource_size(res))
-			goto no_mmap;
-
-		if (resource_size(res) >= PAGE_SIZE) {
-			vdev->bar_mmap_supported[bar] = true;
-			continue;
-		}
-
-		if (!(res->start & ~PAGE_MASK)) {
-			/*
-			 * Add a dummy resource to reserve the remainder
-			 * of the exclusive page in case that hot-add
-			 * device's bar is assigned into it.
-			 */
-			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
-			if (dummy_res == NULL)
-				goto no_mmap;
-
-			dummy_res->resource.name = "vfio sub-page reserved";
-			dummy_res->resource.start = res->end + 1;
-			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
-			dummy_res->resource.flags = res->flags;
-			if (request_resource(res->parent,
-						&dummy_res->resource)) {
-				kfree(dummy_res);
-				goto no_mmap;
-			}
-			dummy_res->index = bar;
-			list_add(&dummy_res->res_next,
-					&vdev->dummy_resources_list);
-			vdev->bar_mmap_supported[bar] = true;
-			continue;
-		}
-		/*
-		 * Here we don't handle the case when the BAR is not page
-		 * aligned because we can't expect the BAR will be
-		 * assigned into the same location in a page in guest
-		 * when we passthrough the BAR. And it's hard to access
-		 * this BAR in userspace because we have no way to get
-		 * the BAR's location in a page.
-		 */
-no_mmap:
-		vdev->bar_mmap_supported[bar] = false;
-	}
-}
-
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-
-/*
- * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
- * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
- * If a device implements the former but not the latter we would typically
- * expect broken_intx_masking be set and require an exclusive interrupt.
- * However since we do have control of the device's ability to assert INTx,
- * we can instead pretend that the device does not implement INTx, virtualizing
- * the pin register to report zero and maintaining DisINTx set on the host.
- */
-static bool vfio_pci_nointx(struct pci_dev *pdev)
-{
-	switch (pdev->vendor) {
-	case PCI_VENDOR_ID_INTEL:
-		switch (pdev->device) {
-		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
-		case 0x1572:
-		case 0x1574:
-		case 0x1580 ... 0x1581:
-		case 0x1583 ... 0x158b:
-		case 0x37d0 ... 0x37d2:
-			return true;
-		default:
-			return false;
-		}
-	}
-
-	return false;
-}
-
-void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	u16 pmcsr;
-
-	if (!pdev->pm_cap)
-		return;
-
-	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
-	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0.  Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	bool needs_restore = false, needs_save = false;
-	int ret;
-
-	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
-			pci_save_state(pdev);
-			needs_save = true;
-		}
-
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
-			needs_restore = true;
-	}
-
-	ret = pci_set_power_state(pdev, state);
-
-	if (!ret) {
-		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
-			vdev->pm_save = pci_store_saved_state(pdev);
-		} else if (needs_restore) {
-			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
-			pci_restore_state(pdev);
-		}
-	}
-
-	return ret;
-}
-
-int vfio_pci_enable(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	int ret;
-	u16 cmd;
-	u8 msix_pos;
-
-	vfio_pci_set_power_state(vdev, PCI_D0);
-
-	/* Don't allow our initial saved state to include busmaster */
-	pci_clear_master(pdev);
-
-	ret = pci_enable_device(pdev);
-	if (ret)
-		return ret;
-
-	/* If reset fails because of the device lock, fail this path entirely */
-	ret = pci_try_reset_function(pdev);
-	if (ret == -EAGAIN) {
-		pci_disable_device(pdev);
-		return ret;
-	}
-
-	vdev->reset_works = !ret;
-	pci_save_state(pdev);
-	vdev->pci_saved_state = pci_store_saved_state(pdev);
-	if (!vdev->pci_saved_state)
-		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
-
-	if (likely(!vdev->nointxmask)) {
-		if (vfio_pci_nointx(pdev)) {
-			pci_info(pdev, "Masking broken INTx support\n");
-			vdev->nointx = true;
-			pci_intx(pdev, 0);
-		} else
-			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
-	}
-
-	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
-	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
-		cmd &= ~PCI_COMMAND_INTX_DISABLE;
-		pci_write_config_word(pdev, PCI_COMMAND, cmd);
-	}
-
-	ret = vfio_config_init(vdev);
-	if (ret) {
-		kfree(vdev->pci_saved_state);
-		vdev->pci_saved_state = NULL;
-		pci_disable_device(pdev);
-		return ret;
-	}
-
-	msix_pos = pdev->msix_cap;
-	if (msix_pos) {
-		u16 flags;
-		u32 table;
-
-		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
-		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
-
-		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
-		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
-		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
-	} else
-		vdev->msix_bar = 0xFF;
-
-	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
-		vdev->has_vga = true;
-
-
-	if (vfio_pci_is_vga(pdev) &&
-	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
-		ret = vfio_pci_igd_init(vdev);
-		if (ret) {
-			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_ibm_npu2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
-			goto disable_exit;
-		}
-	}
-
-	vfio_pci_probe_mmaps(vdev);
-
-	return 0;
-
-disable_exit:
-	vfio_pci_disable(vdev);
-	return ret;
-}
-
-void vfio_pci_disable(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	struct vfio_pci_dummy_resource *dummy_res, *tmp;
-	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
-	int i, bar;
-
-	/* Stop the device from further DMA */
-	pci_clear_master(pdev);
-
-	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
-				VFIO_IRQ_SET_ACTION_TRIGGER,
-				vdev->irq_type, 0, 0, NULL);
-
-	/* Device closed, don't need mutex here */
-	list_for_each_entry_safe(ioeventfd, ioeventfd_tmp,
-				 &vdev->ioeventfds_list, next) {
-		vfio_virqfd_disable(&ioeventfd->virqfd);
-		list_del(&ioeventfd->next);
-		kfree(ioeventfd);
-	}
-	vdev->ioeventfds_nr = 0;
-
-	vdev->virq_disabled = false;
-
-	for (i = 0; i < vdev->num_regions; i++)
-		vdev->region[i].ops->release(vdev, &vdev->region[i]);
-
-	vdev->num_regions = 0;
-	kfree(vdev->region);
-	vdev->region = NULL; /* don't krealloc a freed pointer */
-
-	vfio_config_free(vdev);
-
-	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
-		if (!vdev->barmap[bar])
-			continue;
-		pci_iounmap(pdev, vdev->barmap[bar]);
-		pci_release_selected_regions(pdev, 1 << bar);
-		vdev->barmap[bar] = NULL;
-	}
-
-	list_for_each_entry_safe(dummy_res, tmp,
-				 &vdev->dummy_resources_list, res_next) {
-		list_del(&dummy_res->res_next);
-		release_resource(&dummy_res->resource);
-		kfree(dummy_res);
-	}
-
-	vdev->needs_reset = true;
-
-	/*
-	 * If we have saved state, restore it.  If we can reset the device,
-	 * even better.  Resetting with current state seems better than
-	 * nothing, but saving and restoring current state without reset
-	 * is just busy work.
-	 */
-	if (pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)) {
-		pci_info(pdev, "%s: Couldn't reload saved state\n", __func__);
-
-		if (!vdev->reset_works)
-			goto out;
-
-		pci_save_state(pdev);
-	}
-
-	/*
-	 * Disable INTx and MSI, presumably to avoid spurious interrupts
-	 * during reset.  Stolen from pci_reset_function()
-	 */
-	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
-
-	/*
-	 * Try to reset the device.  The success of this is dependent on
-	 * being able to lock the device, which is not always possible.
-	 */
-	if (vdev->reset_works && !pci_try_reset_function(pdev))
-		vdev->needs_reset = false;
-
-	pci_restore_state(pdev);
-out:
-	pci_disable_device(pdev);
-
-	vfio_pci_try_bus_reset(vdev);
-
-	if (!vdev->disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-}
-
 static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -484,777 +98,6 @@ static int vfio_pci_open(void *device_data)
 	return ret;
 }
 
-static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
-{
-	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
-		u8 pin;
-
-		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
-		    vdev->nointx || vdev->pdev->is_virtfn)
-			return 0;
-
-		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
-
-		return pin ? 1 : 0;
-	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
-		u8 pos;
-		u16 flags;
-
-		pos = vdev->pdev->msi_cap;
-		if (pos) {
-			pci_read_config_word(vdev->pdev,
-					     pos + PCI_MSI_FLAGS, &flags);
-			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
-		}
-	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
-		u8 pos;
-		u16 flags;
-
-		pos = vdev->pdev->msix_cap;
-		if (pos) {
-			pci_read_config_word(vdev->pdev,
-					     pos + PCI_MSIX_FLAGS, &flags);
-
-			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
-		}
-	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
-	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
-		return 1;
-	}
-
-	return 0;
-}
-
-static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
-{
-	(*(int *)data)++;
-	return 0;
-}
-
-struct vfio_pci_fill_info {
-	int max;
-	int cur;
-	struct vfio_pci_dependent_device *devices;
-};
-
-static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_fill_info *fill = data;
-	struct iommu_group *iommu_group;
-
-	if (fill->cur == fill->max)
-		return -EAGAIN; /* Something changed, try again */
-
-	iommu_group = iommu_group_get(&pdev->dev);
-	if (!iommu_group)
-		return -EPERM; /* Cannot reset non-isolated devices */
-
-	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
-	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
-	fill->devices[fill->cur].bus = pdev->bus->number;
-	fill->devices[fill->cur].devfn = pdev->devfn;
-	fill->cur++;
-	iommu_group_put(iommu_group);
-	return 0;
-}
-
-struct vfio_pci_group_entry {
-	struct vfio_group *group;
-	int id;
-};
-
-struct vfio_pci_group_info {
-	int count;
-	struct vfio_pci_group_entry *groups;
-};
-
-static int vfio_pci_validate_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_group_info *info = data;
-	struct iommu_group *group;
-	int id, i;
-
-	group = iommu_group_get(&pdev->dev);
-	if (!group)
-		return -EPERM;
-
-	id = iommu_group_id(group);
-
-	for (i = 0; i < info->count; i++)
-		if (info->groups[i].id == id)
-			break;
-
-	iommu_group_put(group);
-
-	return (i == info->count) ? -EINVAL : 0;
-}
-
-static bool vfio_pci_dev_below_slot(struct pci_dev *pdev, struct pci_slot *slot)
-{
-	for (; pdev; pdev = pdev->bus->self)
-		if (pdev->bus == slot->bus)
-			return (pdev->slot == slot);
-	return false;
-}
-
-struct vfio_pci_walk_info {
-	int (*fn)(struct pci_dev *, void *data);
-	void *data;
-	struct pci_dev *pdev;
-	bool slot;
-	int ret;
-};
-
-static int vfio_pci_walk_wrapper(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_walk_info *walk = data;
-
-	if (!walk->slot || vfio_pci_dev_below_slot(pdev, walk->pdev->slot))
-		walk->ret = walk->fn(pdev, walk->data);
-
-	return walk->ret;
-}
-
-static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
-					 int (*fn)(struct pci_dev *,
-						   void *data), void *data,
-					 bool slot)
-{
-	struct vfio_pci_walk_info walk = {
-		.fn = fn, .data = data, .pdev = pdev, .slot = slot, .ret = 0,
-	};
-
-	pci_walk_bus(pdev->bus, vfio_pci_walk_wrapper, &walk);
-
-	return walk.ret;
-}
-
-static int msix_mmappable_cap(struct vfio_pci_device *vdev,
-			      struct vfio_info_cap *caps)
-{
-	struct vfio_info_cap_header header = {
-		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
-		.version = 1
-	};
-
-	return vfio_info_add_capability(caps, &header, sizeof(header));
-}
-
-int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
-				 unsigned int type, unsigned int subtype,
-				 const struct vfio_pci_regops *ops,
-				 size_t size, u32 flags, void *data)
-{
-	struct vfio_pci_region *region;
-
-	region = krealloc(vdev->region,
-			  (vdev->num_regions + 1) * sizeof(*region),
-			  GFP_KERNEL);
-	if (!region)
-		return -ENOMEM;
-
-	vdev->region = region;
-	vdev->region[vdev->num_regions].type = type;
-	vdev->region[vdev->num_regions].subtype = subtype;
-	vdev->region[vdev->num_regions].ops = ops;
-	vdev->region[vdev->num_regions].size = size;
-	vdev->region[vdev->num_regions].flags = flags;
-	vdev->region[vdev->num_regions].data = data;
-
-	vdev->num_regions++;
-
-	return 0;
-}
-
-long vfio_pci_ioctl(void *device_data,
-		   unsigned int cmd, unsigned long arg)
-{
-	struct vfio_pci_device *vdev = device_data;
-	unsigned long minsz;
-
-	if (cmd == VFIO_DEVICE_GET_INFO) {
-		struct vfio_device_info info;
-
-		minsz = offsetofend(struct vfio_device_info, num_irqs);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		info.flags = VFIO_DEVICE_FLAGS_PCI;
-
-		if (vdev->reset_works)
-			info.flags |= VFIO_DEVICE_FLAGS_RESET;
-
-		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
-		info.num_irqs = VFIO_PCI_NUM_IRQS;
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
-		struct pci_dev *pdev = vdev->pdev;
-		struct vfio_region_info info;
-		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-		int i, ret;
-
-		minsz = offsetofend(struct vfio_region_info, offset);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		switch (info.index) {
-		case VFIO_PCI_CONFIG_REGION_INDEX:
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = pdev->cfg_size;
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-			break;
-		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = pci_resource_len(pdev, info.index);
-			if (!info.size) {
-				info.flags = 0;
-				break;
-			}
-
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-			if (vdev->bar_mmap_supported[info.index]) {
-				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
-				if (info.index == vdev->msix_bar) {
-					ret = msix_mmappable_cap(vdev, &caps);
-					if (ret)
-						return ret;
-				}
-			}
-
-			break;
-		case VFIO_PCI_ROM_REGION_INDEX:
-		{
-			void __iomem *io;
-			size_t size;
-			u16 orig_cmd;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.flags = 0;
-
-			/* Report the BAR size, not the ROM size */
-			info.size = pci_resource_len(pdev, info.index);
-			if (!info.size) {
-				/* Shadow ROMs appear as PCI option ROMs */
-				if (pdev->resource[PCI_ROM_RESOURCE].flags &
-							IORESOURCE_ROM_SHADOW)
-					info.size = 0x20000;
-				else
-					break;
-			}
-
-			/*
-			 * Is it really there?  Enable memory decode for
-			 * implicit access in pci_map_rom().
-			 */
-			pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
-			pci_write_config_word(pdev, PCI_COMMAND,
-					      orig_cmd | PCI_COMMAND_MEMORY);
-
-			io = pci_map_rom(pdev, &size);
-			if (io) {
-				info.flags = VFIO_REGION_INFO_FLAG_READ;
-				pci_unmap_rom(pdev, io);
-			} else {
-				info.size = 0;
-			}
-
-			pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
-			break;
-		}
-		case VFIO_PCI_VGA_REGION_INDEX:
-			if (!vdev->has_vga)
-				return -EINVAL;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = 0xc0000;
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-
-			break;
-		default:
-		{
-			struct vfio_region_info_cap_type cap_type = {
-					.header.id = VFIO_REGION_INFO_CAP_TYPE,
-					.header.version = 1 };
-
-			if (info.index >=
-			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
-				return -EINVAL;
-			info.index = array_index_nospec(info.index,
-							VFIO_PCI_NUM_REGIONS +
-							vdev->num_regions);
-
-			i = info.index - VFIO_PCI_NUM_REGIONS;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = vdev->region[i].size;
-			info.flags = vdev->region[i].flags;
-
-			cap_type.type = vdev->region[i].type;
-			cap_type.subtype = vdev->region[i].subtype;
-
-			ret = vfio_info_add_capability(&caps, &cap_type.header,
-						       sizeof(cap_type));
-			if (ret)
-				return ret;
-
-			if (vdev->region[i].ops->add_capability) {
-				ret = vdev->region[i].ops->add_capability(vdev,
-						&vdev->region[i], &caps);
-				if (ret)
-					return ret;
-			}
-		}
-		}
-
-		if (caps.size) {
-			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-						  sizeof(info), caps.buf,
-						  caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
-			}
-
-			kfree(caps.buf);
-		}
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
-		struct vfio_irq_info info;
-
-		minsz = offsetofend(struct vfio_irq_info, count);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
-			return -EINVAL;
-
-		switch (info.index) {
-		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
-		case VFIO_PCI_REQ_IRQ_INDEX:
-			break;
-		case VFIO_PCI_ERR_IRQ_INDEX:
-			if (pci_is_pcie(vdev->pdev))
-				break;
-		/* fall through */
-		default:
-			return -EINVAL;
-		}
-
-		info.flags = VFIO_IRQ_INFO_EVENTFD;
-
-		info.count = vfio_pci_get_irq_count(vdev, info.index);
-
-		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
-			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
-				       VFIO_IRQ_INFO_AUTOMASKED);
-		else
-			info.flags |= VFIO_IRQ_INFO_NORESIZE;
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
-		struct vfio_irq_set hdr;
-		u8 *data = NULL;
-		int max, ret = 0;
-		size_t data_size = 0;
-
-		minsz = offsetofend(struct vfio_irq_set, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		max = vfio_pci_get_irq_count(vdev, hdr.index);
-
-		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
-						 VFIO_PCI_NUM_IRQS, &data_size);
-		if (ret)
-			return ret;
-
-		if (data_size) {
-			data = memdup_user((void __user *)(arg + minsz),
-					    data_size);
-			if (IS_ERR(data))
-				return PTR_ERR(data);
-		}
-
-		mutex_lock(&vdev->igate);
-
-		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
-					      hdr.start, hdr.count, data);
-
-		mutex_unlock(&vdev->igate);
-		kfree(data);
-
-		return ret;
-
-	} else if (cmd == VFIO_DEVICE_RESET) {
-		return vdev->reset_works ?
-			pci_try_reset_function(vdev->pdev) : -EINVAL;
-
-	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
-		struct vfio_pci_hot_reset_info hdr;
-		struct vfio_pci_fill_info fill = { 0 };
-		struct vfio_pci_dependent_device *devices = NULL;
-		bool slot = false;
-		int ret = 0;
-
-		minsz = offsetofend(struct vfio_pci_hot_reset_info, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (hdr.argsz < minsz)
-			return -EINVAL;
-
-		hdr.flags = 0;
-
-		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
-			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
-			return -ENODEV;
-
-		/* How many devices are affected? */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_count_devs,
-						    &fill.max, slot);
-		if (ret)
-			return ret;
-
-		WARN_ON(!fill.max); /* Should always be at least one */
-
-		/*
-		 * If there's enough space, fill it now, otherwise return
-		 * -ENOSPC and the number of devices affected.
-		 */
-		if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
-			ret = -ENOSPC;
-			hdr.count = fill.max;
-			goto reset_info_exit;
-		}
-
-		devices = kcalloc(fill.max, sizeof(*devices), GFP_KERNEL);
-		if (!devices)
-			return -ENOMEM;
-
-		fill.devices = devices;
-
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_fill_devs,
-						    &fill, slot);
-
-		/*
-		 * If a device was removed between counting and filling,
-		 * we may come up short of fill.max.  If a device was
-		 * added, we'll have a return of -EAGAIN above.
-		 */
-		if (!ret)
-			hdr.count = fill.cur;
-
-reset_info_exit:
-		if (copy_to_user((void __user *)arg, &hdr, minsz))
-			ret = -EFAULT;
-
-		if (!ret) {
-			if (copy_to_user((void __user *)(arg + minsz), devices,
-					 hdr.count * sizeof(*devices)))
-				ret = -EFAULT;
-		}
-
-		kfree(devices);
-		return ret;
-
-	} else if (cmd == VFIO_DEVICE_PCI_HOT_RESET) {
-		struct vfio_pci_hot_reset hdr;
-		int32_t *group_fds;
-		struct vfio_pci_group_entry *groups;
-		struct vfio_pci_group_info info;
-		bool slot = false;
-		int i, count = 0, ret = 0;
-
-		minsz = offsetofend(struct vfio_pci_hot_reset, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (hdr.argsz < minsz || hdr.flags)
-			return -EINVAL;
-
-		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
-			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
-			return -ENODEV;
-
-		/*
-		 * We can't let userspace give us an arbitrarily large
-		 * buffer to copy, so verify how many we think there
-		 * could be.  Note groups can have multiple devices so
-		 * one group per device is the max.
-		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_count_devs,
-						    &count, slot);
-		if (ret)
-			return ret;
-
-		/* Somewhere between 1 and count is OK */
-		if (!hdr.count || hdr.count > count)
-			return -EINVAL;
-
-		group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
-		groups = kcalloc(hdr.count, sizeof(*groups), GFP_KERNEL);
-		if (!group_fds || !groups) {
-			kfree(group_fds);
-			kfree(groups);
-			return -ENOMEM;
-		}
-
-		if (copy_from_user(group_fds, (void __user *)(arg + minsz),
-				   hdr.count * sizeof(*group_fds))) {
-			kfree(group_fds);
-			kfree(groups);
-			return -EFAULT;
-		}
-
-		/*
-		 * For each group_fd, get the group through the vfio external
-		 * user interface and store the group and iommu ID.  This
-		 * ensures the group is held across the reset.
-		 */
-		for (i = 0; i < hdr.count; i++) {
-			struct vfio_group *group;
-			struct fd f = fdget(group_fds[i]);
-			if (!f.file) {
-				ret = -EBADF;
-				break;
-			}
-
-			group = vfio_group_get_external_user(f.file);
-			fdput(f);
-			if (IS_ERR(group)) {
-				ret = PTR_ERR(group);
-				break;
-			}
-
-			groups[i].group = group;
-			groups[i].id = vfio_external_user_iommu_id(group);
-		}
-
-		kfree(group_fds);
-
-		/* release reference to groups on error */
-		if (ret)
-			goto hot_reset_release;
-
-		info.count = hdr.count;
-		info.groups = groups;
-
-		/*
-		 * Test whether all the affected devices are contained
-		 * by the set of groups provided by the user.
-		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_validate_devs,
-						    &info, slot);
-		if (!ret)
-			/* User has access, do the reset */
-			ret = pci_reset_bus(vdev->pdev);
-
-hot_reset_release:
-		for (i--; i >= 0; i--)
-			vfio_group_put_external_user(groups[i].group);
-
-		kfree(groups);
-		return ret;
-	} else if (cmd == VFIO_DEVICE_IOEVENTFD) {
-		struct vfio_device_ioeventfd ioeventfd;
-		int count;
-
-		minsz = offsetofend(struct vfio_device_ioeventfd, fd);
-
-		if (copy_from_user(&ioeventfd, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (ioeventfd.argsz < minsz)
-			return -EINVAL;
-
-		if (ioeventfd.flags & ~VFIO_DEVICE_IOEVENTFD_SIZE_MASK)
-			return -EINVAL;
-
-		count = ioeventfd.flags & VFIO_DEVICE_IOEVENTFD_SIZE_MASK;
-
-		if (hweight8(count) != 1 || ioeventfd.fd < -1)
-			return -EINVAL;
-
-		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
-					  ioeventfd.data, count, ioeventfd.fd);
-	}
-
-	return -ENOTTY;
-}
-
-static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
-			   size_t count, loff_t *ppos, bool iswrite)
-{
-	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	struct vfio_pci_device *vdev = device_data;
-
-	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
-		return -EINVAL;
-
-	switch (index) {
-	case VFIO_PCI_CONFIG_REGION_INDEX:
-		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
-
-	case VFIO_PCI_ROM_REGION_INDEX:
-		if (iswrite)
-			return -EINVAL;
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
-
-	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
-
-	case VFIO_PCI_VGA_REGION_INDEX:
-		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
-	default:
-		index -= VFIO_PCI_NUM_REGIONS;
-		return vdev->region[index].ops->rw(vdev, buf,
-						   count, ppos, iswrite);
-	}
-
-	return -EINVAL;
-}
-
-ssize_t vfio_pci_read(void *device_data, char __user *buf,
-			     size_t count, loff_t *ppos)
-{
-	if (!count)
-		return 0;
-
-	return vfio_pci_rw(device_data, buf, count, ppos, false);
-}
-
-ssize_t vfio_pci_write(void *device_data, const char __user *buf,
-			      size_t count, loff_t *ppos)
-{
-	if (!count)
-		return 0;
-
-	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
-}
-
-int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
-{
-	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
-	unsigned int index;
-	u64 phys_len, req_len, pgoff, req_start;
-	int ret;
-
-	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
-
-	if (vma->vm_end < vma->vm_start)
-		return -EINVAL;
-	if ((vma->vm_flags & VM_SHARED) == 0)
-		return -EINVAL;
-	if (index >= VFIO_PCI_NUM_REGIONS) {
-		int regnum = index - VFIO_PCI_NUM_REGIONS;
-		struct vfio_pci_region *region = vdev->region + regnum;
-
-		if (region && region->ops && region->ops->mmap &&
-		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
-			return region->ops->mmap(vdev, region, vma);
-		return -EINVAL;
-	}
-	if (index >= VFIO_PCI_ROM_REGION_INDEX)
-		return -EINVAL;
-	if (!vdev->bar_mmap_supported[index])
-		return -EINVAL;
-
-	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
-	req_len = vma->vm_end - vma->vm_start;
-	pgoff = vma->vm_pgoff &
-		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
-	req_start = pgoff << PAGE_SHIFT;
-
-	if (req_start + req_len > phys_len)
-		return -EINVAL;
-
-	/*
-	 * Even though we don't make use of the barmap for the mmap,
-	 * we need to request the region and the barmap tracks that.
-	 */
-	if (!vdev->barmap[index]) {
-		ret = pci_request_selected_regions(pdev,
-						   1 << index, "vfio-pci");
-		if (ret)
-			return ret;
-
-		vdev->barmap[index] = pci_iomap(pdev, index, 0);
-		if (!vdev->barmap[index]) {
-			pci_release_selected_regions(pdev, 1 << index);
-			return -ENOMEM;
-		}
-	}
-
-	vma->vm_private_data = vdev;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
-
-	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       req_len, vma->vm_page_prot);
-}
-
-void vfio_pci_request(void *device_data, unsigned int count)
-{
-	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
-
-	mutex_lock(&vdev->igate);
-
-	if (vdev->req_trigger) {
-		if (!(count % 10))
-			pci_notice_ratelimited(pdev,
-				"Relaying device request to user (#%u)\n",
-				count);
-		eventfd_signal(vdev->req_trigger, 1);
-	} else if (count == 0) {
-		pci_warn(pdev,
-			"No device request channel registered, blocked until released by user\n");
-	}
-
-	mutex_unlock(&vdev->igate);
-}
-
 static const struct vfio_device_ops vfio_pci_ops = {
 	.name		= "vfio-pci",
 	.open		= vfio_pci_open,
@@ -1378,38 +221,6 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	}
 }
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
-{
-	struct vfio_pci_device *vdev;
-	struct vfio_device *device;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (device == NULL)
-		return PCI_ERS_RESULT_DISCONNECT;
-
-	vdev = vfio_device_data(device);
-	if (vdev == NULL) {
-		vfio_device_put(device);
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	mutex_lock(&vdev->igate);
-
-	if (vdev->err_trigger)
-		eventfd_signal(vdev->err_trigger, 1);
-
-	mutex_unlock(&vdev->igate);
-
-	vfio_device_put(device);
-
-	return PCI_ERS_RESULT_CAN_RECOVER;
-}
-
-static const struct pci_error_handlers vfio_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
-};
-
 static struct pci_driver vfio_pci_driver = {
 	.name		= "vfio-pci",
 	.id_table	= NULL, /* only dynamic ids */
@@ -1418,247 +229,12 @@ static struct pci_driver vfio_pci_driver = {
 	.err_handler	= &vfio_err_handlers,
 };
 
-static DEFINE_MUTEX(reflck_lock);
-
-static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
-{
-	struct vfio_pci_reflck *reflck;
-
-	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
-	if (!reflck)
-		return ERR_PTR(-ENOMEM);
-
-	kref_init(&reflck->kref);
-	mutex_init(&reflck->lock);
-
-	return reflck;
-}
-
-static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
-{
-	kref_get(&reflck->kref);
-}
-
-static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_device *vdev = data;
-	struct vfio_pci_reflck **preflck = &vdev->reflck;
-	struct vfio_device *device;
-	struct vfio_pci_device *tmp;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (!device)
-		return 0;
-
-	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
-		vfio_device_put(device);
-		return 0;
-	}
-
-	tmp = vfio_device_data(device);
-
-	if (tmp->reflck) {
-		vfio_pci_reflck_get(tmp->reflck);
-		*preflck = tmp->reflck;
-		vfio_device_put(device);
-		return 1;
-	}
-
-	vfio_device_put(device);
-	return 0;
-}
-
-int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
-{
-	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
-
-	mutex_lock(&reflck_lock);
-
-	if (pci_is_root_bus(vdev->pdev->bus) ||
-	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
-					  vdev, slot) <= 0)
-		vdev->reflck = vfio_pci_reflck_alloc();
-
-	mutex_unlock(&reflck_lock);
-
-	return PTR_ERR_OR_ZERO(vdev->reflck);
-}
-
-static void vfio_pci_reflck_release(struct kref *kref)
-{
-	struct vfio_pci_reflck *reflck = container_of(kref,
-						      struct vfio_pci_reflck,
-						      kref);
-
-	kfree(reflck);
-	mutex_unlock(&reflck_lock);
-}
-
-void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
-{
-	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
-}
-
-struct vfio_devices {
-	struct vfio_device **devices;
-	struct vfio_pci_device *vdev;
-	int cur_index;
-	int max_index;
-};
-
-static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_devices *devs = data;
-	struct vfio_device *device;
-	struct vfio_pci_device *tmp;
-
-	if (devs->cur_index == devs->max_index)
-		return -ENOSPC;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (!device)
-		return -EINVAL;
-
-	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
-		vfio_device_put(device);
-		return -EBUSY;
-	}
-
-	tmp = vfio_device_data(device);
-
-	/* Fault if the device is not unused */
-	if (tmp->refcnt) {
-		vfio_device_put(device);
-		return -EBUSY;
-	}
-
-	devs->devices[devs->cur_index++] = device;
-	return 0;
-}
-
-/*
- * If a bus or slot reset is available for the provided device and:
- *  - All of the devices affected by that bus or slot reset are unused
- *    (!refcnt)
- *  - At least one of the affected devices is marked dirty via
- *    needs_reset (such as by lack of FLR support)
- * Then attempt to perform that bus or slot reset.  Callers are required
- * to hold vdev->reflck->lock, protecting the bus/slot reset group from
- * concurrent opens.  A vfio_device reference is acquired for each device
- * to prevent unbinds during the reset operation.
- *
- * NB: vfio-core considers a group to be viable even if some devices are
- * bound to drivers like pci-stub or pcieport.  Here we require all devices
- * to be bound to vfio_pci since that's the only way we can be sure they
- * stay put.
- */
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
-{
-	struct vfio_devices devs = { .cur_index = 0 };
-	int i = 0, ret = -EINVAL;
-	bool slot = false;
-	struct vfio_pci_device *tmp;
-
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return;
-
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
-					  &i, slot) || !i)
-		return;
-
-	devs.max_index = i;
-	devs.devices = kcalloc(i, sizeof(struct vfio_device *), GFP_KERNEL);
-	if (!devs.devices)
-		return;
-
-	devs.vdev = vdev;
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
-					  vfio_pci_get_unused_devs,
-					  &devs, slot))
-		goto put_devs;
-
-	/* Does at least one need a reset? */
-	for (i = 0; i < devs.cur_index; i++) {
-		tmp = vfio_device_data(devs.devices[i]);
-		if (tmp->needs_reset) {
-			ret = pci_reset_bus(vdev->pdev);
-			break;
-		}
-	}
-
-put_devs:
-	for (i = 0; i < devs.cur_index; i++) {
-		tmp = vfio_device_data(devs.devices[i]);
-
-		/*
-		 * If reset was successful, affected devices no longer need
-		 * a reset and we should return all the collateral devices
-		 * to low power.  If not successful, we either didn't reset
-		 * the bus or timed out waiting for it, so let's not touch
-		 * the power state.
-		 */
-		if (!ret) {
-			tmp->needs_reset = false;
-
-			if (tmp != vdev && !tmp->disable_idle_d3)
-				vfio_pci_set_power_state(tmp, PCI_D3hot);
-		}
-
-		vfio_device_put(devs.devices[i]);
-	}
-
-	kfree(devs.devices);
-}
-
 static void __exit vfio_pci_cleanup(void)
 {
 	pci_unregister_driver(&vfio_pci_driver);
 	vfio_pci_uninit_perm_bits();
 }
 
-void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
-{
-	char *p, *id;
-	int rc;
-
-	/* no ids passed actually */
-	if (ids[0] == '\0')
-		return;
-
-	/* add ids specified in the module parameter */
-	p = ids;
-	while ((id = strsep(&p, ","))) {
-		unsigned int vendor, device, subvendor = PCI_ANY_ID,
-			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
-		int fields;
-
-		if (!strlen(id))
-			continue;
-
-		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
-				&vendor, &device, &subvendor, &subdevice,
-				&class, &class_mask);
-
-		if (fields < 2) {
-			pr_warn("invalid id string \"%s\"\n", id);
-			continue;
-		}
-
-		rc = pci_add_dynid(driver, vendor, device,
-				   subvendor, subdevice, class, class_mask, 0);
-		if (rc)
-			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask, rc);
-		else
-			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask);
-	}
-}
-
 static int __init vfio_pci_init(void)
 {
 	int ret;
diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index 3aab938..522d933 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -1218,7 +1218,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
-static const struct pci_error_handlers vfio_err_handlers = {
+const struct pci_error_handlers vfio_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 7b99881..e422da0 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -138,6 +138,8 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+extern const struct pci_error_handlers vfio_err_handlers;
+
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 8/9] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (6 preceding siblings ...)
  2019-06-08 13:21 ` [PATCH v1 7/9] vfio_pci: shrink vfio_pci.c Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-08 13:21 ` [PATCH v1 9/9] smaples: add vfio-mdev-pci driver Liu Yi L
  8 siblings, 0 replies; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel, kvm

There is a case in which cap_perms and ecap_perms can be reallocated
by different modules. e.g. the vfio-mdev-pci sample driver. To secure
the initialization of cap_perms and ecap_perms, this patch adds an
atomic variable to track the user of cap/ecap_perms bits. First caller
of vfio_pci_init_perm_bits() will initialize the bits. While the last
caller of vfio_pci_uninit_perm_bits() will free the bits.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 52963a9..2f44d8f 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -995,11 +995,17 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
 	return 0;
 }
 
+/* Track the user number of the cap/ecap perm_bits */
+atomic_t vfio_pci_perm_bits_users = ATOMIC_INIT(0);
+
 /*
  * Initialize the shared permission tables
  */
 void vfio_pci_uninit_perm_bits(void)
 {
+	if (atomic_dec_return(&vfio_pci_perm_bits_users))
+		return;
+
 	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
 
 	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
@@ -1016,6 +1022,9 @@ int __init vfio_pci_init_perm_bits(void)
 {
 	int ret;
 
+	if (atomic_inc_return(&vfio_pci_perm_bits_users) != 1)
+		return 0;
+
 	/* Basic config space */
 	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (7 preceding siblings ...)
  2019-06-08 13:21 ` [PATCH v1 8/9] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
@ 2019-06-08 13:21 ` Liu Yi L
  2019-06-20  4:26   ` Alex Williamson
  8 siblings, 1 reply; 26+ messages in thread
From: Liu Yi L @ 2019-06-08 13:21 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, Masahiro Yamada

This patch adds sample driver named vfio-mdev-pci. It is to wrap
a PCI device as a mediated device. For a pci device, once bound
to vfio-mdev-pci driver, user space access of this device will
go through vfio mdev framework. The usage of the device follows
mdev management method. e.g. user should create a mdev before
exposing the device to user-space.

Benefit of this new driver would be acting as a sample driver
for recent changes from "vfio/mdev: IOMMU aware mediated device"
patchset. Also it could be a good experiment driver for future
device specific mdev migration support.

To use this driver:
a) build and load vfio-mdev-pci.ko module
   execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
   then load it with following command
   > sudo modprobe vfio
   > sudo modprobe vfio-pci
   > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko

b) unbind original device driver
   e.g. use following command to unbind its original driver
   > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind

c) bind vfio-mdev-pci driver to the physical device
   > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id

d) check the supported mdev instances
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
     vfio-mdev-pci-type1
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type1/
     available_instances  create  device_api  devices  name

e)  create mdev on this physical device (only 1 instance)
   > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
     /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type1/create

f) passthru the mdev to guest
   add the following line in Qemu boot command
   -device vfio-pci,\
    sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003

g) destroy mdev
   > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
     remove

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Makefile        |   6 +
 drivers/vfio/pci/vfio_mdev_pci.c | 403 +++++++++++++++++++++++++++++++++++++++
 samples/Kconfig                  |  11 ++
 3 files changed, 420 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_mdev_pci.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d94317a..ac118ef 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,4 +5,10 @@ vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
+vfio-mdev-pci-y := vfio_mdev_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+			vfio_pci_rdwr.o vfio_pci_config.o
+vfio-mdev-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-mdev-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
new file mode 100644
index 0000000..07c8067
--- /dev/null
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -0,0 +1,403 @@
+/*
+ * Copyright © 2019 Intel Corporation.
+ *     Author: Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_pci.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+#include <linux/mdev.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu, Yi L <yi.l.liu@intel.com>"
+#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a mdev"
+
+#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-mdev-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+static struct pci_driver vfio_mdev_pci_driver;
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%s-type1\n", dev_name(dev));
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%d\n", 1);
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+		char *buf)
+{
+	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+
+MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *vfio_mdev_pci_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group vfio_mdev_pci_type_group1 = {
+	.name  = "type1",
+	.attrs = vfio_mdev_pci_types_attrs,
+};
+
+struct attribute_group *vfio_mdev_pci_type_groups[] = {
+	&vfio_mdev_pci_type_group1,
+	NULL,
+};
+
+struct vfio_mdev_pci {
+	struct vfio_pci_device *vdev;
+	struct mdev_device *mdev;
+	unsigned long handle;
+};
+
+static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct device *pdev;
+	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci *pmdev;
+	int ret;
+
+	pdev = mdev_parent_dev(mdev);
+	vdev = dev_get_drvdata(pdev);
+	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
+	if (pmdev == NULL) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	pmdev->mdev = mdev;
+	pmdev->vdev = vdev;
+	mdev_set_drvdata(mdev, pmdev);
+	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
+	if (ret) {
+		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
+			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+static int vfio_mdev_pci_remove(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	kfree(pmdev);
+	pr_info("%s, succeeded for mdev: %s\n", __func__,
+		     dev_name(mdev_dev(mdev)));
+
+	return 0;
+}
+
+static int vfio_mdev_pci_open(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (!ret)
+		pr_info("Succeeded to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+	else {
+		pr_info("Failed to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static void vfio_mdev_pci_release(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+
+	pr_info("Release mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
+}
+
+static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
+				struct vm_area_struct *vma)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_mmap(pmdev->vdev, vma);
+}
+
+static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
+}
+
+static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
+				const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
+}
+
+static const struct mdev_parent_ops vfio_mdev_pci_ops = {
+	.supported_type_groups	= vfio_mdev_pci_type_groups,
+	.create			= vfio_mdev_pci_create,
+	.remove			= vfio_mdev_pci_remove,
+
+	.open			= vfio_mdev_pci_open,
+	.release		= vfio_mdev_pci_release,
+
+	.read			= vfio_mdev_pci_read,
+	.write			= vfio_mdev_pci_write,
+	.mmap			= vfio_mdev_pci_mmap,
+	.ioctl			= vfio_mdev_pci_ioctl,
+};
+
+static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
+				       const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
+	 * reject these PFs and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	pci_set_drvdata(pdev, vdev);
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		pci_set_drvdata(pdev, NULL);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	if (ret)
+		pr_err("Cannot register mdev for device %s\n",
+			dev_name(&pdev->dev));
+	else
+		pr_info("Wrap device %s as a mdev\n", dev_name(&pdev->dev));
+
+	return ret;
+}
+
+static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = pci_get_drvdata(pdev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_mdev_pci_driver = {
+	.name		= VFIO_MDEV_PCI_NAME,
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_mdev_pci_driver_probe,
+	.remove		= vfio_mdev_pci_driver_remove,
+	.err_handler	= &vfio_err_handlers,
+};
+
+static void __exit vfio_mdev_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_mdev_pci_driver);
+	vfio_pci_uninit_perm_bits();
+}
+
+static int __init vfio_mdev_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_mdev_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	vfio_pci_fill_ids(ids, &vfio_mdev_pci_driver);
+
+	return 0;
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_mdev_pci_init);
+module_exit(vfio_mdev_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/samples/Kconfig b/samples/Kconfig
index d63cc8a..d799ccd 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -161,4 +161,15 @@ config SAMPLE_VFS
 	  as mount API and statx().  Note that this is restricted to the x86
 	  arch whilst it accesses system calls that aren't yet in all arches.
 
+config SAMPLE_VFIO_MDEV_PCI
+	tristate "Sample driver for wrapping PCI device as a mdev"
+	depends on PCI && EVENTFD && VFIO_MDEV_DEVICE
+	select VFIO_VIRQFD
+	select IRQ_BYPASS_MANAGER
+	help
+	  Sample driver for wrapping a PCI device as a mdev. Once bound to
+	  this driver, device passthru should through mdev path.
+
+	  If you don't know what to do here, say N.
+
 endif # SAMPLES
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-08 13:21 ` [PATCH v1 9/9] smaples: add vfio-mdev-pci driver Liu Yi L
@ 2019-06-20  4:26   ` Alex Williamson
  2019-06-20 13:00     ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-06-20  4:26 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, kevin.tian, baolu.lu, yi.y.sun, joro, linux-kernel,
	kvm, Masahiro Yamada

On Sat,  8 Jun 2019 21:21:11 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds sample driver named vfio-mdev-pci. It is to wrap
> a PCI device as a mediated device. For a pci device, once bound
> to vfio-mdev-pci driver, user space access of this device will
> go through vfio mdev framework. The usage of the device follows
> mdev management method. e.g. user should create a mdev before
> exposing the device to user-space.
> 
> Benefit of this new driver would be acting as a sample driver
> for recent changes from "vfio/mdev: IOMMU aware mediated device"
> patchset. Also it could be a good experiment driver for future
> device specific mdev migration support.
> 
> To use this driver:
> a) build and load vfio-mdev-pci.ko module
>    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
>    then load it with following command
>    > sudo modprobe vfio
>    > sudo modprobe vfio-pci
>    > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko  
> 
> b) unbind original device driver
>    e.g. use following command to unbind its original driver
>    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> 
> c) bind vfio-mdev-pci driver to the physical device
>    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> 
> d) check the supported mdev instances
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
>      vfio-mdev-pci-type1
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
>      vfio-mdev-pci-type1/
>      available_instances  create  device_api  devices  name


I think the static type name here is a problem (and why does it
include "type1"?).  We generally consider that a type defines a
software compatible mdev, but in this case any PCI device wrapped in
vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
but that's a bad precedent.  I've taken a stab at fixing this in the
patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
subsystem device ID, class code, and revision to try to make the type
as specific to the physical device assigned as we can through PCI.

> 
> e)  create mdev on this physical device (only 1 instance)
>    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
>      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
>      vfio-mdev-pci-type1/create

Whoops, available_instances always reports 1 and it doesn't appear that
the create function prevents additional mdevs.  Also addressed in the
patch below.
 
> f) passthru the mdev to guest
>    add the following line in Qemu boot command
>    -device vfio-pci,\
>     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> 
> g) destroy mdev
>    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
>      remove
> 

I also found that unbinding the parent device doesn't unregister with
mdev, so it cannot be bound again, also fixed below.

However, the patch below just makes the mdev interface behave
correctly, I can't make it work on my system because commit
7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
used iommu_attach_device() rather than iommu_attach_group() for non-aux
mdev iommu_device.  Is there a requirement that the mdev parent device
is in a singleton iommu group?  If this is a simplification, then
vfio-mdev-pci should not bind to devices where this is violated since
there's no way to use the device.  Can we support it though?

If I have two devices in the same group and bind them both to
vfio-mdev-pci, I end up with three groups, one for each mdev device and
the original physical device group.  vfio.c works with the mdev groups
and will try to match both groups to the container.  vfio_iommu_type1.c
also works with the mdev groups, except for the point where we actually
try to attach a group to a domain, which is the only window where we use
the iommu_device rather than the provided group, but we don't record
that anywhere.  Should struct vfio_group have a pointer to a reference
counted object that tracks the actual iommu_group attached, such that
we can determine that the group is already attached to the domain and
not try to attach again?  Ideally I'd be able to bind one device to
vfio-pci, the other to vfio-mdev-pci, and be able to use them both
within the same container.  It seems like this should be possible, it's
the same effective iommu configuration as if they were both bound to
vfio-pci.  Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
index 07c8067b3f73..09143d3e5473 100644
--- a/drivers/vfio/pci/vfio_mdev_pci.c
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
 
 static struct pci_driver vfio_mdev_pci_driver;
 
-static ssize_t
-name_show(struct kobject *kobj, struct device *dev, char *buf)
-{
-	return sprintf(buf, "%s-type1\n", dev_name(dev));
-}
-
-MDEV_TYPE_ATTR_RO(name);
+struct vfio_mdev_pci_device {
+	struct vfio_pci_device vdev;
+	struct mdev_parent_ops ops;
+	struct attribute_group *groups[2];
+	struct attribute_group attr;
+	atomic_t avail;
+};
 
 static ssize_t
 available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "%d\n", 1);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = pci_get_drvdata(to_pci_dev(dev));
+
+	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
 }
 
 MDEV_TYPE_ATTR_RO(available_instances);
@@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
 MDEV_TYPE_ATTR_RO(device_api);
 
 static struct attribute *vfio_mdev_pci_types_attrs[] = {
-	&mdev_type_attr_name.attr,
 	&mdev_type_attr_device_api.attr,
 	&mdev_type_attr_available_instances.attr,
 	NULL,
 };
 
-static struct attribute_group vfio_mdev_pci_type_group1 = {
-	.name  = "type1",
-	.attrs = vfio_mdev_pci_types_attrs,
-};
-
-struct attribute_group *vfio_mdev_pci_type_groups[] = {
-	&vfio_mdev_pci_type_group1,
-	NULL,
-};
-
 struct vfio_mdev_pci {
 	struct vfio_pci_device *vdev;
 	struct mdev_device *mdev;
-	unsigned long handle;
 };
 
 static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
 {
 	struct device *pdev;
-	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_mdev_pci *pmdev;
 	int ret;
 
 	pdev = mdev_parent_dev(mdev);
-	vdev = dev_get_drvdata(pdev);
+	vmdev = dev_get_drvdata(pdev);
+
+	if (atomic_dec_if_positive(&vmdev->avail) < 0)
+		return -ENOSPC;
+
 	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
-	if (pmdev == NULL) {
-		ret = -EBUSY;
-		goto out;
-	}
+	if (!pmdev)
+		return -ENOMEM;
 
 	pmdev->mdev = mdev;
-	pmdev->vdev = vdev;
+	pmdev->vdev = &vmdev->vdev;
 	mdev_set_drvdata(mdev, pmdev);
 	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
 	if (ret) {
 		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
 			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
-		goto out;
+		kfree(pmdev);
+		atomic_inc(&vmdev->avail);
+		return ret;
 	}
 
-out:
-	return ret;
+	return 0;
 }
 
 static int vfio_mdev_pci_remove(struct mdev_device *mdev)
 {
 	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
 
 	kfree(pmdev);
+	atomic_inc(&vmdev->avail);
 	pr_info("%s, succeeded for mdev: %s\n", __func__,
 		     dev_name(mdev_dev(mdev)));
 
@@ -237,24 +236,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
 	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
 }
 
-static const struct mdev_parent_ops vfio_mdev_pci_ops = {
-	.supported_type_groups	= vfio_mdev_pci_type_groups,
-	.create			= vfio_mdev_pci_create,
-	.remove			= vfio_mdev_pci_remove,
-
-	.open			= vfio_mdev_pci_open,
-	.release		= vfio_mdev_pci_release,
-
-	.read			= vfio_mdev_pci_read,
-	.write			= vfio_mdev_pci_write,
-	.mmap			= vfio_mdev_pci_mmap,
-	.ioctl			= vfio_mdev_pci_ioctl,
-};
-
 static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 				       const struct pci_device_id *id)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
+	const struct mdev_parent_ops *ops;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -273,10 +260,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		return -EBUSY;
 	}
 
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev)
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (!vmdev)
 		return -ENOMEM;
 
+	vmdev->attr.name = kasprintf(GFP_KERNEL,
+				     "%04x:%04x:%04x:%04x:%06x:%02x",
+				     pdev->vendor, pdev->device,
+				     pdev->subsystem_vendor,
+				     pdev->subsystem_device, pdev->class,
+				     pdev->revision);
+	if (!vmdev->attr.name) {
+		kfree(vmdev);
+		return -ENOMEM;
+	}
+
+	atomic_set(&vmdev->avail, 1);
+
+	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
+	vmdev->groups[0] = &vmdev->attr;
+
+	vmdev->ops.supported_type_groups = vmdev->groups;
+	vmdev->ops.create = vfio_mdev_pci_create;
+	vmdev->ops.remove = vfio_mdev_pci_remove;
+	vmdev->ops.open	= vfio_mdev_pci_open;
+	vmdev->ops.release = vfio_mdev_pci_release;
+	vmdev->ops.read = vfio_mdev_pci_read;
+	vmdev->ops.write = vfio_mdev_pci_write;
+	vmdev->ops.mmap = vfio_mdev_pci_mmap;
+	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
+	ops = &vmdev->ops;
+
+	vdev = &vmdev->vdev;
 	vdev->pdev = pdev;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
@@ -289,7 +304,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 #endif
 	vdev->disable_idle_d3 = disable_idle_d3;
 
-	pci_set_drvdata(pdev, vdev);
+	pci_set_drvdata(pdev, vmdev);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret) {
@@ -320,7 +335,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	ret = mdev_register_device(&pdev->dev, ops);
 	if (ret)
 		pr_err("Cannot register mdev for device %s\n",
 			dev_name(&pdev->dev));
@@ -332,12 +347,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 
 static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
 
-	vdev = pci_get_drvdata(pdev);
-	if (!vdev)
+	mdev_unregister_device(&pdev->dev);
+
+	vmdev = pci_get_drvdata(pdev);
+	if (!vmdev)
 		return;
 
+	vdev = &vmdev->vdev;
+
 	vfio_pci_reflck_put(vdev->reflck);
 
 	kfree(vdev->region);
@@ -355,7 +375,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 
-	kfree(vdev);
+	kfree(vmdev->attr.name);
+	kfree(vmdev);
 }
 
 static struct pci_driver vfio_mdev_pci_driver = {

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-20  4:26   ` Alex Williamson
@ 2019-06-20 13:00     ` Liu, Yi L
  2019-06-20 21:07       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-06-20 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, June 20, 2019 12:27 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Sat,  8 Jun 2019 21:21:11 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > a PCI device as a mediated device. For a pci device, once bound
> > to vfio-mdev-pci driver, user space access of this device will
> > go through vfio mdev framework. The usage of the device follows
> > mdev management method. e.g. user should create a mdev before
> > exposing the device to user-space.
> >
> > Benefit of this new driver would be acting as a sample driver
> > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > patchset. Also it could be a good experiment driver for future
> > device specific mdev migration support.
> >
> > To use this driver:
> > a) build and load vfio-mdev-pci.ko module
> >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> >    then load it with following command
> >    > sudo modprobe vfio
> >    > sudo modprobe vfio-pci
> >    > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko
> >
> > b) unbind original device driver
> >    e.g. use following command to unbind its original driver
> >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> >
> > c) bind vfio-mdev-pci driver to the physical device
> >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id
> >
> > d) check the supported mdev instances
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> >      vfio-mdev-pci-type1
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type1/
> >      available_instances  create  device_api  devices  name
> 
> 
> I think the static type name here is a problem (and why does it
> include "type1"?).  We generally consider that a type defines a
> software compatible mdev, but in this case any PCI device wrapped in
> vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
> but that's a bad precedent.  I've taken a stab at fixing this in the
> patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
> subsystem device ID, class code, and revision to try to make the type
> as specific to the physical device assigned as we can through PCI.

Thanks, it is much better than what I proposed.

> 
> >
> > e)  create mdev on this physical device (only 1 instance)
> >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type1/create
> 
> Whoops, available_instances always reports 1 and it doesn't appear that
> the create function prevents additional mdevs.  Also addressed in the
> patch below.

yep, thanks.

> 
> > f) passthru the mdev to guest
> >    add the following line in Qemu boot command
> >    -device vfio-pci,\
> >     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> >
> > g) destroy mdev
> >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
> >      remove
> >
> 
> I also found that unbinding the parent device doesn't unregister with
> mdev, so it cannot be bound again, also fixed below.

Oops, good catch. :-)

> However, the patch below just makes the mdev interface behave
> correctly, I can't make it work on my system because commit
> 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")

What error did you encounter. I tested the patch with a device in a
singleton iommu group. I'm also searching a proper machine with
multiple devices in an iommu group and test it.

> used iommu_attach_device() rather than iommu_attach_group() for non-aux
> mdev iommu_device.  Is there a requirement that the mdev parent device
> is in a singleton iommu group?

I don't think there should have such limitation. Per my understanding,
vfio-mdev-pci should also be able to bind to devices which shares
iommu group with other devices. vfio-pci works well for such devices.
And since the two drivers share most of the codes, I think vfio-mdev-pci
should naturally support it as well.

> If this is a simplification, then
> vfio-mdev-pci should not bind to devices where this is violated since
> there's no way to use the device.  Can we support it though?

yeah, I think we need to support it.

> If I have two devices in the same group and bind them both to
> vfio-mdev-pci, I end up with three groups, one for each mdev device and
> the original physical device group.  vfio.c works with the mdev groups
> and will try to match both groups to the container.  vfio_iommu_type1.c
> also works with the mdev groups, except for the point where we actually
> try to attach a group to a domain, which is the only window where we use
> the iommu_device rather than the provided group, but we don't record
> that anywhere.  Should struct vfio_group have a pointer to a reference
> counted object that tracks the actual iommu_group attached, such that
> we can determine that the group is already attached to the domain and
> not try to attach again? 

Agreed, we need to avoid such duplicated attach. Instead of adding
reference counted object in vfio_group. I'm also considering the logic
below:

    /*
      * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
      * is initialized.
      */
    if (vfio_group->mdev_group) {
         /*
           * vfio_group->mdev_group is true means vfio_group->iommu_group
           * is not the actual iommu_group which is going to be attached to
           * domain. To avoid duplicate iommu_group attach, needs to check if
           * the actual iommu_group. vfio_get_parent_iommu_group() is a
           * newly added helper function which returns the actual attach
           * iommu_group going to be attached for this mdev group.
              */
         p_iommu_group = vfio_get_parent_iommu_group(
                                                                         vfio_group->iommu_group);
         list_for_each_entry(d, &iommu->domain_list, next) {
                 if (find_iommu_group(d, p_iommu_group)) {
                         mutex_unlock(&iommu->lock);
                         // skip group attach;
                 }
         }

> Ideally I'd be able to bind one device to
> vfio-pci, the other to vfio-mdev-pci, and be able to use them both
> within the same container.  It seems like this should be possible, it's
> the same effective iommu configuration as if they were both bound to
> vfio-pci.  Thanks,

Agreed. Will test it. And thanks for the fix patch below. I've test it
with a device in a singleton iommu group. Need to test the scenario
you mentioned above. :-)

Thanks,
Yi Liu

> 
> Alex
> 
> diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
> index 07c8067b3f73..09143d3e5473 100644
> --- a/drivers/vfio/pci/vfio_mdev_pci.c
> +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> @@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
> 
>  static struct pci_driver vfio_mdev_pci_driver;
> 
> -static ssize_t
> -name_show(struct kobject *kobj, struct device *dev, char *buf)
> -{
> -	return sprintf(buf, "%s-type1\n", dev_name(dev));
> -}
> -
> -MDEV_TYPE_ATTR_RO(name);
> +struct vfio_mdev_pci_device {
> +	struct vfio_pci_device vdev;
> +	struct mdev_parent_ops ops;
> +	struct attribute_group *groups[2];
> +	struct attribute_group attr;
> +	atomic_t avail;
> +};
> 
>  static ssize_t
>  available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
>  {
> -	return sprintf(buf, "%d\n", 1);
> +	struct vfio_mdev_pci_device *vmdev;
> +
> +	vmdev = pci_get_drvdata(to_pci_dev(dev));
> +
> +	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
>  }
> 
>  MDEV_TYPE_ATTR_RO(available_instances);
> @@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, struct
> device *dev,
>  MDEV_TYPE_ATTR_RO(device_api);
> 
>  static struct attribute *vfio_mdev_pci_types_attrs[] = {
> -	&mdev_type_attr_name.attr,
>  	&mdev_type_attr_device_api.attr,
>  	&mdev_type_attr_available_instances.attr,
>  	NULL,
>  };
> 
> -static struct attribute_group vfio_mdev_pci_type_group1 = {
> -	.name  = "type1",
> -	.attrs = vfio_mdev_pci_types_attrs,
> -};
> -
> -struct attribute_group *vfio_mdev_pci_type_groups[] = {
> -	&vfio_mdev_pci_type_group1,
> -	NULL,
> -};
> -
>  struct vfio_mdev_pci {
>  	struct vfio_pci_device *vdev;
>  	struct mdev_device *mdev;
> -	unsigned long handle;
>  };
> 
>  static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
>  {
>  	struct device *pdev;
> -	struct vfio_pci_device *vdev;
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_mdev_pci *pmdev;
>  	int ret;
> 
>  	pdev = mdev_parent_dev(mdev);
> -	vdev = dev_get_drvdata(pdev);
> +	vmdev = dev_get_drvdata(pdev);
> +
> +	if (atomic_dec_if_positive(&vmdev->avail) < 0)
> +		return -ENOSPC;
> +
>  	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> -	if (pmdev == NULL) {
> -		ret = -EBUSY;
> -		goto out;
> -	}
> +	if (!pmdev)
> +		return -ENOMEM;
> 
>  	pmdev->mdev = mdev;
> -	pmdev->vdev = vdev;
> +	pmdev->vdev = &vmdev->vdev;
>  	mdev_set_drvdata(mdev, pmdev);
>  	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
>  	if (ret) {
>  		pr_info("%s, failed to config iommu isolation for mdev: %s on
> pf: %s\n",
>  			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> -		goto out;
> +		kfree(pmdev);
> +		atomic_inc(&vmdev->avail);
> +		return ret;
>  	}
> 
> -out:
> -	return ret;
> +	return 0;
>  }
> 
>  static int vfio_mdev_pci_remove(struct mdev_device *mdev)
>  {
>  	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +	struct vfio_mdev_pci_device *vmdev;
> +
> +	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
> 
>  	kfree(pmdev);
> +	atomic_inc(&vmdev->avail);
>  	pr_info("%s, succeeded for mdev: %s\n", __func__,
>  		     dev_name(mdev_dev(mdev)));
> 
> @@ -237,24 +236,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device
> *mdev,
>  	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
>  }
> 
> -static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> -	.supported_type_groups	= vfio_mdev_pci_type_groups,
> -	.create			= vfio_mdev_pci_create,
> -	.remove			= vfio_mdev_pci_remove,
> -
> -	.open			= vfio_mdev_pci_open,
> -	.release		= vfio_mdev_pci_release,
> -
> -	.read			= vfio_mdev_pci_read,
> -	.write			= vfio_mdev_pci_write,
> -	.mmap			= vfio_mdev_pci_mmap,
> -	.ioctl			= vfio_mdev_pci_ioctl,
> -};
> -
>  static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  				       const struct pci_device_id *id)
>  {
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
> +	const struct mdev_parent_ops *ops;
>  	int ret;
> 
>  	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> @@ -273,10 +260,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev
> *pdev,
>  		return -EBUSY;
>  	}
> 
> -	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> -	if (!vdev)
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (!vmdev)
>  		return -ENOMEM;
> 
> +	vmdev->attr.name = kasprintf(GFP_KERNEL,
> +				     "%04x:%04x:%04x:%04x:%06x:%02x",
> +				     pdev->vendor, pdev->device,
> +				     pdev->subsystem_vendor,
> +				     pdev->subsystem_device, pdev->class,
> +				     pdev->revision);
> +	if (!vmdev->attr.name) {
> +		kfree(vmdev);
> +		return -ENOMEM;
> +	}
> +
> +	atomic_set(&vmdev->avail, 1);
> +
> +	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
> +	vmdev->groups[0] = &vmdev->attr;
> +
> +	vmdev->ops.supported_type_groups = vmdev->groups;
> +	vmdev->ops.create = vfio_mdev_pci_create;
> +	vmdev->ops.remove = vfio_mdev_pci_remove;
> +	vmdev->ops.open	= vfio_mdev_pci_open;
> +	vmdev->ops.release = vfio_mdev_pci_release;
> +	vmdev->ops.read = vfio_mdev_pci_read;
> +	vmdev->ops.write = vfio_mdev_pci_write;
> +	vmdev->ops.mmap = vfio_mdev_pci_mmap;
> +	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
> +	ops = &vmdev->ops;
> +
> +	vdev = &vmdev->vdev;
>  	vdev->pdev = pdev;
>  	vdev->irq_type = VFIO_PCI_NUM_IRQS;
>  	mutex_init(&vdev->igate);
> @@ -289,7 +304,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  #endif
>  	vdev->disable_idle_d3 = disable_idle_d3;
> 
> -	pci_set_drvdata(pdev, vdev);
> +	pci_set_drvdata(pdev, vmdev);
> 
>  	ret = vfio_pci_reflck_attach(vdev);
>  	if (ret) {
> @@ -320,7 +335,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  		vfio_pci_set_power_state(vdev, PCI_D3hot);
>  	}
> 
> -	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
> +	ret = mdev_register_device(&pdev->dev, ops);
>  	if (ret)
>  		pr_err("Cannot register mdev for device %s\n",
>  			dev_name(&pdev->dev));
> @@ -332,12 +347,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev
> *pdev,
> 
>  static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
>  {
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
> 
> -	vdev = pci_get_drvdata(pdev);
> -	if (!vdev)
> +	mdev_unregister_device(&pdev->dev);
> +
> +	vmdev = pci_get_drvdata(pdev);
> +	if (!vmdev)
>  		return;
> 
> +	vdev = &vmdev->vdev;
> +
>  	vfio_pci_reflck_put(vdev->reflck);
> 
>  	kfree(vdev->region);
> @@ -355,7 +375,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev
> *pdev)
>  				VGA_RSRC_LEGACY_IO |
> VGA_RSRC_LEGACY_MEM);
>  	}
> 
> -	kfree(vdev);
> +	kfree(vmdev->attr.name);
> +	kfree(vmdev);
>  }
> 
>  static struct pci_driver vfio_mdev_pci_driver = {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-20 13:00     ` Liu, Yi L
@ 2019-06-20 21:07       ` Alex Williamson
  2019-06-21 10:23         ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-06-20 21:07 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Thu, 20 Jun 2019 13:00:34 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, June 20, 2019 12:27 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Sat,  8 Jun 2019 21:21:11 +0800
> > Liu Yi L <yi.l.liu@intel.com> wrote:
> >   
> > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > a PCI device as a mediated device. For a pci device, once bound
> > > to vfio-mdev-pci driver, user space access of this device will
> > > go through vfio mdev framework. The usage of the device follows
> > > mdev management method. e.g. user should create a mdev before
> > > exposing the device to user-space.
> > >
> > > Benefit of this new driver would be acting as a sample driver
> > > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > > patchset. Also it could be a good experiment driver for future
> > > device specific mdev migration support.
> > >
> > > To use this driver:
> > > a) build and load vfio-mdev-pci.ko module
> > >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> > >    then load it with following command  
> > >    > sudo modprobe vfio
> > >    > sudo modprobe vfio-pci
> > >    > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko  
> > >
> > > b) unbind original device driver
> > >    e.g. use following command to unbind its original driver  
> > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> > >
> > > c) bind vfio-mdev-pci driver to the physical device  
> > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> > >
> > > d) check the supported mdev instances  
> > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
> > >      vfio-mdev-pci-type1  
> > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
> > >      vfio-mdev-pci-type1/
> > >      available_instances  create  device_api  devices  name  
> > 
> > 
> > I think the static type name here is a problem (and why does it
> > include "type1"?).  We generally consider that a type defines a
> > software compatible mdev, but in this case any PCI device wrapped in
> > vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
> > but that's a bad precedent.  I've taken a stab at fixing this in the
> > patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
> > subsystem device ID, class code, and revision to try to make the type
> > as specific to the physical device assigned as we can through PCI.  
> 
> Thanks, it is much better than what I proposed.
> 
> >   
> > >
> > > e)  create mdev on this physical device (only 1 instance)  
> > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
> > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > >      vfio-mdev-pci-type1/create  
> > 
> > Whoops, available_instances always reports 1 and it doesn't appear that
> > the create function prevents additional mdevs.  Also addressed in the
> > patch below.  
> 
> yep, thanks.
> 
> >   
> > > f) passthru the mdev to guest
> > >    add the following line in Qemu boot command
> > >    -device vfio-pci,\
> > >     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> > >
> > > g) destroy mdev  
> > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
> > >      remove
> > >  
> > 
> > I also found that unbinding the parent device doesn't unregister with
> > mdev, so it cannot be bound again, also fixed below.  
> 
> Oops, good catch. :-)
> 
> > However, the patch below just makes the mdev interface behave
> > correctly, I can't make it work on my system because commit
> > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")  
> 
> What error did you encounter. I tested the patch with a device in a
> singleton iommu group. I'm also searching a proper machine with
> multiple devices in an iommu group and test it.

In vfio_iommu_type1, iommu backed mdev devices use the
iommu_attach_device() interface, which includes:

        if (iommu_group_device_count(group) != 1)
                goto out_unlock;

So it's impossible to use with non-singleton groups currently.

> > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > mdev iommu_device.  Is there a requirement that the mdev parent device
> > is in a singleton iommu group?  
> 
> I don't think there should have such limitation. Per my understanding,
> vfio-mdev-pci should also be able to bind to devices which shares
> iommu group with other devices. vfio-pci works well for such devices.
> And since the two drivers share most of the codes, I think vfio-mdev-pci
> should naturally support it as well.

Yes, the difference though is that vfio.c knows when devices are in the
same group, which mdev vfio.c only knows about the non-iommu backed
group, not the group that is actually used for the iommu backing.  So
we either need to enlighten vfio.c or further abstract those details in
vfio_iommu_type1.c.
 
> > If this is a simplification, then
> > vfio-mdev-pci should not bind to devices where this is violated since
> > there's no way to use the device.  Can we support it though?  
> 
> yeah, I think we need to support it.
> 
> > If I have two devices in the same group and bind them both to
> > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > the original physical device group.  vfio.c works with the mdev groups
> > and will try to match both groups to the container.  vfio_iommu_type1.c
> > also works with the mdev groups, except for the point where we actually
> > try to attach a group to a domain, which is the only window where we use
> > the iommu_device rather than the provided group, but we don't record
> > that anywhere.  Should struct vfio_group have a pointer to a reference
> > counted object that tracks the actual iommu_group attached, such that
> > we can determine that the group is already attached to the domain and
> > not try to attach again?   
> 
> Agreed, we need to avoid such duplicated attach. Instead of adding
> reference counted object in vfio_group. I'm also considering the logic
> below:
> 
>     /*
>       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
>       * is initialized.
>       */
>     if (vfio_group->mdev_group) {
>          /*
>            * vfio_group->mdev_group is true means vfio_group->iommu_group
>            * is not the actual iommu_group which is going to be attached to
>            * domain. To avoid duplicate iommu_group attach, needs to check if
>            * the actual iommu_group. vfio_get_parent_iommu_group() is a
>            * newly added helper function which returns the actual attach
>            * iommu_group going to be attached for this mdev group.
>               */
>          p_iommu_group = vfio_get_parent_iommu_group(
>                                                                          vfio_group->iommu_group);
>          list_for_each_entry(d, &iommu->domain_list, next) {
>                  if (find_iommu_group(d, p_iommu_group)) {
>                          mutex_unlock(&iommu->lock);
>                          // skip group attach;
>                  }
>          }

We don't currently create a struct vfio_group for the parent, only for
the mdev iommu group.  The iommu_attach for an iommu backed mdev
doesn't leave any traces of where it is actually attached, we just
count on retracing our steps for the detach.  That's why I'm thinking
we need an object somewhere to track it and it needs to be reference
counted so that if both a vfio-mdev-pci device and a vfio-pci device
are using it, we leave it in place if either one is removed.
 
> > Ideally I'd be able to bind one device to
> > vfio-pci, the other to vfio-mdev-pci, and be able to use them both
> > within the same container.  It seems like this should be possible, it's
> > the same effective iommu configuration as if they were both bound to
> > vfio-pci.  Thanks,  
> 
> Agreed. Will test it. And thanks for the fix patch below. I've test it
> with a device in a singleton iommu group. Need to test the scenario
> you mentioned above. :-)

Thanks!

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-20 21:07       ` Alex Williamson
@ 2019-06-21 10:23         ` Liu, Yi L
  2019-06-21 15:57           ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-06-21 10:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, June 21, 2019 5:08 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Thu, 20 Jun 2019 13:00:34 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, June 20, 2019 12:27 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > a PCI device as a mediated device. For a pci device, once bound
> > > > to vfio-mdev-pci driver, user space access of this device will
> > > > go through vfio mdev framework. The usage of the device follows
> > > > mdev management method. e.g. user should create a mdev before
> > > > exposing the device to user-space.
[...]
> >
> > > However, the patch below just makes the mdev interface behave
> > > correctly, I can't make it work on my system because commit
> > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
> >
> > What error did you encounter. I tested the patch with a device in a
> > singleton iommu group. I'm also searching a proper machine with
> > multiple devices in an iommu group and test it.
> 
> In vfio_iommu_type1, iommu backed mdev devices use the
> iommu_attach_device() interface, which includes:
> 
>         if (iommu_group_device_count(group) != 1)
>                 goto out_unlock;
> 
> So it's impossible to use with non-singleton groups currently.

Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
backed mdev devices now. In this flow, the purpose here is to attach a device
to a domain and no need to check whether the device is in a singleton iommu
group. I think it would be better to use __iommu_attach_device() instead of
iommu_attach_device().

Also I found a potential mutex lock issue if using iommu_attach_device().
In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
all the devices in the group. It holds group->mutex. And then vfio_mdev_attach_domain()
calls iommu_attach_device() which also tries to get group->mutex. This would be
an issue. If you are fine with it, I may post another patch for it. :-)

> > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > is in a singleton iommu group?
> >
> > I don't think there should have such limitation. Per my understanding,
> > vfio-mdev-pci should also be able to bind to devices which shares
> > iommu group with other devices. vfio-pci works well for such devices.
> > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > should naturally support it as well.
> 
> Yes, the difference though is that vfio.c knows when devices are in the
> same group, which mdev vfio.c only knows about the non-iommu backed
> group, not the group that is actually used for the iommu backing.  So
> we either need to enlighten vfio.c or further abstract those details in
> vfio_iommu_type1.c.

Not sure if it is necessary to introduce more changes to vfio.c or
vfio_iommu_type1.c. If it's only for the scenario which two devices share an
iommu_group, I guess it could be supported by using __iommu_attach_device()
which has no device counting for the group. But maybe I missed something
here. It would be great if you can elaborate a bit for it. :-)

> 
> > > If this is a simplification, then
> > > vfio-mdev-pci should not bind to devices where this is violated since
> > > there's no way to use the device.  Can we support it though?
> >
> > yeah, I think we need to support it.
> >
> > > If I have two devices in the same group and bind them both to
> > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > the original physical device group.  vfio.c works with the mdev groups
> > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > also works with the mdev groups, except for the point where we actually
> > > try to attach a group to a domain, which is the only window where we use
> > > the iommu_device rather than the provided group, but we don't record
> > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > counted object that tracks the actual iommu_group attached, such that
> > > we can determine that the group is already attached to the domain and
> > > not try to attach again?
> >
> > Agreed, we need to avoid such duplicated attach. Instead of adding
> > reference counted object in vfio_group. I'm also considering the logic
> > below:

Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
device as vfio_mdev_attach_domain() only attaches the parent devices of
iommu backed mdevs instead of all the devices within the physical iommu_group.
While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
devices within the iommu backed group. The same with detach,
vfio_mdev_detach_domain() detaches selective devices instead of all devices within
the iommu backed group.

> >     /*
> >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> >       * is initialized.
> >       */
> >     if (vfio_group->mdev_group) {
> >          /*
> >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> >            * is not the actual iommu_group which is going to be attached to
> >            * domain. To avoid duplicate iommu_group attach, needs to check if
> >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> >            * newly added helper function which returns the actual attach
> >            * iommu_group going to be attached for this mdev group.
> >               */
> >          p_iommu_group = vfio_get_parent_iommu_group(
> >                                                                          vfio_group->iommu_group);
> >          list_for_each_entry(d, &iommu->domain_list, next) {
> >                  if (find_iommu_group(d, p_iommu_group)) {
> >                          mutex_unlock(&iommu->lock);
> >                          // skip group attach;
> >                  }
> >          }
> 
> We don't currently create a struct vfio_group for the parent, only for
> the mdev iommu group.  The iommu_attach for an iommu backed mdev
> doesn't leave any traces of where it is actually attached, we just
> count on retracing our steps for the detach.  That's why I'm thinking
> we need an object somewhere to track it and it needs to be reference
> counted so that if both a vfio-mdev-pci device and a vfio-pci device
> are using it, we leave it in place if either one is removed.

Hmmm, here we are talking about tracking in iommu_group level though
no good idea on where the object should  be placed yet. However, we may
need to tack in device level as I mentioned in above paragraph. If not,
there may be sequence issue. e.g. if vfio-mdev-pci device is attached
firstly, then the object will be initialized, and when vfio-pci device is
attached, we will find the attach should be skipped and just inc the ref count.
But actually it should not be skipped since the vfio-mdev-pci attach does not
attach all devices within the iommu backed group.

What's more, regards to sIOV case,  a parent devices may have multiple
mdevs and the mdevs may be assigned to the same VM. Thus there will be multiple
attach on this parent device. This also makes me believe track in device level would
be better. 

> 
> > > Ideally I'd be able to bind one device to
> > > vfio-pci, the other to vfio-mdev-pci, and be able to use them both
> > > within the same container.  It seems like this should be possible, it's
> > > the same effective iommu configuration as if they were both bound to
> > > vfio-pci.  Thanks,
> >
> > Agreed. Will test it. And thanks for the fix patch below. I've test it
> > with a device in a singleton iommu group. Need to test the scenario
> > you mentioned above. :-)
> 
> Thanks!

You are welcomed. :-)

> Alex

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-21 10:23         ` Liu, Yi L
@ 2019-06-21 15:57           ` Alex Williamson
  2019-06-24  8:20             ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-06-21 15:57 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Fri, 21 Jun 2019 10:23:10 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, June 21, 2019 5:08 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 20 Jun 2019 13:00:34 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > go through vfio mdev framework. The usage of the device follows
> > > > > mdev management method. e.g. user should create a mdev before
> > > > > exposing the device to user-space.  
> [...]
> > >  
> > > > However, the patch below just makes the mdev interface behave
> > > > correctly, I can't make it work on my system because commit
> > > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")  
> > >
> > > What error did you encounter. I tested the patch with a device in a
> > > singleton iommu group. I'm also searching a proper machine with
> > > multiple devices in an iommu group and test it.  
> > 
> > In vfio_iommu_type1, iommu backed mdev devices use the
> > iommu_attach_device() interface, which includes:
> > 
> >         if (iommu_group_device_count(group) != 1)
> >                 goto out_unlock;
> > 
> > So it's impossible to use with non-singleton groups currently.  
> 
> Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
> backed mdev devices now. In this flow, the purpose here is to attach a device
> to a domain and no need to check whether the device is in a singleton iommu
> group. I think it would be better to use __iommu_attach_device() instead of
> iommu_attach_device().

That's a static and unexported, it's intentionally not an exposed
interface.  We can't attach devices in the same group to separate
domains allocated through iommu_domain_alloc(), this would violate the
iommu group isolation principles.

> Also I found a potential mutex lock issue if using iommu_attach_device().
> In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
> all the devices in the group. It holds group->mutex. And then vfio_mdev_attach_domain()
> calls iommu_attach_device() which also tries to get group->mutex. This would be
> an issue. If you are fine with it, I may post another patch for it. :-)

Gack, yes, please send a patch.

> > > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > > is in a singleton iommu group?  
> > >
> > > I don't think there should have such limitation. Per my understanding,
> > > vfio-mdev-pci should also be able to bind to devices which shares
> > > iommu group with other devices. vfio-pci works well for such devices.
> > > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > > should naturally support it as well.  
> > 
> > Yes, the difference though is that vfio.c knows when devices are in the
> > same group, which mdev vfio.c only knows about the non-iommu backed
> > group, not the group that is actually used for the iommu backing.  So
> > we either need to enlighten vfio.c or further abstract those details in
> > vfio_iommu_type1.c.  
> 
> Not sure if it is necessary to introduce more changes to vfio.c or
> vfio_iommu_type1.c. If it's only for the scenario which two devices share an
> iommu_group, I guess it could be supported by using __iommu_attach_device()
> which has no device counting for the group. But maybe I missed something
> here. It would be great if you can elaborate a bit for it. :-)

We need to use the group semantics, there's a reason
__iommu_attach_device() is not exposed, it's an internal helper.  I
think there's no way around that we need to somewhere track the actual
group we're attaching to and have the smarts to re-use it for other
devices in the same group.
 
> > > > If this is a simplification, then
> > > > vfio-mdev-pci should not bind to devices where this is violated since
> > > > there's no way to use the device.  Can we support it though?  
> > >
> > > yeah, I think we need to support it.
> > >  
> > > > If I have two devices in the same group and bind them both to
> > > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > > the original physical device group.  vfio.c works with the mdev groups
> > > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > > also works with the mdev groups, except for the point where we actually
> > > > try to attach a group to a domain, which is the only window where we use
> > > > the iommu_device rather than the provided group, but we don't record
> > > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > > counted object that tracks the actual iommu_group attached, such that
> > > > we can determine that the group is already attached to the domain and
> > > > not try to attach again?  
> > >
> > > Agreed, we need to avoid such duplicated attach. Instead of adding
> > > reference counted object in vfio_group. I'm also considering the logic
> > > below:  
> 
> Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
> device as vfio_mdev_attach_domain() only attaches the parent devices of
> iommu backed mdevs instead of all the devices within the physical iommu_group.
> While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
> devices within the iommu backed group. The same with detach,
> vfio_mdev_detach_domain() detaches selective devices instead of all devices within
> the iommu backed group.

Yep, that's not good, for the non-aux case we need to follow the usual
group semantics or else we're limited to singleton groups.

> > >     /*
> > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > >       * is initialized.
> > >       */
> > >     if (vfio_group->mdev_group) {
> > >          /*
> > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > >            * is not the actual iommu_group which is going to be attached to
> > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > >            * newly added helper function which returns the actual attach
> > >            * iommu_group going to be attached for this mdev group.
> > >               */
> > >          p_iommu_group = vfio_get_parent_iommu_group(
> > >                                                                          vfio_group->iommu_group);
> > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > >                  if (find_iommu_group(d, p_iommu_group)) {
> > >                          mutex_unlock(&iommu->lock);
> > >                          // skip group attach;
> > >                  }
> > >          }  
> > 
> > We don't currently create a struct vfio_group for the parent, only for
> > the mdev iommu group.  The iommu_attach for an iommu backed mdev
> > doesn't leave any traces of where it is actually attached, we just
> > count on retracing our steps for the detach.  That's why I'm thinking
> > we need an object somewhere to track it and it needs to be reference
> > counted so that if both a vfio-mdev-pci device and a vfio-pci device
> > are using it, we leave it in place if either one is removed.  
> 
> Hmmm, here we are talking about tracking in iommu_group level though
> no good idea on where the object should  be placed yet. However, we may
> need to tack in device level as I mentioned in above paragraph. If not,
> there may be sequence issue. e.g. if vfio-mdev-pci device is attached
> firstly, then the object will be initialized, and when vfio-pci device is
> attached, we will find the attach should be skipped and just inc the ref count.
> But actually it should not be skipped since the vfio-mdev-pci attach does not
> attach all devices within the iommu backed group.

We can't do that though, the entire group needs to be attached.

> What's more, regards to sIOV case,  a parent devices may have multiple
> mdevs and the mdevs may be assigned to the same VM. Thus there will be multiple
> attach on this parent device. This also makes me believe track in device level would
> be better. 

The aux domain support essentially specifies that the device can be
attached to multiple domains, so I think we're ok for device-level
group attach there, but not for bare iommu backed devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-21 15:57           ` Alex Williamson
@ 2019-06-24  8:20             ` Liu, Yi L
  2019-06-28 15:07               ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-06-24  8:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, June 21, 2019 11:58 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Fri, 21 Jun 2019 10:23:10 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, June 21, 2019 5:08 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 20 Jun 2019 13:00:34 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > >
> > > > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > exposing the device to user-space.
> > [...]
> > > >
> > > > > However, the patch below just makes the mdev interface behave
> > > > > correctly, I can't make it work on my system because commit
> > > > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
> > > >
> > > > What error did you encounter. I tested the patch with a device in a
> > > > singleton iommu group. I'm also searching a proper machine with
> > > > multiple devices in an iommu group and test it.
> > >
> > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > iommu_attach_device() interface, which includes:
> > >
> > >         if (iommu_group_device_count(group) != 1)
> > >                 goto out_unlock;
> > >
> > > So it's impossible to use with non-singleton groups currently.
> >
> > Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
> > backed mdev devices now. In this flow, the purpose here is to attach a device
> > to a domain and no need to check whether the device is in a singleton iommu
> > group. I think it would be better to use __iommu_attach_device() instead of
> > iommu_attach_device().
> 
> That's a static and unexported, it's intentionally not an exposed
> interface.  We can't attach devices in the same group to separate
> domains allocated through iommu_domain_alloc(), this would violate the
> iommu group isolation principles.

Go it. :-) Then not good to expose such interface. But to support devices in
non-singleton iommu group, we need to have a new interface which doesn't
count the devices but attach all the devices.

> > Also I found a potential mutex lock issue if using iommu_attach_device().
> > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
> > all the devices in the group. It holds group->mutex. And then
> vfio_mdev_attach_domain()
> > calls iommu_attach_device() which also tries to get group->mutex. This would be
> > an issue. If you are fine with it, I may post another patch for it. :-)
> 
> Gack, yes, please send a patch.

Would do it, may be together with the support of vfio-mdev-pci on devices in
non-singleton iommu group.

> 
> > > > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > > > is in a singleton iommu group?
> > > >
> > > > I don't think there should have such limitation. Per my understanding,
> > > > vfio-mdev-pci should also be able to bind to devices which shares
> > > > iommu group with other devices. vfio-pci works well for such devices.
> > > > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > > > should naturally support it as well.
> > >
> > > Yes, the difference though is that vfio.c knows when devices are in the
> > > same group, which mdev vfio.c only knows about the non-iommu backed
> > > group, not the group that is actually used for the iommu backing.  So
> > > we either need to enlighten vfio.c or further abstract those details in
> > > vfio_iommu_type1.c.
> >
> > Not sure if it is necessary to introduce more changes to vfio.c or
> > vfio_iommu_type1.c. If it's only for the scenario which two devices share an
> > iommu_group, I guess it could be supported by using __iommu_attach_device()
> > which has no device counting for the group. But maybe I missed something
> > here. It would be great if you can elaborate a bit for it. :-)
> 
> We need to use the group semantics, there's a reason
> __iommu_attach_device() is not exposed, it's an internal helper.  I
> think there's no way around that we need to somewhere track the actual
> group we're attaching to and have the smarts to re-use it for other
> devices in the same group.

Hmmm, exposing __iommu_attach_device() is not good, let's forget it. :-)

> > > > > If this is a simplification, then
> > > > > vfio-mdev-pci should not bind to devices where this is violated since
> > > > > there's no way to use the device.  Can we support it though?
> > > >
> > > > yeah, I think we need to support it.
> > > >
> > > > > If I have two devices in the same group and bind them both to
> > > > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > > > the original physical device group.  vfio.c works with the mdev groups
> > > > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > > > also works with the mdev groups, except for the point where we actually
> > > > > try to attach a group to a domain, which is the only window where we use
> > > > > the iommu_device rather than the provided group, but we don't record
> > > > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > > > counted object that tracks the actual iommu_group attached, such that
> > > > > we can determine that the group is already attached to the domain and
> > > > > not try to attach again?
> > > >
> > > > Agreed, we need to avoid such duplicated attach. Instead of adding
> > > > reference counted object in vfio_group. I'm also considering the logic
> > > > below:
> >
> > Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
> > device as vfio_mdev_attach_domain() only attaches the parent devices of
> > iommu backed mdevs instead of all the devices within the physical iommu_group.
> > While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
> > devices within the iommu backed group. The same with detach,
> > vfio_mdev_detach_domain() detaches selective devices instead of all devices
> within
> > the iommu backed group.
> 
> Yep, that's not good, for the non-aux case we need to follow the usual
> group semantics or else we're limited to singleton groups.

yep.

> 
> > > >     /*
> > > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > > >       * is initialized.
> > > >       */
> > > >     if (vfio_group->mdev_group) {
> > > >          /*
> > > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > > >            * is not the actual iommu_group which is going to be attached to
> > > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > > >            * newly added helper function which returns the actual attach
> > > >            * iommu_group going to be attached for this mdev group.
> > > >               */
> > > >          p_iommu_group = vfio_get_parent_iommu_group(
> > > >                                                                          vfio_group->iommu_group);
> > > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > > >                  if (find_iommu_group(d, p_iommu_group)) {
> > > >                          mutex_unlock(&iommu->lock);
> > > >                          // skip group attach;
> > > >                  }
> > > >          }
> > >
> > > We don't currently create a struct vfio_group for the parent, only for
> > > the mdev iommu group.  The iommu_attach for an iommu backed mdev
> > > doesn't leave any traces of where it is actually attached, we just
> > > count on retracing our steps for the detach.  That's why I'm thinking
> > > we need an object somewhere to track it and it needs to be reference
> > > counted so that if both a vfio-mdev-pci device and a vfio-pci device
> > > are using it, we leave it in place if either one is removed.
> >
> > Hmmm, here we are talking about tracking in iommu_group level though
> > no good idea on where the object should  be placed yet. However, we may
> > need to tack in device level as I mentioned in above paragraph. If not,
> > there may be sequence issue. e.g. if vfio-mdev-pci device is attached
> > firstly, then the object will be initialized, and when vfio-pci device is
> > attached, we will find the attach should be skipped and just inc the ref count.
> > But actually it should not be skipped since the vfio-mdev-pci attach does not
> > attach all devices within the iommu backed group.
> 
> We can't do that though, the entire group needs to be attached.

Agree, may be getting another interface which is similar with
iommu_attach_device(), but works for devices which is in non-singleton
groups. So the attach for iommu backed mdev will also result in a sound
attach to all the devices which share iommu group with the parent device.
This is just like vfio-pci devices. For the object for tracking purpose may be
as below:

struct vfio_iommu_object {
	struct iommu_group *group;
	struct kref kref;
};

And I think it should be per-domain and per-iommu backed group since
aux-domain support allows a iommu backed group to be attached to
multiple domains. I'm considering if it is ok to have a list in vfio_domain.
Before each domain attach, vfio should do a check in the list if the iommu
backed group has been attached already. For vfio-pci devices, use its iommu
group to do a search in the list. For vfio-mdev-pci devices, use its parent
devices iommu group to do a search. Thus avoid duplicate attach. Thoughts?
 
> > What's more, regards to sIOV case,  a parent devices may have multiple
> > mdevs and the mdevs may be assigned to the same VM. Thus there will be multiple
> > attach on this parent device. This also makes me believe track in device level would
> > be better.
> 
> The aux domain support essentially specifies that the device can be
> attached to multiple domains, so I think we're ok for device-level
> group attach there, but not for bare iommu backed devices.  Thanks,

Got it.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-24  8:20             ` Liu, Yi L
@ 2019-06-28 15:07               ` Alex Williamson
  2019-07-03  8:25                 ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-06-28 15:07 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Mon, 24 Jun 2019 08:20:38 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, June 21, 2019 11:58 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Fri, 21 Jun 2019 10:23:10 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Thu, 20 Jun 2019 13:00:34 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > exposing the device to user-space.  
> > > [...]  
> > > > >  
> > > > > > However, the patch below just makes the mdev interface behave
> > > > > > correctly, I can't make it work on my system because commit
> > > > > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")  
> > > > >
> > > > > What error did you encounter. I tested the patch with a device in a
> > > > > singleton iommu group. I'm also searching a proper machine with
> > > > > multiple devices in an iommu group and test it.  
> > > >
> > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > iommu_attach_device() interface, which includes:
> > > >
> > > >         if (iommu_group_device_count(group) != 1)
> > > >                 goto out_unlock;
> > > >
> > > > So it's impossible to use with non-singleton groups currently.  
> > >
> > > Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
> > > backed mdev devices now. In this flow, the purpose here is to attach a device
> > > to a domain and no need to check whether the device is in a singleton iommu
> > > group. I think it would be better to use __iommu_attach_device() instead of
> > > iommu_attach_device().  
> > 
> > That's a static and unexported, it's intentionally not an exposed
> > interface.  We can't attach devices in the same group to separate
> > domains allocated through iommu_domain_alloc(), this would violate the
> > iommu group isolation principles.  
> 
> Go it. :-) Then not good to expose such interface. But to support devices in
> non-singleton iommu group, we need to have a new interface which doesn't
> count the devices but attach all the devices.

We have iommu_attach_group(), we just need to track which groups are
attached.
 
> > > Also I found a potential mutex lock issue if using iommu_attach_device().
> > > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
> > > all the devices in the group. It holds group->mutex. And then  
> > vfio_mdev_attach_domain()  
> > > calls iommu_attach_device() which also tries to get group->mutex. This would be
> > > an issue. If you are fine with it, I may post another patch for it. :-)  
> > 
> > Gack, yes, please send a patch.  
> 
> Would do it, may be together with the support of vfio-mdev-pci on devices in
> non-singleton iommu group.
> 
> >   
> > > > > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > > > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > > > > is in a singleton iommu group?  
> > > > >
> > > > > I don't think there should have such limitation. Per my understanding,
> > > > > vfio-mdev-pci should also be able to bind to devices which shares
> > > > > iommu group with other devices. vfio-pci works well for such devices.
> > > > > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > > > > should naturally support it as well.  
> > > >
> > > > Yes, the difference though is that vfio.c knows when devices are in the
> > > > same group, which mdev vfio.c only knows about the non-iommu backed
> > > > group, not the group that is actually used for the iommu backing.  So
> > > > we either need to enlighten vfio.c or further abstract those details in
> > > > vfio_iommu_type1.c.  
> > >
> > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > vfio_iommu_type1.c. If it's only for the scenario which two devices share an
> > > iommu_group, I guess it could be supported by using __iommu_attach_device()
> > > which has no device counting for the group. But maybe I missed something
> > > here. It would be great if you can elaborate a bit for it. :-)  
> > 
> > We need to use the group semantics, there's a reason
> > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > think there's no way around that we need to somewhere track the actual
> > group we're attaching to and have the smarts to re-use it for other
> > devices in the same group.  
> 
> Hmmm, exposing __iommu_attach_device() is not good, let's forget it. :-)
> 
> > > > > > If this is a simplification, then
> > > > > > vfio-mdev-pci should not bind to devices where this is violated since
> > > > > > there's no way to use the device.  Can we support it though?  
> > > > >
> > > > > yeah, I think we need to support it.
> > > > >  
> > > > > > If I have two devices in the same group and bind them both to
> > > > > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > > > > the original physical device group.  vfio.c works with the mdev groups
> > > > > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > > > > also works with the mdev groups, except for the point where we actually
> > > > > > try to attach a group to a domain, which is the only window where we use
> > > > > > the iommu_device rather than the provided group, but we don't record
> > > > > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > > > > counted object that tracks the actual iommu_group attached, such that
> > > > > > we can determine that the group is already attached to the domain and
> > > > > > not try to attach again?  
> > > > >
> > > > > Agreed, we need to avoid such duplicated attach. Instead of adding
> > > > > reference counted object in vfio_group. I'm also considering the logic
> > > > > below:  
> > >
> > > Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
> > > device as vfio_mdev_attach_domain() only attaches the parent devices of
> > > iommu backed mdevs instead of all the devices within the physical iommu_group.
> > > While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
> > > devices within the iommu backed group. The same with detach,
> > > vfio_mdev_detach_domain() detaches selective devices instead of all devices  
> > within  
> > > the iommu backed group.  
> > 
> > Yep, that's not good, for the non-aux case we need to follow the usual
> > group semantics or else we're limited to singleton groups.  
> 
> yep.
> 
> >   
> > > > >     /*
> > > > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > > > >       * is initialized.
> > > > >       */
> > > > >     if (vfio_group->mdev_group) {
> > > > >          /*
> > > > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > > > >            * is not the actual iommu_group which is going to be attached to
> > > > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > > > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > > > >            * newly added helper function which returns the actual attach
> > > > >            * iommu_group going to be attached for this mdev group.
> > > > >               */
> > > > >          p_iommu_group = vfio_get_parent_iommu_group(
> > > > >                                                                          vfio_group->iommu_group);
> > > > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > > > >                  if (find_iommu_group(d, p_iommu_group)) {
> > > > >                          mutex_unlock(&iommu->lock);
> > > > >                          // skip group attach;
> > > > >                  }
> > > > >          }  
> > > >
> > > > We don't currently create a struct vfio_group for the parent, only for
> > > > the mdev iommu group.  The iommu_attach for an iommu backed mdev
> > > > doesn't leave any traces of where it is actually attached, we just
> > > > count on retracing our steps for the detach.  That's why I'm thinking
> > > > we need an object somewhere to track it and it needs to be reference
> > > > counted so that if both a vfio-mdev-pci device and a vfio-pci device
> > > > are using it, we leave it in place if either one is removed.  
> > >
> > > Hmmm, here we are talking about tracking in iommu_group level though
> > > no good idea on where the object should  be placed yet. However, we may
> > > need to tack in device level as I mentioned in above paragraph. If not,
> > > there may be sequence issue. e.g. if vfio-mdev-pci device is attached
> > > firstly, then the object will be initialized, and when vfio-pci device is
> > > attached, we will find the attach should be skipped and just inc the ref count.
> > > But actually it should not be skipped since the vfio-mdev-pci attach does not
> > > attach all devices within the iommu backed group.  
> > 
> > We can't do that though, the entire group needs to be attached.  
> 
> Agree, may be getting another interface which is similar with
> iommu_attach_device(), but works for devices which is in non-singleton
> groups. So the attach for iommu backed mdev will also result in a sound
> attach to all the devices which share iommu group with the parent device.

iommu_attach_group()...

> This is just like vfio-pci devices. For the object for tracking purpose may be
> as below:
> 
> struct vfio_iommu_object {
> 	struct iommu_group *group;
> 	struct kref kref;
> };
> 
> And I think it should be per-domain and per-iommu backed group since
> aux-domain support allows a iommu backed group to be attached to
> multiple domains. I'm considering if it is ok to have a list in vfio_domain.
> Before each domain attach, vfio should do a check in the list if the iommu
> backed group has been attached already. For vfio-pci devices, use its iommu
> group to do a search in the list. For vfio-mdev-pci devices, use its parent
> devices iommu group to do a search. Thus avoid duplicate attach. Thoughts?

vfio_iommu_type1 already creates a struct vfio_iommu per container,
which includes a linked list of struct vfio_domain objects, where each
vfio_domain has a list of struct vfio_group objects.  So we need to
include the iommu device iommu group in that latter list somehow.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-06-28 15:07               ` Alex Williamson
@ 2019-07-03  8:25                 ` Liu, Yi L
  2019-07-03 17:22                   ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-07-03  8:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

Thanks for the comments. Have four inline responses below. And one
of them need your further help. :-)
.
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, June 28, 2019 11:08 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Mon, 24 Jun 2019 08:20:38 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, June 21, 2019 11:58 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Fri, 21 Jun 2019 10:23:10 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > >
> > > > > On Thu, 20 Jun 2019 13:00:34 +0000 "Liu, Yi L"
> > > > > <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci
> > > > > > > driver
> > > > > > >
> > > > > > > On Sat,  8 Jun 2019 21:21:11 +0800 Liu Yi L
> > > > > > > <yi.l.liu@intel.com> wrote:
> > > > > > >
> > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is
> > > > > > > > to wrap a PCI device as a mediated device. For a pci
> > > > > > > > device, once bound to vfio-mdev-pci driver, user space
> > > > > > > > access of this device will go through vfio mdev framework.
> > > > > > > > The usage of the device follows mdev management method.
> > > > > > > > e.g. user should create a mdev before exposing the device to user-space.
> > > > [...]
> > > > > >
> > > > > > > However, the patch below just makes the mdev interface
> > > > > > > behave correctly, I can't make it work on my system because
> > > > > > > commit 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching
> > > > > > > group helpers")
> > > > > >
> > > > > > What error did you encounter. I tested the patch with a device
> > > > > > in a singleton iommu group. I'm also searching a proper
> > > > > > machine with multiple devices in an iommu group and test it.
> > > > >
> > > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > > iommu_attach_device() interface, which includes:
> > > > >
> > > > >         if (iommu_group_device_count(group) != 1)
> > > > >                 goto out_unlock;
> > > > >
> > > > > So it's impossible to use with non-singleton groups currently.
> > > >
> > > > Hmmm, I think it is no longer good to use iommu_attach_device()
> > > > for iommu backed mdev devices now. In this flow, the purpose here
> > > > is to attach a device to a domain and no need to check whether the
> > > > device is in a singleton iommu group. I think it would be better
> > > > to use __iommu_attach_device() instead of iommu_attach_device().
> > >
> > > That's a static and unexported, it's intentionally not an exposed
> > > interface.  We can't attach devices in the same group to separate
> > > domains allocated through iommu_domain_alloc(), this would violate
> > > the iommu group isolation principles.
> >
> > Go it. :-) Then not good to expose such interface. But to support
> > devices in non-singleton iommu group, we need to have a new interface
> > which doesn't count the devices but attach all the devices.
> 
> We have iommu_attach_group(), we just need to track which groups are attached.

yep.

> > > > Also I found a potential mutex lock issue if using iommu_attach_device().
> > > > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev()
> > > > to loop all the devices in the group. It holds group->mutex. And
> > > > then
> > > vfio_mdev_attach_domain()
> > > > calls iommu_attach_device() which also tries to get group->mutex.
> > > > This would be an issue. If you are fine with it, I may post
> > > > another patch for it. :-)
> > >
> > > Gack, yes, please send a patch.
> >
> > Would do it, may be together with the support of vfio-mdev-pci on
> > devices in non-singleton iommu group.
> >
> > >
> > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > the mdev parent device is in a singleton iommu group?
> > > > > >
> > > > > > I don't think there should have such limitation. Per my
> > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > devices which shares iommu group with other devices. vfio-pci works well
> for such devices.
> > > > > > And since the two drivers share most of the codes, I think
> > > > > > vfio-mdev-pci should naturally support it as well.
> > > > >
> > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > in the same group, which mdev vfio.c only knows about the
> > > > > non-iommu backed group, not the group that is actually used for
> > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > further abstract those details in vfio_iommu_type1.c.
> > > >
> > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > devices share an iommu_group, I guess it could be supported by
> > > > using __iommu_attach_device() which has no device counting for the
> > > > group. But maybe I missed something here. It would be great if you
> > > > can elaborate a bit for it. :-)
> > >
> > > We need to use the group semantics, there's a reason
> > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > think there's no way around that we need to somewhere track the
> > > actual group we're attaching to and have the smarts to re-use it for
> > > other devices in the same group.
> >
> > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > :-)
> >
> > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > bind to devices where this is violated since there's no way
> > > > > > > to use the device.  Can we support it though?
> > > > > > 
> > > > > > yeah, I think we need to support it.

I've already made vfio-mdev-pci driver work for non-singleton iommu
group. e.g. for devices in a single iommu group, I can bind the devices
to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
device can only passthru via
"-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
we expect.

However, I encountered a problem when trying to prevent user from
passthru these devices to different VMs. I've tried in my side, and I
can passthru vfio-pci device and vfio-mdev-pci device to different
VMs. But actually this operation should be failed. If all the devices
are bound to vfio-pci, Qemu will open iommu backed group. So
Qemu can check if a given group has already been used by an
AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
user from passthru these devices to different VMs if the devices
are in the same iommu backed group. However, here for a
vfio-mdev-pci device, it has a new group and group ID, Qemu
will not be able to detect if the other devices (share iommu group
with vfio-mdev-pci device) are passthru to existing VMs. This is the
major problem for vfio-mdev-pci to support non-singleton group
in my side now. Even all devices are bound to vfio-mdev-pci driver,
Qemu is still unable to check since all the vfio-mdev-pci devices
have a separate mdev group.

To fix it, may need Qemu to do more things. E.g. If it tries to use a
non-singleton iommu backed group, it needs to check if any mdev
group is created and used by an existing VM. Also it needs check if
iommu backed group is passthru to an existing VM when trying to
use a mdev group. For singleton iommu backed group and
aux-domain enabled physical device, still allow to passthru mdev
group to different VMs. To achieve these checks, Qemu may need
to have knowledge whether a group is iommu backed and singleton
or not. Do you think it is good to expose such info to userspace? or
any other idea? :-)

> > > > > >
> > > > > > > If I have two devices in the same group and bind them both
> > > > > > > to vfio-mdev-pci, I end up with three groups, one for each
> > > > > > > mdev device and the original physical device group.  vfio.c
> > > > > > > works with the mdev groups and will try to match both groups
> > > > > > > to the container.  vfio_iommu_type1.c also works with the
> > > > > > > mdev groups, except for the point where we actually try to
> > > > > > > attach a group to a domain, which is the only window where
> > > > > > > we use the iommu_device rather than the provided group, but
> > > > > > > we don't record that anywhere.  Should struct vfio_group
> > > > > > > have a pointer to a reference counted object that tracks the
> > > > > > > actual iommu_group attached, such that we can determine that the group
> is already attached to the domain and not try to attach again?
> > > > > >
> > > > > > Agreed, we need to avoid such duplicated attach. Instead of
> > > > > > adding reference counted object in vfio_group. I'm also
> > > > > > considering the logic
> > > > > > below:
> > > >
> > > > Re-walked the code, I find the duplicated attach will happen on
> > > > the vfio-mdev-pci device as vfio_mdev_attach_domain() only
> > > > attaches the parent devices of iommu backed mdevs instead of all the devices
> within the physical iommu_group.
> > > > While for a vfio-pci device, it will use iommu_attach_group()
> > > > which attaches all the devices within the iommu backed group. The
> > > > same with detach,
> > > > vfio_mdev_detach_domain() detaches selective devices instead of
> > > > all devices
> > > within
> > > > the iommu backed group.
> > >
> > > Yep, that's not good, for the non-aux case we need to follow the
> > > usual group semantics or else we're limited to singleton groups.
> >
> > yep.
> >
> > >
> > > > > >     /*
> > > > > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > > > > >       * is initialized.
> > > > > >       */
> > > > > >     if (vfio_group->mdev_group) {
> > > > > >          /*
> > > > > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > > > > >            * is not the actual iommu_group which is going to be attached to
> > > > > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > > > > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > > > > >            * newly added helper function which returns the actual attach
> > > > > >            * iommu_group going to be attached for this mdev group.
> > > > > >               */
> > > > > >          p_iommu_group = vfio_get_parent_iommu_group(
> > > > > >                                                                          vfio_group->iommu_group);
> > > > > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > > > > >                  if (find_iommu_group(d, p_iommu_group)) {
> > > > > >                          mutex_unlock(&iommu->lock);
> > > > > >                          // skip group attach;
> > > > > >                  }
> > > > > >          }
> > > > >
> > > > > We don't currently create a struct vfio_group for the parent,
> > > > > only for the mdev iommu group.  The iommu_attach for an iommu
> > > > > backed mdev doesn't leave any traces of where it is actually
> > > > > attached, we just count on retracing our steps for the detach.
> > > > > That's why I'm thinking we need an object somewhere to track it
> > > > > and it needs to be reference counted so that if both a
> > > > > vfio-mdev-pci device and a vfio-pci device are using it, we leave it in place if
> either one is removed.
> > > >
> > > > Hmmm, here we are talking about tracking in iommu_group level
> > > > though no good idea on where the object should  be placed yet.
> > > > However, we may need to tack in device level as I mentioned in
> > > > above paragraph. If not, there may be sequence issue. e.g. if
> > > > vfio-mdev-pci device is attached firstly, then the object will be
> > > > initialized, and when vfio-pci device is attached, we will find the attach should
> be skipped and just inc the ref count.
> > > > But actually it should not be skipped since the vfio-mdev-pci
> > > > attach does not attach all devices within the iommu backed group.
> > >
> > > We can't do that though, the entire group needs to be attached.
> >
> > Agree, may be getting another interface which is similar with
> > iommu_attach_device(), but works for devices which is in non-singleton
> > groups. So the attach for iommu backed mdev will also result in a
> > sound attach to all the devices which share iommu group with the parent device.
> 
> iommu_attach_group()...

got it. :-)

> > This is just like vfio-pci devices. For the object for tracking
> > purpose may be as below:
> >
> > struct vfio_iommu_object {
> > 	struct iommu_group *group;
> > 	struct kref kref;
> > };
> >
> > And I think it should be per-domain and per-iommu backed group since
> > aux-domain support allows a iommu backed group to be attached to
> > multiple domains. I'm considering if it is ok to have a list in vfio_domain.
> > Before each domain attach, vfio should do a check in the list if the
> > iommu backed group has been attached already. For vfio-pci devices,
> > use its iommu group to do a search in the list. For vfio-mdev-pci
> > devices, use its parent devices iommu group to do a search. Thus avoid duplicate
> attach. Thoughts?
> 
> vfio_iommu_type1 already creates a struct vfio_iommu per container, which
> includes a linked list of struct vfio_domain objects, where each vfio_domain has a
> list of struct vfio_group objects.  So we need to include the iommu device iommu
> group in that latter list somehow.
> Thanks,

Sure, will try it.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-03  8:25                 ` Liu, Yi L
@ 2019-07-03 17:22                   ` Alex Williamson
  2019-07-04  9:11                     ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-07-03 17:22 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Wed, 3 Jul 2019 08:25:25 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> Thanks for the comments. Have four inline responses below. And one
> of them need your further help. :-)
> .
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, June 28, 2019 11:08 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Mon, 24 Jun 2019 08:20:38 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, June 21, 2019 11:58 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Fri, 21 Jun 2019 10:23:10 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Thu, 20 Jun 2019 13:00:34 +0000 "Liu, Yi L"
> > > > > > <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > Hi Alex,
> > > > > > >  
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci
> > > > > > > > driver
> > > > > > > >
> > > > > > > > On Sat,  8 Jun 2019 21:21:11 +0800 Liu Yi L
> > > > > > > > <yi.l.liu@intel.com> wrote:
> > > > > > > >  
> > > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is
> > > > > > > > > to wrap a PCI device as a mediated device. For a pci
> > > > > > > > > device, once bound to vfio-mdev-pci driver, user space
> > > > > > > > > access of this device will go through vfio mdev framework.
> > > > > > > > > The usage of the device follows mdev management method.
> > > > > > > > > e.g. user should create a mdev before exposing the device to user-space.  
> > > > > [...]  
> > > > > > >  
> > > > > > > > However, the patch below just makes the mdev interface
> > > > > > > > behave correctly, I can't make it work on my system because
> > > > > > > > commit 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching
> > > > > > > > group helpers")  
> > > > > > >
> > > > > > > What error did you encounter. I tested the patch with a device
> > > > > > > in a singleton iommu group. I'm also searching a proper
> > > > > > > machine with multiple devices in an iommu group and test it.  
> > > > > >
> > > > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > > > iommu_attach_device() interface, which includes:
> > > > > >
> > > > > >         if (iommu_group_device_count(group) != 1)
> > > > > >                 goto out_unlock;
> > > > > >
> > > > > > So it's impossible to use with non-singleton groups currently.  
> > > > >
> > > > > Hmmm, I think it is no longer good to use iommu_attach_device()
> > > > > for iommu backed mdev devices now. In this flow, the purpose here
> > > > > is to attach a device to a domain and no need to check whether the
> > > > > device is in a singleton iommu group. I think it would be better
> > > > > to use __iommu_attach_device() instead of iommu_attach_device().  
> > > >
> > > > That's a static and unexported, it's intentionally not an exposed
> > > > interface.  We can't attach devices in the same group to separate
> > > > domains allocated through iommu_domain_alloc(), this would violate
> > > > the iommu group isolation principles.  
> > >
> > > Go it. :-) Then not good to expose such interface. But to support
> > > devices in non-singleton iommu group, we need to have a new interface
> > > which doesn't count the devices but attach all the devices.  
> > 
> > We have iommu_attach_group(), we just need to track which groups are attached.  
> 
> yep.
> 
> > > > > Also I found a potential mutex lock issue if using iommu_attach_device().
> > > > > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev()
> > > > > to loop all the devices in the group. It holds group->mutex. And
> > > > > then  
> > > > vfio_mdev_attach_domain()  
> > > > > calls iommu_attach_device() which also tries to get group->mutex.
> > > > > This would be an issue. If you are fine with it, I may post
> > > > > another patch for it. :-)  
> > > >
> > > > Gack, yes, please send a patch.  
> > >
> > > Would do it, may be together with the support of vfio-mdev-pci on
> > > devices in non-singleton iommu group.
> > >  
> > > >  
> > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > the mdev parent device is in a singleton iommu group?  
> > > > > > >
> > > > > > > I don't think there should have such limitation. Per my
> > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > devices which shares iommu group with other devices. vfio-pci works well  
> > for such devices.  
> > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > vfio-mdev-pci should naturally support it as well.  
> > > > > >
> > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > further abstract those details in vfio_iommu_type1.c.  
> > > > >
> > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > devices share an iommu_group, I guess it could be supported by
> > > > > using __iommu_attach_device() which has no device counting for the
> > > > > group. But maybe I missed something here. It would be great if you
> > > > > can elaborate a bit for it. :-)  
> > > >
> > > > We need to use the group semantics, there's a reason
> > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > think there's no way around that we need to somewhere track the
> > > > actual group we're attaching to and have the smarts to re-use it for
> > > > other devices in the same group.  
> > >
> > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > :-)
> > >  
> > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > to use the device.  Can we support it though?  
> > > > > > > 
> > > > > > > yeah, I think we need to support it.  
> 
> I've already made vfio-mdev-pci driver work for non-singleton iommu
> group. e.g. for devices in a single iommu group, I can bind the devices
> to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> device can only passthru via
> "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> we expect.
> 
> However, I encountered a problem when trying to prevent user from
> passthru these devices to different VMs. I've tried in my side, and I
> can passthru vfio-pci device and vfio-mdev-pci device to different
> VMs. But actually this operation should be failed. If all the devices
> are bound to vfio-pci, Qemu will open iommu backed group. So
> Qemu can check if a given group has already been used by an
> AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
> user from passthru these devices to different VMs if the devices
> are in the same iommu backed group. However, here for a
> vfio-mdev-pci device, it has a new group and group ID, Qemu
> will not be able to detect if the other devices (share iommu group
> with vfio-mdev-pci device) are passthru to existing VMs. This is the
> major problem for vfio-mdev-pci to support non-singleton group
> in my side now. Even all devices are bound to vfio-mdev-pci driver,
> Qemu is still unable to check since all the vfio-mdev-pci devices
> have a separate mdev group.
> 
> To fix it, may need Qemu to do more things. E.g. If it tries to use a
> non-singleton iommu backed group, it needs to check if any mdev
> group is created and used by an existing VM. Also it needs check if
> iommu backed group is passthru to an existing VM when trying to
> use a mdev group. For singleton iommu backed group and
> aux-domain enabled physical device, still allow to passthru mdev
> group to different VMs. To achieve these checks, Qemu may need
> to have knowledge whether a group is iommu backed and singleton
> or not. Do you think it is good to expose such info to userspace? or
> any other idea? :-)

QEMU is never responsible for isolating a group, QEMU is just a
userspace driver, it's vfio's responsibility to prevent the user from
splitting groups in ways that are not allowed.  QEMU does not know the
true group association, it only knows the "virtual" group of the mdev
device.  QEMU will create a container and add the mdev virtual group to
the container.  In the kernel, the type1 backend should actually do an
iommu_attach_group(), attaching the iommu_device group to the domain.
When QEMU processes the next device, it will have a different group,
but (assuming no vIOMMU) it will try to attach it to the same
container, which should work because the iommu_device group backing the
mdev virtual group is already attached to this domain.

If we had two separate QEMU processes, each with an mdev device from a
common group at the iommu_device level, the type1 backend should fail
to attach the group to the container for the later caller.  I'd think
this should fail at the iommu_attach_group() call since the group we're
trying to attach is already attached to another domain.

It's really unfortunate that we don't have the mdev inheriting the
iommu group of the iommu_device so that userspace can really understand
this relationship.  A separate group makes sense for the aux-domain
case, and is (I guess) not a significant issue in the case of a
singleton iommu_device group, but it's pretty awkward here.  Perhaps
this is something we should correct in design of iommu backed mdevs.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-03 17:22                   ` Alex Williamson
@ 2019-07-04  9:11                     ` Liu, Yi L
  2019-07-05 15:55                       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-07-04  9:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, July 4, 2019 1:22 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Wed, 3 Jul 2019 08:25:25 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > Thanks for the comments. Have four inline responses below. And one
> > of them need your further help. :-)

[...]

> > > > >
> > > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > > the mdev parent device is in a singleton iommu group?
> > > > > > > >
> > > > > > > > I don't think there should have such limitation. Per my
> > > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > > devices which shares iommu group with other devices. vfio-pci works
> well
> > > for such devices.
> > > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > > vfio-mdev-pci should naturally support it as well.
> > > > > > >
> > > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > > further abstract those details in vfio_iommu_type1.c.
> > > > > >
> > > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > > devices share an iommu_group, I guess it could be supported by
> > > > > > using __iommu_attach_device() which has no device counting for the
> > > > > > group. But maybe I missed something here. It would be great if you
> > > > > > can elaborate a bit for it. :-)
> > > > >
> > > > > We need to use the group semantics, there's a reason
> > > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > > think there's no way around that we need to somewhere track the
> > > > > actual group we're attaching to and have the smarts to re-use it for
> > > > > other devices in the same group.
> > > >
> > > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > > :-)
> > > >
> > > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > > to use the device.  Can we support it though?
> > > > > > > >
> > > > > > > > yeah, I think we need to support it.
> >
> > I've already made vfio-mdev-pci driver work for non-singleton iommu
> > group. e.g. for devices in a single iommu group, I can bind the devices
> > to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> > it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> > manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> > device can only passthru via
> > "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> > we expect.
> >
> > However, I encountered a problem when trying to prevent user from
> > passthru these devices to different VMs. I've tried in my side, and I
> > can passthru vfio-pci device and vfio-mdev-pci device to different
> > VMs. But actually this operation should be failed. If all the devices
> > are bound to vfio-pci, Qemu will open iommu backed group. So
> > Qemu can check if a given group has already been used by an
> > AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
> > user from passthru these devices to different VMs if the devices
> > are in the same iommu backed group. However, here for a
> > vfio-mdev-pci device, it has a new group and group ID, Qemu
> > will not be able to detect if the other devices (share iommu group
> > with vfio-mdev-pci device) are passthru to existing VMs. This is the
> > major problem for vfio-mdev-pci to support non-singleton group
> > in my side now. Even all devices are bound to vfio-mdev-pci driver,
> > Qemu is still unable to check since all the vfio-mdev-pci devices
> > have a separate mdev group.
> >
> > To fix it, may need Qemu to do more things. E.g. If it tries to use a
> > non-singleton iommu backed group, it needs to check if any mdev
> > group is created and used by an existing VM. Also it needs check if
> > iommu backed group is passthru to an existing VM when trying to
> > use a mdev group. For singleton iommu backed group and
> > aux-domain enabled physical device, still allow to passthru mdev
> > group to different VMs. To achieve these checks, Qemu may need
> > to have knowledge whether a group is iommu backed and singleton
> > or not. Do you think it is good to expose such info to userspace? or
> > any other idea? :-)
> 
> QEMU is never responsible for isolating a group, QEMU is just a
> userspace driver, it's vfio's responsibility to prevent the user from
> splitting groups in ways that are not allowed.  QEMU does not know the

yep, also my concern.

> true group association, it only knows the "virtual" group of the mdev
> device.  QEMU will create a container and add the mdev virtual group to
> the container.  In the kernel, the type1 backend should actually do an
> iommu_attach_group(), attaching the iommu_device group to the domain.
> When QEMU processes the next device, it will have a different group,
> but (assuming no vIOMMU) it will try to attach it to the same
> container, which should work because the iommu_device group backing the
> mdev virtual group is already attached to this domain.
> If we had two separate QEMU processes, each with an mdev device from a
> common group at the iommu_device level, the type1 backend should fail
> to attach the group to the container for the later caller.  I'd think
> this should fail at the iommu_attach_group() call since the group we're
> trying to attach is already attached to another domain.

Agree with you. At first, I want to fail it in similar way with vfio-pci devices.
For vfio-pci devices from a common group, vfio will fail the operation around
/dev/vfio/group_id open if user tries to assign the vfio-pci devices from common
group to multiple QEMU processes. Meanwhile, QEMU will avoid to open a
/dev/vfio/group_id multiple times, so current vfio/QEMU works well for 
non- singleton group (no vIOMMU). Unfortunately, looks like we have no way
to fail vfio-mdev-pci devices in similar mechanism as each mdev has a separate
group. So yes, I agree with you that we may fail it around the group attach
phase. Below is my draft idea:

In vfio_iommu_type1_attach_group(), we need to do the following checks.

if (mdev_group) {
	if (iommu_device group enabled aux-domain) {
		/*
		  * iommu_group enabled aux-domain means the iommu_devices
		  * in this group are aux-domain enabled. e.g. SIOV capable devices.
		  * Also, I think for aux-domain enabled group, it essentially means
		  * the group is a singleton group as SIOV capable devices require
		  * to be in a singleton group.
		  */
		 iommu_aux_attach_device();
	} else {
		/*
		  * needs to check the group->opened in vfio.c. Just like what
		  * vfio_group_fopen() does. May be a new VFIO interface required
		  * here since the group->opened is within vfio.c.
		  * vfio_iommu_device_group_opened_inc() will inc group->opened, so
		  * that other VM will fail when trying to open the group. And another
		  * VFIO interface is also required to dec group->opened when VM is
		  * down.
		  */
		if (vfio_iommu_device_group_opened_inc(iommu_device_group))
			return -EBUSY;
		iommu_attach_gorup(iommu_device_group);
	}
}

The concern here is the two new VFIO interfaces. Any thoughts on this proposal? :-)

> It's really unfortunate that we don't have the mdev inheriting the
> iommu group of the iommu_device so that userspace can really understand
> this relationship.  A separate group makes sense for the aux-domain
> case, and is (I guess) not a significant issue in the case of a
> singleton iommu_device group, but it's pretty awkward here.  Perhaps
> this is something we should correct in design of iommu backed mdevs.

Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
means singleton iommu_devie group. And in early time, when designing the support
for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
iommu_device group. But this results in an iommu backed group includes mdev and
physical devices, which might also be strange. Do you think it is valuable to reconsider
it?

> Thanks,
> 
> Alex

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-04  9:11                     ` Liu, Yi L
@ 2019-07-05 15:55                       ` Alex Williamson
  2019-07-11 12:27                         ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-07-05 15:55 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Thu, 4 Jul 2019 09:11:02 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, July 4, 2019 1:22 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Wed, 3 Jul 2019 08:25:25 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >
> > > Thanks for the comments. Have four inline responses below. And one
> > > of them need your further help. :-)  
> 
> [...]
> 
> > > > > >  
> > > > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > > > the mdev parent device is in a singleton iommu group?  
> > > > > > > > >
> > > > > > > > > I don't think there should have such limitation. Per my
> > > > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > > > devices which shares iommu group with other devices. vfio-pci works  
> > well  
> > > > for such devices.  
> > > > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > > > vfio-mdev-pci should naturally support it as well.  
> > > > > > > >
> > > > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > > > further abstract those details in vfio_iommu_type1.c.  
> > > > > > >
> > > > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > > > devices share an iommu_group, I guess it could be supported by
> > > > > > > using __iommu_attach_device() which has no device counting for the
> > > > > > > group. But maybe I missed something here. It would be great if you
> > > > > > > can elaborate a bit for it. :-)  
> > > > > >
> > > > > > We need to use the group semantics, there's a reason
> > > > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > > > think there's no way around that we need to somewhere track the
> > > > > > actual group we're attaching to and have the smarts to re-use it for
> > > > > > other devices in the same group.  
> > > > >
> > > > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > > > :-)
> > > > >  
> > > > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > > > to use the device.  Can we support it though?  
> > > > > > > > >
> > > > > > > > > yeah, I think we need to support it.  
> > >
> > > I've already made vfio-mdev-pci driver work for non-singleton iommu
> > > group. e.g. for devices in a single iommu group, I can bind the devices
> > > to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> > > it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> > > manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> > > device can only passthru via
> > > "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> > > we expect.
> > >
> > > However, I encountered a problem when trying to prevent user from
> > > passthru these devices to different VMs. I've tried in my side, and I
> > > can passthru vfio-pci device and vfio-mdev-pci device to different
> > > VMs. But actually this operation should be failed. If all the devices
> > > are bound to vfio-pci, Qemu will open iommu backed group. So
> > > Qemu can check if a given group has already been used by an
> > > AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
> > > user from passthru these devices to different VMs if the devices
> > > are in the same iommu backed group. However, here for a
> > > vfio-mdev-pci device, it has a new group and group ID, Qemu
> > > will not be able to detect if the other devices (share iommu group
> > > with vfio-mdev-pci device) are passthru to existing VMs. This is the
> > > major problem for vfio-mdev-pci to support non-singleton group
> > > in my side now. Even all devices are bound to vfio-mdev-pci driver,
> > > Qemu is still unable to check since all the vfio-mdev-pci devices
> > > have a separate mdev group.
> > >
> > > To fix it, may need Qemu to do more things. E.g. If it tries to use a
> > > non-singleton iommu backed group, it needs to check if any mdev
> > > group is created and used by an existing VM. Also it needs check if
> > > iommu backed group is passthru to an existing VM when trying to
> > > use a mdev group. For singleton iommu backed group and
> > > aux-domain enabled physical device, still allow to passthru mdev
> > > group to different VMs. To achieve these checks, Qemu may need
> > > to have knowledge whether a group is iommu backed and singleton
> > > or not. Do you think it is good to expose such info to userspace? or
> > > any other idea? :-)  
> > 
> > QEMU is never responsible for isolating a group, QEMU is just a
> > userspace driver, it's vfio's responsibility to prevent the user from
> > splitting groups in ways that are not allowed.  QEMU does not know the  
> 
> yep, also my concern.
> 
> > true group association, it only knows the "virtual" group of the mdev
> > device.  QEMU will create a container and add the mdev virtual group to
> > the container.  In the kernel, the type1 backend should actually do an
> > iommu_attach_group(), attaching the iommu_device group to the domain.
> > When QEMU processes the next device, it will have a different group,
> > but (assuming no vIOMMU) it will try to attach it to the same
> > container, which should work because the iommu_device group backing the
> > mdev virtual group is already attached to this domain.
> > If we had two separate QEMU processes, each with an mdev device from a
> > common group at the iommu_device level, the type1 backend should fail
> > to attach the group to the container for the later caller.  I'd think
> > this should fail at the iommu_attach_group() call since the group we're
> > trying to attach is already attached to another domain.  
> 
> Agree with you. At first, I want to fail it in similar way with vfio-pci devices.
> For vfio-pci devices from a common group, vfio will fail the operation around
> /dev/vfio/group_id open if user tries to assign the vfio-pci devices from common
> group to multiple QEMU processes. Meanwhile, QEMU will avoid to open a
> /dev/vfio/group_id multiple times, so current vfio/QEMU works well for 
> non- singleton group (no vIOMMU). Unfortunately, looks like we have no way
> to fail vfio-mdev-pci devices in similar mechanism as each mdev has a separate
> group. So yes, I agree with you that we may fail it around the group attach
> phase. Below is my draft idea:
> 
> In vfio_iommu_type1_attach_group(), we need to do the following checks.
> 
> if (mdev_group) {
> 	if (iommu_device group enabled aux-domain) {
> 		/*
> 		  * iommu_group enabled aux-domain means the iommu_devices
> 		  * in this group are aux-domain enabled. e.g. SIOV capable devices.
> 		  * Also, I think for aux-domain enabled group, it essentially means
> 		  * the group is a singleton group as SIOV capable devices require
> 		  * to be in a singleton group.
> 		  */
> 		 iommu_aux_attach_device();
> 	} else {
> 		/*
> 		  * needs to check the group->opened in vfio.c. Just like what
> 		  * vfio_group_fopen() does. May be a new VFIO interface required
> 		  * here since the group->opened is within vfio.c.
> 		  * vfio_iommu_device_group_opened_inc() will inc group->opened, so
> 		  * that other VM will fail when trying to open the group. And another
> 		  * VFIO interface is also required to dec group->opened when VM is
> 		  * down.
> 		  */
> 		if (vfio_iommu_device_group_opened_inc(iommu_device_group))
> 			return -EBUSY;
> 		iommu_attach_gorup(iommu_device_group);
> 	}
> }
> 
> The concern here is the two new VFIO interfaces. Any thoughts on this proposal? :-)
> 
> > It's really unfortunate that we don't have the mdev inheriting the
> > iommu group of the iommu_device so that userspace can really understand
> > this relationship.  A separate group makes sense for the aux-domain
> > case, and is (I guess) not a significant issue in the case of a
> > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > this is something we should correct in design of iommu backed mdevs.  
> 
> Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> means singleton iommu_devie group. And in early time, when designing the support
> for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> iommu_device group. But this results in an iommu backed group includes mdev and
> physical devices, which might also be strange. Do you think it is valuable to reconsider
> it?

From a group perspective, the cleanest solution would seem to be that
IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
group of the iommu_device, but I think the barrier here is that we have
a difficult time determining if the group is "viable" in that case.
For example a group where one devices is bound to a native host driver
and the other device bound to a vfio driver would typically be
considered non-viable as it breaks the isolation guarantees.  However
I think in this configuration, the parent device is effectively
participating in the isolation and "donating" its iommu group on behalf
of the mdev device.  I don't think we can simultaneously use that iommu
group for any other purpose.  I'm sure we could come up with a way for
vifo-core to understand this relationship and add it to the white list,
I wonder though how confusing this might be to users who now understand
the group/driver requirement to be "all endpoints bound to vfio
drivers".  This might still be the best approach regardless of this.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-05 15:55                       ` Alex Williamson
@ 2019-07-11 12:27                         ` Liu, Yi L
  2019-07-11 19:08                           ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-07-11 12:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> Of Alex Williamson
> Sent: Friday, July 5, 2019 11:55 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Thu, 4 Jul 2019 09:11:02 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, July 4, 2019 1:22 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
[...]
> >
> > > It's really unfortunate that we don't have the mdev inheriting the
> > > iommu group of the iommu_device so that userspace can really understand
> > > this relationship.  A separate group makes sense for the aux-domain
> > > case, and is (I guess) not a significant issue in the case of a
> > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > this is something we should correct in design of iommu backed mdevs.
> >
> > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > means singleton iommu_devie group. And in early time, when designing the
> support
> > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > iommu_device group. But this results in an iommu backed group includes mdev and
> > physical devices, which might also be strange. Do you think it is valuable to
> reconsider
> > it?
> 
> From a group perspective, the cleanest solution would seem to be that
> IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> group of the iommu_device,

A confirm here. Regards to inherit the IOMMU group of iommu_device, do
you mean mdev device should be added to the IOMMU group of iommu_device
or maintain a parent and inheritor relationship within vfio? I guess you mean the
later one? :-)

> but I think the barrier here is that we have
> a difficult time determining if the group is "viable" in that case.
> For example a group where one devices is bound to a native host driver
> and the other device bound to a vfio driver would typically be
> considered non-viable as it breaks the isolation guarantees.  However

yes, this is how vfio guarantee the isolation before allowing user to further
add a group to a vfio container and so on.

> I think in this configuration, the parent device is effectively
> participating in the isolation and "donating" its iommu group on behalf
> of the mdev device.  I don't think we can simultaneously use that iommu
> group for any other purpose. 

Agree. At least host cannot make use of the iommu group any more in such
configuration.

> I'm sure we could come up with a way for
> vifo-core to understand this relationship and add it to the white list,

The configuration is host driver still exists while we want to let mdev device
to somehow "own" the iommu backed DMA isolation capability. So one possible
way may be calling vfio_add_group_dev() which will creates a vfio_device instance
for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
iommu group is fairly viable.

> I wonder though how confusing this might be to users who now understand
> the group/driver requirement to be "all endpoints bound to vfio
> drivers".  This might still be the best approach regardless of this.

Yes, another thing I'm considering is how to prevent such a host driver from
issuing DMA. If we finally get a device bound to vfio-pci and another device
wrapped as mdev and passthru them to VM, the host driver is still capable to
issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
DMA issued by host driver happens to have mapping in IOMMU side, then
host is kind of doing things on behalf on VM. Though we may trust the host
driver, but it looks to be a little bit awkward to me. :-(

> Thanks,
> 
> Alex

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-11 12:27                         ` Liu, Yi L
@ 2019-07-11 19:08                           ` Alex Williamson
  2019-07-12 12:55                             ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-07-11 19:08 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Thu, 11 Jul 2019 12:27:26 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> > Of Alex Williamson
> > Sent: Friday, July 5, 2019 11:55 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 4 Jul 2019 09:11:02 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver  
> [...]
> > >  
> > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > iommu group of the iommu_device so that userspace can really understand
> > > > this relationship.  A separate group makes sense for the aux-domain
> > > > case, and is (I guess) not a significant issue in the case of a
> > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > this is something we should correct in design of iommu backed mdevs.  
> > >
> > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > means singleton iommu_devie group. And in early time, when designing the  
> > support  
> > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > iommu_device group. But this results in an iommu backed group includes mdev and
> > > physical devices, which might also be strange. Do you think it is valuable to  
> > reconsider  
> > > it?  
> > 
> > From a group perspective, the cleanest solution would seem to be that
> > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > group of the iommu_device,  
> 
> A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> you mean mdev device should be added to the IOMMU group of iommu_device
> or maintain a parent and inheritor relationship within vfio? I guess you mean the
> later one? :-)

I was thinking the former, I'm not sure what the latter implies.  There
is no hierarchy within or between IOMMU groups, it's simply a set of
devices.  Maybe what you're getting at is that vfio needs to understand
that the mdev is a child of the endpoint device in its determination of
whether the group is viable.  That's true, but we can also have IOMMU
groups composed of SR-IOV VFs along with their parent PF if the root of
the IOMMU group is (for example) a downstream switch port above the PF.
So we can't simply look at the parent/child relationship within the
group, we somehow need to know that the parent device sharing the IOMMU
group is operating in host kernel space on behalf of the mdev.
 
> > but I think the barrier here is that we have
> > a difficult time determining if the group is "viable" in that case.
> > For example a group where one devices is bound to a native host driver
> > and the other device bound to a vfio driver would typically be
> > considered non-viable as it breaks the isolation guarantees.  However  
> 
> yes, this is how vfio guarantee the isolation before allowing user to further
> add a group to a vfio container and so on.
> 
> > I think in this configuration, the parent device is effectively
> > participating in the isolation and "donating" its iommu group on behalf
> > of the mdev device.  I don't think we can simultaneously use that iommu
> > group for any other purpose.   
> 
> Agree. At least host cannot make use of the iommu group any more in such
> configuration.
> 
> > I'm sure we could come up with a way for
> > vifo-core to understand this relationship and add it to the white list,  
> 
> The configuration is host driver still exists while we want to let mdev device
> to somehow "own" the iommu backed DMA isolation capability. So one possible
> way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> iommu group is fairly viable.

"fairly viable" ;)  It's a correct use of the term, it's a little funny
though as "fairly" can also mean reasonably/sufficiently/adequately as
well as I think the intended use here equivalent to justly. </tangent>

That's an interesting idea to do an implicit vfio_add_group_dev() on
the iommu_device in this case, if you've worked through how that could
play out, it'd be interesting to see.

> > I wonder though how confusing this might be to users who now understand
> > the group/driver requirement to be "all endpoints bound to vfio
> > drivers".  This might still be the best approach regardless of this.  
> 
> Yes, another thing I'm considering is how to prevent such a host driver from
> issuing DMA. If we finally get a device bound to vfio-pci and another device
> wrapped as mdev and passthru them to VM, the host driver is still capable to
> issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
> DMA issued by host driver happens to have mapping in IOMMU side, then
> host is kind of doing things on behalf on VM. Though we may trust the host
> driver, but it looks to be a little bit awkward to me. :-(

vfio is allocating an iommu domain and placing the iommu_device into
that domain, the user therefore own the iova context for the parent
device, how would that not manage all DMA?   The vendor driver could
theoretically also manipulate mappings within that domain, but that
driver is a host kernel driver and therefore essentially trusted like
any other host kernel driver.  The only unique thing here is that it's
part of a channel providing access for an untrusted user, so it needs
to be particularly concerned with keeping that user access within
bounds.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-11 19:08                           ` Alex Williamson
@ 2019-07-12 12:55                             ` Liu, Yi L
  2019-07-19 20:57                               ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Liu, Yi L @ 2019-07-12 12:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, July 12, 2019 3:08 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Thu, 11 Jul 2019 12:27:26 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf
> > > Of Alex Williamson
> > > Sent: Friday, July 5, 2019 11:55 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > [...]
> > > >
> > > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > > iommu group of the iommu_device so that userspace can really understand
> > > > > this relationship.  A separate group makes sense for the aux-domain
> > > > > case, and is (I guess) not a significant issue in the case of a
> > > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > > this is something we should correct in design of iommu backed mdevs.
> > > >
> > > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > > means singleton iommu_devie group. And in early time, when designing the
> > > support
> > > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > > iommu_device group. But this results in an iommu backed group includes mdev
> and
> > > > physical devices, which might also be strange. Do you think it is valuable to
> > > reconsider
> > > > it?
> > >
> > > From a group perspective, the cleanest solution would seem to be that
> > > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > > group of the iommu_device,
> >
> > A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> > you mean mdev device should be added to the IOMMU group of iommu_device
> > or maintain a parent and inheritor relationship within vfio? I guess you mean the
> > later one? :-)
> 
> I was thinking the former, I'm not sure what the latter implies.  There
> is no hierarchy within or between IOMMU groups, it's simply a set of
> devices.

I have a concern on adding the mdev device to the iommu_group of
iommu_device. In such configuration, a iommu backed group includes
mdev devices and physical devices. Then it might be necessary to advertise
the mdev info to the in-kernel software which want to loop all devices within
such an iommu_group. An example I can see is the virtual SVA threads in
community. e.g. for a guest pasid bind, the changes below loops all the
devices within an iommu_group, and each loop will call into vendor iommu
driver with a device structure passed in. It is quite possible that vendor
iommu driver need to get something behind a physical device (e.g.
intel_iommu structure). For a physical device, it is fine. While for mdev
device, it would be a problem if no mdev info advertised to iommu driver. :-(
Although we have agreement that PASID support should be disabled for
devices which are from non-singleton group. But I don't feel like to rely on
such assumptions when designing software flows. Also, it's just an example,
we have no idea if there will be more similar flows which require to loop all
devices in an iommu group in future. May be we want to avoid adding a mdev
to an iommu backed group. :-) More replies to you response below.

+static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
+					    void __user *arg,
+					    struct vfio_iommu_type1_bind *bind)
+ ...
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		list_for_each_entry(group, &domain->group_list, next) {
+			ret = iommu_group_for_each_dev(group->iommu_group,
+			   &guest_bind, vfio_bind_gpasid_fn);
+			if (ret)
+				goto out_unbind;
+		}
+	}
+ ...
+}

> Maybe what you're getting at is that vfio needs to understand
> that the mdev is a child of the endpoint device in its determination of
> whether the group is viable.

Is the group here the group of iommu_device or a group of a mdev device?
:-) Actually, I think the group of a mdev device is always viable since
it has only a device and mdev_driver will add the mdev device to vfio
controlled scope to make the mdev group viable. Per my understanding,
VFIO guarantees the isolation by two major arts. First is checking if
group is viable before adding it to a container, second is preventing
multiple opens to /dev/vfio/group_id by the vfio_group->opened field
maintained in vfio.c.

Back to the configuration we are talking here (For example a group where
one devices is bound to a native host driver and the other device bound
to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
I think for iommu_device which wants to "donate" its iommu_group, the
host driver should explicitly call vfio_add_group_dev() to add itself
to the vfio controlled scope. And thus make its iommu backed group be
viable. So that we can have two viable iommu groups. iommu backed group
is viable by the host driver's vfio_add_group_dev() calling, and mdev
group is naturally viable. Until now, we can passthru the devices
(vfio-pci device and a mdev device) under this configuration to VM well.
But we cannot prevent user to passthru the devices to different VMs since
the two iommu groups are both viable. If I'm still understanding vfio
correct until this line, I think we need to fail the attempt of passthru
to multiple VMs in vfio_iommu_type1_attach_group() by checking the
vfio_group->opened field which is maintained in vfio.c. e.g. let's say
for iommu backed group, we have vfio_group#1 and mdev group, we have
vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
vfio_group#2->opened. And vice versa.

[1] the example from the previous reply of you.

> That's true, but we can also have IOMMU
> groups composed of SR-IOV VFs along with their parent PF if the root of
> the IOMMU group is (for example) a downstream switch port above the PF.
> So we can't simply look at the parent/child relationship within the
> group, we somehow need to know that the parent device sharing the IOMMU
> group is operating in host kernel space on behalf of the mdev.

I think for such hardware configuration, we still have only two iommu
group, a iommu backed one and a mdev group. May the idea above still
applicable. :-)

> > > but I think the barrier here is that we have
> > > a difficult time determining if the group is "viable" in that case.
> > > For example a group where one devices is bound to a native host driver
> > > and the other device bound to a vfio driver would typically be
> > > considered non-viable as it breaks the isolation guarantees.  However
> >
> > yes, this is how vfio guarantee the isolation before allowing user to further
> > add a group to a vfio container and so on.
> >
> > > I think in this configuration, the parent device is effectively
> > > participating in the isolation and "donating" its iommu group on behalf
> > > of the mdev device.  I don't think we can simultaneously use that iommu
> > > group for any other purpose.
> >
> > Agree. At least host cannot make use of the iommu group any more in such
> > configuration.
> >
> > > I'm sure we could come up with a way for
> > > vifo-core to understand this relationship and add it to the white list,
> >
> > The configuration is host driver still exists while we want to let mdev device
> > to somehow "own" the iommu backed DMA isolation capability. So one possible
> > way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > iommu group is fairly viable.
> 
> "fairly viable" ;)  It's a correct use of the term, it's a little funny
> though as "fairly" can also mean reasonably/sufficiently/adequately as
> well as I think the intended use here equivalent to justly. </tangent>

Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
when trying to describe my idea with a moderate term either. Luckily,
it made me well understood. :-)

> That's an interesting idea to do an implicit vfio_add_group_dev() on
> the iommu_device in this case, if you've worked through how that could
> play out, it'd be interesting to see.

I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
And this is an explicit calling. And I guess we may really want host driver
to do it explicitly instead of implicitly as host driver owns the choice
of whether "donating" group or not. While for failing the
vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
doing some changes. If it's a correct way, I'll try to send out a new version
for your further review. :-)

> > > I wonder though how confusing this might be to users who now understand
> > > the group/driver requirement to be "all endpoints bound to vfio
> > > drivers".  This might still be the best approach regardless of this.
> >
> > Yes, another thing I'm considering is how to prevent such a host driver from
> > issuing DMA. If we finally get a device bound to vfio-pci and another device
> > wrapped as mdev and passthru them to VM, the host driver is still capable to
> > issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
> > DMA issued by host driver happens to have mapping in IOMMU side, then
> > host is kind of doing things on behalf on VM. Though we may trust the host
> > driver, but it looks to be a little bit awkward to me. :-(
> 
> vfio is allocating an iommu domain and placing the iommu_device into
> that domain, the user therefore own the iova context for the parent
> device, how would that not manage all DMA?   The vendor driver could
> theoretically also manipulate mappings within that domain, but that
> driver is a host kernel driver and therefore essentially trusted like
> any other host kernel driver.  The only unique thing here is that it's
> part of a channel providing access for an untrusted user, so it needs
> to be particularly concerned with keeping that user access within
> bounds.  Thanks,

Got it, thanks for the explanation. Looks like I overplayed the concern.

> 
> Alex

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-12 12:55                             ` Liu, Yi L
@ 2019-07-19 20:57                               ` Alex Williamson
  2019-07-26  9:04                                 ` Liu, Yi L
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2019-07-19 20:57 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

On Fri, 12 Jul 2019 12:55:27 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, July 12, 2019 3:08 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 11 Jul 2019 12:27:26 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On  
> > Behalf  
> > > > Of Alex Williamson
> > > > Sent: Friday, July 5, 2019 11:55 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver  
> > > [...]  
> > > > >  
> > > > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > > > iommu group of the iommu_device so that userspace can really understand
> > > > > > this relationship.  A separate group makes sense for the aux-domain
> > > > > > case, and is (I guess) not a significant issue in the case of a
> > > > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > > > this is something we should correct in design of iommu backed mdevs.  
> > > > >
> > > > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > > > means singleton iommu_devie group. And in early time, when designing the  
> > > > support  
> > > > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > > > iommu_device group. But this results in an iommu backed group includes mdev  
> > and  
> > > > > physical devices, which might also be strange. Do you think it is valuable to  
> > > > reconsider  
> > > > > it?  
> > > >
> > > > From a group perspective, the cleanest solution would seem to be that
> > > > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > > > group of the iommu_device,  
> > >
> > > A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> > > you mean mdev device should be added to the IOMMU group of iommu_device
> > > or maintain a parent and inheritor relationship within vfio? I guess you mean the
> > > later one? :-)  
> > 
> > I was thinking the former, I'm not sure what the latter implies.  There
> > is no hierarchy within or between IOMMU groups, it's simply a set of
> > devices.  
> 
> I have a concern on adding the mdev device to the iommu_group of
> iommu_device. In such configuration, a iommu backed group includes
> mdev devices and physical devices. Then it might be necessary to advertise
> the mdev info to the in-kernel software which want to loop all devices within
> such an iommu_group. An example I can see is the virtual SVA threads in
> community. e.g. for a guest pasid bind, the changes below loops all the
> devices within an iommu_group, and each loop will call into vendor iommu
> driver with a device structure passed in. It is quite possible that vendor
> iommu driver need to get something behind a physical device (e.g.
> intel_iommu structure). For a physical device, it is fine. While for mdev
> device, it would be a problem if no mdev info advertised to iommu driver. :-(
> Although we have agreement that PASID support should be disabled for
> devices which are from non-singleton group. But I don't feel like to rely on
> such assumptions when designing software flows. Also, it's just an example,
> we have no idea if there will be more similar flows which require to loop all
> devices in an iommu group in future. May be we want to avoid adding a mdev
> to an iommu backed group. :-) More replies to you response below.
> 
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> +					    void __user *arg,
> +					    struct vfio_iommu_type1_bind *bind)
> + ...
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			ret = iommu_group_for_each_dev(group->iommu_group,
> +			   &guest_bind, vfio_bind_gpasid_fn);
> +			if (ret)
> +				goto out_unbind;
> +		}
> +	}
> + ...
> +}

Sorry for the delayed response.

I think you're right, making the IOMMU code understand virtual devices
in an IOMMU group makes traversing the group difficult for any layer
that doesn't understand the relationship of these virtual devices.  I
guess we can't go that route.

> > Maybe what you're getting at is that vfio needs to understand
> > that the mdev is a child of the endpoint device in its determination of
> > whether the group is viable.  
> 
> Is the group here the group of iommu_device or a group of a mdev device?
> :-) Actually, I think the group of a mdev device is always viable since
> it has only a device and mdev_driver will add the mdev device to vfio
> controlled scope to make the mdev group viable. Per my understanding,
> VFIO guarantees the isolation by two major arts. First is checking if
> group is viable before adding it to a container, second is preventing
> multiple opens to /dev/vfio/group_id by the vfio_group->opened field
> maintained in vfio.c.

Yes, minor nit, an mdev needs to be bound to vfio-mdev for the group to
be vfio "viable", we expect that there will eventually be non-vfio
drivers for mdev devices.

> Back to the configuration we are talking here (For example a group where
> one devices is bound to a native host driver and the other device bound
> to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
> I think for iommu_device which wants to "donate" its iommu_group, the
> host driver should explicitly call vfio_add_group_dev() to add itself
> to the vfio controlled scope. And thus make its iommu backed group be
> viable. So that we can have two viable iommu groups. iommu backed group
> is viable by the host driver's vfio_add_group_dev() calling, and mdev
> group is naturally viable. Until now, we can passthru the devices
> (vfio-pci device and a mdev device) under this configuration to VM well.
> But we cannot prevent user to passthru the devices to different VMs since
> the two iommu groups are both viable. If I'm still understanding vfio
> correct until this line, I think we need to fail the attempt of passthru
> to multiple VMs in vfio_iommu_type1_attach_group() by checking the
> vfio_group->opened field which is maintained in vfio.c. e.g. let's say
> for iommu backed group, we have vfio_group#1 and mdev group, we have
> vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
> vfio_group#2->opened. And vice versa.
> 
> [1] the example from the previous reply of you.

I think there's a problem with incrementing the group, the user still
needs to be able to open the group for devices within the group that
may be bound to vfio-pci, so I don't think this plan really works.
Also, who would be responsible for calling vfio_add_group_dev(), the
vendor driver is just registering an mdev parent device, it doesn't
know that those devices will be used by vfio-mdev or some other mdev
bus driver.  I think that means that vfio-mdev would need to call this
for mdevs with an iommu_device after it registers the mdev itself.  The
vfio_device_ops it registers would need to essentially be stubbed out
too, in order to prevent direct vfio access to the backing device.

I wonder if the "inheritance" of a group could be isolated to vfio in
such a case.  The vfio group file for the mdev must exist for
userspace compatibility, but I wonder if we could manage to make that be
effectively an alias for the iommu device.  Using a device from a group
without actually opening the group still seems problematic too.  I'm
also wondering how much effort we want to go to in supporting this
versus mdev could essentially fail the call to register an iommu device
for an mdev if that iommu device is not in a singleton group.  It would
limit the application of vfio-mdev-pci, but already being proposed as a
proof of concept sample driver anyway.


> > That's true, but we can also have IOMMU
> > groups composed of SR-IOV VFs along with their parent PF if the root of
> > the IOMMU group is (for example) a downstream switch port above the PF.
> > So we can't simply look at the parent/child relationship within the
> > group, we somehow need to know that the parent device sharing the IOMMU
> > group is operating in host kernel space on behalf of the mdev.  
> 
> I think for such hardware configuration, we still have only two iommu
> group, a iommu backed one and a mdev group. May the idea above still
> applicable. :-)
> 
> > > > but I think the barrier here is that we have
> > > > a difficult time determining if the group is "viable" in that case.
> > > > For example a group where one devices is bound to a native host driver
> > > > and the other device bound to a vfio driver would typically be
> > > > considered non-viable as it breaks the isolation guarantees.  However  
> > >
> > > yes, this is how vfio guarantee the isolation before allowing user to further
> > > add a group to a vfio container and so on.
> > >  
> > > > I think in this configuration, the parent device is effectively
> > > > participating in the isolation and "donating" its iommu group on behalf
> > > > of the mdev device.  I don't think we can simultaneously use that iommu
> > > > group for any other purpose.  
> > >
> > > Agree. At least host cannot make use of the iommu group any more in such
> > > configuration.
> > >  
> > > > I'm sure we could come up with a way for
> > > > vifo-core to understand this relationship and add it to the white list,  
> > >
> > > The configuration is host driver still exists while we want to let mdev device
> > > to somehow "own" the iommu backed DMA isolation capability. So one possible
> > > way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> > > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > > iommu group is fairly viable.  
> > 
> > "fairly viable" ;)  It's a correct use of the term, it's a little funny
> > though as "fairly" can also mean reasonably/sufficiently/adequately as
> > well as I think the intended use here equivalent to justly. </tangent>  
> 
> Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
> when trying to describe my idea with a moderate term either. Luckily,
> it made me well understood. :-)
> 
> > That's an interesting idea to do an implicit vfio_add_group_dev() on
> > the iommu_device in this case, if you've worked through how that could
> > play out, it'd be interesting to see.  
> 
> I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
> And this is an explicit calling. And I guess we may really want host driver
> to do it explicitly instead of implicitly as host driver owns the choice
> of whether "donating" group or not. While for failing the
> vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
> and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
> doing some changes. If it's a correct way, I'll try to send out a new version
> for your further review. :-)

I'm interested to see it, but as above, I have some reservations.  And
as I mention, and mdev vendor driver cannot assume the device is used
by vfio-mdev.  I know Intel vGPUs not only assume vfio-mdev, but also
KVM and fail the device open if the constraints aren't met, but I don't
think we can start introducing that sort of vfio specific dependencies
on the mdev bus interface.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
  2019-07-19 20:57                               ` Alex Williamson
@ 2019-07-26  9:04                                 ` Liu, Yi L
  0 siblings, 0 replies; 26+ messages in thread
From: Liu, Yi L @ 2019-07-26  9:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Masahiro Yamada

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, July 20, 2019 4:58 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Fri, 12 Jul 2019 12:55:27 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, July 12, 2019 3:08 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 11 Jul 2019 12:27:26 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> > > Behalf
> > > > > Of Alex Williamson
> > > > > Sent: Friday, July 5, 2019 11:55 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > >
> > > > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > [...]
[...]
> > > Maybe what you're getting at is that vfio needs to understand
> > > that the mdev is a child of the endpoint device in its determination of
> > > whether the group is viable.
> >
> > Is the group here the group of iommu_device or a group of a mdev device?
> > :-) Actually, I think the group of a mdev device is always viable since
> > it has only a device and mdev_driver will add the mdev device to vfio
> > controlled scope to make the mdev group viable. Per my understanding,
> > VFIO guarantees the isolation by two major arts. First is checking if
> > group is viable before adding it to a container, second is preventing
> > multiple opens to /dev/vfio/group_id by the vfio_group->opened field
> > maintained in vfio.c.
> 
> Yes, minor nit, an mdev needs to be bound to vfio-mdev for the group to
> be vfio "viable", we expect that there will eventually be non-vfio
> drivers for mdev devices.

Then I guess the mdev group is non-viable per vfio's mind. :-)

> > Back to the configuration we are talking here (For example a group where
> > one devices is bound to a native host driver and the other device bound
> > to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
> > I think for iommu_device which wants to "donate" its iommu_group, the
> > host driver should explicitly call vfio_add_group_dev() to add itself
> > to the vfio controlled scope. And thus make its iommu backed group be
> > viable. So that we can have two viable iommu groups. iommu backed group
> > is viable by the host driver's vfio_add_group_dev() calling, and mdev
> > group is naturally viable. Until now, we can passthru the devices
> > (vfio-pci device and a mdev device) under this configuration to VM well.
> > But we cannot prevent user to passthru the devices to different VMs since
> > the two iommu groups are both viable. If I'm still understanding vfio
> > correct until this line, I think we need to fail the attempt of passthru
> > to multiple VMs in vfio_iommu_type1_attach_group() by checking the
> > vfio_group->opened field which is maintained in vfio.c. e.g. let's say
> > for iommu backed group, we have vfio_group#1 and mdev group, we have
> > vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
> > vfio_group#2->opened. And vice versa.
> >
> > [1] the example from the previous reply of you.
> 
> I think there's a problem with incrementing the group, the user still
> needs to be able to open the group for devices within the group that
> may be bound to vfio-pci, so I don't think this plan really works.

Perhaps I failed to make it clear. Let me explain. By incrementing the
group, vfio can prevent the usage of passthru a single iommu group
to different QEMUs (VMs). Once a QEMU opens a group. It will not
open again. e.g. current QEMU implementation checks the
vfio_group_list in vfio_get_group() before opening group for an
assigned device. Thus it avoids to open multiple times in a QEMU
process. This makes sense since kernel VFIO will attach all the devices
within a given iommu group to the allocated unmanaged domain in
vfio_iommu_type1_attach_group(). Back to my plan. :-) Say I have a
iommu group with three devices. A device bound to vfio-pci, and two
devices bound to a host driver which will wrap itself as a mdev (e.g.
vfio-mdev-pci driver). So there will be finally three groups, an iommu
backed group, two mdev groups. As I mentioned I may be able to
make the iommu backed group be vfio viable with
vfio_add_group_dev(). Then my plan is simple. Let the three groups
shares a group->open field. When any of the three groups results in
a increment of iommu back group. Also before any open of the three
groups, vfio_group_fops_open() should check the iommu backed
group first. Alternatively, this check can also be done in
vfio_iommu_type1_attach_group(). Looks like it may be better to
happen in vfio_group_fops_open() since we may need to let vfio.c
understand the " inheritance" between the three groups.

> Also, who would be responsible for calling vfio_add_group_dev(), the
> vendor driver is just registering an mdev parent device, it doesn't

I think it should be the vendor driver since I believe it's vendor driver's
duty to make this decision. This would like the vendor driver wants to
"donate" its iommu group to a mdev device.

> know that those devices will be used by vfio-mdev or some other mdev
> bus driver.  I think that means that vfio-mdev would need to call this
> for mdevs with an iommu_device after it registers the mdev itself.  The

Hmmm, it may be a trouble if letting vfio-mdev call this for mdevs. I'm
not sure if vfio-mdev can have the knowledge that the mdev is backed
by an iommu device.

> vfio_device_ops it registers would need to essentially be stubbed out
> too, in order to prevent direct vfio access to the backing device.

yes, the vfio_device_ops would be a dummy. In order to prevent
direct vfio access to the backing device, the vfio_device_ops.open()
should be implemented as always fail the open attempt. Thus no direct
vfio access will be successful.

> I wonder if the "inheritance" of a group could be isolated to vfio in
> such a case.  The vfio group file for the mdev must exist for
> userspace compatibility, but I wonder if we could manage to make that be
> effectively an alias for the iommu device.  Using a device from a group
> without actually opening the group still seems problematic too.  I'm

Yeah, the "inheritance" of iommu backed group and mdev groups should
be kind of "alias".

> also wondering how much effort we want to go to in supporting this
> versus mdev could essentially fail the call to register an iommu device
> for an mdev if that iommu device is not in a singleton group.  It would
> limit the application of vfio-mdev-pci, but already being proposed as a
> proof of concept sample driver anyway.

Let me have a try and get back to you. :-)
 
> > > That's true, but we can also have IOMMU
> > > groups composed of SR-IOV VFs along with their parent PF if the root of
> > > the IOMMU group is (for example) a downstream switch port above the PF.
> > > So we can't simply look at the parent/child relationship within the
> > > group, we somehow need to know that the parent device sharing the IOMMU
> > > group is operating in host kernel space on behalf of the mdev.
> >
> > I think for such hardware configuration, we still have only two iommu
> > group, a iommu backed one and a mdev group. May the idea above still
> > applicable. :-)
> >
> > > > > but I think the barrier here is that we have
> > > > > a difficult time determining if the group is "viable" in that case.
> > > > > For example a group where one devices is bound to a native host driver
> > > > > and the other device bound to a vfio driver would typically be
> > > > > considered non-viable as it breaks the isolation guarantees.  However
> > > >
> > > > yes, this is how vfio guarantee the isolation before allowing user to further
> > > > add a group to a vfio container and so on.
> > > >
> > > > > I think in this configuration, the parent device is effectively
> > > > > participating in the isolation and "donating" its iommu group on behalf
> > > > > of the mdev device.  I don't think we can simultaneously use that iommu
> > > > > group for any other purpose.
> > > >
> > > > Agree. At least host cannot make use of the iommu group any more in such
> > > > configuration.
> > > >
> > > > > I'm sure we could come up with a way for
> > > > > vifo-core to understand this relationship and add it to the white list,
> > > >
> > > > The configuration is host driver still exists while we want to let mdev device
> > > > to somehow "own" the iommu backed DMA isolation capability. So one
> possible
> > > > way may be calling vfio_add_group_dev() which will creates a vfio_device
> instance
> > > > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > > > iommu group is fairly viable.
> > >
> > > "fairly viable" ;)  It's a correct use of the term, it's a little funny
> > > though as "fairly" can also mean reasonably/sufficiently/adequately as
> > > well as I think the intended use here equivalent to justly. </tangent>
> >
> > Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
> > when trying to describe my idea with a moderate term either. Luckily,
> > it made me well understood. :-)
> >
> > > That's an interesting idea to do an implicit vfio_add_group_dev() on
> > > the iommu_device in this case, if you've worked through how that could
> > > play out, it'd be interesting to see.
> >
> > I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
> > And this is an explicit calling. And I guess we may really want host driver
> > to do it explicitly instead of implicitly as host driver owns the choice
> > of whether "donating" group or not. While for failing the
> > vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
> > and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
> > doing some changes. If it's a correct way, I'll try to send out a new version
> > for your further review. :-)
> 
> I'm interested to see it, but as above, I have some reservations.  And
> as I mention, and mdev vendor driver cannot assume the device is used
> by vfio-mdev.  I know Intel vGPUs not only assume vfio-mdev, but also
> KVM and fail the device open if the constraints aren't met, but I don't
> think we can start introducing that sort of vfio specific dependencies
> on the mdev bus interface.  Thanks,

Yeah, it's always bad to introduce specific dependencies. But here if
letting vendor driver to call the vfio_add_group_dev(), then it is still
agnostic to vfio-mdev and other potential vfio-mdev alike mdev drivers
in future. Not sure if this is correct, pls feel free correct me. :-)

> Alex

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-07-26  9:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-08 13:21 [PATCH v1 0/9] vfio_pci: wrap pci device as a mediated device Liu Yi L
2019-06-08 13:21 ` [PATCH v1 1/9] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
2019-06-08 13:21 ` [PATCH v1 2/9] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
2019-06-08 13:21 ` [PATCH v1 3/9] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
2019-06-08 13:21 ` [PATCH v1 4/9] vfio_pci: make common functions be extern Liu Yi L
2019-06-08 13:21 ` [PATCH v1 5/9] vfio_pci: duplicate vfio_pci.c Liu Yi L
2019-06-08 13:21 ` [PATCH v1 6/9] vfio_pci: shrink vfio_pci_common.c Liu Yi L
2019-06-08 13:21 ` [PATCH v1 7/9] vfio_pci: shrink vfio_pci.c Liu Yi L
2019-06-08 13:21 ` [PATCH v1 8/9] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
2019-06-08 13:21 ` [PATCH v1 9/9] smaples: add vfio-mdev-pci driver Liu Yi L
2019-06-20  4:26   ` Alex Williamson
2019-06-20 13:00     ` Liu, Yi L
2019-06-20 21:07       ` Alex Williamson
2019-06-21 10:23         ` Liu, Yi L
2019-06-21 15:57           ` Alex Williamson
2019-06-24  8:20             ` Liu, Yi L
2019-06-28 15:07               ` Alex Williamson
2019-07-03  8:25                 ` Liu, Yi L
2019-07-03 17:22                   ` Alex Williamson
2019-07-04  9:11                     ` Liu, Yi L
2019-07-05 15:55                       ` Alex Williamson
2019-07-11 12:27                         ` Liu, Yi L
2019-07-11 19:08                           ` Alex Williamson
2019-07-12 12:55                             ` Liu, Yi L
2019-07-19 20:57                               ` Alex Williamson
2019-07-26  9:04                                 ` Liu, Yi L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).