linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device
@ 2019-09-05  7:59 Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 01/13] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
                   ` (13 more replies)
  0 siblings, 14 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patchset aims to add a vfio-pci-like meta driver as a demo
user of the vfio changes introduced in "vfio/mdev: IOMMU aware
mediated device" patchset from Baolu Lu. Besides the test purpose,
per Alex's comments, it could also be a good base driver for
experimenting with device specific mdev migration.

Specific interface tested in this proposal:
 *) int mdev_set_iommu_device(struct device *dev,
 				struct device *iommu_device)
    introduced in the patch as below:
    "[PATCH v5 6/8] vfio/mdev: Add iommu related member in mdev_device"

Patch Overview:
 *) patch 1 ~ 7: code refactor for existing vfio-pci module
                 move the common codes from vfio_pci.c to
                 vfio_pci_common.c
 *) patch 8: add protection to perm_bits alloc/free
 *) patch 9: add vfio-mdev-pci sample driver
 *) patch 10: refine the sample driver
 *) patch 11 - 13: make the sample driver work for non-singleton groups
                   also work for vfio-pci and vfio-mdev-pci mixed usage
                   includes vfio-mdev-pci driver change and vfio_iommu_type1
                   changes.

Links:
 *) Link of "vfio/mdev: IOMMU aware mediated device"
         https://lwn.net/Articles/780522/
 *) Previous versions:
         RFC v1: https://lkml.org/lkml/2019/3/4/529
         RFC v2: https://lkml.org/lkml/2019/3/13/113
         RFC v3: https://lkml.org/lkml/2019/4/24/495
         Patch v1: https://www.spinics.net/lists/kvm/msg188952.html
 *) may try it with the codes in below repo
    current version is branch "v5.3-rc7-pci-mdev":
         https://github.com/luxis1999/vfio-mdev-pci-sample-driver.git

Test done on two NICs which share an iommu group.

-------------------------------------------------------------------
         |  NIC0           |  NIC1      | vIOMMU  | VMs  | Passtrhu
-------------------------------------------------------------------
  Test#0 |  vfio-pci       |  vfio-pci  | no      |  1   | pass
-------------------------------------------------------------------
  Test#1 |  vfio-pci       |  vfio-pci  | no      |  2   | fail[1]
-------------------------------------------------------------------
  Test#2 |  vfio-pci       |  vfio-pci  | yes     |  1   | fail[2]
-------------------------------------------------------------------
  Test#3 |  vfio-pci-mdev  |  vfio-pci  | no      |  1   | pass
-------------------------------------------------------------------
  Test#4 |  vfio-pci-mdev  |  vfio-pci  | no      |  2   | fail[3]
-------------------------------------------------------------------
  Test#5 |  vfio-pci-mdev  |  vfio-pci  | yes     |  1   | fail[4]
-------------------------------------------------------------------
Tips:
[1] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6: 
     vfio 0000:01:00.1: failed to open /dev/vfio/1: Device or resource busy
[2] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
     vfio 0000:01:00.1: group 1 used in multiple address spaces
[3] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
     vfio 0000:01:00.1: failed to setup container for group 1: Failed to set
     iommu for container: Device or resource busy
[4] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
     vfio 0000:01:00.1: failed to setup container for group 1: Failed to set
     iommu for container: Device or resource busy
    Or
    qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/
     83b8f4f2-509f-382f-3c1e-e6bfe0fa1003: vfio 83b8f4f2-509f-382f-3c1e-
     e6bfe0fa1003: failed to setup container for group 11: Failed to set iommu
     for container: Device or resource busy
Some other tests are not listed. Like bind NIC0 to vfio-pci-mdev and try to
passthru it with "vfio-pci,host=01:00.0", kernel will throw a warn log and
fail the operation.

Please feel free give your comments.

Thanks,
Yi Liu

Change log:
  patch v1 -> patch v2:
  - the sample driver implementation refined
  - the sample driver can work on non-singleton iommu groups
  - the sample driver can work with vfio-pci, devices from a non-singleton
    group can either be bound to vfio-mdev-pci or vfio-pci, and the
    assignment of this group still follows current vfio assignment rule.

  RFC v3 -> patch v1:
  - split the patchset from 3 patches to 9 patches to better demonstrate
    the changes step by step

  v2->v3:
  - use vfio-mdev-pci instead of vfio-pci-mdev
  - place the new driver under drivers/vfio/pci while define
    Kconfig in samples/Kconfig to clarify it is a sample driver

  v1->v2:
  - instead of adding kernel option to existing vfio-pci
    module in v1, v2 follows Alex's suggestion to add a
    separate vfio-pci-mdev module.
  - new patchset subject: "vfio/pci: wrap pci device as a mediated device"

Alex Williamson (1):
  samples: refine vfio-mdev-pci driver

Liu Yi L (12):
  vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header
  vfio_pci: refine user config reference in vfio-pci module
  vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  vfio_pci: make common functions be extern
  vfio_pci: duplicate vfio_pci.c
  vfio_pci: shrink vfio_pci_common.c
  vfio_pci: shrink vfio_pci.c
  vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
  samples: add vfio-mdev-pci driver
  samples/vfio-mdev-pci: call vfio_add_group_dev()
  vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev
  vfio/type1: track iommu backed group attach

 drivers/vfio/pci/Makefile           |    9 +-
 drivers/vfio/pci/vfio_mdev_pci.c    |  497 ++++++++++++
 drivers/vfio/pci/vfio_pci.c         | 1449 +---------------------------------
 drivers/vfio/pci/vfio_pci_common.c  | 1455 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  |    9 +
 drivers/vfio/pci/vfio_pci_private.h |   36 +
 drivers/vfio/vfio_iommu_type1.c     |  185 ++++-
 samples/Kconfig                     |   11 +
 8 files changed, 2194 insertions(+), 1457 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_mdev_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_common.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 01/13] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch fix an issue regards to always_inline. e.g.:

"error: inlining failed in call to always_inline ‘vfio_pci_is_vga’:
function body not available".

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 14 --------------
 drivers/vfio/pci/vfio_pci_private.h | 14 ++++++++++++++
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 703948c..38271df 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,15 +54,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static inline bool vfio_vga_disabled(void)
-{
-#ifdef CONFIG_VFIO_PCI_VGA
-	return disable_vga;
-#else
-	return true;
-#endif
-}
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -102,11 +93,6 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 	return decodes;
 }
 
-static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
-{
-	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-}
-
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
 	struct resource *res;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index ee6ee91..f12d92c 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -130,6 +130,20 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
+{
+	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
+}
+
+static inline bool vfio_vga_disabled(void)
+{
+#ifdef CONFIG_VFIO_PCI_VGA
+	return disable_vga;
+#else
+	return true;
+#endif
+}
+
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 01/13] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-26  2:36   ` Alex Williamson
  2019-09-05  7:59 ` [PATCH v2 03/13] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch adds three fields in struct vfio_pci_device to pass the user
configs of vfio-pci module to some functions which could be common in
future usage.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 24 +++++++++++++++---------
 drivers/vfio/pci/vfio_pci_private.h |  9 +++++++--
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 38271df..fed2687 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -69,7 +69,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 	unsigned char max_busnr;
 	unsigned int decodes;
 
-	if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
+	if (single_vga || !vfio_vga_disabled(vdev) ||
+		pci_is_root_bus(pdev->bus))
 		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
 		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
 
@@ -273,7 +274,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	if (!vdev->pci_saved_state)
 		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
 
-	if (likely(!nointxmask)) {
+	if (likely(!vdev->nointxmask)) {
 		if (vfio_pci_nointx(pdev)) {
 			pci_info(pdev, "Masking broken INTx support\n");
 			vdev->nointx = true;
@@ -310,7 +311,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	} else
 		vdev->msix_bar = 0xFF;
 
-	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
+	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
 
@@ -436,7 +437,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
 	vfio_pci_try_bus_reset(vdev);
 
-	if (!disable_idle_d3)
+	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
@@ -1304,6 +1305,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret) {
@@ -1328,7 +1334,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	vfio_pci_probe_power_state(vdev);
 
-	if (!disable_idle_d3) {
+	if (!vdev->disable_idle_d3) {
 		/*
 		 * pci-core sets the device power state to an unknown value at
 		 * bootup and after being removed from a driver.  The only
@@ -1359,7 +1365,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	kfree(vdev->region);
 	mutex_destroy(&vdev->ioeventfds_lock);
 
-	if (!disable_idle_d3)
+	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D0);
 
 	kfree(vdev->pm_save);
@@ -1594,7 +1600,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 		if (!ret) {
 			tmp->needs_reset = false;
 
-			if (tmp != vdev && !disable_idle_d3)
+			if (tmp != vdev && !tmp->disable_idle_d3)
 				vfio_pci_set_power_state(tmp, PCI_D3hot);
 		}
 
@@ -1610,7 +1616,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(void)
+static void __init vfio_pci_fill_ids(char *ids)
 {
 	char *p, *id;
 	int rc;
@@ -1665,7 +1671,7 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
-	vfio_pci_fill_ids();
+	vfio_pci_fill_ids(&ids[0]);
 
 	return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index f12d92c..68521d2 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -122,6 +122,11 @@ struct vfio_pci_device {
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
+	bool			nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	bool			disable_vga;
+#endif
+	bool			disable_idle_d3;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -135,10 +140,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
-static inline bool vfio_vga_disabled(void)
+static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
-	return disable_vga;
+	return vdev->disable_vga;
 #else
 	return true;
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 03/13] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 01/13] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 04/13] vfio_pci: make common functions be extern Liu Yi L
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch replaces the vfio_pci_driver reference in vfio_pci.c with
pci_dev_driver(vdev->pdev) which is more helpful to make the functions
be generic to module types.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index fed2687..722c041 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1442,24 +1442,25 @@ static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
 
 static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 {
-	struct vfio_pci_reflck **preflck = data;
+	struct vfio_pci_device *vdev = data;
+	struct vfio_pci_reflck **preflck = &vdev->reflck;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_device *tmp;
 
 	device = vfio_device_get_from_dev(&pdev->dev);
 	if (!device)
 		return 0;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(device);
 		return 0;
 	}
 
-	vdev = vfio_device_data(device);
+	tmp = vfio_device_data(device);
 
-	if (vdev->reflck) {
-		vfio_pci_reflck_get(vdev->reflck);
-		*preflck = vdev->reflck;
+	if (tmp->reflck) {
+		vfio_pci_reflck_get(tmp->reflck);
+		*preflck = tmp->reflck;
 		vfio_device_put(device);
 		return 1;
 	}
@@ -1476,7 +1477,7 @@ static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 
 	if (pci_is_root_bus(vdev->pdev->bus) ||
 	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
-					  &vdev->reflck, slot) <= 0)
+					  vdev, slot) <= 0)
 		vdev->reflck = vfio_pci_reflck_alloc();
 
 	mutex_unlock(&reflck_lock);
@@ -1501,6 +1502,7 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 
 struct vfio_devices {
 	struct vfio_device **devices;
+	struct vfio_pci_device *vdev;
 	int cur_index;
 	int max_index;
 };
@@ -1509,7 +1511,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_device *tmp;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -1518,15 +1520,15 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
+	tmp = vfio_device_data(device);
 
 	/* Fault if the device is not unused */
-	if (vdev->refcnt) {
+	if (tmp->refcnt) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
@@ -1572,6 +1574,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	if (!devs.devices)
 		return;
 
+	devs.vdev = vdev;
 	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
 					  vfio_pci_get_unused_devs,
 					  &devs, slot))
@@ -1616,7 +1619,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(char *ids)
+static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
@@ -1644,7 +1647,7 @@ static void __init vfio_pci_fill_ids(char *ids)
 			continue;
 		}
 
-		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
+		rc = pci_add_dynid(driver, vendor, device,
 				   subvendor, subdevice, class, class_mask, 0);
 		if (rc)
 			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
@@ -1671,7 +1674,7 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
-	vfio_pci_fill_ids(&ids[0]);
+	vfio_pci_fill_ids(&ids[0], &vfio_pci_driver);
 
 	return 0;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 04/13] vfio_pci: make common functions be extern
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (2 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 03/13] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 05/13] vfio_pci: duplicate vfio_pci.c Liu Yi L
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch makes the common functions (module agnostic functions) in
vfio_pci.c to extern. So that such functions could be moved to a common
source file.
  *) vfio_pci_set_vga_decode
  *) vfio_pci_enable
  *) vfio_pci_disable
  *) vfio_pci_ioctl
  *) vfio_pci_read
  *) vfio_pci_write
  *) vfio_pci_mmap
  *) vfio_pci_request
  *) vfio_pci_fill_ids
  *) vfio_pci_reflck_attach
  *) vfio_pci_reflck_put
  *) vfio_pci_probe_power_state

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 30 +++++++++++++-----------------
 drivers/vfio/pci/vfio_pci_private.h | 15 +++++++++++++++
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 722c041..90a44a7 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -62,7 +62,7 @@ MODULE_PARM_DESC(disable_idle_d3,
  * has no way to get to it and routing can be disabled externally at the
  * bridge.
  */
-static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
+unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 {
 	struct vfio_pci_device *vdev = opaque;
 	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
@@ -163,7 +163,6 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 }
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-static void vfio_pci_disable(struct vfio_pci_device *vdev);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -194,7 +193,7 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
-static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u16 pmcsr;
@@ -245,7 +244,7 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 	return ret;
 }
 
-static int vfio_pci_enable(struct vfio_pci_device *vdev)
+int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int ret;
@@ -352,7 +351,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	return ret;
 }
 
-static void vfio_pci_disable(struct vfio_pci_device *vdev)
+void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
@@ -666,8 +665,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static long vfio_pci_ioctl(void *device_data,
-			   unsigned int cmd, unsigned long arg)
+long vfio_pci_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
 {
 	struct vfio_pci_device *vdev = device_data;
 	unsigned long minsz;
@@ -1152,7 +1151,7 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 	return -EINVAL;
 }
 
-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
 			     size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1161,7 +1160,7 @@ static ssize_t vfio_pci_read(void *device_data, char __user *buf,
 	return vfio_pci_rw(device_data, buf, count, ppos, false);
 }
 
-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 			      size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1170,7 +1169,7 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1232,7 +1231,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 			       req_len, vma->vm_page_prot);
 }
 
-static void vfio_pci_request(void *device_data, unsigned int count)
+void vfio_pci_request(void *device_data, unsigned int count)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1264,9 +1263,6 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.request	= vfio_pci_request,
 };
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
-static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
-
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
@@ -1469,7 +1465,7 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 {
 	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
 
@@ -1495,7 +1491,7 @@ static void vfio_pci_reflck_release(struct kref *kref)
 	mutex_unlock(&reflck_lock);
 }
 
-static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 {
 	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
 }
@@ -1619,7 +1615,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 68521d2..53d0385 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -182,6 +182,21 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 
 extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
 				    pci_power_t state);
+extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga);
+extern int vfio_pci_enable(struct vfio_pci_device *vdev);
+extern void vfio_pci_disable(struct vfio_pci_device *vdev);
+extern long vfio_pci_ioctl(void *device_data,
+			unsigned int cmd, unsigned long arg);
+extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			size_t count, loff_t *ppos);
+extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
+extern void vfio_pci_request(void *device_data, unsigned int count);
+extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
+extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
 
 #ifdef CONFIG_VFIO_PCI_IGD
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 05/13] vfio_pci: duplicate vfio_pci.c
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (3 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 04/13] vfio_pci: make common functions be extern Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 06/13] vfio_pci: shrink vfio_pci_common.c Liu Yi L
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch has no code change, just a file copy. In following patches,
vfio_pci_common.c will be modified to only include the common functions
and related static functions in original vfio_pci.c. Meanwhile, vfio_pci.c
will be modified to only include vfio-pci module specific codes.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_common.c | 1688 ++++++++++++++++++++++++++++++++++++
 1 file changed, 1688 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_common.c

diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
new file mode 100644
index 0000000..90a44a7
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -0,0 +1,1688 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#define dev_fmt pr_fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+/*
+ * Our VGA arbiter participation is limited since we don't know anything
+ * about the device itself.  However, if the device is the only VGA device
+ * downstream of a bridge and VFIO VGA support is disabled, then we can
+ * safely return legacy VGA IO and memory as not decoded since the user
+ * has no way to get to it and routing can be disabled externally at the
+ * bridge.
+ */
+unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
+{
+	struct vfio_pci_device *vdev = opaque;
+	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
+	unsigned char max_busnr;
+	unsigned int decodes;
+
+	if (single_vga || !vfio_vga_disabled(vdev) ||
+		pci_is_root_bus(pdev->bus))
+		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
+
+	max_busnr = pci_bus_max_busnr(pdev->bus);
+	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
+
+	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
+		if (tmp == pdev ||
+		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
+		    pci_is_root_bus(tmp->bus))
+			continue;
+
+		if (tmp->bus->number >= pdev->bus->number &&
+		    tmp->bus->number <= max_busnr) {
+			pci_dev_put(tmp);
+			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
+			break;
+		}
+	}
+
+	return decodes;
+}
+
+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+	struct resource *res;
+	int bar;
+	struct vfio_pci_dummy_resource *dummy_res;
+
+	INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		res = vdev->pdev->resource + bar;
+
+		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+			goto no_mmap;
+
+		if (!(res->flags & IORESOURCE_MEM))
+			goto no_mmap;
+
+		/*
+		 * The PCI core shouldn't set up a resource with a
+		 * type but zero size. But there may be bugs that
+		 * cause us to do that.
+		 */
+		if (!resource_size(res))
+			goto no_mmap;
+
+		if (resource_size(res) >= PAGE_SIZE) {
+			vdev->bar_mmap_supported[bar] = true;
+			continue;
+		}
+
+		if (!(res->start & ~PAGE_MASK)) {
+			/*
+			 * Add a dummy resource to reserve the remainder
+			 * of the exclusive page in case that hot-add
+			 * device's bar is assigned into it.
+			 */
+			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+			if (dummy_res == NULL)
+				goto no_mmap;
+
+			dummy_res->resource.name = "vfio sub-page reserved";
+			dummy_res->resource.start = res->end + 1;
+			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+			dummy_res->resource.flags = res->flags;
+			if (request_resource(res->parent,
+						&dummy_res->resource)) {
+				kfree(dummy_res);
+				goto no_mmap;
+			}
+			dummy_res->index = bar;
+			list_add(&dummy_res->res_next,
+					&vdev->dummy_resources_list);
+			vdev->bar_mmap_supported[bar] = true;
+			continue;
+		}
+		/*
+		 * Here we don't handle the case when the BAR is not page
+		 * aligned because we can't expect the BAR will be
+		 * assigned into the same location in a page in guest
+		 * when we passthrough the BAR. And it's hard to access
+		 * this BAR in userspace because we have no way to get
+		 * the BAR's location in a page.
+		 */
+no_mmap:
+		vdev->bar_mmap_supported[bar] = false;
+	}
+}
+
+static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
+
+/*
+ * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
+ * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
+ * If a device implements the former but not the latter we would typically
+ * expect broken_intx_masking be set and require an exclusive interrupt.
+ * However since we do have control of the device's ability to assert INTx,
+ * we can instead pretend that the device does not implement INTx, virtualizing
+ * the pin register to report zero and maintaining DisINTx set on the host.
+ */
+static bool vfio_pci_nointx(struct pci_dev *pdev)
+{
+	switch (pdev->vendor) {
+	case PCI_VENDOR_ID_INTEL:
+		switch (pdev->device) {
+		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
+		case 0x1572:
+		case 0x1574:
+		case 0x1580 ... 0x1581:
+		case 0x1583 ... 0x158b:
+		case 0x37d0 ... 0x37d2:
+			return true;
+		default:
+			return false;
+		}
+	}
+
+	return false;
+}
+
+void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 pmcsr;
+
+	if (!pdev->pm_cap)
+		return;
+
+	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
+
+	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
+}
+
+/*
+ * pci_set_power_state() wrapper handling devices which perform a soft reset on
+ * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
+ * restore when returned to D0.  Saved separately from pci_saved_state for use
+ * by PM capability emulation and separately from pci_dev internal saved state
+ * to avoid it being overwritten and consumed around other resets.
+ */
+int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	bool needs_restore = false, needs_save = false;
+	int ret;
+
+	if (vdev->needs_pm_restore) {
+		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
+			pci_save_state(pdev);
+			needs_save = true;
+		}
+
+		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
+			needs_restore = true;
+	}
+
+	ret = pci_set_power_state(pdev, state);
+
+	if (!ret) {
+		/* D3 might be unsupported via quirk, skip unless in D3 */
+		if (needs_save && pdev->current_state >= PCI_D3hot) {
+			vdev->pm_save = pci_store_saved_state(pdev);
+		} else if (needs_restore) {
+			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
+			pci_restore_state(pdev);
+		}
+	}
+
+	return ret;
+}
+
+int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	u16 cmd;
+	u8 msix_pos;
+
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
+	/* Don't allow our initial saved state to include busmaster */
+	pci_clear_master(pdev);
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		return ret;
+
+	/* If reset fails because of the device lock, fail this path entirely */
+	ret = pci_try_reset_function(pdev);
+	if (ret == -EAGAIN) {
+		pci_disable_device(pdev);
+		return ret;
+	}
+
+	vdev->reset_works = !ret;
+	pci_save_state(pdev);
+	vdev->pci_saved_state = pci_store_saved_state(pdev);
+	if (!vdev->pci_saved_state)
+		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
+
+	if (likely(!vdev->nointxmask)) {
+		if (vfio_pci_nointx(pdev)) {
+			pci_info(pdev, "Masking broken INTx support\n");
+			vdev->nointx = true;
+			pci_intx(pdev, 0);
+		} else
+			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+	}
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+		cmd &= ~PCI_COMMAND_INTX_DISABLE;
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+	}
+
+	ret = vfio_config_init(vdev);
+	if (ret) {
+		kfree(vdev->pci_saved_state);
+		vdev->pci_saved_state = NULL;
+		pci_disable_device(pdev);
+		return ret;
+	}
+
+	msix_pos = pdev->msix_cap;
+	if (msix_pos) {
+		u16 flags;
+		u32 table;
+
+		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
+		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
+		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+	} else
+		vdev->msix_bar = 0xFF;
+
+	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
+		vdev->has_vga = true;
+
+
+	if (vfio_pci_is_vga(pdev) &&
+	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
+		ret = vfio_pci_igd_init(vdev);
+		if (ret) {
+			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_ibm_npu2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
+			goto disable_exit;
+		}
+	}
+
+	vfio_pci_probe_mmaps(vdev);
+
+	return 0;
+
+disable_exit:
+	vfio_pci_disable(vdev);
+	return ret;
+}
+
+void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_dummy_resource *dummy_res, *tmp;
+	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
+	int i, bar;
+
+	/* Stop the device from further DMA */
+	pci_clear_master(pdev);
+
+	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+				VFIO_IRQ_SET_ACTION_TRIGGER,
+				vdev->irq_type, 0, 0, NULL);
+
+	/* Device closed, don't need mutex here */
+	list_for_each_entry_safe(ioeventfd, ioeventfd_tmp,
+				 &vdev->ioeventfds_list, next) {
+		vfio_virqfd_disable(&ioeventfd->virqfd);
+		list_del(&ioeventfd->next);
+		kfree(ioeventfd);
+	}
+	vdev->ioeventfds_nr = 0;
+
+	vdev->virq_disabled = false;
+
+	for (i = 0; i < vdev->num_regions; i++)
+		vdev->region[i].ops->release(vdev, &vdev->region[i]);
+
+	vdev->num_regions = 0;
+	kfree(vdev->region);
+	vdev->region = NULL; /* don't krealloc a freed pointer */
+
+	vfio_config_free(vdev);
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		if (!vdev->barmap[bar])
+			continue;
+		pci_iounmap(pdev, vdev->barmap[bar]);
+		pci_release_selected_regions(pdev, 1 << bar);
+		vdev->barmap[bar] = NULL;
+	}
+
+	list_for_each_entry_safe(dummy_res, tmp,
+				 &vdev->dummy_resources_list, res_next) {
+		list_del(&dummy_res->res_next);
+		release_resource(&dummy_res->resource);
+		kfree(dummy_res);
+	}
+
+	vdev->needs_reset = true;
+
+	/*
+	 * If we have saved state, restore it.  If we can reset the device,
+	 * even better.  Resetting with current state seems better than
+	 * nothing, but saving and restoring current state without reset
+	 * is just busy work.
+	 */
+	if (pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)) {
+		pci_info(pdev, "%s: Couldn't reload saved state\n", __func__);
+
+		if (!vdev->reset_works)
+			goto out;
+
+		pci_save_state(pdev);
+	}
+
+	/*
+	 * Disable INTx and MSI, presumably to avoid spurious interrupts
+	 * during reset.  Stolen from pci_reset_function()
+	 */
+	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+
+	/*
+	 * Try to reset the device.  The success of this is dependent on
+	 * being able to lock the device, which is not always possible.
+	 */
+	if (vdev->reset_works && !pci_try_reset_function(pdev))
+		vdev->needs_reset = false;
+
+	pci_restore_state(pdev);
+out:
+	pci_disable_device(pdev);
+
+	vfio_pci_try_bus_reset(vdev);
+
+	if (!vdev->disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+
+		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
+		    vdev->nointx || vdev->pdev->is_virtfn)
+			return 0;
+
+		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+
+		return pin ? 1 : 0;
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = vdev->pdev->msi_cap;
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSI_FLAGS, &flags);
+			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = vdev->pdev->msix_cap;
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSIX_FLAGS, &flags);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+		if (pci_is_pcie(vdev->pdev))
+			return 1;
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+		return 1;
+	}
+
+	return 0;
+}
+
+static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
+{
+	(*(int *)data)++;
+	return 0;
+}
+
+struct vfio_pci_fill_info {
+	int max;
+	int cur;
+	struct vfio_pci_dependent_device *devices;
+};
+
+static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_fill_info *fill = data;
+	struct iommu_group *iommu_group;
+
+	if (fill->cur == fill->max)
+		return -EAGAIN; /* Something changed, try again */
+
+	iommu_group = iommu_group_get(&pdev->dev);
+	if (!iommu_group)
+		return -EPERM; /* Cannot reset non-isolated devices */
+
+	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
+	fill->devices[fill->cur].bus = pdev->bus->number;
+	fill->devices[fill->cur].devfn = pdev->devfn;
+	fill->cur++;
+	iommu_group_put(iommu_group);
+	return 0;
+}
+
+struct vfio_pci_group_entry {
+	struct vfio_group *group;
+	int id;
+};
+
+struct vfio_pci_group_info {
+	int count;
+	struct vfio_pci_group_entry *groups;
+};
+
+static int vfio_pci_validate_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_group_info *info = data;
+	struct iommu_group *group;
+	int id, i;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EPERM;
+
+	id = iommu_group_id(group);
+
+	for (i = 0; i < info->count; i++)
+		if (info->groups[i].id == id)
+			break;
+
+	iommu_group_put(group);
+
+	return (i == info->count) ? -EINVAL : 0;
+}
+
+static bool vfio_pci_dev_below_slot(struct pci_dev *pdev, struct pci_slot *slot)
+{
+	for (; pdev; pdev = pdev->bus->self)
+		if (pdev->bus == slot->bus)
+			return (pdev->slot == slot);
+	return false;
+}
+
+struct vfio_pci_walk_info {
+	int (*fn)(struct pci_dev *, void *data);
+	void *data;
+	struct pci_dev *pdev;
+	bool slot;
+	int ret;
+};
+
+static int vfio_pci_walk_wrapper(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_walk_info *walk = data;
+
+	if (!walk->slot || vfio_pci_dev_below_slot(pdev, walk->pdev->slot))
+		walk->ret = walk->fn(pdev, walk->data);
+
+	return walk->ret;
+}
+
+static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
+					 int (*fn)(struct pci_dev *,
+						   void *data), void *data,
+					 bool slot)
+{
+	struct vfio_pci_walk_info walk = {
+		.fn = fn, .data = data, .pdev = pdev, .slot = slot, .ret = 0,
+	};
+
+	pci_walk_bus(pdev->bus, vfio_pci_walk_wrapper, &walk);
+
+	return walk.ret;
+}
+
+static int msix_mmappable_cap(struct vfio_pci_device *vdev,
+			      struct vfio_info_cap *caps)
+{
+	struct vfio_info_cap_header header = {
+		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
+		.version = 1
+	};
+
+	return vfio_info_add_capability(caps, &header, sizeof(header));
+}
+
+int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+				 unsigned int type, unsigned int subtype,
+				 const struct vfio_pci_regops *ops,
+				 size_t size, u32 flags, void *data)
+{
+	struct vfio_pci_region *region;
+
+	region = krealloc(vdev->region,
+			  (vdev->num_regions + 1) * sizeof(*region),
+			  GFP_KERNEL);
+	if (!region)
+		return -ENOMEM;
+
+	vdev->region = region;
+	vdev->region[vdev->num_regions].type = type;
+	vdev->region[vdev->num_regions].subtype = subtype;
+	vdev->region[vdev->num_regions].ops = ops;
+	vdev->region[vdev->num_regions].size = size;
+	vdev->region[vdev->num_regions].flags = flags;
+	vdev->region[vdev->num_regions].data = data;
+
+	vdev->num_regions++;
+
+	return 0;
+}
+
+long vfio_pci_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_device *vdev = device_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (vdev->reset_works)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		int i, ret;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = pdev->cfg_size;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = pci_resource_len(pdev, info.index);
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			if (vdev->bar_mmap_supported[info.index]) {
+				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+				if (info.index == vdev->msix_bar) {
+					ret = msix_mmappable_cap(vdev, &caps);
+					if (ret)
+						return ret;
+				}
+			}
+
+			break;
+		case VFIO_PCI_ROM_REGION_INDEX:
+		{
+			void __iomem *io;
+			size_t size;
+			u16 orig_cmd;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.flags = 0;
+
+			/* Report the BAR size, not the ROM size */
+			info.size = pci_resource_len(pdev, info.index);
+			if (!info.size) {
+				/* Shadow ROMs appear as PCI option ROMs */
+				if (pdev->resource[PCI_ROM_RESOURCE].flags &
+							IORESOURCE_ROM_SHADOW)
+					info.size = 0x20000;
+				else
+					break;
+			}
+
+			/*
+			 * Is it really there?  Enable memory decode for
+			 * implicit access in pci_map_rom().
+			 */
+			pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
+			pci_write_config_word(pdev, PCI_COMMAND,
+					      orig_cmd | PCI_COMMAND_MEMORY);
+
+			io = pci_map_rom(pdev, &size);
+			if (io) {
+				info.flags = VFIO_REGION_INFO_FLAG_READ;
+				pci_unmap_rom(pdev, io);
+			} else {
+				info.size = 0;
+			}
+
+			pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
+			break;
+		}
+		case VFIO_PCI_VGA_REGION_INDEX:
+			if (!vdev->has_vga)
+				return -EINVAL;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+
+			break;
+		default:
+		{
+			struct vfio_region_info_cap_type cap_type = {
+					.header.id = VFIO_REGION_INFO_CAP_TYPE,
+					.header.version = 1 };
+
+			if (info.index >=
+			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
+				return -EINVAL;
+			info.index = array_index_nospec(info.index,
+							VFIO_PCI_NUM_REGIONS +
+							vdev->num_regions);
+
+			i = info.index - VFIO_PCI_NUM_REGIONS;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->region[i].size;
+			info.flags = vdev->region[i].flags;
+
+			cap_type.type = vdev->region[i].type;
+			cap_type.subtype = vdev->region[i].subtype;
+
+			ret = vfio_info_add_capability(&caps, &cap_type.header,
+						       sizeof(cap_type));
+			if (ret)
+				return ret;
+
+			if (vdev->region[i].ops->add_capability) {
+				ret = vdev->region[i].ops->add_capability(vdev,
+						&vdev->region[i], &caps);
+				if (ret)
+					return ret;
+			}
+		}
+		}
+
+		if (caps.size) {
+			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg +
+						  sizeof(info), caps.buf,
+						  caps.size)) {
+					kfree(caps.buf);
+					return -EFAULT;
+				}
+				info.cap_offset = sizeof(info);
+			}
+
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+		case VFIO_PCI_ERR_IRQ_INDEX:
+			if (pci_is_pcie(vdev->pdev))
+				break;
+		/* fall through */
+		default:
+			return -EINVAL;
+		}
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+		info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+				       VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		int max, ret = 0;
+		size_t data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		max = vfio_pci_get_irq_count(vdev, hdr.index);
+
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
+						 VFIO_PCI_NUM_IRQS, &data_size);
+		if (ret)
+			return ret;
+
+		if (data_size) {
+			data = memdup_user((void __user *)(arg + minsz),
+					    data_size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		mutex_lock(&vdev->igate);
+
+		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+					      hdr.start, hdr.count, data);
+
+		mutex_unlock(&vdev->igate);
+		kfree(data);
+
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_RESET) {
+		return vdev->reset_works ?
+			pci_try_reset_function(vdev->pdev) : -EINVAL;
+
+	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
+		struct vfio_pci_hot_reset_info hdr;
+		struct vfio_pci_fill_info fill = { 0 };
+		struct vfio_pci_dependent_device *devices = NULL;
+		bool slot = false;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_pci_hot_reset_info, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz)
+			return -EINVAL;
+
+		hdr.flags = 0;
+
+		/* Can we do a slot or bus reset or neither? */
+		if (!pci_probe_reset_slot(vdev->pdev->slot))
+			slot = true;
+		else if (pci_probe_reset_bus(vdev->pdev->bus))
+			return -ENODEV;
+
+		/* How many devices are affected? */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_count_devs,
+						    &fill.max, slot);
+		if (ret)
+			return ret;
+
+		WARN_ON(!fill.max); /* Should always be at least one */
+
+		/*
+		 * If there's enough space, fill it now, otherwise return
+		 * -ENOSPC and the number of devices affected.
+		 */
+		if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
+			ret = -ENOSPC;
+			hdr.count = fill.max;
+			goto reset_info_exit;
+		}
+
+		devices = kcalloc(fill.max, sizeof(*devices), GFP_KERNEL);
+		if (!devices)
+			return -ENOMEM;
+
+		fill.devices = devices;
+
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_fill_devs,
+						    &fill, slot);
+
+		/*
+		 * If a device was removed between counting and filling,
+		 * we may come up short of fill.max.  If a device was
+		 * added, we'll have a return of -EAGAIN above.
+		 */
+		if (!ret)
+			hdr.count = fill.cur;
+
+reset_info_exit:
+		if (copy_to_user((void __user *)arg, &hdr, minsz))
+			ret = -EFAULT;
+
+		if (!ret) {
+			if (copy_to_user((void __user *)(arg + minsz), devices,
+					 hdr.count * sizeof(*devices)))
+				ret = -EFAULT;
+		}
+
+		kfree(devices);
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_PCI_HOT_RESET) {
+		struct vfio_pci_hot_reset hdr;
+		int32_t *group_fds;
+		struct vfio_pci_group_entry *groups;
+		struct vfio_pci_group_info info;
+		bool slot = false;
+		int i, count = 0, ret = 0;
+
+		minsz = offsetofend(struct vfio_pci_hot_reset, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.flags)
+			return -EINVAL;
+
+		/* Can we do a slot or bus reset or neither? */
+		if (!pci_probe_reset_slot(vdev->pdev->slot))
+			slot = true;
+		else if (pci_probe_reset_bus(vdev->pdev->bus))
+			return -ENODEV;
+
+		/*
+		 * We can't let userspace give us an arbitrarily large
+		 * buffer to copy, so verify how many we think there
+		 * could be.  Note groups can have multiple devices so
+		 * one group per device is the max.
+		 */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_count_devs,
+						    &count, slot);
+		if (ret)
+			return ret;
+
+		/* Somewhere between 1 and count is OK */
+		if (!hdr.count || hdr.count > count)
+			return -EINVAL;
+
+		group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
+		groups = kcalloc(hdr.count, sizeof(*groups), GFP_KERNEL);
+		if (!group_fds || !groups) {
+			kfree(group_fds);
+			kfree(groups);
+			return -ENOMEM;
+		}
+
+		if (copy_from_user(group_fds, (void __user *)(arg + minsz),
+				   hdr.count * sizeof(*group_fds))) {
+			kfree(group_fds);
+			kfree(groups);
+			return -EFAULT;
+		}
+
+		/*
+		 * For each group_fd, get the group through the vfio external
+		 * user interface and store the group and iommu ID.  This
+		 * ensures the group is held across the reset.
+		 */
+		for (i = 0; i < hdr.count; i++) {
+			struct vfio_group *group;
+			struct fd f = fdget(group_fds[i]);
+			if (!f.file) {
+				ret = -EBADF;
+				break;
+			}
+
+			group = vfio_group_get_external_user(f.file);
+			fdput(f);
+			if (IS_ERR(group)) {
+				ret = PTR_ERR(group);
+				break;
+			}
+
+			groups[i].group = group;
+			groups[i].id = vfio_external_user_iommu_id(group);
+		}
+
+		kfree(group_fds);
+
+		/* release reference to groups on error */
+		if (ret)
+			goto hot_reset_release;
+
+		info.count = hdr.count;
+		info.groups = groups;
+
+		/*
+		 * Test whether all the affected devices are contained
+		 * by the set of groups provided by the user.
+		 */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_validate_devs,
+						    &info, slot);
+		if (!ret)
+			/* User has access, do the reset */
+			ret = pci_reset_bus(vdev->pdev);
+
+hot_reset_release:
+		for (i--; i >= 0; i--)
+			vfio_group_put_external_user(groups[i].group);
+
+		kfree(groups);
+		return ret;
+	} else if (cmd == VFIO_DEVICE_IOEVENTFD) {
+		struct vfio_device_ioeventfd ioeventfd;
+		int count;
+
+		minsz = offsetofend(struct vfio_device_ioeventfd, fd);
+
+		if (copy_from_user(&ioeventfd, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ioeventfd.argsz < minsz)
+			return -EINVAL;
+
+		if (ioeventfd.flags & ~VFIO_DEVICE_IOEVENTFD_SIZE_MASK)
+			return -EINVAL;
+
+		count = ioeventfd.flags & VFIO_DEVICE_IOEVENTFD_SIZE_MASK;
+
+		if (hweight8(count) != 1 || ioeventfd.fd < -1)
+			return -EINVAL;
+
+		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
+					  ioeventfd.data, count, ioeventfd.fd);
+	}
+
+	return -ENOTTY;
+}
+
+static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+		if (iswrite)
+			return -EINVAL;
+		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_VGA_REGION_INDEX:
+		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+	default:
+		index -= VFIO_PCI_NUM_REGIONS;
+		return vdev->region[index].ops->rw(vdev, buf,
+						   count, ppos, iswrite);
+	}
+
+	return -EINVAL;
+}
+
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	if (!count)
+		return 0;
+
+	return vfio_pci_rw(device_data, buf, count, ppos, false);
+}
+
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	if (!count)
+		return 0;
+
+	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
+}
+
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index;
+	u64 phys_len, req_len, pgoff, req_start;
+	int ret;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap &&
+		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	if (!vdev->bar_mmap_supported[index])
+		return -EINVAL;
+
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (req_start + req_len > phys_len)
+		return -EINVAL;
+
+	/*
+	 * Even though we don't make use of the barmap for the mmap,
+	 * we need to request the region and the barmap tracks that.
+	 */
+	if (!vdev->barmap[index]) {
+		ret = pci_request_selected_regions(pdev,
+						   1 << index, "vfio-pci");
+		if (ret)
+			return ret;
+
+		vdev->barmap[index] = pci_iomap(pdev, index, 0);
+		if (!vdev->barmap[index]) {
+			pci_release_selected_regions(pdev, 1 << index);
+			return -ENOMEM;
+		}
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			       req_len, vma->vm_page_prot);
+}
+
+void vfio_pci_request(void *device_data, unsigned int count)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	mutex_lock(&vdev->igate);
+
+	if (vdev->req_trigger) {
+		if (!(count % 10))
+			pci_notice_ratelimited(pdev,
+				"Relaying device request to user (#%u)\n",
+				count);
+		eventfd_signal(vdev->req_trigger, 1);
+	} else if (count == 0) {
+		pci_warn(pdev,
+			"No device request channel registered, blocked until released by user\n");
+	}
+
+	mutex_unlock(&vdev->igate);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_ioctl,
+	.read		= vfio_pci_read,
+	.write		= vfio_pci_write,
+	.mmap		= vfio_pci_mmap,
+	.request	= vfio_pci_request,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	struct iommu_group *group;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci.  Just reject these PFs
+	 * and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	group = vfio_iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		return -ENOMEM;
+	}
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (ret) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		vfio_del_group_dev(&pdev->dev);
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = vfio_del_group_dev(&pdev->dev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!vdev->disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+	kfree(vdev);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+}
+
+static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
+						  pci_channel_state_t state)
+{
+	struct vfio_pci_device *vdev;
+	struct vfio_device *device;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (device == NULL)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	vdev = vfio_device_data(device);
+	if (vdev == NULL) {
+		vfio_device_put(device);
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	mutex_lock(&vdev->igate);
+
+	if (vdev->err_trigger)
+		eventfd_signal(vdev->err_trigger, 1);
+
+	mutex_unlock(&vdev->igate);
+
+	vfio_device_put(device);
+
+	return PCI_ERS_RESULT_CAN_RECOVER;
+}
+
+static const struct pci_error_handlers vfio_err_handlers = {
+	.error_detected = vfio_pci_aer_err_detected,
+};
+
+static struct pci_driver vfio_pci_driver = {
+	.name		= "vfio-pci",
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_pci_probe,
+	.remove		= vfio_pci_remove,
+	.err_handler	= &vfio_err_handlers,
+};
+
+static DEFINE_MUTEX(reflck_lock);
+
+static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
+{
+	struct vfio_pci_reflck *reflck;
+
+	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
+	if (!reflck)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&reflck->kref);
+	mutex_init(&reflck->lock);
+
+	return reflck;
+}
+
+static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
+{
+	kref_get(&reflck->kref);
+}
+
+static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_device *vdev = data;
+	struct vfio_pci_reflck **preflck = &vdev->reflck;
+	struct vfio_device *device;
+	struct vfio_pci_device *tmp;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return 0;
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	tmp = vfio_device_data(device);
+
+	if (tmp->reflck) {
+		vfio_pci_reflck_get(tmp->reflck);
+		*preflck = tmp->reflck;
+		vfio_device_put(device);
+		return 1;
+	}
+
+	vfio_device_put(device);
+	return 0;
+}
+
+int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+{
+	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
+
+	mutex_lock(&reflck_lock);
+
+	if (pci_is_root_bus(vdev->pdev->bus) ||
+	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
+					  vdev, slot) <= 0)
+		vdev->reflck = vfio_pci_reflck_alloc();
+
+	mutex_unlock(&reflck_lock);
+
+	return PTR_ERR_OR_ZERO(vdev->reflck);
+}
+
+static void vfio_pci_reflck_release(struct kref *kref)
+{
+	struct vfio_pci_reflck *reflck = container_of(kref,
+						      struct vfio_pci_reflck,
+						      kref);
+
+	kfree(reflck);
+	mutex_unlock(&reflck_lock);
+}
+
+void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+{
+	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
+}
+
+struct vfio_devices {
+	struct vfio_device **devices;
+	struct vfio_pci_device *vdev;
+	int cur_index;
+	int max_index;
+};
+
+static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_devices *devs = data;
+	struct vfio_device *device;
+	struct vfio_pci_device *tmp;
+
+	if (devs->cur_index == devs->max_index)
+		return -ENOSPC;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return -EINVAL;
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	tmp = vfio_device_data(device);
+
+	/* Fault if the device is not unused */
+	if (tmp->refcnt) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	devs->devices[devs->cur_index++] = device;
+	return 0;
+}
+
+/*
+ * If a bus or slot reset is available for the provided device and:
+ *  - All of the devices affected by that bus or slot reset are unused
+ *    (!refcnt)
+ *  - At least one of the affected devices is marked dirty via
+ *    needs_reset (such as by lack of FLR support)
+ * Then attempt to perform that bus or slot reset.  Callers are required
+ * to hold vdev->reflck->lock, protecting the bus/slot reset group from
+ * concurrent opens.  A vfio_device reference is acquired for each device
+ * to prevent unbinds during the reset operation.
+ *
+ * NB: vfio-core considers a group to be viable even if some devices are
+ * bound to drivers like pci-stub or pcieport.  Here we require all devices
+ * to be bound to vfio_pci since that's the only way we can be sure they
+ * stay put.
+ */
+static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
+{
+	struct vfio_devices devs = { .cur_index = 0 };
+	int i = 0, ret = -EINVAL;
+	bool slot = false;
+	struct vfio_pci_device *tmp;
+
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return;
+
+	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+					  &i, slot) || !i)
+		return;
+
+	devs.max_index = i;
+	devs.devices = kcalloc(i, sizeof(struct vfio_device *), GFP_KERNEL);
+	if (!devs.devices)
+		return;
+
+	devs.vdev = vdev;
+	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
+					  vfio_pci_get_unused_devs,
+					  &devs, slot))
+		goto put_devs;
+
+	/* Does at least one need a reset? */
+	for (i = 0; i < devs.cur_index; i++) {
+		tmp = vfio_device_data(devs.devices[i]);
+		if (tmp->needs_reset) {
+			ret = pci_reset_bus(vdev->pdev);
+			break;
+		}
+	}
+
+put_devs:
+	for (i = 0; i < devs.cur_index; i++) {
+		tmp = vfio_device_data(devs.devices[i]);
+
+		/*
+		 * If reset was successful, affected devices no longer need
+		 * a reset and we should return all the collateral devices
+		 * to low power.  If not successful, we either didn't reset
+		 * the bus or timed out waiting for it, so let's not touch
+		 * the power state.
+		 */
+		if (!ret) {
+			tmp->needs_reset = false;
+
+			if (tmp != vdev && !tmp->disable_idle_d3)
+				vfio_pci_set_power_state(tmp, PCI_D3hot);
+		}
+
+		vfio_device_put(devs.devices[i]);
+	}
+
+	kfree(devs.devices);
+}
+
+static void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+	vfio_pci_uninit_perm_bits();
+}
+
+void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+{
+	char *p, *id;
+	int rc;
+
+	/* no ids passed actually */
+	if (ids[0] == '\0')
+		return;
+
+	/* add ids specified in the module parameter */
+	p = ids;
+	while ((id = strsep(&p, ","))) {
+		unsigned int vendor, device, subvendor = PCI_ANY_ID,
+			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+		int fields;
+
+		if (!strlen(id))
+			continue;
+
+		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
+				&vendor, &device, &subvendor, &subdevice,
+				&class, &class_mask);
+
+		if (fields < 2) {
+			pr_warn("invalid id string \"%s\"\n", id);
+			continue;
+		}
+
+		rc = pci_add_dynid(driver, vendor, device,
+				   subvendor, subdevice, class, class_mask, 0);
+		if (rc)
+			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask, rc);
+		else
+			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask);
+	}
+}
+
+static int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	vfio_pci_fill_ids(&ids[0], &vfio_pci_driver);
+
+	return 0;
+
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 06/13] vfio_pci: shrink vfio_pci_common.c
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (4 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 05/13] vfio_pci: duplicate vfio_pci.c Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 07/13] vfio_pci: shrink vfio_pci.c Liu Yi L
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch removes the vfio-pci module specific codes in vfio_pci_common.c
to make vfio_pci_common.c be a common source file.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_common.c | 233 -------------------------------------
 1 file changed, 233 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index 90a44a7..11365f1 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -30,30 +30,6 @@
 
 #include "vfio_pci_private.h"
 
-#define DRIVER_VERSION  "0.2"
-#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
-#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
-
-static char ids[1024] __initdata;
-module_param_string(ids, ids, sizeof(ids), 0);
-MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
-
-static bool nointxmask;
-module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(nointxmask,
-		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
-
-#ifdef CONFIG_VFIO_PCI_VGA
-static bool disable_vga;
-module_param(disable_vga, bool, S_IRUGO);
-MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-pci");
-#endif
-
-static bool disable_idle_d3;
-module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(disable_idle_d3,
-		 "Disable using the PCI D3 low power state for idle, unused devices");
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -440,47 +416,6 @@ void vfio_pci_disable(struct vfio_pci_device *vdev)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
-static void vfio_pci_release(void *device_data)
-{
-	struct vfio_pci_device *vdev = device_data;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!(--vdev->refcnt)) {
-		vfio_spapr_pci_eeh_release(vdev->pdev);
-		vfio_pci_disable(vdev);
-	}
-
-	mutex_unlock(&vdev->reflck->lock);
-
-	module_put(THIS_MODULE);
-}
-
-static int vfio_pci_open(void *device_data)
-{
-	struct vfio_pci_device *vdev = device_data;
-	int ret = 0;
-
-	if (!try_module_get(THIS_MODULE))
-		return -ENODEV;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!vdev->refcnt) {
-		ret = vfio_pci_enable(vdev);
-		if (ret)
-			goto error;
-
-		vfio_spapr_pci_eeh_open(vdev->pdev);
-	}
-	vdev->refcnt++;
-error:
-	mutex_unlock(&vdev->reflck->lock);
-	if (ret)
-		module_put(THIS_MODULE);
-	return ret;
-}
-
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 {
 	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
@@ -1252,129 +1187,6 @@ void vfio_pci_request(void *device_data, unsigned int count)
 	mutex_unlock(&vdev->igate);
 }
 
-static const struct vfio_device_ops vfio_pci_ops = {
-	.name		= "vfio-pci",
-	.open		= vfio_pci_open,
-	.release	= vfio_pci_release,
-	.ioctl		= vfio_pci_ioctl,
-	.read		= vfio_pci_read,
-	.write		= vfio_pci_write,
-	.mmap		= vfio_pci_mmap,
-	.request	= vfio_pci_request,
-};
-
-static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
-{
-	struct vfio_pci_device *vdev;
-	struct iommu_group *group;
-	int ret;
-
-	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-		return -EINVAL;
-
-	/*
-	 * Prevent binding to PFs with VFs enabled, this too easily allows
-	 * userspace instance with VFs and PFs from the same device, which
-	 * cannot work.  Disabling SR-IOV here would initiate removing the
-	 * VFs, which would unbind the driver, which is prone to blocking
-	 * if that VF is also in use by vfio-pci.  Just reject these PFs
-	 * and let the user sort it out.
-	 */
-	if (pci_num_vf(pdev)) {
-		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-		return -EBUSY;
-	}
-
-	group = vfio_iommu_group_get(&pdev->dev);
-	if (!group)
-		return -EINVAL;
-
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev) {
-		vfio_iommu_group_put(group, &pdev->dev);
-		return -ENOMEM;
-	}
-
-	vdev->pdev = pdev;
-	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	mutex_init(&vdev->igate);
-	spin_lock_init(&vdev->irqlock);
-	mutex_init(&vdev->ioeventfds_lock);
-	INIT_LIST_HEAD(&vdev->ioeventfds_list);
-	vdev->nointxmask = nointxmask;
-#ifdef CONFIG_VFIO_PCI_VGA
-	vdev->disable_vga = disable_vga;
-#endif
-	vdev->disable_idle_d3 = disable_idle_d3;
-
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
-	if (ret) {
-		vfio_iommu_group_put(group, &pdev->dev);
-		kfree(vdev);
-		return ret;
-	}
-
-	ret = vfio_pci_reflck_attach(vdev);
-	if (ret) {
-		vfio_del_group_dev(&pdev->dev);
-		vfio_iommu_group_put(group, &pdev->dev);
-		kfree(vdev);
-		return ret;
-	}
-
-	if (vfio_pci_is_vga(pdev)) {
-		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
-		vga_set_legacy_decoding(pdev,
-					vfio_pci_set_vga_decode(vdev, false));
-	}
-
-	vfio_pci_probe_power_state(vdev);
-
-	if (!vdev->disable_idle_d3) {
-		/*
-		 * pci-core sets the device power state to an unknown value at
-		 * bootup and after being removed from a driver.  The only
-		 * transition it allows from this unknown state is to D0, which
-		 * typically happens when a driver calls pci_enable_device().
-		 * We're not ready to enable the device yet, but we do want to
-		 * be able to get to D3.  Therefore first do a D0 transition
-		 * before going to D3.
-		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-	}
-
-	return ret;
-}
-
-static void vfio_pci_remove(struct pci_dev *pdev)
-{
-	struct vfio_pci_device *vdev;
-
-	vdev = vfio_del_group_dev(&pdev->dev);
-	if (!vdev)
-		return;
-
-	vfio_pci_reflck_put(vdev->reflck);
-
-	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
-	kfree(vdev->region);
-	mutex_destroy(&vdev->ioeventfds_lock);
-
-	if (!vdev->disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
-
-	kfree(vdev->pm_save);
-	kfree(vdev);
-
-	if (vfio_pci_is_vga(pdev)) {
-		vga_client_register(pdev, NULL, NULL, NULL);
-		vga_set_legacy_decoding(pdev,
-				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
-				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
-	}
-}
-
 static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 						  pci_channel_state_t state)
 {
@@ -1407,14 +1219,6 @@ static const struct pci_error_handlers vfio_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
 
-static struct pci_driver vfio_pci_driver = {
-	.name		= "vfio-pci",
-	.id_table	= NULL, /* only dynamic ids */
-	.probe		= vfio_pci_probe,
-	.remove		= vfio_pci_remove,
-	.err_handler	= &vfio_err_handlers,
-};
-
 static DEFINE_MUTEX(reflck_lock);
 
 static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
@@ -1609,12 +1413,6 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	kfree(devs.devices);
 }
 
-static void __exit vfio_pci_cleanup(void)
-{
-	pci_unregister_driver(&vfio_pci_driver);
-	vfio_pci_uninit_perm_bits();
-}
-
 void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
@@ -1655,34 +1453,3 @@ void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 				class, class_mask);
 	}
 }
-
-static int __init vfio_pci_init(void)
-{
-	int ret;
-
-	/* Allocate shared config space permision data used by all devices */
-	ret = vfio_pci_init_perm_bits();
-	if (ret)
-		return ret;
-
-	/* Register and scan for devices */
-	ret = pci_register_driver(&vfio_pci_driver);
-	if (ret)
-		goto out_driver;
-
-	vfio_pci_fill_ids(&ids[0], &vfio_pci_driver);
-
-	return 0;
-
-out_driver:
-	vfio_pci_uninit_perm_bits();
-	return ret;
-}
-
-module_init(vfio_pci_init);
-module_exit(vfio_pci_cleanup);
-
-MODULE_VERSION(DRIVER_VERSION);
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR(DRIVER_AUTHOR);
-MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 07/13] vfio_pci: shrink vfio_pci.c
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (5 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 06/13] vfio_pci: shrink vfio_pci_common.c Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch removes the common codes in vfio_pci.c, instead, vfio-pci
module will leverage the common functions implemented in vfio_pci_common.c.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Makefile           |    3 +-
 drivers/vfio/pci/vfio_pci.c         | 1424 -----------------------------------
 drivers/vfio/pci/vfio_pci_common.c  |    2 +-
 drivers/vfio/pci/vfio_pci_private.h |    2 +
 4 files changed, 5 insertions(+), 1426 deletions(-)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f027f8a..d94317a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+		vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 90a44a7..c6c39e3 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,392 +54,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-/*
- * Our VGA arbiter participation is limited since we don't know anything
- * about the device itself.  However, if the device is the only VGA device
- * downstream of a bridge and VFIO VGA support is disabled, then we can
- * safely return legacy VGA IO and memory as not decoded since the user
- * has no way to get to it and routing can be disabled externally at the
- * bridge.
- */
-unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
-{
-	struct vfio_pci_device *vdev = opaque;
-	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
-	unsigned char max_busnr;
-	unsigned int decodes;
-
-	if (single_vga || !vfio_vga_disabled(vdev) ||
-		pci_is_root_bus(pdev->bus))
-		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
-		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
-
-	max_busnr = pci_bus_max_busnr(pdev->bus);
-	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
-
-	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
-		if (tmp == pdev ||
-		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
-		    pci_is_root_bus(tmp->bus))
-			continue;
-
-		if (tmp->bus->number >= pdev->bus->number &&
-		    tmp->bus->number <= max_busnr) {
-			pci_dev_put(tmp);
-			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
-			break;
-		}
-	}
-
-	return decodes;
-}
-
-static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
-{
-	struct resource *res;
-	int bar;
-	struct vfio_pci_dummy_resource *dummy_res;
-
-	INIT_LIST_HEAD(&vdev->dummy_resources_list);
-
-	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
-		res = vdev->pdev->resource + bar;
-
-		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
-			goto no_mmap;
-
-		if (!(res->flags & IORESOURCE_MEM))
-			goto no_mmap;
-
-		/*
-		 * The PCI core shouldn't set up a resource with a
-		 * type but zero size. But there may be bugs that
-		 * cause us to do that.
-		 */
-		if (!resource_size(res))
-			goto no_mmap;
-
-		if (resource_size(res) >= PAGE_SIZE) {
-			vdev->bar_mmap_supported[bar] = true;
-			continue;
-		}
-
-		if (!(res->start & ~PAGE_MASK)) {
-			/*
-			 * Add a dummy resource to reserve the remainder
-			 * of the exclusive page in case that hot-add
-			 * device's bar is assigned into it.
-			 */
-			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
-			if (dummy_res == NULL)
-				goto no_mmap;
-
-			dummy_res->resource.name = "vfio sub-page reserved";
-			dummy_res->resource.start = res->end + 1;
-			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
-			dummy_res->resource.flags = res->flags;
-			if (request_resource(res->parent,
-						&dummy_res->resource)) {
-				kfree(dummy_res);
-				goto no_mmap;
-			}
-			dummy_res->index = bar;
-			list_add(&dummy_res->res_next,
-					&vdev->dummy_resources_list);
-			vdev->bar_mmap_supported[bar] = true;
-			continue;
-		}
-		/*
-		 * Here we don't handle the case when the BAR is not page
-		 * aligned because we can't expect the BAR will be
-		 * assigned into the same location in a page in guest
-		 * when we passthrough the BAR. And it's hard to access
-		 * this BAR in userspace because we have no way to get
-		 * the BAR's location in a page.
-		 */
-no_mmap:
-		vdev->bar_mmap_supported[bar] = false;
-	}
-}
-
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-
-/*
- * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
- * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
- * If a device implements the former but not the latter we would typically
- * expect broken_intx_masking be set and require an exclusive interrupt.
- * However since we do have control of the device's ability to assert INTx,
- * we can instead pretend that the device does not implement INTx, virtualizing
- * the pin register to report zero and maintaining DisINTx set on the host.
- */
-static bool vfio_pci_nointx(struct pci_dev *pdev)
-{
-	switch (pdev->vendor) {
-	case PCI_VENDOR_ID_INTEL:
-		switch (pdev->device) {
-		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
-		case 0x1572:
-		case 0x1574:
-		case 0x1580 ... 0x1581:
-		case 0x1583 ... 0x158b:
-		case 0x37d0 ... 0x37d2:
-			return true;
-		default:
-			return false;
-		}
-	}
-
-	return false;
-}
-
-void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	u16 pmcsr;
-
-	if (!pdev->pm_cap)
-		return;
-
-	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
-	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0.  Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	bool needs_restore = false, needs_save = false;
-	int ret;
-
-	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
-			pci_save_state(pdev);
-			needs_save = true;
-		}
-
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
-			needs_restore = true;
-	}
-
-	ret = pci_set_power_state(pdev, state);
-
-	if (!ret) {
-		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
-			vdev->pm_save = pci_store_saved_state(pdev);
-		} else if (needs_restore) {
-			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
-			pci_restore_state(pdev);
-		}
-	}
-
-	return ret;
-}
-
-int vfio_pci_enable(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	int ret;
-	u16 cmd;
-	u8 msix_pos;
-
-	vfio_pci_set_power_state(vdev, PCI_D0);
-
-	/* Don't allow our initial saved state to include busmaster */
-	pci_clear_master(pdev);
-
-	ret = pci_enable_device(pdev);
-	if (ret)
-		return ret;
-
-	/* If reset fails because of the device lock, fail this path entirely */
-	ret = pci_try_reset_function(pdev);
-	if (ret == -EAGAIN) {
-		pci_disable_device(pdev);
-		return ret;
-	}
-
-	vdev->reset_works = !ret;
-	pci_save_state(pdev);
-	vdev->pci_saved_state = pci_store_saved_state(pdev);
-	if (!vdev->pci_saved_state)
-		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
-
-	if (likely(!vdev->nointxmask)) {
-		if (vfio_pci_nointx(pdev)) {
-			pci_info(pdev, "Masking broken INTx support\n");
-			vdev->nointx = true;
-			pci_intx(pdev, 0);
-		} else
-			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
-	}
-
-	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
-	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
-		cmd &= ~PCI_COMMAND_INTX_DISABLE;
-		pci_write_config_word(pdev, PCI_COMMAND, cmd);
-	}
-
-	ret = vfio_config_init(vdev);
-	if (ret) {
-		kfree(vdev->pci_saved_state);
-		vdev->pci_saved_state = NULL;
-		pci_disable_device(pdev);
-		return ret;
-	}
-
-	msix_pos = pdev->msix_cap;
-	if (msix_pos) {
-		u16 flags;
-		u32 table;
-
-		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
-		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
-
-		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
-		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
-		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
-	} else
-		vdev->msix_bar = 0xFF;
-
-	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
-		vdev->has_vga = true;
-
-
-	if (vfio_pci_is_vga(pdev) &&
-	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
-		ret = vfio_pci_igd_init(vdev);
-		if (ret) {
-			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_ibm_npu2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
-			goto disable_exit;
-		}
-	}
-
-	vfio_pci_probe_mmaps(vdev);
-
-	return 0;
-
-disable_exit:
-	vfio_pci_disable(vdev);
-	return ret;
-}
-
-void vfio_pci_disable(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	struct vfio_pci_dummy_resource *dummy_res, *tmp;
-	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
-	int i, bar;
-
-	/* Stop the device from further DMA */
-	pci_clear_master(pdev);
-
-	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
-				VFIO_IRQ_SET_ACTION_TRIGGER,
-				vdev->irq_type, 0, 0, NULL);
-
-	/* Device closed, don't need mutex here */
-	list_for_each_entry_safe(ioeventfd, ioeventfd_tmp,
-				 &vdev->ioeventfds_list, next) {
-		vfio_virqfd_disable(&ioeventfd->virqfd);
-		list_del(&ioeventfd->next);
-		kfree(ioeventfd);
-	}
-	vdev->ioeventfds_nr = 0;
-
-	vdev->virq_disabled = false;
-
-	for (i = 0; i < vdev->num_regions; i++)
-		vdev->region[i].ops->release(vdev, &vdev->region[i]);
-
-	vdev->num_regions = 0;
-	kfree(vdev->region);
-	vdev->region = NULL; /* don't krealloc a freed pointer */
-
-	vfio_config_free(vdev);
-
-	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
-		if (!vdev->barmap[bar])
-			continue;
-		pci_iounmap(pdev, vdev->barmap[bar]);
-		pci_release_selected_regions(pdev, 1 << bar);
-		vdev->barmap[bar] = NULL;
-	}
-
-	list_for_each_entry_safe(dummy_res, tmp,
-				 &vdev->dummy_resources_list, res_next) {
-		list_del(&dummy_res->res_next);
-		release_resource(&dummy_res->resource);
-		kfree(dummy_res);
-	}
-
-	vdev->needs_reset = true;
-
-	/*
-	 * If we have saved state, restore it.  If we can reset the device,
-	 * even better.  Resetting with current state seems better than
-	 * nothing, but saving and restoring current state without reset
-	 * is just busy work.
-	 */
-	if (pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)) {
-		pci_info(pdev, "%s: Couldn't reload saved state\n", __func__);
-
-		if (!vdev->reset_works)
-			goto out;
-
-		pci_save_state(pdev);
-	}
-
-	/*
-	 * Disable INTx and MSI, presumably to avoid spurious interrupts
-	 * during reset.  Stolen from pci_reset_function()
-	 */
-	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
-
-	/*
-	 * Try to reset the device.  The success of this is dependent on
-	 * being able to lock the device, which is not always possible.
-	 */
-	if (vdev->reset_works && !pci_try_reset_function(pdev))
-		vdev->needs_reset = false;
-
-	pci_restore_state(pdev);
-out:
-	pci_disable_device(pdev);
-
-	vfio_pci_try_bus_reset(vdev);
-
-	if (!vdev->disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-}
-
 static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -481,777 +95,6 @@ static int vfio_pci_open(void *device_data)
 	return ret;
 }
 
-static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
-{
-	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
-		u8 pin;
-
-		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
-		    vdev->nointx || vdev->pdev->is_virtfn)
-			return 0;
-
-		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
-
-		return pin ? 1 : 0;
-	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
-		u8 pos;
-		u16 flags;
-
-		pos = vdev->pdev->msi_cap;
-		if (pos) {
-			pci_read_config_word(vdev->pdev,
-					     pos + PCI_MSI_FLAGS, &flags);
-			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
-		}
-	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
-		u8 pos;
-		u16 flags;
-
-		pos = vdev->pdev->msix_cap;
-		if (pos) {
-			pci_read_config_word(vdev->pdev,
-					     pos + PCI_MSIX_FLAGS, &flags);
-
-			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
-		}
-	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
-	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
-		return 1;
-	}
-
-	return 0;
-}
-
-static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
-{
-	(*(int *)data)++;
-	return 0;
-}
-
-struct vfio_pci_fill_info {
-	int max;
-	int cur;
-	struct vfio_pci_dependent_device *devices;
-};
-
-static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_fill_info *fill = data;
-	struct iommu_group *iommu_group;
-
-	if (fill->cur == fill->max)
-		return -EAGAIN; /* Something changed, try again */
-
-	iommu_group = iommu_group_get(&pdev->dev);
-	if (!iommu_group)
-		return -EPERM; /* Cannot reset non-isolated devices */
-
-	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
-	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
-	fill->devices[fill->cur].bus = pdev->bus->number;
-	fill->devices[fill->cur].devfn = pdev->devfn;
-	fill->cur++;
-	iommu_group_put(iommu_group);
-	return 0;
-}
-
-struct vfio_pci_group_entry {
-	struct vfio_group *group;
-	int id;
-};
-
-struct vfio_pci_group_info {
-	int count;
-	struct vfio_pci_group_entry *groups;
-};
-
-static int vfio_pci_validate_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_group_info *info = data;
-	struct iommu_group *group;
-	int id, i;
-
-	group = iommu_group_get(&pdev->dev);
-	if (!group)
-		return -EPERM;
-
-	id = iommu_group_id(group);
-
-	for (i = 0; i < info->count; i++)
-		if (info->groups[i].id == id)
-			break;
-
-	iommu_group_put(group);
-
-	return (i == info->count) ? -EINVAL : 0;
-}
-
-static bool vfio_pci_dev_below_slot(struct pci_dev *pdev, struct pci_slot *slot)
-{
-	for (; pdev; pdev = pdev->bus->self)
-		if (pdev->bus == slot->bus)
-			return (pdev->slot == slot);
-	return false;
-}
-
-struct vfio_pci_walk_info {
-	int (*fn)(struct pci_dev *, void *data);
-	void *data;
-	struct pci_dev *pdev;
-	bool slot;
-	int ret;
-};
-
-static int vfio_pci_walk_wrapper(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_walk_info *walk = data;
-
-	if (!walk->slot || vfio_pci_dev_below_slot(pdev, walk->pdev->slot))
-		walk->ret = walk->fn(pdev, walk->data);
-
-	return walk->ret;
-}
-
-static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
-					 int (*fn)(struct pci_dev *,
-						   void *data), void *data,
-					 bool slot)
-{
-	struct vfio_pci_walk_info walk = {
-		.fn = fn, .data = data, .pdev = pdev, .slot = slot, .ret = 0,
-	};
-
-	pci_walk_bus(pdev->bus, vfio_pci_walk_wrapper, &walk);
-
-	return walk.ret;
-}
-
-static int msix_mmappable_cap(struct vfio_pci_device *vdev,
-			      struct vfio_info_cap *caps)
-{
-	struct vfio_info_cap_header header = {
-		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
-		.version = 1
-	};
-
-	return vfio_info_add_capability(caps, &header, sizeof(header));
-}
-
-int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
-				 unsigned int type, unsigned int subtype,
-				 const struct vfio_pci_regops *ops,
-				 size_t size, u32 flags, void *data)
-{
-	struct vfio_pci_region *region;
-
-	region = krealloc(vdev->region,
-			  (vdev->num_regions + 1) * sizeof(*region),
-			  GFP_KERNEL);
-	if (!region)
-		return -ENOMEM;
-
-	vdev->region = region;
-	vdev->region[vdev->num_regions].type = type;
-	vdev->region[vdev->num_regions].subtype = subtype;
-	vdev->region[vdev->num_regions].ops = ops;
-	vdev->region[vdev->num_regions].size = size;
-	vdev->region[vdev->num_regions].flags = flags;
-	vdev->region[vdev->num_regions].data = data;
-
-	vdev->num_regions++;
-
-	return 0;
-}
-
-long vfio_pci_ioctl(void *device_data,
-		   unsigned int cmd, unsigned long arg)
-{
-	struct vfio_pci_device *vdev = device_data;
-	unsigned long minsz;
-
-	if (cmd == VFIO_DEVICE_GET_INFO) {
-		struct vfio_device_info info;
-
-		minsz = offsetofend(struct vfio_device_info, num_irqs);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		info.flags = VFIO_DEVICE_FLAGS_PCI;
-
-		if (vdev->reset_works)
-			info.flags |= VFIO_DEVICE_FLAGS_RESET;
-
-		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
-		info.num_irqs = VFIO_PCI_NUM_IRQS;
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
-		struct pci_dev *pdev = vdev->pdev;
-		struct vfio_region_info info;
-		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-		int i, ret;
-
-		minsz = offsetofend(struct vfio_region_info, offset);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		switch (info.index) {
-		case VFIO_PCI_CONFIG_REGION_INDEX:
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = pdev->cfg_size;
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-			break;
-		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = pci_resource_len(pdev, info.index);
-			if (!info.size) {
-				info.flags = 0;
-				break;
-			}
-
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-			if (vdev->bar_mmap_supported[info.index]) {
-				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
-				if (info.index == vdev->msix_bar) {
-					ret = msix_mmappable_cap(vdev, &caps);
-					if (ret)
-						return ret;
-				}
-			}
-
-			break;
-		case VFIO_PCI_ROM_REGION_INDEX:
-		{
-			void __iomem *io;
-			size_t size;
-			u16 orig_cmd;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.flags = 0;
-
-			/* Report the BAR size, not the ROM size */
-			info.size = pci_resource_len(pdev, info.index);
-			if (!info.size) {
-				/* Shadow ROMs appear as PCI option ROMs */
-				if (pdev->resource[PCI_ROM_RESOURCE].flags &
-							IORESOURCE_ROM_SHADOW)
-					info.size = 0x20000;
-				else
-					break;
-			}
-
-			/*
-			 * Is it really there?  Enable memory decode for
-			 * implicit access in pci_map_rom().
-			 */
-			pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
-			pci_write_config_word(pdev, PCI_COMMAND,
-					      orig_cmd | PCI_COMMAND_MEMORY);
-
-			io = pci_map_rom(pdev, &size);
-			if (io) {
-				info.flags = VFIO_REGION_INFO_FLAG_READ;
-				pci_unmap_rom(pdev, io);
-			} else {
-				info.size = 0;
-			}
-
-			pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
-			break;
-		}
-		case VFIO_PCI_VGA_REGION_INDEX:
-			if (!vdev->has_vga)
-				return -EINVAL;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = 0xc0000;
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-
-			break;
-		default:
-		{
-			struct vfio_region_info_cap_type cap_type = {
-					.header.id = VFIO_REGION_INFO_CAP_TYPE,
-					.header.version = 1 };
-
-			if (info.index >=
-			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
-				return -EINVAL;
-			info.index = array_index_nospec(info.index,
-							VFIO_PCI_NUM_REGIONS +
-							vdev->num_regions);
-
-			i = info.index - VFIO_PCI_NUM_REGIONS;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = vdev->region[i].size;
-			info.flags = vdev->region[i].flags;
-
-			cap_type.type = vdev->region[i].type;
-			cap_type.subtype = vdev->region[i].subtype;
-
-			ret = vfio_info_add_capability(&caps, &cap_type.header,
-						       sizeof(cap_type));
-			if (ret)
-				return ret;
-
-			if (vdev->region[i].ops->add_capability) {
-				ret = vdev->region[i].ops->add_capability(vdev,
-						&vdev->region[i], &caps);
-				if (ret)
-					return ret;
-			}
-		}
-		}
-
-		if (caps.size) {
-			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-						  sizeof(info), caps.buf,
-						  caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
-			}
-
-			kfree(caps.buf);
-		}
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
-		struct vfio_irq_info info;
-
-		minsz = offsetofend(struct vfio_irq_info, count);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
-			return -EINVAL;
-
-		switch (info.index) {
-		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
-		case VFIO_PCI_REQ_IRQ_INDEX:
-			break;
-		case VFIO_PCI_ERR_IRQ_INDEX:
-			if (pci_is_pcie(vdev->pdev))
-				break;
-		/* fall through */
-		default:
-			return -EINVAL;
-		}
-
-		info.flags = VFIO_IRQ_INFO_EVENTFD;
-
-		info.count = vfio_pci_get_irq_count(vdev, info.index);
-
-		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
-			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
-				       VFIO_IRQ_INFO_AUTOMASKED);
-		else
-			info.flags |= VFIO_IRQ_INFO_NORESIZE;
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
-		struct vfio_irq_set hdr;
-		u8 *data = NULL;
-		int max, ret = 0;
-		size_t data_size = 0;
-
-		minsz = offsetofend(struct vfio_irq_set, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		max = vfio_pci_get_irq_count(vdev, hdr.index);
-
-		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
-						 VFIO_PCI_NUM_IRQS, &data_size);
-		if (ret)
-			return ret;
-
-		if (data_size) {
-			data = memdup_user((void __user *)(arg + minsz),
-					    data_size);
-			if (IS_ERR(data))
-				return PTR_ERR(data);
-		}
-
-		mutex_lock(&vdev->igate);
-
-		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
-					      hdr.start, hdr.count, data);
-
-		mutex_unlock(&vdev->igate);
-		kfree(data);
-
-		return ret;
-
-	} else if (cmd == VFIO_DEVICE_RESET) {
-		return vdev->reset_works ?
-			pci_try_reset_function(vdev->pdev) : -EINVAL;
-
-	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
-		struct vfio_pci_hot_reset_info hdr;
-		struct vfio_pci_fill_info fill = { 0 };
-		struct vfio_pci_dependent_device *devices = NULL;
-		bool slot = false;
-		int ret = 0;
-
-		minsz = offsetofend(struct vfio_pci_hot_reset_info, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (hdr.argsz < minsz)
-			return -EINVAL;
-
-		hdr.flags = 0;
-
-		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
-			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
-			return -ENODEV;
-
-		/* How many devices are affected? */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_count_devs,
-						    &fill.max, slot);
-		if (ret)
-			return ret;
-
-		WARN_ON(!fill.max); /* Should always be at least one */
-
-		/*
-		 * If there's enough space, fill it now, otherwise return
-		 * -ENOSPC and the number of devices affected.
-		 */
-		if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
-			ret = -ENOSPC;
-			hdr.count = fill.max;
-			goto reset_info_exit;
-		}
-
-		devices = kcalloc(fill.max, sizeof(*devices), GFP_KERNEL);
-		if (!devices)
-			return -ENOMEM;
-
-		fill.devices = devices;
-
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_fill_devs,
-						    &fill, slot);
-
-		/*
-		 * If a device was removed between counting and filling,
-		 * we may come up short of fill.max.  If a device was
-		 * added, we'll have a return of -EAGAIN above.
-		 */
-		if (!ret)
-			hdr.count = fill.cur;
-
-reset_info_exit:
-		if (copy_to_user((void __user *)arg, &hdr, minsz))
-			ret = -EFAULT;
-
-		if (!ret) {
-			if (copy_to_user((void __user *)(arg + minsz), devices,
-					 hdr.count * sizeof(*devices)))
-				ret = -EFAULT;
-		}
-
-		kfree(devices);
-		return ret;
-
-	} else if (cmd == VFIO_DEVICE_PCI_HOT_RESET) {
-		struct vfio_pci_hot_reset hdr;
-		int32_t *group_fds;
-		struct vfio_pci_group_entry *groups;
-		struct vfio_pci_group_info info;
-		bool slot = false;
-		int i, count = 0, ret = 0;
-
-		minsz = offsetofend(struct vfio_pci_hot_reset, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (hdr.argsz < minsz || hdr.flags)
-			return -EINVAL;
-
-		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
-			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
-			return -ENODEV;
-
-		/*
-		 * We can't let userspace give us an arbitrarily large
-		 * buffer to copy, so verify how many we think there
-		 * could be.  Note groups can have multiple devices so
-		 * one group per device is the max.
-		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_count_devs,
-						    &count, slot);
-		if (ret)
-			return ret;
-
-		/* Somewhere between 1 and count is OK */
-		if (!hdr.count || hdr.count > count)
-			return -EINVAL;
-
-		group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
-		groups = kcalloc(hdr.count, sizeof(*groups), GFP_KERNEL);
-		if (!group_fds || !groups) {
-			kfree(group_fds);
-			kfree(groups);
-			return -ENOMEM;
-		}
-
-		if (copy_from_user(group_fds, (void __user *)(arg + minsz),
-				   hdr.count * sizeof(*group_fds))) {
-			kfree(group_fds);
-			kfree(groups);
-			return -EFAULT;
-		}
-
-		/*
-		 * For each group_fd, get the group through the vfio external
-		 * user interface and store the group and iommu ID.  This
-		 * ensures the group is held across the reset.
-		 */
-		for (i = 0; i < hdr.count; i++) {
-			struct vfio_group *group;
-			struct fd f = fdget(group_fds[i]);
-			if (!f.file) {
-				ret = -EBADF;
-				break;
-			}
-
-			group = vfio_group_get_external_user(f.file);
-			fdput(f);
-			if (IS_ERR(group)) {
-				ret = PTR_ERR(group);
-				break;
-			}
-
-			groups[i].group = group;
-			groups[i].id = vfio_external_user_iommu_id(group);
-		}
-
-		kfree(group_fds);
-
-		/* release reference to groups on error */
-		if (ret)
-			goto hot_reset_release;
-
-		info.count = hdr.count;
-		info.groups = groups;
-
-		/*
-		 * Test whether all the affected devices are contained
-		 * by the set of groups provided by the user.
-		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_validate_devs,
-						    &info, slot);
-		if (!ret)
-			/* User has access, do the reset */
-			ret = pci_reset_bus(vdev->pdev);
-
-hot_reset_release:
-		for (i--; i >= 0; i--)
-			vfio_group_put_external_user(groups[i].group);
-
-		kfree(groups);
-		return ret;
-	} else if (cmd == VFIO_DEVICE_IOEVENTFD) {
-		struct vfio_device_ioeventfd ioeventfd;
-		int count;
-
-		minsz = offsetofend(struct vfio_device_ioeventfd, fd);
-
-		if (copy_from_user(&ioeventfd, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (ioeventfd.argsz < minsz)
-			return -EINVAL;
-
-		if (ioeventfd.flags & ~VFIO_DEVICE_IOEVENTFD_SIZE_MASK)
-			return -EINVAL;
-
-		count = ioeventfd.flags & VFIO_DEVICE_IOEVENTFD_SIZE_MASK;
-
-		if (hweight8(count) != 1 || ioeventfd.fd < -1)
-			return -EINVAL;
-
-		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
-					  ioeventfd.data, count, ioeventfd.fd);
-	}
-
-	return -ENOTTY;
-}
-
-static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
-			   size_t count, loff_t *ppos, bool iswrite)
-{
-	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	struct vfio_pci_device *vdev = device_data;
-
-	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
-		return -EINVAL;
-
-	switch (index) {
-	case VFIO_PCI_CONFIG_REGION_INDEX:
-		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
-
-	case VFIO_PCI_ROM_REGION_INDEX:
-		if (iswrite)
-			return -EINVAL;
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
-
-	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
-
-	case VFIO_PCI_VGA_REGION_INDEX:
-		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
-	default:
-		index -= VFIO_PCI_NUM_REGIONS;
-		return vdev->region[index].ops->rw(vdev, buf,
-						   count, ppos, iswrite);
-	}
-
-	return -EINVAL;
-}
-
-ssize_t vfio_pci_read(void *device_data, char __user *buf,
-			     size_t count, loff_t *ppos)
-{
-	if (!count)
-		return 0;
-
-	return vfio_pci_rw(device_data, buf, count, ppos, false);
-}
-
-ssize_t vfio_pci_write(void *device_data, const char __user *buf,
-			      size_t count, loff_t *ppos)
-{
-	if (!count)
-		return 0;
-
-	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
-}
-
-int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
-{
-	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
-	unsigned int index;
-	u64 phys_len, req_len, pgoff, req_start;
-	int ret;
-
-	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
-
-	if (vma->vm_end < vma->vm_start)
-		return -EINVAL;
-	if ((vma->vm_flags & VM_SHARED) == 0)
-		return -EINVAL;
-	if (index >= VFIO_PCI_NUM_REGIONS) {
-		int regnum = index - VFIO_PCI_NUM_REGIONS;
-		struct vfio_pci_region *region = vdev->region + regnum;
-
-		if (region && region->ops && region->ops->mmap &&
-		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
-			return region->ops->mmap(vdev, region, vma);
-		return -EINVAL;
-	}
-	if (index >= VFIO_PCI_ROM_REGION_INDEX)
-		return -EINVAL;
-	if (!vdev->bar_mmap_supported[index])
-		return -EINVAL;
-
-	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
-	req_len = vma->vm_end - vma->vm_start;
-	pgoff = vma->vm_pgoff &
-		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
-	req_start = pgoff << PAGE_SHIFT;
-
-	if (req_start + req_len > phys_len)
-		return -EINVAL;
-
-	/*
-	 * Even though we don't make use of the barmap for the mmap,
-	 * we need to request the region and the barmap tracks that.
-	 */
-	if (!vdev->barmap[index]) {
-		ret = pci_request_selected_regions(pdev,
-						   1 << index, "vfio-pci");
-		if (ret)
-			return ret;
-
-		vdev->barmap[index] = pci_iomap(pdev, index, 0);
-		if (!vdev->barmap[index]) {
-			pci_release_selected_regions(pdev, 1 << index);
-			return -ENOMEM;
-		}
-	}
-
-	vma->vm_private_data = vdev;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
-
-	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       req_len, vma->vm_page_prot);
-}
-
-void vfio_pci_request(void *device_data, unsigned int count)
-{
-	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
-
-	mutex_lock(&vdev->igate);
-
-	if (vdev->req_trigger) {
-		if (!(count % 10))
-			pci_notice_ratelimited(pdev,
-				"Relaying device request to user (#%u)\n",
-				count);
-		eventfd_signal(vdev->req_trigger, 1);
-	} else if (count == 0) {
-		pci_warn(pdev,
-			"No device request channel registered, blocked until released by user\n");
-	}
-
-	mutex_unlock(&vdev->igate);
-}
-
 static const struct vfio_device_ops vfio_pci_ops = {
 	.name		= "vfio-pci",
 	.open		= vfio_pci_open,
@@ -1375,38 +218,6 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	}
 }
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
-{
-	struct vfio_pci_device *vdev;
-	struct vfio_device *device;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (device == NULL)
-		return PCI_ERS_RESULT_DISCONNECT;
-
-	vdev = vfio_device_data(device);
-	if (vdev == NULL) {
-		vfio_device_put(device);
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	mutex_lock(&vdev->igate);
-
-	if (vdev->err_trigger)
-		eventfd_signal(vdev->err_trigger, 1);
-
-	mutex_unlock(&vdev->igate);
-
-	vfio_device_put(device);
-
-	return PCI_ERS_RESULT_CAN_RECOVER;
-}
-
-static const struct pci_error_handlers vfio_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
-};
-
 static struct pci_driver vfio_pci_driver = {
 	.name		= "vfio-pci",
 	.id_table	= NULL, /* only dynamic ids */
@@ -1415,247 +226,12 @@ static struct pci_driver vfio_pci_driver = {
 	.err_handler	= &vfio_err_handlers,
 };
 
-static DEFINE_MUTEX(reflck_lock);
-
-static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
-{
-	struct vfio_pci_reflck *reflck;
-
-	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
-	if (!reflck)
-		return ERR_PTR(-ENOMEM);
-
-	kref_init(&reflck->kref);
-	mutex_init(&reflck->lock);
-
-	return reflck;
-}
-
-static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
-{
-	kref_get(&reflck->kref);
-}
-
-static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_device *vdev = data;
-	struct vfio_pci_reflck **preflck = &vdev->reflck;
-	struct vfio_device *device;
-	struct vfio_pci_device *tmp;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (!device)
-		return 0;
-
-	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
-		vfio_device_put(device);
-		return 0;
-	}
-
-	tmp = vfio_device_data(device);
-
-	if (tmp->reflck) {
-		vfio_pci_reflck_get(tmp->reflck);
-		*preflck = tmp->reflck;
-		vfio_device_put(device);
-		return 1;
-	}
-
-	vfio_device_put(device);
-	return 0;
-}
-
-int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
-{
-	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
-
-	mutex_lock(&reflck_lock);
-
-	if (pci_is_root_bus(vdev->pdev->bus) ||
-	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
-					  vdev, slot) <= 0)
-		vdev->reflck = vfio_pci_reflck_alloc();
-
-	mutex_unlock(&reflck_lock);
-
-	return PTR_ERR_OR_ZERO(vdev->reflck);
-}
-
-static void vfio_pci_reflck_release(struct kref *kref)
-{
-	struct vfio_pci_reflck *reflck = container_of(kref,
-						      struct vfio_pci_reflck,
-						      kref);
-
-	kfree(reflck);
-	mutex_unlock(&reflck_lock);
-}
-
-void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
-{
-	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
-}
-
-struct vfio_devices {
-	struct vfio_device **devices;
-	struct vfio_pci_device *vdev;
-	int cur_index;
-	int max_index;
-};
-
-static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_devices *devs = data;
-	struct vfio_device *device;
-	struct vfio_pci_device *tmp;
-
-	if (devs->cur_index == devs->max_index)
-		return -ENOSPC;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (!device)
-		return -EINVAL;
-
-	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
-		vfio_device_put(device);
-		return -EBUSY;
-	}
-
-	tmp = vfio_device_data(device);
-
-	/* Fault if the device is not unused */
-	if (tmp->refcnt) {
-		vfio_device_put(device);
-		return -EBUSY;
-	}
-
-	devs->devices[devs->cur_index++] = device;
-	return 0;
-}
-
-/*
- * If a bus or slot reset is available for the provided device and:
- *  - All of the devices affected by that bus or slot reset are unused
- *    (!refcnt)
- *  - At least one of the affected devices is marked dirty via
- *    needs_reset (such as by lack of FLR support)
- * Then attempt to perform that bus or slot reset.  Callers are required
- * to hold vdev->reflck->lock, protecting the bus/slot reset group from
- * concurrent opens.  A vfio_device reference is acquired for each device
- * to prevent unbinds during the reset operation.
- *
- * NB: vfio-core considers a group to be viable even if some devices are
- * bound to drivers like pci-stub or pcieport.  Here we require all devices
- * to be bound to vfio_pci since that's the only way we can be sure they
- * stay put.
- */
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
-{
-	struct vfio_devices devs = { .cur_index = 0 };
-	int i = 0, ret = -EINVAL;
-	bool slot = false;
-	struct vfio_pci_device *tmp;
-
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return;
-
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
-					  &i, slot) || !i)
-		return;
-
-	devs.max_index = i;
-	devs.devices = kcalloc(i, sizeof(struct vfio_device *), GFP_KERNEL);
-	if (!devs.devices)
-		return;
-
-	devs.vdev = vdev;
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
-					  vfio_pci_get_unused_devs,
-					  &devs, slot))
-		goto put_devs;
-
-	/* Does at least one need a reset? */
-	for (i = 0; i < devs.cur_index; i++) {
-		tmp = vfio_device_data(devs.devices[i]);
-		if (tmp->needs_reset) {
-			ret = pci_reset_bus(vdev->pdev);
-			break;
-		}
-	}
-
-put_devs:
-	for (i = 0; i < devs.cur_index; i++) {
-		tmp = vfio_device_data(devs.devices[i]);
-
-		/*
-		 * If reset was successful, affected devices no longer need
-		 * a reset and we should return all the collateral devices
-		 * to low power.  If not successful, we either didn't reset
-		 * the bus or timed out waiting for it, so let's not touch
-		 * the power state.
-		 */
-		if (!ret) {
-			tmp->needs_reset = false;
-
-			if (tmp != vdev && !tmp->disable_idle_d3)
-				vfio_pci_set_power_state(tmp, PCI_D3hot);
-		}
-
-		vfio_device_put(devs.devices[i]);
-	}
-
-	kfree(devs.devices);
-}
-
 static void __exit vfio_pci_cleanup(void)
 {
 	pci_unregister_driver(&vfio_pci_driver);
 	vfio_pci_uninit_perm_bits();
 }
 
-void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
-{
-	char *p, *id;
-	int rc;
-
-	/* no ids passed actually */
-	if (ids[0] == '\0')
-		return;
-
-	/* add ids specified in the module parameter */
-	p = ids;
-	while ((id = strsep(&p, ","))) {
-		unsigned int vendor, device, subvendor = PCI_ANY_ID,
-			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
-		int fields;
-
-		if (!strlen(id))
-			continue;
-
-		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
-				&vendor, &device, &subvendor, &subdevice,
-				&class, &class_mask);
-
-		if (fields < 2) {
-			pr_warn("invalid id string \"%s\"\n", id);
-			continue;
-		}
-
-		rc = pci_add_dynid(driver, vendor, device,
-				   subvendor, subdevice, class, class_mask, 0);
-		if (rc)
-			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask, rc);
-		else
-			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask);
-	}
-}
-
 static int __init vfio_pci_init(void)
 {
 	int ret;
diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index 11365f1..318ad29 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -1215,7 +1215,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
-static const struct pci_error_handlers vfio_err_handlers = {
+const struct pci_error_handlers vfio_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 53d0385..b64c4e9 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -135,6 +135,8 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+extern const struct pci_error_handlers vfio_err_handlers;
+
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (6 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 07/13] vfio_pci: shrink vfio_pci.c Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-26  2:36   ` Alex Williamson
  2019-09-05  7:59 ` [PATCH v2 09/13] samples: add vfio-mdev-pci driver Liu Yi L
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

There is a case in which cap_perms and ecap_perms can be reallocated
by different modules. e.g. the vfio-mdev-pci sample driver. To secure
the initialization of cap_perms and ecap_perms, this patch adds an
atomic variable to track the user of cap/ecap_perms bits. First caller
of vfio_pci_init_perm_bits() will initialize the bits. While the last
caller of vfio_pci_uninit_perm_bits() will free the bits.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index f0891bd..1b3e6e5 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -992,11 +992,17 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
 	return 0;
 }
 
+/* Track the user number of the cap/ecap perm_bits */
+atomic_t vfio_pci_perm_bits_users = ATOMIC_INIT(0);
+
 /*
  * Initialize the shared permission tables
  */
 void vfio_pci_uninit_perm_bits(void)
 {
+	if (atomic_dec_return(&vfio_pci_perm_bits_users))
+		return;
+
 	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
 
 	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
@@ -1013,6 +1019,9 @@ int __init vfio_pci_init_perm_bits(void)
 {
 	int ret;
 
+	if (atomic_inc_return(&vfio_pci_perm_bits_users) != 1)
+		return 0;
+
 	/* Basic config space */
 	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 09/13] samples: add vfio-mdev-pci driver
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (7 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-05  7:59 ` [PATCH v2 10/13] samples: refine " Liu Yi L
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian,
	Masahiro Yamada

This patch adds sample driver named vfio-mdev-pci. It is to wrap
a PCI device as a mediated device. For a pci device, once bound
to vfio-mdev-pci driver, user space access of this device will
go through vfio mdev framework. The usage of the device follows
mdev management method. e.g. user should create a mdev before
exposing the device to user-space.

Benefit of this new driver would be acting as a sample driver
for recent changes from "vfio/mdev: IOMMU aware mediated device"
patchset. Also it could be a good experiment driver for future
device specific mdev migration support.

To use this driver:
a) build and load vfio-mdev-pci.ko module
   execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
   then load it with following command
   > sudo modprobe vfio
   > sudo modprobe vfio-pci
   > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko

b) unbind original device driver
   e.g. use following command to unbind its original driver
   > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind

c) bind vfio-mdev-pci driver to the physical device
   > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id

d) check the supported mdev instances
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
     vfio-mdev-pci-type_name
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type_name/
     available_instances  create  device_api  devices  name

e)  create mdev on this physical device (only 1 instance)
   > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
     /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type_name/create

f) passthru the mdev to guest
   add the following line in QEMU boot command
   -device vfio-pci,\
    sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003

g) destroy mdev
   > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
     remove

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Makefile        |   6 +
 drivers/vfio/pci/vfio_mdev_pci.c | 403 +++++++++++++++++++++++++++++++++++++++
 samples/Kconfig                  |  11 ++
 3 files changed, 420 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_mdev_pci.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d94317a..ac118ef 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,4 +5,10 @@ vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
+vfio-mdev-pci-y := vfio_mdev_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+			vfio_pci_rdwr.o vfio_pci_config.o
+vfio-mdev-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-mdev-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
new file mode 100644
index 0000000..07c8067
--- /dev/null
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -0,0 +1,403 @@
+/*
+ * Copyright © 2019 Intel Corporation.
+ *     Author: Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_pci.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+#include <linux/mdev.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu, Yi L <yi.l.liu@intel.com>"
+#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a mdev"
+
+#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-mdev-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+static struct pci_driver vfio_mdev_pci_driver;
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%s-type1\n", dev_name(dev));
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%d\n", 1);
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+		char *buf)
+{
+	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+
+MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *vfio_mdev_pci_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group vfio_mdev_pci_type_group1 = {
+	.name  = "type1",
+	.attrs = vfio_mdev_pci_types_attrs,
+};
+
+struct attribute_group *vfio_mdev_pci_type_groups[] = {
+	&vfio_mdev_pci_type_group1,
+	NULL,
+};
+
+struct vfio_mdev_pci {
+	struct vfio_pci_device *vdev;
+	struct mdev_device *mdev;
+	unsigned long handle;
+};
+
+static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct device *pdev;
+	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci *pmdev;
+	int ret;
+
+	pdev = mdev_parent_dev(mdev);
+	vdev = dev_get_drvdata(pdev);
+	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
+	if (pmdev == NULL) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	pmdev->mdev = mdev;
+	pmdev->vdev = vdev;
+	mdev_set_drvdata(mdev, pmdev);
+	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
+	if (ret) {
+		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
+			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+static int vfio_mdev_pci_remove(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	kfree(pmdev);
+	pr_info("%s, succeeded for mdev: %s\n", __func__,
+		     dev_name(mdev_dev(mdev)));
+
+	return 0;
+}
+
+static int vfio_mdev_pci_open(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (!ret)
+		pr_info("Succeeded to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+	else {
+		pr_info("Failed to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static void vfio_mdev_pci_release(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+
+	pr_info("Release mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
+}
+
+static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
+				struct vm_area_struct *vma)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_mmap(pmdev->vdev, vma);
+}
+
+static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
+}
+
+static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
+				const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
+}
+
+static const struct mdev_parent_ops vfio_mdev_pci_ops = {
+	.supported_type_groups	= vfio_mdev_pci_type_groups,
+	.create			= vfio_mdev_pci_create,
+	.remove			= vfio_mdev_pci_remove,
+
+	.open			= vfio_mdev_pci_open,
+	.release		= vfio_mdev_pci_release,
+
+	.read			= vfio_mdev_pci_read,
+	.write			= vfio_mdev_pci_write,
+	.mmap			= vfio_mdev_pci_mmap,
+	.ioctl			= vfio_mdev_pci_ioctl,
+};
+
+static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
+				       const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
+	 * reject these PFs and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	pci_set_drvdata(pdev, vdev);
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		pci_set_drvdata(pdev, NULL);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	if (ret)
+		pr_err("Cannot register mdev for device %s\n",
+			dev_name(&pdev->dev));
+	else
+		pr_info("Wrap device %s as a mdev\n", dev_name(&pdev->dev));
+
+	return ret;
+}
+
+static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = pci_get_drvdata(pdev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_mdev_pci_driver = {
+	.name		= VFIO_MDEV_PCI_NAME,
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_mdev_pci_driver_probe,
+	.remove		= vfio_mdev_pci_driver_remove,
+	.err_handler	= &vfio_err_handlers,
+};
+
+static void __exit vfio_mdev_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_mdev_pci_driver);
+	vfio_pci_uninit_perm_bits();
+}
+
+static int __init vfio_mdev_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_mdev_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	vfio_pci_fill_ids(ids, &vfio_mdev_pci_driver);
+
+	return 0;
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_mdev_pci_init);
+module_exit(vfio_mdev_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/samples/Kconfig b/samples/Kconfig
index c8dacb4..1513fef 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -169,4 +169,15 @@ config SAMPLE_VFS
 	  as mount API and statx().  Note that this is restricted to the x86
 	  arch whilst it accesses system calls that aren't yet in all arches.
 
+config SAMPLE_VFIO_MDEV_PCI
+	tristate "Sample driver for wrapping PCI device as a mdev"
+	depends on PCI && EVENTFD && VFIO_MDEV_DEVICE
+	select VFIO_VIRQFD
+	select IRQ_BYPASS_MANAGER
+	help
+	  Sample driver for wrapping a PCI device as a mdev. Once bound to
+	  this driver, device passthru should through mdev path.
+
+	  If you don't know what to do here, say N.
+
 endif # SAMPLES
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 10/13] samples: refine vfio-mdev-pci driver
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (8 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 09/13] samples: add vfio-mdev-pci driver Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-26  2:36   ` Alex Williamson
  2019-09-05  7:59 ` [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev() Liu Yi L
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

From: Alex Williamson <alex.williamson@redhat.com>

This patch refines the implementation of original vfio-mdev-pci driver.

And the vfio-mdev-pci-type_name will be named per the following rule:

	vmdev->attr.name = kasprintf(GFP_KERNEL,
				     "%04x:%04x:%04x:%04x:%06x:%02x",
				     pdev->vendor, pdev->device,
				     pdev->subsystem_vendor,
				     pdev->subsystem_device, pdev->class,
				     pdev->revision);

Before usage, check the /sys/bus/pci/devices/$bdf/mdev_supported_types/
to ensure the final mdev_supported_types.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_mdev_pci.c | 123 +++++++++++++++++++++++----------------
 1 file changed, 72 insertions(+), 51 deletions(-)

diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
index 07c8067..09143d3 100644
--- a/drivers/vfio/pci/vfio_mdev_pci.c
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
 
 static struct pci_driver vfio_mdev_pci_driver;
 
-static ssize_t
-name_show(struct kobject *kobj, struct device *dev, char *buf)
-{
-	return sprintf(buf, "%s-type1\n", dev_name(dev));
-}
-
-MDEV_TYPE_ATTR_RO(name);
+struct vfio_mdev_pci_device {
+	struct vfio_pci_device vdev;
+	struct mdev_parent_ops ops;
+	struct attribute_group *groups[2];
+	struct attribute_group attr;
+	atomic_t avail;
+};
 
 static ssize_t
 available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "%d\n", 1);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = pci_get_drvdata(to_pci_dev(dev));
+
+	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
 }
 
 MDEV_TYPE_ATTR_RO(available_instances);
@@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
 MDEV_TYPE_ATTR_RO(device_api);
 
 static struct attribute *vfio_mdev_pci_types_attrs[] = {
-	&mdev_type_attr_name.attr,
 	&mdev_type_attr_device_api.attr,
 	&mdev_type_attr_available_instances.attr,
 	NULL,
 };
 
-static struct attribute_group vfio_mdev_pci_type_group1 = {
-	.name  = "type1",
-	.attrs = vfio_mdev_pci_types_attrs,
-};
-
-struct attribute_group *vfio_mdev_pci_type_groups[] = {
-	&vfio_mdev_pci_type_group1,
-	NULL,
-};
-
 struct vfio_mdev_pci {
 	struct vfio_pci_device *vdev;
 	struct mdev_device *mdev;
-	unsigned long handle;
 };
 
 static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
 {
 	struct device *pdev;
-	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_mdev_pci *pmdev;
 	int ret;
 
 	pdev = mdev_parent_dev(mdev);
-	vdev = dev_get_drvdata(pdev);
+	vmdev = dev_get_drvdata(pdev);
+
+	if (atomic_dec_if_positive(&vmdev->avail) < 0)
+		return -ENOSPC;
+
 	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
-	if (pmdev == NULL) {
-		ret = -EBUSY;
-		goto out;
-	}
+	if (!pmdev)
+		return -ENOMEM;
 
 	pmdev->mdev = mdev;
-	pmdev->vdev = vdev;
+	pmdev->vdev = &vmdev->vdev;
 	mdev_set_drvdata(mdev, pmdev);
 	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
 	if (ret) {
 		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
 			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
-		goto out;
+		kfree(pmdev);
+		atomic_inc(&vmdev->avail);
+		return ret;
 	}
 
-out:
-	return ret;
+	return 0;
 }
 
 static int vfio_mdev_pci_remove(struct mdev_device *mdev)
 {
 	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
 
 	kfree(pmdev);
+	atomic_inc(&vmdev->avail);
 	pr_info("%s, succeeded for mdev: %s\n", __func__,
 		     dev_name(mdev_dev(mdev)));
 
@@ -237,24 +236,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
 	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
 }
 
-static const struct mdev_parent_ops vfio_mdev_pci_ops = {
-	.supported_type_groups	= vfio_mdev_pci_type_groups,
-	.create			= vfio_mdev_pci_create,
-	.remove			= vfio_mdev_pci_remove,
-
-	.open			= vfio_mdev_pci_open,
-	.release		= vfio_mdev_pci_release,
-
-	.read			= vfio_mdev_pci_read,
-	.write			= vfio_mdev_pci_write,
-	.mmap			= vfio_mdev_pci_mmap,
-	.ioctl			= vfio_mdev_pci_ioctl,
-};
-
 static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 				       const struct pci_device_id *id)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
+	const struct mdev_parent_ops *ops;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -273,10 +260,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		return -EBUSY;
 	}
 
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev)
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (!vmdev)
 		return -ENOMEM;
 
+	vmdev->attr.name = kasprintf(GFP_KERNEL,
+				     "%04x:%04x:%04x:%04x:%06x:%02x",
+				     pdev->vendor, pdev->device,
+				     pdev->subsystem_vendor,
+				     pdev->subsystem_device, pdev->class,
+				     pdev->revision);
+	if (!vmdev->attr.name) {
+		kfree(vmdev);
+		return -ENOMEM;
+	}
+
+	atomic_set(&vmdev->avail, 1);
+
+	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
+	vmdev->groups[0] = &vmdev->attr;
+
+	vmdev->ops.supported_type_groups = vmdev->groups;
+	vmdev->ops.create = vfio_mdev_pci_create;
+	vmdev->ops.remove = vfio_mdev_pci_remove;
+	vmdev->ops.open	= vfio_mdev_pci_open;
+	vmdev->ops.release = vfio_mdev_pci_release;
+	vmdev->ops.read = vfio_mdev_pci_read;
+	vmdev->ops.write = vfio_mdev_pci_write;
+	vmdev->ops.mmap = vfio_mdev_pci_mmap;
+	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
+	ops = &vmdev->ops;
+
+	vdev = &vmdev->vdev;
 	vdev->pdev = pdev;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
@@ -289,7 +304,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 #endif
 	vdev->disable_idle_d3 = disable_idle_d3;
 
-	pci_set_drvdata(pdev, vdev);
+	pci_set_drvdata(pdev, vmdev);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret) {
@@ -320,7 +335,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	ret = mdev_register_device(&pdev->dev, ops);
 	if (ret)
 		pr_err("Cannot register mdev for device %s\n",
 			dev_name(&pdev->dev));
@@ -332,12 +347,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 
 static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
 
-	vdev = pci_get_drvdata(pdev);
-	if (!vdev)
+	mdev_unregister_device(&pdev->dev);
+
+	vmdev = pci_get_drvdata(pdev);
+	if (!vmdev)
 		return;
 
+	vdev = &vmdev->vdev;
+
 	vfio_pci_reflck_put(vdev->reflck);
 
 	kfree(vdev->region);
@@ -355,7 +375,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 
-	kfree(vdev);
+	kfree(vmdev->attr.name);
+	kfree(vmdev);
 }
 
 static struct pci_driver vfio_mdev_pci_driver = {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev()
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (9 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 10/13] samples: refine " Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-26  2:36   ` Alex Williamson
  2019-09-05  7:59 ` [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev Liu Yi L
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch adds vfio_add_group_dev() calling in probe() to make
vfio-mdev-pci work well with non-singleton iommu group. User could
bind devices from a non-singleton iommu group to either vfio-pci
driver or this sample driver. Existing passthru policy works well
for this non-singleton group.

This is actually a policy choice. A device driver can make this call
if it wants to be vfio viable. And it needs to provide dummy
vfio_device_ops which is required by vfio framework. To prevent user
from opening the device from the iommu backed group fd, the open
callback of the dummy vfio_device_ops should return -ENODEV to fail
the VFIO_GET_DEVICE_FD request from userspace.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_mdev_pci.c | 91 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 82 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
index 09143d3..a61c20d 100644
--- a/drivers/vfio/pci/vfio_mdev_pci.c
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -107,19 +107,27 @@ struct vfio_mdev_pci {
 static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
 {
 	struct device *pdev;
+	struct vfio_device *device;
 	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_mdev_pci *pmdev;
 	int ret;
 
 	pdev = mdev_parent_dev(mdev);
-	vmdev = dev_get_drvdata(pdev);
+	device = vfio_device_get_from_dev(pdev);
+	vmdev = vfio_device_data(device);
 
-	if (atomic_dec_if_positive(&vmdev->avail) < 0)
-		return -ENOSPC;
+	if (atomic_dec_if_positive(&vmdev->avail) < 0) {
+		ret = -ENOSPC;
+		goto out;
+	}
 
+	pr_info("%s, available instance: %d\n",
+			__func__, atomic_read(&vmdev->avail));
 	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
-	if (!pmdev)
-		return -ENOMEM;
+	if (!pmdev) {
+		ret = -ENOMEM;
+		goto out;
+	}
 
 	pmdev->mdev = mdev;
 	pmdev->vdev = &vmdev->vdev;
@@ -130,10 +138,11 @@ static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
 			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
 		kfree(pmdev);
 		atomic_inc(&vmdev->avail);
-		return ret;
 	}
 
-	return 0;
+out:
+	vfio_device_put(device);
+	return ret;
 }
 
 static int vfio_mdev_pci_remove(struct mdev_device *mdev)
@@ -145,6 +154,8 @@ static int vfio_mdev_pci_remove(struct mdev_device *mdev)
 
 	kfree(pmdev);
 	atomic_inc(&vmdev->avail);
+	pr_info("%s, available instance: %d\n",
+			__func__, atomic_read(&vmdev->avail));
 	pr_info("%s, succeeded for mdev: %s\n", __func__,
 		     dev_name(mdev_dev(mdev)));
 
@@ -236,12 +247,65 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
 	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
 }
 
+static int vfio_pci_dummy_open(void *device_data)
+{
+	struct vfio_mdev_pci_device *vmdev =
+		(struct vfio_mdev_pci_device *) device_data;
+	pr_warn("Device %s is not viable for vfio-pci passthru, please follow"
+		" vfio-mdev passthru path as it has been wrapped as mdev!!!\n",
+					dev_name(&vmdev->vdev.pdev->dev));
+	return -ENODEV;
+}
+
+static void vfio_pci_dummy_release(void *device_data)
+{
+}
+
+long vfio_pci_dummy_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
+{
+	return 0;
+}
+
+ssize_t vfio_pci_dummy_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	return 0;
+}
+
+ssize_t vfio_pci_dummy_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	return 0;
+}
+
+int vfio_pci_dummy_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+void vfio_pci_dummy_request(void *device_data, unsigned int count)
+{
+}
+
+static const struct vfio_device_ops vfio_pci_dummy_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_dummy_open,
+	.release	= vfio_pci_dummy_release,
+	.ioctl		= vfio_pci_dummy_ioctl,
+	.read		= vfio_pci_dummy_read,
+	.write		= vfio_pci_dummy_write,
+	.mmap		= vfio_pci_dummy_mmap,
+	.request	= vfio_pci_dummy_request,
+};
+
 static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 				       const struct pci_device_id *id)
 {
 	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
 	const struct mdev_parent_ops *ops;
+	struct iommu_group *group;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -260,6 +324,10 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		return -EBUSY;
 	}
 
+	group = vfio_iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
 	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
 	if (!vmdev)
 		return -ENOMEM;
@@ -304,7 +372,12 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 #endif
 	vdev->disable_idle_d3 = disable_idle_d3;
 
-	pci_set_drvdata(pdev, vmdev);
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_dummy_ops, vmdev);
+	if (ret) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vmdev);
+		return ret;
+	}
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret) {
@@ -352,7 +425,7 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 
 	mdev_unregister_device(&pdev->dev);
 
-	vmdev = pci_get_drvdata(pdev);
+	vmdev = vfio_del_group_dev(&pdev->dev);
 	if (!vmdev)
 		return;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (10 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev() Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-26  2:37   ` Alex Williamson
  2019-09-05  7:59 ` [PATCH v2 13/13] vfio/type1: track iommu backed group attach Liu Yi L
  2019-09-25  9:13 ` [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu, Yi L
  13 siblings, 1 reply; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

This patch uses iommu_attach_group() to do group attach when it is
for the case of wrapping a PF/VF as a mdev. iommu_attach_device()
doesn't support non-singleton iommu group attach. With this change,
wrapping PF/VF as mdev can work on non-singleton iommu groups.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 054391f..317430d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1312,13 +1312,20 @@ static int vfio_mdev_attach_domain(struct device *dev, void *data)
 {
 	struct iommu_domain *domain = data;
 	struct device *iommu_device;
+	struct iommu_group *group;
 
 	iommu_device = vfio_mdev_get_iommu_device(dev);
 	if (iommu_device) {
 		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
 			return iommu_aux_attach_device(domain, iommu_device);
-		else
-			return iommu_attach_device(domain, iommu_device);
+		else {
+			group = iommu_group_get(iommu_device);
+			if (!group) {
+				WARN_ON(1);
+				return -EINVAL;
+			}
+			return iommu_attach_group(domain, group);
+		}
 	}
 
 	return -EINVAL;
@@ -1328,13 +1335,20 @@ static int vfio_mdev_detach_domain(struct device *dev, void *data)
 {
 	struct iommu_domain *domain = data;
 	struct device *iommu_device;
+	struct iommu_group *group;
 
 	iommu_device = vfio_mdev_get_iommu_device(dev);
 	if (iommu_device) {
 		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
 			iommu_aux_detach_device(domain, iommu_device);
-		else
-			iommu_detach_device(domain, iommu_device);
+		else {
+			group = iommu_group_get(iommu_device);
+			if (!group) {
+				WARN_ON(1);
+				return -EINVAL;
+			}
+			iommu_detach_group(domain, group);
+		}
 	}
 
 	return 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 13/13] vfio/type1: track iommu backed group attach
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (11 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev Liu Yi L
@ 2019-09-05  7:59 ` Liu Yi L
  2019-09-25  9:13 ` [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu, Yi L
  13 siblings, 0 replies; 24+ messages in thread
From: Liu Yi L @ 2019-09-05  7:59 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: kevin.tian, baolu.lu, yi.l.liu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

With the introduction of iommu aware mdev group, user may wrap a PF/VF
as a mdev. Such mdevs will be called as wrapped PF/VF mdevs in following
statements. If it's applied on a non-singleton iommu group, there would
be multiple domain attach on an iommu_device group (equal to iommu backed
group). Reason is that mdev group attaches is finally an iommu_device
group attach in the end. And existing vfio_domain.gorup_list has no idea
about it. Thus multiple attach would happen.

What's more, under default domain policy, group attach is allowed only
when its in-use domain is equal to its default domain as the code below:

static int __iommu_attach_group(struct iommu_domain *domain, ..)
{
	..
	if (group->default_domain && group->domain != group->default_domain)
		return -EBUSY;
	...
}

So for the above scenario, only the first group attach on the
non-singleton iommu group will be successful. Subsequent group
attaches will be failed. However, this is a fairly valid usage case
if the wrapped PF/VF mdevs and other devices are assigned to a single
VM. We may want to prevent it. In other words, the subsequent group
attaches should return success before going to __iommu_attach_group().

However, if user tries to assign the wrapped PF/VF mdevs and other
devices to different VMs, the subsequent group attaches on a single
iommu_device group should be failed. This means the subsequent group
attach should finally calls into __iommu_attach_group() and be failed.

To meet the above requirements, this patch introduces vfio_group_object
structure to track the group attach of an iommu_device group (a.ka.
iommu backed group). Each vfio_domain will have a group_obj_list to
record the vfio_group_objects. The search of the group_obj_list should
use iommu_device group if a group is mdev group.

	struct vfio_group_object {
		atomic_t		count;
		struct iommu_group	*iommu_group;
		struct vfio_domain	*domain;
		struct list_head	next;
	};

Each time, a successful group attach should either have a new
vfio_group_object created or count increasing of an existing
vfio_group_object instance. Details can be found in
vfio_domain_attach_group_object().

For group detach, should have count decreasing. Please check
vfio_domain_detach_group_object().

As the vfio_domain.group_obj_list is within vfio container(vfio_iommu)
scope, if user wants to passthru a non-singleton to multiple VMs, it
will be failed as VMs will have separate vfio containers. Also, if
vIOMMU is exposed, it will also fail the attempts of assigning multiple
devices (via vfio-pci or PF/VF wrapped mdev) to a single VM. This is
aligned with current vfio passthru rules.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 167 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 154 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 317430d..6a67bd6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -75,6 +75,7 @@ struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
+	struct list_head	group_obj_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
 };
@@ -97,6 +98,13 @@ struct vfio_group {
 	bool			mdev_group;	/* An mdev group */
 };
 
+struct vfio_group_object {
+	atomic_t		count;
+	struct iommu_group	*iommu_group;
+	struct vfio_domain	*domain;
+	struct list_head	next;
+};
+
 /*
  * Guest RAM pinning working set or DMA target
  */
@@ -1263,6 +1271,85 @@ static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
 	return NULL;
 }
 
+static struct vfio_group_object *find_iommu_group_object(
+		struct vfio_domain *domain, struct iommu_group *iommu_group)
+{
+	struct vfio_group_object *g;
+
+	list_for_each_entry(g, &domain->group_obj_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
+static void vfio_init_iommu_group_object(struct vfio_group_object *group_obj,
+		struct vfio_domain *domain, struct iommu_group *iommu_group)
+{
+	if (!group_obj || !domain || !iommu_group) {
+		WARN_ON(1);
+		return;
+	}
+	atomic_set(&group_obj->count, 1);
+	group_obj->iommu_group = iommu_group;
+	group_obj->domain = domain;
+	list_add(&group_obj->next, &domain->group_obj_list);
+}
+
+static int vfio_domain_attach_group_object(
+		struct vfio_domain *domain, struct iommu_group *iommu_group)
+{
+	struct vfio_group_object *group_obj;
+
+	group_obj = find_iommu_group_object(domain, iommu_group);
+	if (group_obj) {
+		atomic_inc(&group_obj->count);
+		return 0;
+	}
+	group_obj = kzalloc(sizeof(*group_obj), GFP_KERNEL);
+	vfio_init_iommu_group_object(group_obj, domain, iommu_group);
+	return iommu_attach_group(domain->domain, iommu_group);
+}
+
+static int vfio_domain_detach_group_object(
+		struct vfio_domain *domain, struct iommu_group *iommu_group)
+{
+	struct vfio_group_object *group_obj;
+
+	group_obj = find_iommu_group_object(domain, iommu_group);
+	if (!group_obj) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+	if (atomic_dec_if_positive(&group_obj->count) == 0) {
+		list_del(&group_obj->next);
+		kfree(group_obj);
+	}
+	iommu_detach_group(domain->domain, iommu_group);
+	return 0;
+}
+
+/*
+ * Check if an iommu backed group has been attached to a domain within
+ * a specific container (vfio_iommu). If yes, return the vfio_group_object
+ * which tracks the previous domain attach for this group. Caller of this
+ * function should hold vfio_iommu->lock.
+ */
+static struct vfio_group_object *vfio_iommu_group_object_check(
+		struct vfio_iommu *iommu, struct iommu_group *iommu_group)
+{
+	struct vfio_domain *d;
+	struct vfio_group_object *group_obj;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		group_obj = find_iommu_group_object(d, iommu_group);
+		if (group_obj)
+			return group_obj;
+	}
+	return NULL;
+}
+
 static bool vfio_iommu_has_sw_msi(struct iommu_group *group, phys_addr_t *base)
 {
 	struct list_head group_resv_regions;
@@ -1310,21 +1397,23 @@ static struct device *vfio_mdev_get_iommu_device(struct device *dev)
 
 static int vfio_mdev_attach_domain(struct device *dev, void *data)
 {
-	struct iommu_domain *domain = data;
+	struct vfio_domain *domain = data;
 	struct device *iommu_device;
 	struct iommu_group *group;
 
 	iommu_device = vfio_mdev_get_iommu_device(dev);
 	if (iommu_device) {
 		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
-			return iommu_aux_attach_device(domain, iommu_device);
+			return iommu_aux_attach_device(domain->domain,
+							iommu_device);
 		else {
 			group = iommu_group_get(iommu_device);
 			if (!group) {
 				WARN_ON(1);
 				return -EINVAL;
 			}
-			return iommu_attach_group(domain, group);
+			return vfio_domain_attach_group_object(
+							domain, group);
 		}
 	}
 
@@ -1333,21 +1422,22 @@ static int vfio_mdev_attach_domain(struct device *dev, void *data)
 
 static int vfio_mdev_detach_domain(struct device *dev, void *data)
 {
-	struct iommu_domain *domain = data;
+	struct vfio_domain *domain = data;
 	struct device *iommu_device;
 	struct iommu_group *group;
 
 	iommu_device = vfio_mdev_get_iommu_device(dev);
 	if (iommu_device) {
 		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
-			iommu_aux_detach_device(domain, iommu_device);
+			iommu_aux_detach_device(domain->domain, iommu_device);
 		else {
 			group = iommu_group_get(iommu_device);
 			if (!group) {
 				WARN_ON(1);
 				return -EINVAL;
 			}
-			iommu_detach_group(domain, group);
+			return vfio_domain_detach_group_object(
+							domain, group);
 		}
 	}
 
@@ -1359,20 +1449,27 @@ static int vfio_iommu_attach_group(struct vfio_domain *domain,
 {
 	if (group->mdev_group)
 		return iommu_group_for_each_dev(group->iommu_group,
-						domain->domain,
+						domain,
 						vfio_mdev_attach_domain);
 	else
-		return iommu_attach_group(domain->domain, group->iommu_group);
+		return vfio_domain_attach_group_object(domain,
+							group->iommu_group);
 }
 
 static void vfio_iommu_detach_group(struct vfio_domain *domain,
 				    struct vfio_group *group)
 {
+	int ret;
+
 	if (group->mdev_group)
-		iommu_group_for_each_dev(group->iommu_group, domain->domain,
+		iommu_group_for_each_dev(group->iommu_group, domain,
 					 vfio_mdev_detach_domain);
-	else
-		iommu_detach_group(domain->domain, group->iommu_group);
+	else {
+		ret = vfio_domain_detach_group_object(
+						domain, group->iommu_group);
+		if (ret)
+			pr_warn("%s, deatch failed!! ret: %d", __func__, ret);
+	}
 }
 
 static bool vfio_bus_is_mdev(struct bus_type *bus)
@@ -1412,6 +1509,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	int ret;
 	bool resv_msi, msi_remap;
 	phys_addr_t resv_msi_base;
+	struct vfio_group_object *group_obj = NULL;
+	struct device *iommu_device = NULL;
+	struct iommu_group *iommu_device_group;
+
 
 	mutex_lock(&iommu->lock);
 
@@ -1438,14 +1539,20 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	group->iommu_group = iommu_group;
 
+	group_obj = vfio_iommu_group_object_check(iommu, group->iommu_group);
+	if (group_obj) {
+		atomic_inc(&group_obj->count);
+		list_add(&group->next, &group_obj->domain->group_list);
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
 	/* Determine bus_type in order to allocate a domain */
 	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
 	if (ret)
 		goto out_free;
 
 	if (vfio_bus_is_mdev(bus)) {
-		struct device *iommu_device = NULL;
-
 		group->mdev_group = true;
 
 		/* Determine the isolation type */
@@ -1469,6 +1576,39 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 		bus = iommu_device->bus;
 	}
 
+	/*
+	 * Check if iommu backed group attached to a domain within current
+	 * container. If yes, increase the count; If no, go ahead with a
+	 * new domain attach process.
+	 */
+	group_obj = NULL;
+	if (iommu_device) {
+		iommu_device_group = iommu_group_get(iommu_device);
+		if (!iommu_device_group) {
+			WARN_ON(1);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+		group_obj = vfio_iommu_group_object_check(iommu,
+							iommu_device_group);
+	} else
+		group_obj = vfio_iommu_group_object_check(iommu,
+							group->iommu_group);
+
+	if (group_obj) {
+		atomic_inc(&group_obj->count);
+		list_add(&group->next, &group_obj->domain->group_list);
+		kfree(domain);
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
+	/*
+	 * Now we are sure we want to initialize a new vfio_domain.
+	 * First step is to alloc an iommu_domain from iommu abstract
+	 * layer.
+	 */
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -1484,6 +1624,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 			goto out_domain;
 	}
 
+	INIT_LIST_HEAD(&domain->group_obj_list);
 	ret = vfio_iommu_attach_group(domain, group);
 	if (ret)
 		goto out_domain;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device
  2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (12 preceding siblings ...)
  2019-09-05  7:59 ` [PATCH v2 13/13] vfio/type1: track iommu backed group attach Liu Yi L
@ 2019-09-25  9:13 ` Liu, Yi L
  13 siblings, 0 replies; 24+ messages in thread
From: Liu, Yi L @ 2019-09-25  9:13 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel, kvm, Zhao,
	Yan Y, He, Shaopeng, Xia, Chenbo, Tian, Jun J

Hi Alex,

Any comments on it? :-) With this version, the vfio-mdev-pci driver
could work with non-singleton groups, also it works with vfio-pci
as well.

Regards,
Yi Liu

> From: Liu, Yi L
> Sent: Thursday, September 5, 2019 3:59 PM
> To: alex.williamson@redhat.com; kwankhede@nvidia.com
> Subject: [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device
> 
> This patchset aims to add a vfio-pci-like meta driver as a demo
> user of the vfio changes introduced in "vfio/mdev: IOMMU aware
> mediated device" patchset from Baolu Lu. Besides the test purpose,
> per Alex's comments, it could also be a good base driver for
> experimenting with device specific mdev migration.
> 
> Specific interface tested in this proposal:
>  *) int mdev_set_iommu_device(struct device *dev,
>  				struct device *iommu_device)
>     introduced in the patch as below:
>     "[PATCH v5 6/8] vfio/mdev: Add iommu related member in mdev_device"
> 
> Patch Overview:
>  *) patch 1 ~ 7: code refactor for existing vfio-pci module
>                  move the common codes from vfio_pci.c to
>                  vfio_pci_common.c
>  *) patch 8: add protection to perm_bits alloc/free
>  *) patch 9: add vfio-mdev-pci sample driver
>  *) patch 10: refine the sample driver
>  *) patch 11 - 13: make the sample driver work for non-singleton groups
>                    also work for vfio-pci and vfio-mdev-pci mixed usage
>                    includes vfio-mdev-pci driver change and vfio_iommu_type1
>                    changes.
> 
> Links:
>  *) Link of "vfio/mdev: IOMMU aware mediated device"
>          https://lwn.net/Articles/780522/
>  *) Previous versions:
>          RFC v1: https://lkml.org/lkml/2019/3/4/529
>          RFC v2: https://lkml.org/lkml/2019/3/13/113
>          RFC v3: https://lkml.org/lkml/2019/4/24/495
>          Patch v1: https://www.spinics.net/lists/kvm/msg188952.html
>  *) may try it with the codes in below repo
>     current version is branch "v5.3-rc7-pci-mdev":
>          https://github.com/luxis1999/vfio-mdev-pci-sample-driver.git
> 
> Test done on two NICs which share an iommu group.
> 
> -------------------------------------------------------------------
>          |  NIC0           |  NIC1      | vIOMMU  | VMs  | Passtrhu
> -------------------------------------------------------------------
>   Test#0 |  vfio-pci       |  vfio-pci  | no      |  1   | pass
> -------------------------------------------------------------------
>   Test#1 |  vfio-pci       |  vfio-pci  | no      |  2   | fail[1]
> -------------------------------------------------------------------
>   Test#2 |  vfio-pci       |  vfio-pci  | yes     |  1   | fail[2]
> -------------------------------------------------------------------
>   Test#3 |  vfio-pci-mdev  |  vfio-pci  | no      |  1   | pass
> -------------------------------------------------------------------
>   Test#4 |  vfio-pci-mdev  |  vfio-pci  | no      |  2   | fail[3]
> -------------------------------------------------------------------
>   Test#5 |  vfio-pci-mdev  |  vfio-pci  | yes     |  1   | fail[4]
> -------------------------------------------------------------------
> Tips:
> [1] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
>      vfio 0000:01:00.1: failed to open /dev/vfio/1: Device or resource busy
> [2] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
>      vfio 0000:01:00.1: group 1 used in multiple address spaces
> [3] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
>      vfio 0000:01:00.1: failed to setup container for group 1: Failed to set
>      iommu for container: Device or resource busy
> [4] qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,addr=0x6:
>      vfio 0000:01:00.1: failed to setup container for group 1: Failed to set
>      iommu for container: Device or resource busy
>     Or
>     qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/
>      83b8f4f2-509f-382f-3c1e-e6bfe0fa1003: vfio 83b8f4f2-509f-382f-3c1e-
>      e6bfe0fa1003: failed to setup container for group 11: Failed to set iommu
>      for container: Device or resource busy
> Some other tests are not listed. Like bind NIC0 to vfio-pci-mdev and try to
> passthru it with "vfio-pci,host=01:00.0", kernel will throw a warn log and
> fail the operation.
> 
> Please feel free give your comments.
> 
> Thanks,
> Yi Liu
> 
> Change log:
>   patch v1 -> patch v2:
>   - the sample driver implementation refined
>   - the sample driver can work on non-singleton iommu groups
>   - the sample driver can work with vfio-pci, devices from a non-singleton
>     group can either be bound to vfio-mdev-pci or vfio-pci, and the
>     assignment of this group still follows current vfio assignment rule.
> 
>   RFC v3 -> patch v1:
>   - split the patchset from 3 patches to 9 patches to better demonstrate
>     the changes step by step
> 
>   v2->v3:
>   - use vfio-mdev-pci instead of vfio-pci-mdev
>   - place the new driver under drivers/vfio/pci while define
>     Kconfig in samples/Kconfig to clarify it is a sample driver
> 
>   v1->v2:
>   - instead of adding kernel option to existing vfio-pci
>     module in v1, v2 follows Alex's suggestion to add a
>     separate vfio-pci-mdev module.
>   - new patchset subject: "vfio/pci: wrap pci device as a mediated device"
> 
> Alex Williamson (1):
>   samples: refine vfio-mdev-pci driver
> 
> Liu Yi L (12):
>   vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header
>   vfio_pci: refine user config reference in vfio-pci module
>   vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
>   vfio_pci: make common functions be extern
>   vfio_pci: duplicate vfio_pci.c
>   vfio_pci: shrink vfio_pci_common.c
>   vfio_pci: shrink vfio_pci.c
>   vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
>   samples: add vfio-mdev-pci driver
>   samples/vfio-mdev-pci: call vfio_add_group_dev()
>   vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev
>   vfio/type1: track iommu backed group attach
> 
>  drivers/vfio/pci/Makefile           |    9 +-
>  drivers/vfio/pci/vfio_mdev_pci.c    |  497 ++++++++++++
>  drivers/vfio/pci/vfio_pci.c         | 1449 +---------------------------------
>  drivers/vfio/pci/vfio_pci_common.c  | 1455
> +++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_config.c  |    9 +
>  drivers/vfio/pci/vfio_pci_private.h |   36 +
>  drivers/vfio/vfio_iommu_type1.c     |  185 ++++-
>  samples/Kconfig                     |   11 +
>  8 files changed, 2194 insertions(+), 1457 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_mdev_pci.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_common.c
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module
  2019-09-05  7:59 ` [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
@ 2019-09-26  2:36   ` Alex Williamson
  2019-09-30 12:38     ` Liu, Yi L
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2019-09-26  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, kevin.tian, baolu.lu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

On Thu,  5 Sep 2019 15:59:19 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds three fields in struct vfio_pci_device to pass the user
> configs of vfio-pci module to some functions which could be common in
> future usage.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         | 24 +++++++++++++++---------
>  drivers/vfio/pci/vfio_pci_private.h |  9 +++++++--
>  2 files changed, 22 insertions(+), 11 deletions(-)

A subtle behavioral difference here is that disable_idle_d3 and
nointxmask are runtime modifiable parameters, if the value is changed
in sysfs the device will adopt the new behavior on its next
transition.  After this patch, each device operates in the mode defined
at the time it was probed.  Should we maybe refresh the value at key
points, like the user opening or releasing the device so that it tracks
the module parameter?  I think we could defend not changing the
behavior of a device while it's in use by a user.  Otherwise we might
want to make the module parameter read-only to avoid the
inconsistency.

> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 38271df..fed2687 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -69,7 +69,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
>  	unsigned char max_busnr;
>  	unsigned int decodes;
>  
> -	if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
> +	if (single_vga || !vfio_vga_disabled(vdev) ||
> +		pci_is_root_bus(pdev->bus))
>  		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
>  		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
>  
> @@ -273,7 +274,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  	if (!vdev->pci_saved_state)
>  		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
>  
> -	if (likely(!nointxmask)) {
> +	if (likely(!vdev->nointxmask)) {
>  		if (vfio_pci_nointx(pdev)) {
>  			pci_info(pdev, "Masking broken INTx support\n");
>  			vdev->nointx = true;
> @@ -310,7 +311,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  	} else
>  		vdev->msix_bar = 0xFF;
>  
> -	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
> +	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
>  		vdev->has_vga = true;
>  
>  
> @@ -436,7 +437,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>  
>  	vfio_pci_try_bus_reset(vdev);
>  
> -	if (!disable_idle_d3)
> +	if (!vdev->disable_idle_d3)
>  		vfio_pci_set_power_state(vdev, PCI_D3hot);
>  }
>  
> @@ -1304,6 +1305,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	spin_lock_init(&vdev->irqlock);
>  	mutex_init(&vdev->ioeventfds_lock);
>  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> +	vdev->nointxmask = nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> +	vdev->disable_vga = disable_vga;
> +#endif
> +	vdev->disable_idle_d3 = disable_idle_d3;
>  
>  	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
>  	if (ret) {
> @@ -1328,7 +1334,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  
>  	vfio_pci_probe_power_state(vdev);
>  
> -	if (!disable_idle_d3) {
> +	if (!vdev->disable_idle_d3) {
>  		/*
>  		 * pci-core sets the device power state to an unknown value at
>  		 * bootup and after being removed from a driver.  The only
> @@ -1359,7 +1365,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>  	kfree(vdev->region);
>  	mutex_destroy(&vdev->ioeventfds_lock);
>  
> -	if (!disable_idle_d3)
> +	if (!vdev->disable_idle_d3)
>  		vfio_pci_set_power_state(vdev, PCI_D0);
>  
>  	kfree(vdev->pm_save);
> @@ -1594,7 +1600,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
>  		if (!ret) {
>  			tmp->needs_reset = false;
>  
> -			if (tmp != vdev && !disable_idle_d3)
> +			if (tmp != vdev && !tmp->disable_idle_d3)
>  				vfio_pci_set_power_state(tmp, PCI_D3hot);
>  		}
>  
> @@ -1610,7 +1616,7 @@ static void __exit vfio_pci_cleanup(void)
>  	vfio_pci_uninit_perm_bits();
>  }
>  
> -static void __init vfio_pci_fill_ids(void)
> +static void __init vfio_pci_fill_ids(char *ids)
>  {
>  	char *p, *id;
>  	int rc;
> @@ -1665,7 +1671,7 @@ static int __init vfio_pci_init(void)
>  	if (ret)
>  		goto out_driver;
>  
> -	vfio_pci_fill_ids();
> +	vfio_pci_fill_ids(&ids[0]);

Or just 'ids'.  Thanks,

Alex

>  
>  	return 0;
>  
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index f12d92c..68521d2 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -122,6 +122,11 @@ struct vfio_pci_device {
>  	struct list_head	dummy_resources_list;
>  	struct mutex		ioeventfds_lock;
>  	struct list_head	ioeventfds_list;
> +	bool			nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> +	bool			disable_vga;
> +#endif
> +	bool			disable_idle_d3;
>  };
>  
>  #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> @@ -135,10 +140,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> -static inline bool vfio_vga_disabled(void)
> +static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
>  {
>  #ifdef CONFIG_VFIO_PCI_VGA
> -	return disable_vga;
> +	return vdev->disable_vga;
>  #else
>  	return true;
>  #endif


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
  2019-09-05  7:59 ` [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
@ 2019-09-26  2:36   ` Alex Williamson
  2019-09-30 12:38     ` Liu, Yi L
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2019-09-26  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, kevin.tian, baolu.lu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

On Thu,  5 Sep 2019 15:59:25 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> There is a case in which cap_perms and ecap_perms can be reallocated
> by different modules. e.g. the vfio-mdev-pci sample driver. To secure
> the initialization of cap_perms and ecap_perms, this patch adds an
> atomic variable to track the user of cap/ecap_perms bits. First caller
> of vfio_pci_init_perm_bits() will initialize the bits. While the last
> caller of vfio_pci_uninit_perm_bits() will free the bits.

Yes, but it still allows races; we're not really protecting the data.
If driver A begins freeing the shared data in the uninit path, driver B
could start allocating shared data in the init path and we're left with
either use after free issues or memory leaks.  Probably better to hold
a semaphore around the allocation/free and a non-atomic for reference
counting.  Thanks,

Alex
 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index f0891bd..1b3e6e5 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -992,11 +992,17 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
>  	return 0;
>  }
>  
> +/* Track the user number of the cap/ecap perm_bits */
> +atomic_t vfio_pci_perm_bits_users = ATOMIC_INIT(0);
> +
>  /*
>   * Initialize the shared permission tables
>   */
>  void vfio_pci_uninit_perm_bits(void)
>  {
> +	if (atomic_dec_return(&vfio_pci_perm_bits_users))
> +		return;
> +
>  	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
>  
>  	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
> @@ -1013,6 +1019,9 @@ int __init vfio_pci_init_perm_bits(void)
>  {
>  	int ret;
>  
> +	if (atomic_inc_return(&vfio_pci_perm_bits_users) != 1)
> +		return 0;
> +
>  	/* Basic config space */
>  	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
>  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 10/13] samples: refine vfio-mdev-pci driver
  2019-09-05  7:59 ` [PATCH v2 10/13] samples: refine " Liu Yi L
@ 2019-09-26  2:36   ` Alex Williamson
  2019-09-30 12:39     ` Liu, Yi L
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2019-09-26  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, kevin.tian, baolu.lu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

On Thu,  5 Sep 2019 15:59:27 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> From: Alex Williamson <alex.williamson@redhat.com>
> 
> This patch refines the implementation of original vfio-mdev-pci driver.
> 
> And the vfio-mdev-pci-type_name will be named per the following rule:
> 
> 	vmdev->attr.name = kasprintf(GFP_KERNEL,
> 				     "%04x:%04x:%04x:%04x:%06x:%02x",
> 				     pdev->vendor, pdev->device,
> 				     pdev->subsystem_vendor,
> 				     pdev->subsystem_device, pdev->class,
> 				     pdev->revision);
> 
> Before usage, check the /sys/bus/pci/devices/$bdf/mdev_supported_types/
> to ensure the final mdev_supported_types.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_mdev_pci.c | 123 +++++++++++++++++++++++----------------
>  1 file changed, 72 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
> index 07c8067..09143d3 100644
> --- a/drivers/vfio/pci/vfio_mdev_pci.c
> +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> @@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
>  
>  static struct pci_driver vfio_mdev_pci_driver;
>  
> -static ssize_t
> -name_show(struct kobject *kobj, struct device *dev, char *buf)
> -{
> -	return sprintf(buf, "%s-type1\n", dev_name(dev));
> -}
> -
> -MDEV_TYPE_ATTR_RO(name);
> +struct vfio_mdev_pci_device {
> +	struct vfio_pci_device vdev;
> +	struct mdev_parent_ops ops;
> +	struct attribute_group *groups[2];
> +	struct attribute_group attr;
> +	atomic_t avail;
> +};
>  
>  static ssize_t
>  available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
>  {
> -	return sprintf(buf, "%d\n", 1);
> +	struct vfio_mdev_pci_device *vmdev;
> +
> +	vmdev = pci_get_drvdata(to_pci_dev(dev));
> +
> +	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
>  }
>  
>  MDEV_TYPE_ATTR_RO(available_instances);
> @@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
>  MDEV_TYPE_ATTR_RO(device_api);
>  
>  static struct attribute *vfio_mdev_pci_types_attrs[] = {
> -	&mdev_type_attr_name.attr,
>  	&mdev_type_attr_device_api.attr,
>  	&mdev_type_attr_available_instances.attr,
>  	NULL,
>  };
>  
> -static struct attribute_group vfio_mdev_pci_type_group1 = {
> -	.name  = "type1",
> -	.attrs = vfio_mdev_pci_types_attrs,
> -};
> -
> -struct attribute_group *vfio_mdev_pci_type_groups[] = {
> -	&vfio_mdev_pci_type_group1,
> -	NULL,
> -};
> -
>  struct vfio_mdev_pci {
>  	struct vfio_pci_device *vdev;
>  	struct mdev_device *mdev;
> -	unsigned long handle;
>  };
>  
>  static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
>  {
>  	struct device *pdev;
> -	struct vfio_pci_device *vdev;
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_mdev_pci *pmdev;
>  	int ret;
>  
>  	pdev = mdev_parent_dev(mdev);
> -	vdev = dev_get_drvdata(pdev);
> +	vmdev = dev_get_drvdata(pdev);
> +
> +	if (atomic_dec_if_positive(&vmdev->avail) < 0)
> +		return -ENOSPC;
> +
>  	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> -	if (pmdev == NULL) {
> -		ret = -EBUSY;
> -		goto out;
> -	}
> +	if (!pmdev)
> +		return -ENOMEM;

Needs an atomic_inc(&vmdev->avail) in this error path.  Thanks,

Alex

>  
>  	pmdev->mdev = mdev;
> -	pmdev->vdev = vdev;
> +	pmdev->vdev = &vmdev->vdev;
>  	mdev_set_drvdata(mdev, pmdev);
>  	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
>  	if (ret) {
>  		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
>  			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> -		goto out;
> +		kfree(pmdev);
> +		atomic_inc(&vmdev->avail);
> +		return ret;
>  	}
>  
> -out:
> -	return ret;
> +	return 0;
>  }
>  
>  static int vfio_mdev_pci_remove(struct mdev_device *mdev)
>  {
>  	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +	struct vfio_mdev_pci_device *vmdev;
> +
> +	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
>  
>  	kfree(pmdev);
> +	atomic_inc(&vmdev->avail);
>  	pr_info("%s, succeeded for mdev: %s\n", __func__,
>  		     dev_name(mdev_dev(mdev)));
>  
> @@ -237,24 +236,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
>  	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
>  }
>  
> -static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> -	.supported_type_groups	= vfio_mdev_pci_type_groups,
> -	.create			= vfio_mdev_pci_create,
> -	.remove			= vfio_mdev_pci_remove,
> -
> -	.open			= vfio_mdev_pci_open,
> -	.release		= vfio_mdev_pci_release,
> -
> -	.read			= vfio_mdev_pci_read,
> -	.write			= vfio_mdev_pci_write,
> -	.mmap			= vfio_mdev_pci_mmap,
> -	.ioctl			= vfio_mdev_pci_ioctl,
> -};
> -
>  static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  				       const struct pci_device_id *id)
>  {
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
> +	const struct mdev_parent_ops *ops;
>  	int ret;
>  
>  	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> @@ -273,10 +260,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  		return -EBUSY;
>  	}
>  
> -	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> -	if (!vdev)
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (!vmdev)
>  		return -ENOMEM;
>  
> +	vmdev->attr.name = kasprintf(GFP_KERNEL,
> +				     "%04x:%04x:%04x:%04x:%06x:%02x",
> +				     pdev->vendor, pdev->device,
> +				     pdev->subsystem_vendor,
> +				     pdev->subsystem_device, pdev->class,
> +				     pdev->revision);
> +	if (!vmdev->attr.name) {
> +		kfree(vmdev);
> +		return -ENOMEM;
> +	}
> +
> +	atomic_set(&vmdev->avail, 1);
> +
> +	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
> +	vmdev->groups[0] = &vmdev->attr;
> +
> +	vmdev->ops.supported_type_groups = vmdev->groups;
> +	vmdev->ops.create = vfio_mdev_pci_create;
> +	vmdev->ops.remove = vfio_mdev_pci_remove;
> +	vmdev->ops.open	= vfio_mdev_pci_open;
> +	vmdev->ops.release = vfio_mdev_pci_release;
> +	vmdev->ops.read = vfio_mdev_pci_read;
> +	vmdev->ops.write = vfio_mdev_pci_write;
> +	vmdev->ops.mmap = vfio_mdev_pci_mmap;
> +	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
> +	ops = &vmdev->ops;
> +
> +	vdev = &vmdev->vdev;
>  	vdev->pdev = pdev;
>  	vdev->irq_type = VFIO_PCI_NUM_IRQS;
>  	mutex_init(&vdev->igate);
> @@ -289,7 +304,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  #endif
>  	vdev->disable_idle_d3 = disable_idle_d3;
>  
> -	pci_set_drvdata(pdev, vdev);
> +	pci_set_drvdata(pdev, vmdev);
>  
>  	ret = vfio_pci_reflck_attach(vdev);
>  	if (ret) {
> @@ -320,7 +335,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  		vfio_pci_set_power_state(vdev, PCI_D3hot);
>  	}
>  
> -	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
> +	ret = mdev_register_device(&pdev->dev, ops);
>  	if (ret)
>  		pr_err("Cannot register mdev for device %s\n",
>  			dev_name(&pdev->dev));
> @@ -332,12 +347,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  
>  static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
>  {
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
>  
> -	vdev = pci_get_drvdata(pdev);
> -	if (!vdev)
> +	mdev_unregister_device(&pdev->dev);
> +
> +	vmdev = pci_get_drvdata(pdev);
> +	if (!vmdev)
>  		return;
>  
> +	vdev = &vmdev->vdev;
> +
>  	vfio_pci_reflck_put(vdev->reflck);
>  
>  	kfree(vdev->region);
> @@ -355,7 +375,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
>  				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
>  	}
>  
> -	kfree(vdev);
> +	kfree(vmdev->attr.name);
> +	kfree(vmdev);
>  }
>  
>  static struct pci_driver vfio_mdev_pci_driver = {


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev()
  2019-09-05  7:59 ` [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev() Liu Yi L
@ 2019-09-26  2:36   ` Alex Williamson
  2019-09-30 12:40     ` Liu, Yi L
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2019-09-26  2:36 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, kevin.tian, baolu.lu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

On Thu,  5 Sep 2019 15:59:28 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds vfio_add_group_dev() calling in probe() to make
> vfio-mdev-pci work well with non-singleton iommu group. User could
> bind devices from a non-singleton iommu group to either vfio-pci
> driver or this sample driver. Existing passthru policy works well
> for this non-singleton group.
> 
> This is actually a policy choice. A device driver can make this call
> if it wants to be vfio viable. And it needs to provide dummy
> vfio_device_ops which is required by vfio framework. To prevent user
> from opening the device from the iommu backed group fd, the open
> callback of the dummy vfio_device_ops should return -ENODEV to fail
> the VFIO_GET_DEVICE_FD request from userspace.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_mdev_pci.c | 91 ++++++++++++++++++++++++++++++++++++----
>  1 file changed, 82 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
> index 09143d3..a61c20d 100644
> --- a/drivers/vfio/pci/vfio_mdev_pci.c
> +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> @@ -107,19 +107,27 @@ struct vfio_mdev_pci {
>  static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
>  {
>  	struct device *pdev;
> +	struct vfio_device *device;
>  	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_mdev_pci *pmdev;
>  	int ret;
>  
>  	pdev = mdev_parent_dev(mdev);
> -	vmdev = dev_get_drvdata(pdev);
> +	device = vfio_device_get_from_dev(pdev);
> +	vmdev = vfio_device_data(device);
>  
> -	if (atomic_dec_if_positive(&vmdev->avail) < 0)
> -		return -ENOSPC;
> +	if (atomic_dec_if_positive(&vmdev->avail) < 0) {
> +		ret = -ENOSPC;
> +		goto out;
> +	}
>  
> +	pr_info("%s, available instance: %d\n",
> +			__func__, atomic_read(&vmdev->avail));
>  	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> -	if (!pmdev)
> -		return -ENOMEM;
> +	if (!pmdev) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
>  
>  	pmdev->mdev = mdev;
>  	pmdev->vdev = &vmdev->vdev;
> @@ -130,10 +138,11 @@ static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
>  			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
>  		kfree(pmdev);
>  		atomic_inc(&vmdev->avail);
> -		return ret;
>  	}
>  
> -	return 0;
> +out:
> +	vfio_device_put(device);
> +	return ret;
>  }
>  
>  static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> @@ -145,6 +154,8 @@ static int vfio_mdev_pci_remove(struct mdev_device *mdev)
>  
>  	kfree(pmdev);
>  	atomic_inc(&vmdev->avail);
> +	pr_info("%s, available instance: %d\n",
> +			__func__, atomic_read(&vmdev->avail));
>  	pr_info("%s, succeeded for mdev: %s\n", __func__,
>  		     dev_name(mdev_dev(mdev)));
>  
> @@ -236,12 +247,65 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
>  	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
>  }
>  
> +static int vfio_pci_dummy_open(void *device_data)
> +{
> +	struct vfio_mdev_pci_device *vmdev =
> +		(struct vfio_mdev_pci_device *) device_data;
> +	pr_warn("Device %s is not viable for vfio-pci passthru, please follow"
> +		" vfio-mdev passthru path as it has been wrapped as mdev!!!\n",
> +					dev_name(&vmdev->vdev.pdev->dev));
> +	return -ENODEV;
> +}
> +
> +static void vfio_pci_dummy_release(void *device_data)
> +{
> +}

Theoretically .release will never be called.  If we're paranoid, we
could keep it with a pr_warn.

> +
> +long vfio_pci_dummy_ioctl(void *device_data,
> +		   unsigned int cmd, unsigned long arg)
> +{
> +	return 0;
> +}
> +
> +ssize_t vfio_pci_dummy_read(void *device_data, char __user *buf,
> +			     size_t count, loff_t *ppos)
> +{
> +	return 0;
> +}
> +
> +ssize_t vfio_pci_dummy_write(void *device_data, const char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	return 0;
> +}
> +
> +int vfio_pci_dummy_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +
> +void vfio_pci_dummy_request(void *device_data, unsigned int count)
> +{
> +}

AFAICT, none of .ioctl, .read, .write, .mmap, or .request need to be
provided, only .open and only .release for paranoia.

> +
> +static const struct vfio_device_ops vfio_pci_dummy_ops = {
> +	.name		= "vfio-pci",

This is impersonating vfio-pci, shouldn't we use something like
"vfio-mdev-pci-dummy".  Thanks,

Alex

> +	.open		= vfio_pci_dummy_open,
> +	.release	= vfio_pci_dummy_release,
> +	.ioctl		= vfio_pci_dummy_ioctl,
> +	.read		= vfio_pci_dummy_read,
> +	.write		= vfio_pci_dummy_write,
> +	.mmap		= vfio_pci_dummy_mmap,
> +	.request	= vfio_pci_dummy_request,
> +};
> +
>  static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  				       const struct pci_device_id *id)
>  {
>  	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
>  	const struct mdev_parent_ops *ops;
> +	struct iommu_group *group;
>  	int ret;
>  
>  	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> @@ -260,6 +324,10 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  		return -EBUSY;
>  	}
>  
> +	group = vfio_iommu_group_get(&pdev->dev);
> +	if (!group)
> +		return -EINVAL;
> +
>  	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
>  	if (!vmdev)
>  		return -ENOMEM;
> @@ -304,7 +372,12 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  #endif
>  	vdev->disable_idle_d3 = disable_idle_d3;
>  
> -	pci_set_drvdata(pdev, vmdev);
> +	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_dummy_ops, vmdev);
> +	if (ret) {
> +		vfio_iommu_group_put(group, &pdev->dev);
> +		kfree(vmdev);
> +		return ret;
> +	}
>  
>  	ret = vfio_pci_reflck_attach(vdev);
>  	if (ret) {
> @@ -352,7 +425,7 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
>  
>  	mdev_unregister_device(&pdev->dev);
>  
> -	vmdev = pci_get_drvdata(pdev);
> +	vmdev = vfio_del_group_dev(&pdev->dev);
>  	if (!vmdev)
>  		return;
>  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev
  2019-09-05  7:59 ` [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev Liu Yi L
@ 2019-09-26  2:37   ` Alex Williamson
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Williamson @ 2019-09-26  2:37 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, kevin.tian, baolu.lu, yi.y.sun, joro, linux-kernel,
	kvm, yan.y.zhao, shaopeng.he, chenbo.xia, jun.j.tian

On Thu,  5 Sep 2019 15:59:29 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch uses iommu_attach_group() to do group attach when it is
> for the case of wrapping a PF/VF as a mdev. iommu_attach_device()
> doesn't support non-singleton iommu group attach. With this change,
> wrapping PF/VF as mdev can work on non-singleton iommu groups.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 054391f..317430d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1312,13 +1312,20 @@ static int vfio_mdev_attach_domain(struct device *dev, void *data)
>  {
>  	struct iommu_domain *domain = data;
>  	struct device *iommu_device;
> +	struct iommu_group *group;
>  
>  	iommu_device = vfio_mdev_get_iommu_device(dev);
>  	if (iommu_device) {
>  		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
>  			return iommu_aux_attach_device(domain, iommu_device);
> -		else
> -			return iommu_attach_device(domain, iommu_device);
> +		else {
> +			group = iommu_group_get(iommu_device);
> +			if (!group) {
> +				WARN_ON(1);

What's the value of the WARN_ON here and below?

iommu_group_get() increments the kobject reference, looks like it's
leaked.  Thanks,

Alex

> +				return -EINVAL;
> +			}
> +			return iommu_attach_group(domain, group);
> +		}
>  	}
>  
>  	return -EINVAL;
> @@ -1328,13 +1335,20 @@ static int vfio_mdev_detach_domain(struct device *dev, void *data)
>  {
>  	struct iommu_domain *domain = data;
>  	struct device *iommu_device;
> +	struct iommu_group *group;
>  
>  	iommu_device = vfio_mdev_get_iommu_device(dev);
>  	if (iommu_device) {
>  		if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
>  			iommu_aux_detach_device(domain, iommu_device);
> -		else
> -			iommu_detach_device(domain, iommu_device);
> +		else {
> +			group = iommu_group_get(iommu_device);
> +			if (!group) {
> +				WARN_ON(1);
> +				return -EINVAL;
> +			}
> +			iommu_detach_group(domain, group);
> +		}
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module
  2019-09-26  2:36   ` Alex Williamson
@ 2019-09-30 12:38     ` Liu, Yi L
  0 siblings, 0 replies; 24+ messages in thread
From: Liu, Yi L @ 2019-09-30 12:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Zhao, Yan Y, He, Shaopeng, Xia, Chenbo, Tian, Jun J

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 26, 2019 10:36 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci
> module
> 
> On Thu,  5 Sep 2019 15:59:19 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds three fields in struct vfio_pci_device to pass the user
> > configs of vfio-pci module to some functions which could be common in
> > future usage.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c         | 24 +++++++++++++++---------
> >  drivers/vfio/pci/vfio_pci_private.h |  9 +++++++--
> >  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> A subtle behavioral difference here is that disable_idle_d3 and
> nointxmask are runtime modifiable parameters, if the value is changed
> in sysfs the device will adopt the new behavior on its next
> transition.  After this patch, each device operates in the mode defined
> at the time it was probed.  Should we maybe refresh the value at key
> points, like the user opening or releasing the device so that it tracks
> the module parameter?  I think we could defend not changing the
> behavior of a device while it's in use by a user.  Otherwise we might
> want to make the module parameter read-only to avoid the
> inconsistency.

Agreed. I think we can take such assumption that the changing is not
allowed during an open/release cycle. Let me add the updates in the
next version.

> 
> >
[...]
> > +	vfio_pci_fill_ids(&ids[0]);
> 
> Or just 'ids'.  Thanks,

yes, let me fix it.

> Alex
> 

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op
  2019-09-26  2:36   ` Alex Williamson
@ 2019-09-30 12:38     ` Liu, Yi L
  0 siblings, 0 replies; 24+ messages in thread
From: Liu, Yi L @ 2019-09-30 12:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Zhao, Yan Y, He, Shaopeng, Xia, Chenbo, Tian, Jun J

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 26, 2019 10:36 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with
> atomic op
> 
> On Thu,  5 Sep 2019 15:59:25 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > There is a case in which cap_perms and ecap_perms can be reallocated
> > by different modules. e.g. the vfio-mdev-pci sample driver. To secure
> > the initialization of cap_perms and ecap_perms, this patch adds an
> > atomic variable to track the user of cap/ecap_perms bits. First caller
> > of vfio_pci_init_perm_bits() will initialize the bits. While the last
> > caller of vfio_pci_uninit_perm_bits() will free the bits.
> 
> Yes, but it still allows races; we're not really protecting the data.
> If driver A begins freeing the shared data in the uninit path, driver B could start
> allocating shared data in the init path and we're left with either use after free issues
> or memory leaks.  Probably better to hold a semaphore around the allocation/free
> and a non-atomic for reference counting.  Thanks,

That's true. We just want to have only one copy of the bits. As long as the
race is under control, it is acceptable. Let me make this change. Thanks.

> Alex
> 
Regards,
Yi Liu

> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_config.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index f0891bd..1b3e6e5 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -992,11 +992,17 @@ static int __init init_pci_ext_cap_pwr_perm(struct
> perm_bits *perm)
> >  	return 0;
> >  }
> >
> > +/* Track the user number of the cap/ecap perm_bits */ atomic_t
> > +vfio_pci_perm_bits_users = ATOMIC_INIT(0);
> > +
> >  /*
> >   * Initialize the shared permission tables
> >   */
> >  void vfio_pci_uninit_perm_bits(void)
> >  {
> > +	if (atomic_dec_return(&vfio_pci_perm_bits_users))
> > +		return;
> > +
> >  	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
> >
> >  	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
> > @@ -1013,6 +1019,9 @@ int __init vfio_pci_init_perm_bits(void)  {
> >  	int ret;
> >
> > +	if (atomic_inc_return(&vfio_pci_perm_bits_users) != 1)
> > +		return 0;
> > +
> >  	/* Basic config space */
> >  	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
> >


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 10/13] samples: refine vfio-mdev-pci driver
  2019-09-26  2:36   ` Alex Williamson
@ 2019-09-30 12:39     ` Liu, Yi L
  0 siblings, 0 replies; 24+ messages in thread
From: Liu, Yi L @ 2019-09-30 12:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Zhao, Yan Y, He, Shaopeng, Xia, Chenbo, Tian, Jun J

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 26, 2019 10:37 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
>
> Subject: Re: [PATCH v2 10/13] samples: refine vfio-mdev-pci driver
> 
> On Thu,  5 Sep 2019 15:59:27 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> >
> > This patch refines the implementation of original vfio-mdev-pci driver.
> >
> > And the vfio-mdev-pci-type_name will be named per the following rule:
> >
> > 	vmdev->attr.name = kasprintf(GFP_KERNEL,
> > 				     "%04x:%04x:%04x:%04x:%06x:%02x",
> > 				     pdev->vendor, pdev->device,
> > 				     pdev->subsystem_vendor,
> > 				     pdev->subsystem_device, pdev->class,
> > 				     pdev->revision);
> >
> > Before usage, check the
> > /sys/bus/pci/devices/$bdf/mdev_supported_types/
> > to ensure the final mdev_supported_types.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_mdev_pci.c | 123
> > +++++++++++++++++++++++----------------
> >  1 file changed, 72 insertions(+), 51 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_mdev_pci.c
> > b/drivers/vfio/pci/vfio_mdev_pci.c
> > index 07c8067..09143d3 100644
> > --- a/drivers/vfio/pci/vfio_mdev_pci.c
> > +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> > @@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
> >
> >  static struct pci_driver vfio_mdev_pci_driver;
> >
> > -static ssize_t
> > -name_show(struct kobject *kobj, struct device *dev, char *buf) -{
> > -	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > -}
> > -
> > -MDEV_TYPE_ATTR_RO(name);
> > +struct vfio_mdev_pci_device {
> > +	struct vfio_pci_device vdev;
> > +	struct mdev_parent_ops ops;
> > +	struct attribute_group *groups[2];
> > +	struct attribute_group attr;
> > +	atomic_t avail;
> > +};
> >
> >  static ssize_t
> >  available_instances_show(struct kobject *kobj, struct device *dev,
> > char *buf)  {
> > -	return sprintf(buf, "%d\n", 1);
> > +	struct vfio_mdev_pci_device *vmdev;
> > +
> > +	vmdev = pci_get_drvdata(to_pci_dev(dev));
> > +
> > +	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
> >  }
> >
> >  MDEV_TYPE_ATTR_RO(available_instances);
> > @@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject
> > *kobj, struct device *dev,  MDEV_TYPE_ATTR_RO(device_api);
> >
> >  static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > -	&mdev_type_attr_name.attr,
> >  	&mdev_type_attr_device_api.attr,
> >  	&mdev_type_attr_available_instances.attr,
> >  	NULL,
> >  };
> >
> > -static struct attribute_group vfio_mdev_pci_type_group1 = {
> > -	.name  = "type1",
> > -	.attrs = vfio_mdev_pci_types_attrs,
> > -};
> > -
> > -struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > -	&vfio_mdev_pci_type_group1,
> > -	NULL,
> > -};
> > -
> >  struct vfio_mdev_pci {
> >  	struct vfio_pci_device *vdev;
> >  	struct mdev_device *mdev;
> > -	unsigned long handle;
> >  };
> >
> >  static int vfio_mdev_pci_create(struct kobject *kobj, struct
> > mdev_device *mdev)  {
> >  	struct device *pdev;
> > -	struct vfio_pci_device *vdev;
> > +	struct vfio_mdev_pci_device *vmdev;
> >  	struct vfio_mdev_pci *pmdev;
> >  	int ret;
> >
> >  	pdev = mdev_parent_dev(mdev);
> > -	vdev = dev_get_drvdata(pdev);
> > +	vmdev = dev_get_drvdata(pdev);
> > +
> > +	if (atomic_dec_if_positive(&vmdev->avail) < 0)
> > +		return -ENOSPC;
> > +
> >  	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > -	if (pmdev == NULL) {
> > -		ret = -EBUSY;
> > -		goto out;
> > -	}
> > +	if (!pmdev)
> > +		return -ENOMEM;
> 
> Needs an atomic_inc(&vmdev->avail) in this error path.  Thanks,

Oops, yes it is.

> 
> Alex

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev()
  2019-09-26  2:36   ` Alex Williamson
@ 2019-09-30 12:40     ` Liu, Yi L
  0 siblings, 0 replies; 24+ messages in thread
From: Liu, Yi L @ 2019-09-30 12:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, Tian, Kevin, baolu.lu, Sun, Yi Y, joro, linux-kernel,
	kvm, Zhao, Yan Y, He, Shaopeng, Xia, Chenbo, Tian, Jun J

Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 26, 2019 10:37 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev()
> 
> On Thu,  5 Sep 2019 15:59:28 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds vfio_add_group_dev() calling in probe() to make
> > vfio-mdev-pci work well with non-singleton iommu group. User could
> > bind devices from a non-singleton iommu group to either vfio-pci
> > driver or this sample driver. Existing passthru policy works well for
> > this non-singleton group.
> >
> > This is actually a policy choice. A device driver can make this call
> > if it wants to be vfio viable. And it needs to provide dummy
> > vfio_device_ops which is required by vfio framework. To prevent user
> > from opening the device from the iommu backed group fd, the open
> > callback of the dummy vfio_device_ops should return -ENODEV to fail
> > the VFIO_GET_DEVICE_FD request from userspace.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_mdev_pci.c | 91
> > ++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 82 insertions(+), 9 deletions(-)
> >

[...]

> > +static int vfio_pci_dummy_open(void *device_data) {
> > +	struct vfio_mdev_pci_device *vmdev =
> > +		(struct vfio_mdev_pci_device *) device_data;
> > +	pr_warn("Device %s is not viable for vfio-pci passthru, please follow"
> > +		" vfio-mdev passthru path as it has been wrapped as mdev!!!\n",
> > +					dev_name(&vmdev->vdev.pdev->dev));
> > +	return -ENODEV;
> > +}
> > +
> > +static void vfio_pci_dummy_release(void *device_data) { }
> 
> Theoretically .release will never be called.  If we're paranoid, we could keep it with a
> pr_warn.

yes, it is.

> > +
> > +long vfio_pci_dummy_ioctl(void *device_data,
> > +		   unsigned int cmd, unsigned long arg) {
> > +	return 0;
> > +}
> > +
> > +ssize_t vfio_pci_dummy_read(void *device_data, char __user *buf,
> > +			     size_t count, loff_t *ppos)
> > +{
> > +	return 0;
> > +}
> > +
> > +ssize_t vfio_pci_dummy_write(void *device_data, const char __user *buf,
> > +			      size_t count, loff_t *ppos)
> > +{
> > +	return 0;
> > +}
> > +
> > +int vfio_pci_dummy_mmap(void *device_data, struct vm_area_struct
> > +*vma) {
> > +	return 0;
> > +}
> > +
> > +void vfio_pci_dummy_request(void *device_data, unsigned int count) {
> > +}
> 
> AFAICT, none of .ioctl, .read, .write, .mmap, or .request need to be provided,
> only .open and only .release for paranoia.

sure. let me fix it.

> > +
> > +static const struct vfio_device_ops vfio_pci_dummy_ops = {
> > +	.name		= "vfio-pci",
> 
> This is impersonating vfio-pci, shouldn't we use something like "vfio-mdev-pci-
> dummy".  Thanks,

Yep. will modify it.
 
> Alex

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-09-30 12:40 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-05  7:59 [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu Yi L
2019-09-05  7:59 ` [PATCH v2 01/13] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header Liu Yi L
2019-09-05  7:59 ` [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
2019-09-26  2:36   ` Alex Williamson
2019-09-30 12:38     ` Liu, Yi L
2019-09-05  7:59 ` [PATCH v2 03/13] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
2019-09-05  7:59 ` [PATCH v2 04/13] vfio_pci: make common functions be extern Liu Yi L
2019-09-05  7:59 ` [PATCH v2 05/13] vfio_pci: duplicate vfio_pci.c Liu Yi L
2019-09-05  7:59 ` [PATCH v2 06/13] vfio_pci: shrink vfio_pci_common.c Liu Yi L
2019-09-05  7:59 ` [PATCH v2 07/13] vfio_pci: shrink vfio_pci.c Liu Yi L
2019-09-05  7:59 ` [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op Liu Yi L
2019-09-26  2:36   ` Alex Williamson
2019-09-30 12:38     ` Liu, Yi L
2019-09-05  7:59 ` [PATCH v2 09/13] samples: add vfio-mdev-pci driver Liu Yi L
2019-09-05  7:59 ` [PATCH v2 10/13] samples: refine " Liu Yi L
2019-09-26  2:36   ` Alex Williamson
2019-09-30 12:39     ` Liu, Yi L
2019-09-05  7:59 ` [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev() Liu Yi L
2019-09-26  2:36   ` Alex Williamson
2019-09-30 12:40     ` Liu, Yi L
2019-09-05  7:59 ` [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev Liu Yi L
2019-09-26  2:37   ` Alex Williamson
2019-09-05  7:59 ` [PATCH v2 13/13] vfio/type1: track iommu backed group attach Liu Yi L
2019-09-25  9:13 ` [PATCH v2 00/13] vfio_pci: wrap pci device as a mediated device Liu, Yi L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).