linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device
@ 2020-01-07 12:01 Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu

This patchset aims to add a vfio-pci-like meta driver as a demo
user of the vfio changes introduced in "vfio/mdev: IOMMU aware
mediated device" patchset from Baolu Lu. Besides the test purpose,
per Alex's comments, it could also be a good base driver for
experimenting with device specific mdev migration.

Specific interface tested in this proposal:
 *) int mdev_set_iommu_device(struct device *dev,
 				struct device *iommu_device)
    introduced in the patch as below:
    "[PATCH v5 6/8] vfio/mdev: Add iommu related member in mdev_device"

Patch Overview:
 *) patch 1 ~ 9: code refactor for existing vfio-pci module
                 move the common codes from vfio_pci.c to
                 vfio_pci_common.c
 *) patch 10: build vfio-pci-common.ko
 *) patch 11: add initial vfio-mdev-pci sample driver
 *) patch 12: refine the sample driver

Links:
 *) Link of "vfio/mdev: IOMMU aware mediated device"
         https://lwn.net/Articles/780522/
 *) Previous versions:
         Patch v3: https://lkml.org/lkml/2019/11/22/1558
         Patch v2: https://lkml.org/lkml/2019/9/6/115
         Patch v1: https://www.spinics.net/lists/kvm/msg188952.html
         RFC v3: https://lkml.org/lkml/2019/4/24/495
         RFC v2: https://lkml.org/lkml/2019/3/13/113
         RFC v1: https://lkml.org/lkml/2019/3/4/529
 *) may try it with the codes in below repo
    https://github.com/luxis1999/vfio-mdev-pci-sample-driver.git : v5.5-rc5-pci-mdev

Please feel free give your comments.

Thanks,
Yi Liu

Change log:
  patch v3 -> patch v4:
  - switched the sequence of
    "vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header"
    and
    "vfio_pci: refine user config reference in vfio-pci module".
  - refined "vfio_pci: refine vfio_pci_driver reference in vfio_pci.c"
    per Alex's comments.
  - split the vfio_pci_private.h file to be two files.
  - Build vfio_pci_common.c to be vfio-pci-common.ko for code sharing
    outside of drivers/vfio/pci/.
  - moved vfio-mdev-pci driver to under samples/.
  - dropped "vfio/pci: protect cap/ecap_perm bits alloc/free" as new
    version builds vfio_pci_common.c to be a kernel module.

  patch v2 -> patch v3:
  - refresh the disable_idle_d3, disable_vga and nointxmask config
    according to user config in device open.
  - add a semaphore around the vfio-pci cap/ecap perm bits allocation/free
  - drop the non-singleton iommu group support to keep it simple as it's
    a sample driver for now.

  patch v1 -> patch v2:
  - the sample driver implementation refined
  - the sample driver can work on non-singleton iommu groups
  - the sample driver can work with vfio-pci, devices from a non-singleton
    group can either be bound to vfio-mdev-pci or vfio-pci, and the
    assignment of this group still follows current vfio assignment rule.

  RFC v3 -> patch v1:
  - split the patchset from 3 patches to 9 patches to better demonstrate
    the changes step by step

  rfc v2->v3:
  - use vfio-mdev-pci instead of vfio-pci-mdev
  - place the new driver under drivers/vfio/pci while define
    Kconfig in samples/Kconfig to clarify it is a sample driver

  rfc v1->v2:
  - instead of adding kernel option to existing vfio-pci
    module in v1, v2 follows Alex's suggestion to add a
    separate vfio-pci-mdev module.
  - new patchset subject: "vfio/pci: wrap pci device as a mediated device"

Alex Williamson (1):
  samples: refine vfio-mdev-pci driver

Liu Yi L (11):
  vfio_pci: refine user config reference in vfio-pci module
  vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file
  vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  vfio_pci: make common functions be extern
  vfio_pci: duplicate vfio_pci.c
  vfio_pci: shrink vfio_pci_common.c
  vfio_pci: shrink vfio_pci.c
  vfio_pci: duplicate vfio_pci_private.h to include/linux
  vfio: split vfio_pci_private.h into two files
  vfio: build vfio_pci_common.c into a kernel module
  samples: add vfio-mdev-pci driver

 drivers/vfio/pci/Kconfig              |    9 +-
 drivers/vfio/pci/Makefile             |   10 +-
 drivers/vfio/pci/vfio_pci.c           | 1477 +-------------------------------
 drivers/vfio/pci/vfio_pci_common.c    | 1512 +++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h   |   94 +-
 include/linux/vfio_pci_common.h       |  154 ++++
 samples/Kconfig                       |   10 +
 samples/Makefile                      |    1 +
 samples/vfio-mdev-pci/Makefile        |    4 +
 samples/vfio-mdev-pci/vfio_mdev_pci.c |  420 +++++++++
 10 files changed, 2128 insertions(+), 1563 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_common.c
 create mode 100644 include/linux/vfio_pci_common.h
 create mode 100644 samples/vfio-mdev-pci/Makefile
 create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-09 22:48   ` Alex Williamson
  2020-01-07 12:01 ` [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file Liu Yi L
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch adds three fields in struct vfio_pci_device to pass the user
configurations of vfio-pci.ko module to some functions which could be
common in future usage. The values stored in struct vfio_pci_device will
be initiated in probe and refreshed in device open phase to allow runtime
modifications to parameters. e.g. disable_idle_d3 and nointxmask.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 37 ++++++++++++++++++++++++++-----------
 drivers/vfio/pci/vfio_pci_private.h |  8 ++++++++
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 379a02c..af507c2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,10 +54,10 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static inline bool vfio_vga_disabled(void)
+static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
-	return disable_vga;
+	return vdev->disable_vga;
 #else
 	return true;
 #endif
@@ -78,7 +78,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 	unsigned char max_busnr;
 	unsigned int decodes;
 
-	if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
+	if (single_vga || !vfio_vga_disabled(vdev) ||
+		pci_is_root_bus(pdev->bus))
 		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
 		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
 
@@ -289,7 +290,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	if (!vdev->pci_saved_state)
 		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
 
-	if (likely(!nointxmask)) {
+	if (likely(!vdev->nointxmask)) {
 		if (vfio_pci_nointx(pdev)) {
 			pci_info(pdev, "Masking broken INTx support\n");
 			vdev->nointx = true;
@@ -326,7 +327,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	} else
 		vdev->msix_bar = 0xFF;
 
-	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
+	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
 
@@ -462,10 +463,17 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
 	vfio_pci_try_bus_reset(vdev);
 
-	if (!disable_idle_d3)
+	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
+void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
+			bool nointxmask, bool disable_idle_d3)
+{
+	vdev->nointxmask = nointxmask;
+	vdev->disable_idle_d3 = disable_idle_d3;
+}
+
 static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -490,6 +498,8 @@ static int vfio_pci_open(void *device_data)
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
+	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
+
 	mutex_lock(&vdev->reflck->lock);
 
 	if (!vdev->refcnt) {
@@ -1330,6 +1340,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret) {
@@ -1354,7 +1369,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	vfio_pci_probe_power_state(vdev);
 
-	if (!disable_idle_d3) {
+	if (!vdev->disable_idle_d3) {
 		/*
 		 * pci-core sets the device power state to an unknown value at
 		 * bootup and after being removed from a driver.  The only
@@ -1385,7 +1400,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	kfree(vdev->region);
 	mutex_destroy(&vdev->ioeventfds_lock);
 
-	if (!disable_idle_d3)
+	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D0);
 
 	kfree(vdev->pm_save);
@@ -1620,7 +1635,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 		if (!ret) {
 			tmp->needs_reset = false;
 
-			if (tmp != vdev && !disable_idle_d3)
+			if (tmp != vdev && !tmp->disable_idle_d3)
 				vfio_pci_set_power_state(tmp, PCI_D3hot);
 		}
 
@@ -1636,7 +1651,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(void)
+static void __init vfio_pci_fill_ids(char *ids)
 {
 	char *p, *id;
 	int rc;
@@ -1691,7 +1706,7 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
-	vfio_pci_fill_ids();
+	vfio_pci_fill_ids(ids);
 
 	return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a2c760..0398608 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -122,6 +122,11 @@ struct vfio_pci_device {
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
+	bool			nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	bool			disable_vga;
+#endif
+	bool			disable_idle_d3;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -130,6 +135,9 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
+				bool nointxmask, bool disable_idle_d3);
+
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-15 10:43   ` Cornelia Huck
  2020-01-07 12:01 ` [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch moves two inline functions to vfio_pci_private.h for further
sharing across source files. Also avoids below compiling error in further
code split.

"error: inlining failed in call to always_inline ‘vfio_pci_is_vga’:
function body not available".

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 14 --------------
 drivers/vfio/pci/vfio_pci_private.h | 14 ++++++++++++++
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index af507c2..009d2df 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,15 +54,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
-{
-#ifdef CONFIG_VFIO_PCI_VGA
-	return vdev->disable_vga;
-#else
-	return true;
-#endif
-}
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -103,11 +94,6 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 	return decodes;
 }
 
-static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
-{
-	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-}
-
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
 	struct resource *res;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 0398608..9263021 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -135,6 +135,20 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
+{
+	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
+}
+
+static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
+{
+#ifdef CONFIG_VFIO_PCI_VGA
+	return vdev->disable_vga;
+#else
+	return true;
+#endif
+}
+
 extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
 				bool nointxmask, bool disable_idle_d3);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-09 22:48   ` Alex Williamson
  2020-01-07 12:01 ` [PATCH v4 04/12] vfio_pci: make common functions be extern Liu Yi L
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch replaces the vfio_pci_driver reference in vfio_pci.c with
pci_dev_driver(vdev->pdev) which is more helpful to make the functions
be generic to module types.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 009d2df..9140f5e5 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1463,24 +1463,25 @@ static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
 
 static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 {
-	struct vfio_pci_reflck **preflck = data;
+	struct vfio_pci_device *vdev = data;
+	struct vfio_pci_reflck **preflck = &vdev->reflck;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_device *tmp;
 
 	device = vfio_device_get_from_dev(&pdev->dev);
 	if (!device)
 		return 0;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(device);
 		return 0;
 	}
 
-	vdev = vfio_device_data(device);
+	tmp = vfio_device_data(device);
 
-	if (vdev->reflck) {
-		vfio_pci_reflck_get(vdev->reflck);
-		*preflck = vdev->reflck;
+	if (tmp->reflck) {
+		vfio_pci_reflck_get(tmp->reflck);
+		*preflck = tmp->reflck;
 		vfio_device_put(device);
 		return 1;
 	}
@@ -1497,7 +1498,7 @@ static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 
 	if (pci_is_root_bus(vdev->pdev->bus) ||
 	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
-					  &vdev->reflck, slot) <= 0)
+					  vdev, slot) <= 0)
 		vdev->reflck = vfio_pci_reflck_alloc();
 
 	mutex_unlock(&reflck_lock);
@@ -1522,6 +1523,7 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 
 struct vfio_devices {
 	struct vfio_device **devices;
+	struct vfio_pci_device *vdev;
 	int cur_index;
 	int max_index;
 };
@@ -1530,7 +1532,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_device *tmp;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -1539,15 +1541,15 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
+	tmp = vfio_device_data(device);
 
 	/* Fault if the device is not unused */
-	if (vdev->refcnt) {
+	if (tmp->refcnt) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
@@ -1574,7 +1576,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
  */
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 {
-	struct vfio_devices devs = { .cur_index = 0 };
+	struct vfio_devices devs = { .vdev = vdev, .cur_index = 0 };
 	int i = 0, ret = -EINVAL;
 	bool slot = false;
 	struct vfio_pci_device *tmp;
@@ -1637,7 +1639,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(char *ids)
+static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
@@ -1665,7 +1667,7 @@ static void __init vfio_pci_fill_ids(char *ids)
 			continue;
 		}
 
-		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
+		rc = pci_add_dynid(driver, vendor, device,
 				   subvendor, subdevice, class, class_mask, 0);
 		if (rc)
 			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
@@ -1692,7 +1694,7 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
-	vfio_pci_fill_ids(ids);
+	vfio_pci_fill_ids(ids, &vfio_pci_driver);
 
 	return 0;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 04/12] vfio_pci: make common functions be extern
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (2 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-15 10:56   ` Cornelia Huck
  2020-01-07 12:01 ` [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c Liu Yi L
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch makes the common functions (module agnostic functions) in
vfio_pci.c to be extern. So that such functions could be moved to a
common source file.

*) vfio_pci_set_vga_decode
*) vfio_pci_probe_power_state
*) vfio_pci_set_power_state
*) vfio_pci_enable
*) vfio_pci_disable
*) vfio_pci_refresh_config
*) vfio_pci_register_dev_region
*) vfio_pci_ioctl
*) vfio_pci_read
*) vfio_pci_write
*) vfio_pci_mmap
*) vfio_pci_request
*) vfio_pci_err_handlers
*) vfio_pci_reflck_attach
*) vfio_pci_reflck_put
*) vfio_pci_fill_ids

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 30 +++++++++++++-----------------
 drivers/vfio/pci/vfio_pci_private.h | 15 +++++++++++++++
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 9140f5e5..103e493 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -62,7 +62,7 @@ MODULE_PARM_DESC(disable_idle_d3,
  * has no way to get to it and routing can be disabled externally at the
  * bridge.
  */
-static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
+unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 {
 	struct vfio_pci_device *vdev = opaque;
 	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
@@ -165,7 +165,6 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 }
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-static void vfio_pci_disable(struct vfio_pci_device *vdev);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -196,7 +195,7 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
-static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u16 pmcsr;
@@ -247,7 +246,7 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 	return ret;
 }
 
-static int vfio_pci_enable(struct vfio_pci_device *vdev)
+int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int ret;
@@ -354,7 +353,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	return ret;
 }
 
-static void vfio_pci_disable(struct vfio_pci_device *vdev)
+void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
@@ -687,8 +686,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static long vfio_pci_ioctl(void *device_data,
-			   unsigned int cmd, unsigned long arg)
+long vfio_pci_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
 {
 	struct vfio_pci_device *vdev = device_data;
 	unsigned long minsz;
@@ -1173,7 +1172,7 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 	return -EINVAL;
 }
 
-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
 			     size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1182,7 +1181,7 @@ static ssize_t vfio_pci_read(void *device_data, char __user *buf,
 	return vfio_pci_rw(device_data, buf, count, ppos, false);
 }
 
-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 			      size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1191,7 +1190,7 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1253,7 +1252,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 			       req_len, vma->vm_page_prot);
 }
 
-static void vfio_pci_request(void *device_data, unsigned int count)
+void vfio_pci_request(void *device_data, unsigned int count)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1285,9 +1284,6 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.request	= vfio_pci_request,
 };
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
-static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
-
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
@@ -1490,7 +1486,7 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 {
 	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
 
@@ -1516,7 +1512,7 @@ static void vfio_pci_reflck_release(struct kref *kref)
 	mutex_unlock(&reflck_lock);
 }
 
-static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 {
 	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
 }
@@ -1639,7 +1635,7 @@ static void __exit vfio_pci_cleanup(void)
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9263021..194d487 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -185,6 +185,21 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 
 extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
 				    pci_power_t state);
+extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga);
+extern int vfio_pci_enable(struct vfio_pci_device *vdev);
+extern void vfio_pci_disable(struct vfio_pci_device *vdev);
+extern long vfio_pci_ioctl(void *device_data,
+			unsigned int cmd, unsigned long arg);
+extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			size_t count, loff_t *ppos);
+extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
+extern void vfio_pci_request(void *device_data, unsigned int count);
+extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
+extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
 
 #ifdef CONFIG_VFIO_PCI_IGD
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (3 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 04/12] vfio_pci: make common functions be extern Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-15 11:03   ` Cornelia Huck
  2020-01-07 12:01 ` [PATCH v4 06/12] vfio_pci: shrink vfio_pci_common.c Liu Yi L
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch has no code change, just a file copy. In following patches,
vfio_pci_common.c will be modified to only include the common functions
and related static functions in original vfio_pci.c. Meanwhile, vfio_pci.c
will be modified to only include vfio-pci module specific codes.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_common.c | 1708 ++++++++++++++++++++++++++++++++++++
 1 file changed, 1708 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_common.c

diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
new file mode 100644
index 0000000..103e493
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -0,0 +1,1708 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#define dev_fmt pr_fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+/*
+ * Our VGA arbiter participation is limited since we don't know anything
+ * about the device itself.  However, if the device is the only VGA device
+ * downstream of a bridge and VFIO VGA support is disabled, then we can
+ * safely return legacy VGA IO and memory as not decoded since the user
+ * has no way to get to it and routing can be disabled externally at the
+ * bridge.
+ */
+unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
+{
+	struct vfio_pci_device *vdev = opaque;
+	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
+	unsigned char max_busnr;
+	unsigned int decodes;
+
+	if (single_vga || !vfio_vga_disabled(vdev) ||
+		pci_is_root_bus(pdev->bus))
+		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
+
+	max_busnr = pci_bus_max_busnr(pdev->bus);
+	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
+
+	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
+		if (tmp == pdev ||
+		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
+		    pci_is_root_bus(tmp->bus))
+			continue;
+
+		if (tmp->bus->number >= pdev->bus->number &&
+		    tmp->bus->number <= max_busnr) {
+			pci_dev_put(tmp);
+			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
+			break;
+		}
+	}
+
+	return decodes;
+}
+
+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+	struct resource *res;
+	int i;
+	struct vfio_pci_dummy_resource *dummy_res;
+
+	INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+		int bar = i + PCI_STD_RESOURCES;
+
+		res = &vdev->pdev->resource[bar];
+
+		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+			goto no_mmap;
+
+		if (!(res->flags & IORESOURCE_MEM))
+			goto no_mmap;
+
+		/*
+		 * The PCI core shouldn't set up a resource with a
+		 * type but zero size. But there may be bugs that
+		 * cause us to do that.
+		 */
+		if (!resource_size(res))
+			goto no_mmap;
+
+		if (resource_size(res) >= PAGE_SIZE) {
+			vdev->bar_mmap_supported[bar] = true;
+			continue;
+		}
+
+		if (!(res->start & ~PAGE_MASK)) {
+			/*
+			 * Add a dummy resource to reserve the remainder
+			 * of the exclusive page in case that hot-add
+			 * device's bar is assigned into it.
+			 */
+			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+			if (dummy_res == NULL)
+				goto no_mmap;
+
+			dummy_res->resource.name = "vfio sub-page reserved";
+			dummy_res->resource.start = res->end + 1;
+			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+			dummy_res->resource.flags = res->flags;
+			if (request_resource(res->parent,
+						&dummy_res->resource)) {
+				kfree(dummy_res);
+				goto no_mmap;
+			}
+			dummy_res->index = bar;
+			list_add(&dummy_res->res_next,
+					&vdev->dummy_resources_list);
+			vdev->bar_mmap_supported[bar] = true;
+			continue;
+		}
+		/*
+		 * Here we don't handle the case when the BAR is not page
+		 * aligned because we can't expect the BAR will be
+		 * assigned into the same location in a page in guest
+		 * when we passthrough the BAR. And it's hard to access
+		 * this BAR in userspace because we have no way to get
+		 * the BAR's location in a page.
+		 */
+no_mmap:
+		vdev->bar_mmap_supported[bar] = false;
+	}
+}
+
+static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
+
+/*
+ * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
+ * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
+ * If a device implements the former but not the latter we would typically
+ * expect broken_intx_masking be set and require an exclusive interrupt.
+ * However since we do have control of the device's ability to assert INTx,
+ * we can instead pretend that the device does not implement INTx, virtualizing
+ * the pin register to report zero and maintaining DisINTx set on the host.
+ */
+static bool vfio_pci_nointx(struct pci_dev *pdev)
+{
+	switch (pdev->vendor) {
+	case PCI_VENDOR_ID_INTEL:
+		switch (pdev->device) {
+		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
+		case 0x1572:
+		case 0x1574:
+		case 0x1580 ... 0x1581:
+		case 0x1583 ... 0x158b:
+		case 0x37d0 ... 0x37d2:
+			return true;
+		default:
+			return false;
+		}
+	}
+
+	return false;
+}
+
+void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 pmcsr;
+
+	if (!pdev->pm_cap)
+		return;
+
+	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
+
+	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
+}
+
+/*
+ * pci_set_power_state() wrapper handling devices which perform a soft reset on
+ * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
+ * restore when returned to D0.  Saved separately from pci_saved_state for use
+ * by PM capability emulation and separately from pci_dev internal saved state
+ * to avoid it being overwritten and consumed around other resets.
+ */
+int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	bool needs_restore = false, needs_save = false;
+	int ret;
+
+	if (vdev->needs_pm_restore) {
+		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
+			pci_save_state(pdev);
+			needs_save = true;
+		}
+
+		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
+			needs_restore = true;
+	}
+
+	ret = pci_set_power_state(pdev, state);
+
+	if (!ret) {
+		/* D3 might be unsupported via quirk, skip unless in D3 */
+		if (needs_save && pdev->current_state >= PCI_D3hot) {
+			vdev->pm_save = pci_store_saved_state(pdev);
+		} else if (needs_restore) {
+			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
+			pci_restore_state(pdev);
+		}
+	}
+
+	return ret;
+}
+
+int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	u16 cmd;
+	u8 msix_pos;
+
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
+	/* Don't allow our initial saved state to include busmaster */
+	pci_clear_master(pdev);
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		return ret;
+
+	/* If reset fails because of the device lock, fail this path entirely */
+	ret = pci_try_reset_function(pdev);
+	if (ret == -EAGAIN) {
+		pci_disable_device(pdev);
+		return ret;
+	}
+
+	vdev->reset_works = !ret;
+	pci_save_state(pdev);
+	vdev->pci_saved_state = pci_store_saved_state(pdev);
+	if (!vdev->pci_saved_state)
+		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
+
+	if (likely(!vdev->nointxmask)) {
+		if (vfio_pci_nointx(pdev)) {
+			pci_info(pdev, "Masking broken INTx support\n");
+			vdev->nointx = true;
+			pci_intx(pdev, 0);
+		} else
+			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+	}
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+		cmd &= ~PCI_COMMAND_INTX_DISABLE;
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+	}
+
+	ret = vfio_config_init(vdev);
+	if (ret) {
+		kfree(vdev->pci_saved_state);
+		vdev->pci_saved_state = NULL;
+		pci_disable_device(pdev);
+		return ret;
+	}
+
+	msix_pos = pdev->msix_cap;
+	if (msix_pos) {
+		u16 flags;
+		u32 table;
+
+		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
+		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
+		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+	} else
+		vdev->msix_bar = 0xFF;
+
+	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
+		vdev->has_vga = true;
+
+
+	if (vfio_pci_is_vga(pdev) &&
+	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
+		ret = vfio_pci_igd_init(vdev);
+		if (ret) {
+			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_ibm_npu2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
+			goto disable_exit;
+		}
+	}
+
+	vfio_pci_probe_mmaps(vdev);
+
+	return 0;
+
+disable_exit:
+	vfio_pci_disable(vdev);
+	return ret;
+}
+
+void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_dummy_resource *dummy_res, *tmp;
+	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
+	int i, bar;
+
+	/* Stop the device from further DMA */
+	pci_clear_master(pdev);
+
+	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+				VFIO_IRQ_SET_ACTION_TRIGGER,
+				vdev->irq_type, 0, 0, NULL);
+
+	/* Device closed, don't need mutex here */
+	list_for_each_entry_safe(ioeventfd, ioeventfd_tmp,
+				 &vdev->ioeventfds_list, next) {
+		vfio_virqfd_disable(&ioeventfd->virqfd);
+		list_del(&ioeventfd->next);
+		kfree(ioeventfd);
+	}
+	vdev->ioeventfds_nr = 0;
+
+	vdev->virq_disabled = false;
+
+	for (i = 0; i < vdev->num_regions; i++)
+		vdev->region[i].ops->release(vdev, &vdev->region[i]);
+
+	vdev->num_regions = 0;
+	kfree(vdev->region);
+	vdev->region = NULL; /* don't krealloc a freed pointer */
+
+	vfio_config_free(vdev);
+
+	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+		bar = i + PCI_STD_RESOURCES;
+		if (!vdev->barmap[bar])
+			continue;
+		pci_iounmap(pdev, vdev->barmap[bar]);
+		pci_release_selected_regions(pdev, 1 << bar);
+		vdev->barmap[bar] = NULL;
+	}
+
+	list_for_each_entry_safe(dummy_res, tmp,
+				 &vdev->dummy_resources_list, res_next) {
+		list_del(&dummy_res->res_next);
+		release_resource(&dummy_res->resource);
+		kfree(dummy_res);
+	}
+
+	vdev->needs_reset = true;
+
+	/*
+	 * If we have saved state, restore it.  If we can reset the device,
+	 * even better.  Resetting with current state seems better than
+	 * nothing, but saving and restoring current state without reset
+	 * is just busy work.
+	 */
+	if (pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)) {
+		pci_info(pdev, "%s: Couldn't reload saved state\n", __func__);
+
+		if (!vdev->reset_works)
+			goto out;
+
+		pci_save_state(pdev);
+	}
+
+	/*
+	 * Disable INTx and MSI, presumably to avoid spurious interrupts
+	 * during reset.  Stolen from pci_reset_function()
+	 */
+	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+
+	/*
+	 * Try to get the locks ourselves to prevent a deadlock. The
+	 * success of this is dependent on being able to lock the device,
+	 * which is not always possible.
+	 * We can not use the "try" reset interface here, which will
+	 * overwrite the previously restored configuration information.
+	 */
+	if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
+		if (device_trylock(&pdev->dev)) {
+			if (!__pci_reset_function_locked(pdev))
+				vdev->needs_reset = false;
+			device_unlock(&pdev->dev);
+		}
+		pci_cfg_access_unlock(pdev);
+	}
+
+	pci_restore_state(pdev);
+out:
+	pci_disable_device(pdev);
+
+	vfio_pci_try_bus_reset(vdev);
+
+	if (!vdev->disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+}
+
+void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
+			bool nointxmask, bool disable_idle_d3)
+{
+	vdev->nointxmask = nointxmask;
+	vdev->disable_idle_d3 = disable_idle_d3;
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+
+		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
+		    vdev->nointx || vdev->pdev->is_virtfn)
+			return 0;
+
+		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+
+		return pin ? 1 : 0;
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = vdev->pdev->msi_cap;
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSI_FLAGS, &flags);
+			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = vdev->pdev->msix_cap;
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSIX_FLAGS, &flags);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+		if (pci_is_pcie(vdev->pdev))
+			return 1;
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+		return 1;
+	}
+
+	return 0;
+}
+
+static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
+{
+	(*(int *)data)++;
+	return 0;
+}
+
+struct vfio_pci_fill_info {
+	int max;
+	int cur;
+	struct vfio_pci_dependent_device *devices;
+};
+
+static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_fill_info *fill = data;
+	struct iommu_group *iommu_group;
+
+	if (fill->cur == fill->max)
+		return -EAGAIN; /* Something changed, try again */
+
+	iommu_group = iommu_group_get(&pdev->dev);
+	if (!iommu_group)
+		return -EPERM; /* Cannot reset non-isolated devices */
+
+	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
+	fill->devices[fill->cur].bus = pdev->bus->number;
+	fill->devices[fill->cur].devfn = pdev->devfn;
+	fill->cur++;
+	iommu_group_put(iommu_group);
+	return 0;
+}
+
+struct vfio_pci_group_entry {
+	struct vfio_group *group;
+	int id;
+};
+
+struct vfio_pci_group_info {
+	int count;
+	struct vfio_pci_group_entry *groups;
+};
+
+static int vfio_pci_validate_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_group_info *info = data;
+	struct iommu_group *group;
+	int id, i;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EPERM;
+
+	id = iommu_group_id(group);
+
+	for (i = 0; i < info->count; i++)
+		if (info->groups[i].id == id)
+			break;
+
+	iommu_group_put(group);
+
+	return (i == info->count) ? -EINVAL : 0;
+}
+
+static bool vfio_pci_dev_below_slot(struct pci_dev *pdev, struct pci_slot *slot)
+{
+	for (; pdev; pdev = pdev->bus->self)
+		if (pdev->bus == slot->bus)
+			return (pdev->slot == slot);
+	return false;
+}
+
+struct vfio_pci_walk_info {
+	int (*fn)(struct pci_dev *, void *data);
+	void *data;
+	struct pci_dev *pdev;
+	bool slot;
+	int ret;
+};
+
+static int vfio_pci_walk_wrapper(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_walk_info *walk = data;
+
+	if (!walk->slot || vfio_pci_dev_below_slot(pdev, walk->pdev->slot))
+		walk->ret = walk->fn(pdev, walk->data);
+
+	return walk->ret;
+}
+
+static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
+					 int (*fn)(struct pci_dev *,
+						   void *data), void *data,
+					 bool slot)
+{
+	struct vfio_pci_walk_info walk = {
+		.fn = fn, .data = data, .pdev = pdev, .slot = slot, .ret = 0,
+	};
+
+	pci_walk_bus(pdev->bus, vfio_pci_walk_wrapper, &walk);
+
+	return walk.ret;
+}
+
+static int msix_mmappable_cap(struct vfio_pci_device *vdev,
+			      struct vfio_info_cap *caps)
+{
+	struct vfio_info_cap_header header = {
+		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
+		.version = 1
+	};
+
+	return vfio_info_add_capability(caps, &header, sizeof(header));
+}
+
+int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+				 unsigned int type, unsigned int subtype,
+				 const struct vfio_pci_regops *ops,
+				 size_t size, u32 flags, void *data)
+{
+	struct vfio_pci_region *region;
+
+	region = krealloc(vdev->region,
+			  (vdev->num_regions + 1) * sizeof(*region),
+			  GFP_KERNEL);
+	if (!region)
+		return -ENOMEM;
+
+	vdev->region = region;
+	vdev->region[vdev->num_regions].type = type;
+	vdev->region[vdev->num_regions].subtype = subtype;
+	vdev->region[vdev->num_regions].ops = ops;
+	vdev->region[vdev->num_regions].size = size;
+	vdev->region[vdev->num_regions].flags = flags;
+	vdev->region[vdev->num_regions].data = data;
+
+	vdev->num_regions++;
+
+	return 0;
+}
+
+long vfio_pci_ioctl(void *device_data,
+		   unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_device *vdev = device_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (vdev->reset_works)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		int i, ret;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = pdev->cfg_size;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			break;
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = pci_resource_len(pdev, info.index);
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+			if (vdev->bar_mmap_supported[info.index]) {
+				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+				if (info.index == vdev->msix_bar) {
+					ret = msix_mmappable_cap(vdev, &caps);
+					if (ret)
+						return ret;
+				}
+			}
+
+			break;
+		case VFIO_PCI_ROM_REGION_INDEX:
+		{
+			void __iomem *io;
+			size_t size;
+			u16 orig_cmd;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.flags = 0;
+
+			/* Report the BAR size, not the ROM size */
+			info.size = pci_resource_len(pdev, info.index);
+			if (!info.size) {
+				/* Shadow ROMs appear as PCI option ROMs */
+				if (pdev->resource[PCI_ROM_RESOURCE].flags &
+							IORESOURCE_ROM_SHADOW)
+					info.size = 0x20000;
+				else
+					break;
+			}
+
+			/*
+			 * Is it really there?  Enable memory decode for
+			 * implicit access in pci_map_rom().
+			 */
+			pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
+			pci_write_config_word(pdev, PCI_COMMAND,
+					      orig_cmd | PCI_COMMAND_MEMORY);
+
+			io = pci_map_rom(pdev, &size);
+			if (io) {
+				info.flags = VFIO_REGION_INFO_FLAG_READ;
+				pci_unmap_rom(pdev, io);
+			} else {
+				info.size = 0;
+			}
+
+			pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
+			break;
+		}
+		case VFIO_PCI_VGA_REGION_INDEX:
+			if (!vdev->has_vga)
+				return -EINVAL;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+
+			break;
+		default:
+		{
+			struct vfio_region_info_cap_type cap_type = {
+					.header.id = VFIO_REGION_INFO_CAP_TYPE,
+					.header.version = 1 };
+
+			if (info.index >=
+			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
+				return -EINVAL;
+			info.index = array_index_nospec(info.index,
+							VFIO_PCI_NUM_REGIONS +
+							vdev->num_regions);
+
+			i = info.index - VFIO_PCI_NUM_REGIONS;
+
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->region[i].size;
+			info.flags = vdev->region[i].flags;
+
+			cap_type.type = vdev->region[i].type;
+			cap_type.subtype = vdev->region[i].subtype;
+
+			ret = vfio_info_add_capability(&caps, &cap_type.header,
+						       sizeof(cap_type));
+			if (ret)
+				return ret;
+
+			if (vdev->region[i].ops->add_capability) {
+				ret = vdev->region[i].ops->add_capability(vdev,
+						&vdev->region[i], &caps);
+				if (ret)
+					return ret;
+			}
+		}
+		}
+
+		if (caps.size) {
+			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg +
+						  sizeof(info), caps.buf,
+						  caps.size)) {
+					kfree(caps.buf);
+					return -EFAULT;
+				}
+				info.cap_offset = sizeof(info);
+			}
+
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+		case VFIO_PCI_ERR_IRQ_INDEX:
+			if (pci_is_pcie(vdev->pdev))
+				break;
+		/* fall through */
+		default:
+			return -EINVAL;
+		}
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+		info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+				       VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz) ?
+			-EFAULT : 0;
+
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		int max, ret = 0;
+		size_t data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		max = vfio_pci_get_irq_count(vdev, hdr.index);
+
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
+						 VFIO_PCI_NUM_IRQS, &data_size);
+		if (ret)
+			return ret;
+
+		if (data_size) {
+			data = memdup_user((void __user *)(arg + minsz),
+					    data_size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		mutex_lock(&vdev->igate);
+
+		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+					      hdr.start, hdr.count, data);
+
+		mutex_unlock(&vdev->igate);
+		kfree(data);
+
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_RESET) {
+		return vdev->reset_works ?
+			pci_try_reset_function(vdev->pdev) : -EINVAL;
+
+	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
+		struct vfio_pci_hot_reset_info hdr;
+		struct vfio_pci_fill_info fill = { 0 };
+		struct vfio_pci_dependent_device *devices = NULL;
+		bool slot = false;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_pci_hot_reset_info, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz)
+			return -EINVAL;
+
+		hdr.flags = 0;
+
+		/* Can we do a slot or bus reset or neither? */
+		if (!pci_probe_reset_slot(vdev->pdev->slot))
+			slot = true;
+		else if (pci_probe_reset_bus(vdev->pdev->bus))
+			return -ENODEV;
+
+		/* How many devices are affected? */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_count_devs,
+						    &fill.max, slot);
+		if (ret)
+			return ret;
+
+		WARN_ON(!fill.max); /* Should always be at least one */
+
+		/*
+		 * If there's enough space, fill it now, otherwise return
+		 * -ENOSPC and the number of devices affected.
+		 */
+		if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
+			ret = -ENOSPC;
+			hdr.count = fill.max;
+			goto reset_info_exit;
+		}
+
+		devices = kcalloc(fill.max, sizeof(*devices), GFP_KERNEL);
+		if (!devices)
+			return -ENOMEM;
+
+		fill.devices = devices;
+
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_fill_devs,
+						    &fill, slot);
+
+		/*
+		 * If a device was removed between counting and filling,
+		 * we may come up short of fill.max.  If a device was
+		 * added, we'll have a return of -EAGAIN above.
+		 */
+		if (!ret)
+			hdr.count = fill.cur;
+
+reset_info_exit:
+		if (copy_to_user((void __user *)arg, &hdr, minsz))
+			ret = -EFAULT;
+
+		if (!ret) {
+			if (copy_to_user((void __user *)(arg + minsz), devices,
+					 hdr.count * sizeof(*devices)))
+				ret = -EFAULT;
+		}
+
+		kfree(devices);
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_PCI_HOT_RESET) {
+		struct vfio_pci_hot_reset hdr;
+		int32_t *group_fds;
+		struct vfio_pci_group_entry *groups;
+		struct vfio_pci_group_info info;
+		bool slot = false;
+		int i, count = 0, ret = 0;
+
+		minsz = offsetofend(struct vfio_pci_hot_reset, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.flags)
+			return -EINVAL;
+
+		/* Can we do a slot or bus reset or neither? */
+		if (!pci_probe_reset_slot(vdev->pdev->slot))
+			slot = true;
+		else if (pci_probe_reset_bus(vdev->pdev->bus))
+			return -ENODEV;
+
+		/*
+		 * We can't let userspace give us an arbitrarily large
+		 * buffer to copy, so verify how many we think there
+		 * could be.  Note groups can have multiple devices so
+		 * one group per device is the max.
+		 */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_count_devs,
+						    &count, slot);
+		if (ret)
+			return ret;
+
+		/* Somewhere between 1 and count is OK */
+		if (!hdr.count || hdr.count > count)
+			return -EINVAL;
+
+		group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
+		groups = kcalloc(hdr.count, sizeof(*groups), GFP_KERNEL);
+		if (!group_fds || !groups) {
+			kfree(group_fds);
+			kfree(groups);
+			return -ENOMEM;
+		}
+
+		if (copy_from_user(group_fds, (void __user *)(arg + minsz),
+				   hdr.count * sizeof(*group_fds))) {
+			kfree(group_fds);
+			kfree(groups);
+			return -EFAULT;
+		}
+
+		/*
+		 * For each group_fd, get the group through the vfio external
+		 * user interface and store the group and iommu ID.  This
+		 * ensures the group is held across the reset.
+		 */
+		for (i = 0; i < hdr.count; i++) {
+			struct vfio_group *group;
+			struct fd f = fdget(group_fds[i]);
+			if (!f.file) {
+				ret = -EBADF;
+				break;
+			}
+
+			group = vfio_group_get_external_user(f.file);
+			fdput(f);
+			if (IS_ERR(group)) {
+				ret = PTR_ERR(group);
+				break;
+			}
+
+			groups[i].group = group;
+			groups[i].id = vfio_external_user_iommu_id(group);
+		}
+
+		kfree(group_fds);
+
+		/* release reference to groups on error */
+		if (ret)
+			goto hot_reset_release;
+
+		info.count = hdr.count;
+		info.groups = groups;
+
+		/*
+		 * Test whether all the affected devices are contained
+		 * by the set of groups provided by the user.
+		 */
+		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+						    vfio_pci_validate_devs,
+						    &info, slot);
+		if (!ret)
+			/* User has access, do the reset */
+			ret = pci_reset_bus(vdev->pdev);
+
+hot_reset_release:
+		for (i--; i >= 0; i--)
+			vfio_group_put_external_user(groups[i].group);
+
+		kfree(groups);
+		return ret;
+	} else if (cmd == VFIO_DEVICE_IOEVENTFD) {
+		struct vfio_device_ioeventfd ioeventfd;
+		int count;
+
+		minsz = offsetofend(struct vfio_device_ioeventfd, fd);
+
+		if (copy_from_user(&ioeventfd, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ioeventfd.argsz < minsz)
+			return -EINVAL;
+
+		if (ioeventfd.flags & ~VFIO_DEVICE_IOEVENTFD_SIZE_MASK)
+			return -EINVAL;
+
+		count = ioeventfd.flags & VFIO_DEVICE_IOEVENTFD_SIZE_MASK;
+
+		if (hweight8(count) != 1 || ioeventfd.fd < -1)
+			return -EINVAL;
+
+		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
+					  ioeventfd.data, count, ioeventfd.fd);
+	}
+
+	return -ENOTTY;
+}
+
+static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+		if (iswrite)
+			return -EINVAL;
+		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_VGA_REGION_INDEX:
+		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+	default:
+		index -= VFIO_PCI_NUM_REGIONS;
+		return vdev->region[index].ops->rw(vdev, buf,
+						   count, ppos, iswrite);
+	}
+
+	return -EINVAL;
+}
+
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	if (!count)
+		return 0;
+
+	return vfio_pci_rw(device_data, buf, count, ppos, false);
+}
+
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	if (!count)
+		return 0;
+
+	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
+}
+
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index;
+	u64 phys_len, req_len, pgoff, req_start;
+	int ret;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap &&
+		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	if (!vdev->bar_mmap_supported[index])
+		return -EINVAL;
+
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (req_start + req_len > phys_len)
+		return -EINVAL;
+
+	/*
+	 * Even though we don't make use of the barmap for the mmap,
+	 * we need to request the region and the barmap tracks that.
+	 */
+	if (!vdev->barmap[index]) {
+		ret = pci_request_selected_regions(pdev,
+						   1 << index, "vfio-pci");
+		if (ret)
+			return ret;
+
+		vdev->barmap[index] = pci_iomap(pdev, index, 0);
+		if (!vdev->barmap[index]) {
+			pci_release_selected_regions(pdev, 1 << index);
+			return -ENOMEM;
+		}
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			       req_len, vma->vm_page_prot);
+}
+
+void vfio_pci_request(void *device_data, unsigned int count)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	mutex_lock(&vdev->igate);
+
+	if (vdev->req_trigger) {
+		if (!(count % 10))
+			pci_notice_ratelimited(pdev,
+				"Relaying device request to user (#%u)\n",
+				count);
+		eventfd_signal(vdev->req_trigger, 1);
+	} else if (count == 0) {
+		pci_warn(pdev,
+			"No device request channel registered, blocked until released by user\n");
+	}
+
+	mutex_unlock(&vdev->igate);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_ioctl,
+	.read		= vfio_pci_read,
+	.write		= vfio_pci_write,
+	.mmap		= vfio_pci_mmap,
+	.request	= vfio_pci_request,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	struct iommu_group *group;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci.  Just reject these PFs
+	 * and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	group = vfio_iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		return -ENOMEM;
+	}
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (ret) {
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		vfio_del_group_dev(&pdev->dev);
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = vfio_del_group_dev(&pdev->dev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!vdev->disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+	kfree(vdev);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+}
+
+static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
+						  pci_channel_state_t state)
+{
+	struct vfio_pci_device *vdev;
+	struct vfio_device *device;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (device == NULL)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	vdev = vfio_device_data(device);
+	if (vdev == NULL) {
+		vfio_device_put(device);
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	mutex_lock(&vdev->igate);
+
+	if (vdev->err_trigger)
+		eventfd_signal(vdev->err_trigger, 1);
+
+	mutex_unlock(&vdev->igate);
+
+	vfio_device_put(device);
+
+	return PCI_ERS_RESULT_CAN_RECOVER;
+}
+
+static const struct pci_error_handlers vfio_err_handlers = {
+	.error_detected = vfio_pci_aer_err_detected,
+};
+
+static struct pci_driver vfio_pci_driver = {
+	.name		= "vfio-pci",
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_pci_probe,
+	.remove		= vfio_pci_remove,
+	.err_handler	= &vfio_err_handlers,
+};
+
+static DEFINE_MUTEX(reflck_lock);
+
+static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
+{
+	struct vfio_pci_reflck *reflck;
+
+	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
+	if (!reflck)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&reflck->kref);
+	mutex_init(&reflck->lock);
+
+	return reflck;
+}
+
+static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
+{
+	kref_get(&reflck->kref);
+}
+
+static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_device *vdev = data;
+	struct vfio_pci_reflck **preflck = &vdev->reflck;
+	struct vfio_device *device;
+	struct vfio_pci_device *tmp;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return 0;
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	tmp = vfio_device_data(device);
+
+	if (tmp->reflck) {
+		vfio_pci_reflck_get(tmp->reflck);
+		*preflck = tmp->reflck;
+		vfio_device_put(device);
+		return 1;
+	}
+
+	vfio_device_put(device);
+	return 0;
+}
+
+int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+{
+	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
+
+	mutex_lock(&reflck_lock);
+
+	if (pci_is_root_bus(vdev->pdev->bus) ||
+	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
+					  vdev, slot) <= 0)
+		vdev->reflck = vfio_pci_reflck_alloc();
+
+	mutex_unlock(&reflck_lock);
+
+	return PTR_ERR_OR_ZERO(vdev->reflck);
+}
+
+static void vfio_pci_reflck_release(struct kref *kref)
+{
+	struct vfio_pci_reflck *reflck = container_of(kref,
+						      struct vfio_pci_reflck,
+						      kref);
+
+	kfree(reflck);
+	mutex_unlock(&reflck_lock);
+}
+
+void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+{
+	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
+}
+
+struct vfio_devices {
+	struct vfio_device **devices;
+	struct vfio_pci_device *vdev;
+	int cur_index;
+	int max_index;
+};
+
+static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
+{
+	struct vfio_devices *devs = data;
+	struct vfio_device *device;
+	struct vfio_pci_device *tmp;
+
+	if (devs->cur_index == devs->max_index)
+		return -ENOSPC;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return -EINVAL;
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	tmp = vfio_device_data(device);
+
+	/* Fault if the device is not unused */
+	if (tmp->refcnt) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	devs->devices[devs->cur_index++] = device;
+	return 0;
+}
+
+/*
+ * If a bus or slot reset is available for the provided device and:
+ *  - All of the devices affected by that bus or slot reset are unused
+ *    (!refcnt)
+ *  - At least one of the affected devices is marked dirty via
+ *    needs_reset (such as by lack of FLR support)
+ * Then attempt to perform that bus or slot reset.  Callers are required
+ * to hold vdev->reflck->lock, protecting the bus/slot reset group from
+ * concurrent opens.  A vfio_device reference is acquired for each device
+ * to prevent unbinds during the reset operation.
+ *
+ * NB: vfio-core considers a group to be viable even if some devices are
+ * bound to drivers like pci-stub or pcieport.  Here we require all devices
+ * to be bound to vfio_pci since that's the only way we can be sure they
+ * stay put.
+ */
+static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
+{
+	struct vfio_devices devs = { .vdev = vdev, .cur_index = 0 };
+	int i = 0, ret = -EINVAL;
+	bool slot = false;
+	struct vfio_pci_device *tmp;
+
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return;
+
+	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+					  &i, slot) || !i)
+		return;
+
+	devs.max_index = i;
+	devs.devices = kcalloc(i, sizeof(struct vfio_device *), GFP_KERNEL);
+	if (!devs.devices)
+		return;
+
+	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
+					  vfio_pci_get_unused_devs,
+					  &devs, slot))
+		goto put_devs;
+
+	/* Does at least one need a reset? */
+	for (i = 0; i < devs.cur_index; i++) {
+		tmp = vfio_device_data(devs.devices[i]);
+		if (tmp->needs_reset) {
+			ret = pci_reset_bus(vdev->pdev);
+			break;
+		}
+	}
+
+put_devs:
+	for (i = 0; i < devs.cur_index; i++) {
+		tmp = vfio_device_data(devs.devices[i]);
+
+		/*
+		 * If reset was successful, affected devices no longer need
+		 * a reset and we should return all the collateral devices
+		 * to low power.  If not successful, we either didn't reset
+		 * the bus or timed out waiting for it, so let's not touch
+		 * the power state.
+		 */
+		if (!ret) {
+			tmp->needs_reset = false;
+
+			if (tmp != vdev && !tmp->disable_idle_d3)
+				vfio_pci_set_power_state(tmp, PCI_D3hot);
+		}
+
+		vfio_device_put(devs.devices[i]);
+	}
+
+	kfree(devs.devices);
+}
+
+static void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+	vfio_pci_uninit_perm_bits();
+}
+
+void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+{
+	char *p, *id;
+	int rc;
+
+	/* no ids passed actually */
+	if (ids[0] == '\0')
+		return;
+
+	/* add ids specified in the module parameter */
+	p = ids;
+	while ((id = strsep(&p, ","))) {
+		unsigned int vendor, device, subvendor = PCI_ANY_ID,
+			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+		int fields;
+
+		if (!strlen(id))
+			continue;
+
+		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
+				&vendor, &device, &subvendor, &subdevice,
+				&class, &class_mask);
+
+		if (fields < 2) {
+			pr_warn("invalid id string \"%s\"\n", id);
+			continue;
+		}
+
+		rc = pci_add_dynid(driver, vendor, device,
+				   subvendor, subdevice, class, class_mask, 0);
+		if (rc)
+			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask, rc);
+		else
+			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask);
+	}
+}
+
+static int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	vfio_pci_fill_ids(ids, &vfio_pci_driver);
+
+	return 0;
+
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 06/12] vfio_pci: shrink vfio_pci_common.c
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (4 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c Liu Yi L
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch removes the vfio-pci module specific codes in vfio_pci_common.c
to make vfio_pci_common.c be a common source file.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_common.c | 235 -------------------------------------
 1 file changed, 235 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index 103e493..b0894dfc 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -30,30 +30,6 @@
 
 #include "vfio_pci_private.h"
 
-#define DRIVER_VERSION  "0.2"
-#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
-#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
-
-static char ids[1024] __initdata;
-module_param_string(ids, ids, sizeof(ids), 0);
-MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
-
-static bool nointxmask;
-module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(nointxmask,
-		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
-
-#ifdef CONFIG_VFIO_PCI_VGA
-static bool disable_vga;
-module_param(disable_vga, bool, S_IRUGO);
-MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-pci");
-#endif
-
-static bool disable_idle_d3;
-module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(disable_idle_d3,
-		 "Disable using the PCI D3 low power state for idle, unused devices");
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -459,49 +435,6 @@ void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
 	vdev->disable_idle_d3 = disable_idle_d3;
 }
 
-static void vfio_pci_release(void *device_data)
-{
-	struct vfio_pci_device *vdev = device_data;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!(--vdev->refcnt)) {
-		vfio_spapr_pci_eeh_release(vdev->pdev);
-		vfio_pci_disable(vdev);
-	}
-
-	mutex_unlock(&vdev->reflck->lock);
-
-	module_put(THIS_MODULE);
-}
-
-static int vfio_pci_open(void *device_data)
-{
-	struct vfio_pci_device *vdev = device_data;
-	int ret = 0;
-
-	if (!try_module_get(THIS_MODULE))
-		return -ENODEV;
-
-	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!vdev->refcnt) {
-		ret = vfio_pci_enable(vdev);
-		if (ret)
-			goto error;
-
-		vfio_spapr_pci_eeh_open(vdev->pdev);
-	}
-	vdev->refcnt++;
-error:
-	mutex_unlock(&vdev->reflck->lock);
-	if (ret)
-		module_put(THIS_MODULE);
-	return ret;
-}
-
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 {
 	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
@@ -1273,129 +1206,6 @@ void vfio_pci_request(void *device_data, unsigned int count)
 	mutex_unlock(&vdev->igate);
 }
 
-static const struct vfio_device_ops vfio_pci_ops = {
-	.name		= "vfio-pci",
-	.open		= vfio_pci_open,
-	.release	= vfio_pci_release,
-	.ioctl		= vfio_pci_ioctl,
-	.read		= vfio_pci_read,
-	.write		= vfio_pci_write,
-	.mmap		= vfio_pci_mmap,
-	.request	= vfio_pci_request,
-};
-
-static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
-{
-	struct vfio_pci_device *vdev;
-	struct iommu_group *group;
-	int ret;
-
-	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-		return -EINVAL;
-
-	/*
-	 * Prevent binding to PFs with VFs enabled, this too easily allows
-	 * userspace instance with VFs and PFs from the same device, which
-	 * cannot work.  Disabling SR-IOV here would initiate removing the
-	 * VFs, which would unbind the driver, which is prone to blocking
-	 * if that VF is also in use by vfio-pci.  Just reject these PFs
-	 * and let the user sort it out.
-	 */
-	if (pci_num_vf(pdev)) {
-		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-		return -EBUSY;
-	}
-
-	group = vfio_iommu_group_get(&pdev->dev);
-	if (!group)
-		return -EINVAL;
-
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev) {
-		vfio_iommu_group_put(group, &pdev->dev);
-		return -ENOMEM;
-	}
-
-	vdev->pdev = pdev;
-	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	mutex_init(&vdev->igate);
-	spin_lock_init(&vdev->irqlock);
-	mutex_init(&vdev->ioeventfds_lock);
-	INIT_LIST_HEAD(&vdev->ioeventfds_list);
-	vdev->nointxmask = nointxmask;
-#ifdef CONFIG_VFIO_PCI_VGA
-	vdev->disable_vga = disable_vga;
-#endif
-	vdev->disable_idle_d3 = disable_idle_d3;
-
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
-	if (ret) {
-		vfio_iommu_group_put(group, &pdev->dev);
-		kfree(vdev);
-		return ret;
-	}
-
-	ret = vfio_pci_reflck_attach(vdev);
-	if (ret) {
-		vfio_del_group_dev(&pdev->dev);
-		vfio_iommu_group_put(group, &pdev->dev);
-		kfree(vdev);
-		return ret;
-	}
-
-	if (vfio_pci_is_vga(pdev)) {
-		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
-		vga_set_legacy_decoding(pdev,
-					vfio_pci_set_vga_decode(vdev, false));
-	}
-
-	vfio_pci_probe_power_state(vdev);
-
-	if (!vdev->disable_idle_d3) {
-		/*
-		 * pci-core sets the device power state to an unknown value at
-		 * bootup and after being removed from a driver.  The only
-		 * transition it allows from this unknown state is to D0, which
-		 * typically happens when a driver calls pci_enable_device().
-		 * We're not ready to enable the device yet, but we do want to
-		 * be able to get to D3.  Therefore first do a D0 transition
-		 * before going to D3.
-		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-	}
-
-	return ret;
-}
-
-static void vfio_pci_remove(struct pci_dev *pdev)
-{
-	struct vfio_pci_device *vdev;
-
-	vdev = vfio_del_group_dev(&pdev->dev);
-	if (!vdev)
-		return;
-
-	vfio_pci_reflck_put(vdev->reflck);
-
-	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
-	kfree(vdev->region);
-	mutex_destroy(&vdev->ioeventfds_lock);
-
-	if (!vdev->disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
-
-	kfree(vdev->pm_save);
-	kfree(vdev);
-
-	if (vfio_pci_is_vga(pdev)) {
-		vga_client_register(pdev, NULL, NULL, NULL);
-		vga_set_legacy_decoding(pdev,
-				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
-				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
-	}
-}
-
 static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 						  pci_channel_state_t state)
 {
@@ -1428,14 +1238,6 @@ static const struct pci_error_handlers vfio_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
 
-static struct pci_driver vfio_pci_driver = {
-	.name		= "vfio-pci",
-	.id_table	= NULL, /* only dynamic ids */
-	.probe		= vfio_pci_probe,
-	.remove		= vfio_pci_remove,
-	.err_handler	= &vfio_err_handlers,
-};
-
 static DEFINE_MUTEX(reflck_lock);
 
 static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
@@ -1629,12 +1431,6 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	kfree(devs.devices);
 }
 
-static void __exit vfio_pci_cleanup(void)
-{
-	pci_unregister_driver(&vfio_pci_driver);
-	vfio_pci_uninit_perm_bits();
-}
-
 void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
@@ -1675,34 +1471,3 @@ void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 				class, class_mask);
 	}
 }
-
-static int __init vfio_pci_init(void)
-{
-	int ret;
-
-	/* Allocate shared config space permision data used by all devices */
-	ret = vfio_pci_init_perm_bits();
-	if (ret)
-		return ret;
-
-	/* Register and scan for devices */
-	ret = pci_register_driver(&vfio_pci_driver);
-	if (ret)
-		goto out_driver;
-
-	vfio_pci_fill_ids(ids, &vfio_pci_driver);
-
-	return 0;
-
-out_driver:
-	vfio_pci_uninit_perm_bits();
-	return ret;
-}
-
-module_init(vfio_pci_init);
-module_exit(vfio_pci_cleanup);
-
-MODULE_VERSION(DRIVER_VERSION);
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR(DRIVER_AUTHOR);
-MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (5 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 06/12] vfio_pci: shrink vfio_pci_common.c Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-08 11:24   ` kbuild test robot
  2020-01-09 22:48   ` Alex Williamson
  2020-01-07 12:01 ` [PATCH v4 08/12] vfio_pci: duplicate vfio_pci_private.h to include/linux Liu Yi L
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch removes the common codes in vfio_pci.c, leave the module
specific codes, new vfio_pci.c will leverage the common functions
implemented in vfio_pci_common.c.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Makefile           |    3 +-
 drivers/vfio/pci/vfio_pci.c         | 1442 -----------------------------------
 drivers/vfio/pci/vfio_pci_common.c  |    2 +-
 drivers/vfio/pci/vfio_pci_private.h |    2 +
 4 files changed, 5 insertions(+), 1444 deletions(-)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f027f8a..d94317a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+		vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 103e493..7e24da2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,411 +54,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-/*
- * Our VGA arbiter participation is limited since we don't know anything
- * about the device itself.  However, if the device is the only VGA device
- * downstream of a bridge and VFIO VGA support is disabled, then we can
- * safely return legacy VGA IO and memory as not decoded since the user
- * has no way to get to it and routing can be disabled externally at the
- * bridge.
- */
-unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
-{
-	struct vfio_pci_device *vdev = opaque;
-	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
-	unsigned char max_busnr;
-	unsigned int decodes;
-
-	if (single_vga || !vfio_vga_disabled(vdev) ||
-		pci_is_root_bus(pdev->bus))
-		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
-		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
-
-	max_busnr = pci_bus_max_busnr(pdev->bus);
-	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
-
-	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
-		if (tmp == pdev ||
-		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
-		    pci_is_root_bus(tmp->bus))
-			continue;
-
-		if (tmp->bus->number >= pdev->bus->number &&
-		    tmp->bus->number <= max_busnr) {
-			pci_dev_put(tmp);
-			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
-			break;
-		}
-	}
-
-	return decodes;
-}
-
-static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
-{
-	struct resource *res;
-	int i;
-	struct vfio_pci_dummy_resource *dummy_res;
-
-	INIT_LIST_HEAD(&vdev->dummy_resources_list);
-
-	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
-		int bar = i + PCI_STD_RESOURCES;
-
-		res = &vdev->pdev->resource[bar];
-
-		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
-			goto no_mmap;
-
-		if (!(res->flags & IORESOURCE_MEM))
-			goto no_mmap;
-
-		/*
-		 * The PCI core shouldn't set up a resource with a
-		 * type but zero size. But there may be bugs that
-		 * cause us to do that.
-		 */
-		if (!resource_size(res))
-			goto no_mmap;
-
-		if (resource_size(res) >= PAGE_SIZE) {
-			vdev->bar_mmap_supported[bar] = true;
-			continue;
-		}
-
-		if (!(res->start & ~PAGE_MASK)) {
-			/*
-			 * Add a dummy resource to reserve the remainder
-			 * of the exclusive page in case that hot-add
-			 * device's bar is assigned into it.
-			 */
-			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
-			if (dummy_res == NULL)
-				goto no_mmap;
-
-			dummy_res->resource.name = "vfio sub-page reserved";
-			dummy_res->resource.start = res->end + 1;
-			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
-			dummy_res->resource.flags = res->flags;
-			if (request_resource(res->parent,
-						&dummy_res->resource)) {
-				kfree(dummy_res);
-				goto no_mmap;
-			}
-			dummy_res->index = bar;
-			list_add(&dummy_res->res_next,
-					&vdev->dummy_resources_list);
-			vdev->bar_mmap_supported[bar] = true;
-			continue;
-		}
-		/*
-		 * Here we don't handle the case when the BAR is not page
-		 * aligned because we can't expect the BAR will be
-		 * assigned into the same location in a page in guest
-		 * when we passthrough the BAR. And it's hard to access
-		 * this BAR in userspace because we have no way to get
-		 * the BAR's location in a page.
-		 */
-no_mmap:
-		vdev->bar_mmap_supported[bar] = false;
-	}
-}
-
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-
-/*
- * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
- * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
- * If a device implements the former but not the latter we would typically
- * expect broken_intx_masking be set and require an exclusive interrupt.
- * However since we do have control of the device's ability to assert INTx,
- * we can instead pretend that the device does not implement INTx, virtualizing
- * the pin register to report zero and maintaining DisINTx set on the host.
- */
-static bool vfio_pci_nointx(struct pci_dev *pdev)
-{
-	switch (pdev->vendor) {
-	case PCI_VENDOR_ID_INTEL:
-		switch (pdev->device) {
-		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
-		case 0x1572:
-		case 0x1574:
-		case 0x1580 ... 0x1581:
-		case 0x1583 ... 0x158b:
-		case 0x37d0 ... 0x37d2:
-			return true;
-		default:
-			return false;
-		}
-	}
-
-	return false;
-}
-
-void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	u16 pmcsr;
-
-	if (!pdev->pm_cap)
-		return;
-
-	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
-	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0.  Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	bool needs_restore = false, needs_save = false;
-	int ret;
-
-	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
-			pci_save_state(pdev);
-			needs_save = true;
-		}
-
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
-			needs_restore = true;
-	}
-
-	ret = pci_set_power_state(pdev, state);
-
-	if (!ret) {
-		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
-			vdev->pm_save = pci_store_saved_state(pdev);
-		} else if (needs_restore) {
-			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
-			pci_restore_state(pdev);
-		}
-	}
-
-	return ret;
-}
-
-int vfio_pci_enable(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	int ret;
-	u16 cmd;
-	u8 msix_pos;
-
-	vfio_pci_set_power_state(vdev, PCI_D0);
-
-	/* Don't allow our initial saved state to include busmaster */
-	pci_clear_master(pdev);
-
-	ret = pci_enable_device(pdev);
-	if (ret)
-		return ret;
-
-	/* If reset fails because of the device lock, fail this path entirely */
-	ret = pci_try_reset_function(pdev);
-	if (ret == -EAGAIN) {
-		pci_disable_device(pdev);
-		return ret;
-	}
-
-	vdev->reset_works = !ret;
-	pci_save_state(pdev);
-	vdev->pci_saved_state = pci_store_saved_state(pdev);
-	if (!vdev->pci_saved_state)
-		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
-
-	if (likely(!vdev->nointxmask)) {
-		if (vfio_pci_nointx(pdev)) {
-			pci_info(pdev, "Masking broken INTx support\n");
-			vdev->nointx = true;
-			pci_intx(pdev, 0);
-		} else
-			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
-	}
-
-	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
-	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
-		cmd &= ~PCI_COMMAND_INTX_DISABLE;
-		pci_write_config_word(pdev, PCI_COMMAND, cmd);
-	}
-
-	ret = vfio_config_init(vdev);
-	if (ret) {
-		kfree(vdev->pci_saved_state);
-		vdev->pci_saved_state = NULL;
-		pci_disable_device(pdev);
-		return ret;
-	}
-
-	msix_pos = pdev->msix_cap;
-	if (msix_pos) {
-		u16 flags;
-		u32 table;
-
-		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
-		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
-
-		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
-		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
-		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
-	} else
-		vdev->msix_bar = 0xFF;
-
-	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
-		vdev->has_vga = true;
-
-
-	if (vfio_pci_is_vga(pdev) &&
-	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
-		ret = vfio_pci_igd_init(vdev);
-		if (ret) {
-			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_ibm_npu2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
-			goto disable_exit;
-		}
-	}
-
-	vfio_pci_probe_mmaps(vdev);
-
-	return 0;
-
-disable_exit:
-	vfio_pci_disable(vdev);
-	return ret;
-}
-
-void vfio_pci_disable(struct vfio_pci_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	struct vfio_pci_dummy_resource *dummy_res, *tmp;
-	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
-	int i, bar;
-
-	/* Stop the device from further DMA */
-	pci_clear_master(pdev);
-
-	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
-				VFIO_IRQ_SET_ACTION_TRIGGER,
-				vdev->irq_type, 0, 0, NULL);
-
-	/* Device closed, don't need mutex here */
-	list_for_each_entry_safe(ioeventfd, ioeventfd_tmp,
-				 &vdev->ioeventfds_list, next) {
-		vfio_virqfd_disable(&ioeventfd->virqfd);
-		list_del(&ioeventfd->next);
-		kfree(ioeventfd);
-	}
-	vdev->ioeventfds_nr = 0;
-
-	vdev->virq_disabled = false;
-
-	for (i = 0; i < vdev->num_regions; i++)
-		vdev->region[i].ops->release(vdev, &vdev->region[i]);
-
-	vdev->num_regions = 0;
-	kfree(vdev->region);
-	vdev->region = NULL; /* don't krealloc a freed pointer */
-
-	vfio_config_free(vdev);
-
-	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
-		bar = i + PCI_STD_RESOURCES;
-		if (!vdev->barmap[bar])
-			continue;
-		pci_iounmap(pdev, vdev->barmap[bar]);
-		pci_release_selected_regions(pdev, 1 << bar);
-		vdev->barmap[bar] = NULL;
-	}
-
-	list_for_each_entry_safe(dummy_res, tmp,
-				 &vdev->dummy_resources_list, res_next) {
-		list_del(&dummy_res->res_next);
-		release_resource(&dummy_res->resource);
-		kfree(dummy_res);
-	}
-
-	vdev->needs_reset = true;
-
-	/*
-	 * If we have saved state, restore it.  If we can reset the device,
-	 * even better.  Resetting with current state seems better than
-	 * nothing, but saving and restoring current state without reset
-	 * is just busy work.
-	 */
-	if (pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)) {
-		pci_info(pdev, "%s: Couldn't reload saved state\n", __func__);
-
-		if (!vdev->reset_works)
-			goto out;
-
-		pci_save_state(pdev);
-	}
-
-	/*
-	 * Disable INTx and MSI, presumably to avoid spurious interrupts
-	 * during reset.  Stolen from pci_reset_function()
-	 */
-	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
-
-	/*
-	 * Try to get the locks ourselves to prevent a deadlock. The
-	 * success of this is dependent on being able to lock the device,
-	 * which is not always possible.
-	 * We can not use the "try" reset interface here, which will
-	 * overwrite the previously restored configuration information.
-	 */
-	if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
-		if (device_trylock(&pdev->dev)) {
-			if (!__pci_reset_function_locked(pdev))
-				vdev->needs_reset = false;
-			device_unlock(&pdev->dev);
-		}
-		pci_cfg_access_unlock(pdev);
-	}
-
-	pci_restore_state(pdev);
-out:
-	pci_disable_device(pdev);
-
-	vfio_pci_try_bus_reset(vdev);
-
-	if (!vdev->disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-}
-
-void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
-			bool nointxmask, bool disable_idle_d3)
-{
-	vdev->nointxmask = nointxmask;
-	vdev->disable_idle_d3 = disable_idle_d3;
-}
-
 static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -502,777 +97,6 @@ static int vfio_pci_open(void *device_data)
 	return ret;
 }
 
-static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
-{
-	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
-		u8 pin;
-
-		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
-		    vdev->nointx || vdev->pdev->is_virtfn)
-			return 0;
-
-		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
-
-		return pin ? 1 : 0;
-	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
-		u8 pos;
-		u16 flags;
-
-		pos = vdev->pdev->msi_cap;
-		if (pos) {
-			pci_read_config_word(vdev->pdev,
-					     pos + PCI_MSI_FLAGS, &flags);
-			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
-		}
-	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
-		u8 pos;
-		u16 flags;
-
-		pos = vdev->pdev->msix_cap;
-		if (pos) {
-			pci_read_config_word(vdev->pdev,
-					     pos + PCI_MSIX_FLAGS, &flags);
-
-			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
-		}
-	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
-	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
-		return 1;
-	}
-
-	return 0;
-}
-
-static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
-{
-	(*(int *)data)++;
-	return 0;
-}
-
-struct vfio_pci_fill_info {
-	int max;
-	int cur;
-	struct vfio_pci_dependent_device *devices;
-};
-
-static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_fill_info *fill = data;
-	struct iommu_group *iommu_group;
-
-	if (fill->cur == fill->max)
-		return -EAGAIN; /* Something changed, try again */
-
-	iommu_group = iommu_group_get(&pdev->dev);
-	if (!iommu_group)
-		return -EPERM; /* Cannot reset non-isolated devices */
-
-	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
-	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
-	fill->devices[fill->cur].bus = pdev->bus->number;
-	fill->devices[fill->cur].devfn = pdev->devfn;
-	fill->cur++;
-	iommu_group_put(iommu_group);
-	return 0;
-}
-
-struct vfio_pci_group_entry {
-	struct vfio_group *group;
-	int id;
-};
-
-struct vfio_pci_group_info {
-	int count;
-	struct vfio_pci_group_entry *groups;
-};
-
-static int vfio_pci_validate_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_group_info *info = data;
-	struct iommu_group *group;
-	int id, i;
-
-	group = iommu_group_get(&pdev->dev);
-	if (!group)
-		return -EPERM;
-
-	id = iommu_group_id(group);
-
-	for (i = 0; i < info->count; i++)
-		if (info->groups[i].id == id)
-			break;
-
-	iommu_group_put(group);
-
-	return (i == info->count) ? -EINVAL : 0;
-}
-
-static bool vfio_pci_dev_below_slot(struct pci_dev *pdev, struct pci_slot *slot)
-{
-	for (; pdev; pdev = pdev->bus->self)
-		if (pdev->bus == slot->bus)
-			return (pdev->slot == slot);
-	return false;
-}
-
-struct vfio_pci_walk_info {
-	int (*fn)(struct pci_dev *, void *data);
-	void *data;
-	struct pci_dev *pdev;
-	bool slot;
-	int ret;
-};
-
-static int vfio_pci_walk_wrapper(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_walk_info *walk = data;
-
-	if (!walk->slot || vfio_pci_dev_below_slot(pdev, walk->pdev->slot))
-		walk->ret = walk->fn(pdev, walk->data);
-
-	return walk->ret;
-}
-
-static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
-					 int (*fn)(struct pci_dev *,
-						   void *data), void *data,
-					 bool slot)
-{
-	struct vfio_pci_walk_info walk = {
-		.fn = fn, .data = data, .pdev = pdev, .slot = slot, .ret = 0,
-	};
-
-	pci_walk_bus(pdev->bus, vfio_pci_walk_wrapper, &walk);
-
-	return walk.ret;
-}
-
-static int msix_mmappable_cap(struct vfio_pci_device *vdev,
-			      struct vfio_info_cap *caps)
-{
-	struct vfio_info_cap_header header = {
-		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
-		.version = 1
-	};
-
-	return vfio_info_add_capability(caps, &header, sizeof(header));
-}
-
-int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
-				 unsigned int type, unsigned int subtype,
-				 const struct vfio_pci_regops *ops,
-				 size_t size, u32 flags, void *data)
-{
-	struct vfio_pci_region *region;
-
-	region = krealloc(vdev->region,
-			  (vdev->num_regions + 1) * sizeof(*region),
-			  GFP_KERNEL);
-	if (!region)
-		return -ENOMEM;
-
-	vdev->region = region;
-	vdev->region[vdev->num_regions].type = type;
-	vdev->region[vdev->num_regions].subtype = subtype;
-	vdev->region[vdev->num_regions].ops = ops;
-	vdev->region[vdev->num_regions].size = size;
-	vdev->region[vdev->num_regions].flags = flags;
-	vdev->region[vdev->num_regions].data = data;
-
-	vdev->num_regions++;
-
-	return 0;
-}
-
-long vfio_pci_ioctl(void *device_data,
-		   unsigned int cmd, unsigned long arg)
-{
-	struct vfio_pci_device *vdev = device_data;
-	unsigned long minsz;
-
-	if (cmd == VFIO_DEVICE_GET_INFO) {
-		struct vfio_device_info info;
-
-		minsz = offsetofend(struct vfio_device_info, num_irqs);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		info.flags = VFIO_DEVICE_FLAGS_PCI;
-
-		if (vdev->reset_works)
-			info.flags |= VFIO_DEVICE_FLAGS_RESET;
-
-		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
-		info.num_irqs = VFIO_PCI_NUM_IRQS;
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
-		struct pci_dev *pdev = vdev->pdev;
-		struct vfio_region_info info;
-		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-		int i, ret;
-
-		minsz = offsetofend(struct vfio_region_info, offset);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		switch (info.index) {
-		case VFIO_PCI_CONFIG_REGION_INDEX:
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = pdev->cfg_size;
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-			break;
-		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = pci_resource_len(pdev, info.index);
-			if (!info.size) {
-				info.flags = 0;
-				break;
-			}
-
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-			if (vdev->bar_mmap_supported[info.index]) {
-				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
-				if (info.index == vdev->msix_bar) {
-					ret = msix_mmappable_cap(vdev, &caps);
-					if (ret)
-						return ret;
-				}
-			}
-
-			break;
-		case VFIO_PCI_ROM_REGION_INDEX:
-		{
-			void __iomem *io;
-			size_t size;
-			u16 orig_cmd;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.flags = 0;
-
-			/* Report the BAR size, not the ROM size */
-			info.size = pci_resource_len(pdev, info.index);
-			if (!info.size) {
-				/* Shadow ROMs appear as PCI option ROMs */
-				if (pdev->resource[PCI_ROM_RESOURCE].flags &
-							IORESOURCE_ROM_SHADOW)
-					info.size = 0x20000;
-				else
-					break;
-			}
-
-			/*
-			 * Is it really there?  Enable memory decode for
-			 * implicit access in pci_map_rom().
-			 */
-			pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd);
-			pci_write_config_word(pdev, PCI_COMMAND,
-					      orig_cmd | PCI_COMMAND_MEMORY);
-
-			io = pci_map_rom(pdev, &size);
-			if (io) {
-				info.flags = VFIO_REGION_INFO_FLAG_READ;
-				pci_unmap_rom(pdev, io);
-			} else {
-				info.size = 0;
-			}
-
-			pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
-			break;
-		}
-		case VFIO_PCI_VGA_REGION_INDEX:
-			if (!vdev->has_vga)
-				return -EINVAL;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = 0xc0000;
-			info.flags = VFIO_REGION_INFO_FLAG_READ |
-				     VFIO_REGION_INFO_FLAG_WRITE;
-
-			break;
-		default:
-		{
-			struct vfio_region_info_cap_type cap_type = {
-					.header.id = VFIO_REGION_INFO_CAP_TYPE,
-					.header.version = 1 };
-
-			if (info.index >=
-			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
-				return -EINVAL;
-			info.index = array_index_nospec(info.index,
-							VFIO_PCI_NUM_REGIONS +
-							vdev->num_regions);
-
-			i = info.index - VFIO_PCI_NUM_REGIONS;
-
-			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
-			info.size = vdev->region[i].size;
-			info.flags = vdev->region[i].flags;
-
-			cap_type.type = vdev->region[i].type;
-			cap_type.subtype = vdev->region[i].subtype;
-
-			ret = vfio_info_add_capability(&caps, &cap_type.header,
-						       sizeof(cap_type));
-			if (ret)
-				return ret;
-
-			if (vdev->region[i].ops->add_capability) {
-				ret = vdev->region[i].ops->add_capability(vdev,
-						&vdev->region[i], &caps);
-				if (ret)
-					return ret;
-			}
-		}
-		}
-
-		if (caps.size) {
-			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-						  sizeof(info), caps.buf,
-						  caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
-			}
-
-			kfree(caps.buf);
-		}
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
-		struct vfio_irq_info info;
-
-		minsz = offsetofend(struct vfio_irq_info, count);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
-			return -EINVAL;
-
-		switch (info.index) {
-		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
-		case VFIO_PCI_REQ_IRQ_INDEX:
-			break;
-		case VFIO_PCI_ERR_IRQ_INDEX:
-			if (pci_is_pcie(vdev->pdev))
-				break;
-		/* fall through */
-		default:
-			return -EINVAL;
-		}
-
-		info.flags = VFIO_IRQ_INFO_EVENTFD;
-
-		info.count = vfio_pci_get_irq_count(vdev, info.index);
-
-		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
-			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
-				       VFIO_IRQ_INFO_AUTOMASKED);
-		else
-			info.flags |= VFIO_IRQ_INFO_NORESIZE;
-
-		return copy_to_user((void __user *)arg, &info, minsz) ?
-			-EFAULT : 0;
-
-	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
-		struct vfio_irq_set hdr;
-		u8 *data = NULL;
-		int max, ret = 0;
-		size_t data_size = 0;
-
-		minsz = offsetofend(struct vfio_irq_set, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		max = vfio_pci_get_irq_count(vdev, hdr.index);
-
-		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
-						 VFIO_PCI_NUM_IRQS, &data_size);
-		if (ret)
-			return ret;
-
-		if (data_size) {
-			data = memdup_user((void __user *)(arg + minsz),
-					    data_size);
-			if (IS_ERR(data))
-				return PTR_ERR(data);
-		}
-
-		mutex_lock(&vdev->igate);
-
-		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
-					      hdr.start, hdr.count, data);
-
-		mutex_unlock(&vdev->igate);
-		kfree(data);
-
-		return ret;
-
-	} else if (cmd == VFIO_DEVICE_RESET) {
-		return vdev->reset_works ?
-			pci_try_reset_function(vdev->pdev) : -EINVAL;
-
-	} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
-		struct vfio_pci_hot_reset_info hdr;
-		struct vfio_pci_fill_info fill = { 0 };
-		struct vfio_pci_dependent_device *devices = NULL;
-		bool slot = false;
-		int ret = 0;
-
-		minsz = offsetofend(struct vfio_pci_hot_reset_info, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (hdr.argsz < minsz)
-			return -EINVAL;
-
-		hdr.flags = 0;
-
-		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
-			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
-			return -ENODEV;
-
-		/* How many devices are affected? */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_count_devs,
-						    &fill.max, slot);
-		if (ret)
-			return ret;
-
-		WARN_ON(!fill.max); /* Should always be at least one */
-
-		/*
-		 * If there's enough space, fill it now, otherwise return
-		 * -ENOSPC and the number of devices affected.
-		 */
-		if (hdr.argsz < sizeof(hdr) + (fill.max * sizeof(*devices))) {
-			ret = -ENOSPC;
-			hdr.count = fill.max;
-			goto reset_info_exit;
-		}
-
-		devices = kcalloc(fill.max, sizeof(*devices), GFP_KERNEL);
-		if (!devices)
-			return -ENOMEM;
-
-		fill.devices = devices;
-
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_fill_devs,
-						    &fill, slot);
-
-		/*
-		 * If a device was removed between counting and filling,
-		 * we may come up short of fill.max.  If a device was
-		 * added, we'll have a return of -EAGAIN above.
-		 */
-		if (!ret)
-			hdr.count = fill.cur;
-
-reset_info_exit:
-		if (copy_to_user((void __user *)arg, &hdr, minsz))
-			ret = -EFAULT;
-
-		if (!ret) {
-			if (copy_to_user((void __user *)(arg + minsz), devices,
-					 hdr.count * sizeof(*devices)))
-				ret = -EFAULT;
-		}
-
-		kfree(devices);
-		return ret;
-
-	} else if (cmd == VFIO_DEVICE_PCI_HOT_RESET) {
-		struct vfio_pci_hot_reset hdr;
-		int32_t *group_fds;
-		struct vfio_pci_group_entry *groups;
-		struct vfio_pci_group_info info;
-		bool slot = false;
-		int i, count = 0, ret = 0;
-
-		minsz = offsetofend(struct vfio_pci_hot_reset, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (hdr.argsz < minsz || hdr.flags)
-			return -EINVAL;
-
-		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
-			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
-			return -ENODEV;
-
-		/*
-		 * We can't let userspace give us an arbitrarily large
-		 * buffer to copy, so verify how many we think there
-		 * could be.  Note groups can have multiple devices so
-		 * one group per device is the max.
-		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_count_devs,
-						    &count, slot);
-		if (ret)
-			return ret;
-
-		/* Somewhere between 1 and count is OK */
-		if (!hdr.count || hdr.count > count)
-			return -EINVAL;
-
-		group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
-		groups = kcalloc(hdr.count, sizeof(*groups), GFP_KERNEL);
-		if (!group_fds || !groups) {
-			kfree(group_fds);
-			kfree(groups);
-			return -ENOMEM;
-		}
-
-		if (copy_from_user(group_fds, (void __user *)(arg + minsz),
-				   hdr.count * sizeof(*group_fds))) {
-			kfree(group_fds);
-			kfree(groups);
-			return -EFAULT;
-		}
-
-		/*
-		 * For each group_fd, get the group through the vfio external
-		 * user interface and store the group and iommu ID.  This
-		 * ensures the group is held across the reset.
-		 */
-		for (i = 0; i < hdr.count; i++) {
-			struct vfio_group *group;
-			struct fd f = fdget(group_fds[i]);
-			if (!f.file) {
-				ret = -EBADF;
-				break;
-			}
-
-			group = vfio_group_get_external_user(f.file);
-			fdput(f);
-			if (IS_ERR(group)) {
-				ret = PTR_ERR(group);
-				break;
-			}
-
-			groups[i].group = group;
-			groups[i].id = vfio_external_user_iommu_id(group);
-		}
-
-		kfree(group_fds);
-
-		/* release reference to groups on error */
-		if (ret)
-			goto hot_reset_release;
-
-		info.count = hdr.count;
-		info.groups = groups;
-
-		/*
-		 * Test whether all the affected devices are contained
-		 * by the set of groups provided by the user.
-		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-						    vfio_pci_validate_devs,
-						    &info, slot);
-		if (!ret)
-			/* User has access, do the reset */
-			ret = pci_reset_bus(vdev->pdev);
-
-hot_reset_release:
-		for (i--; i >= 0; i--)
-			vfio_group_put_external_user(groups[i].group);
-
-		kfree(groups);
-		return ret;
-	} else if (cmd == VFIO_DEVICE_IOEVENTFD) {
-		struct vfio_device_ioeventfd ioeventfd;
-		int count;
-
-		minsz = offsetofend(struct vfio_device_ioeventfd, fd);
-
-		if (copy_from_user(&ioeventfd, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (ioeventfd.argsz < minsz)
-			return -EINVAL;
-
-		if (ioeventfd.flags & ~VFIO_DEVICE_IOEVENTFD_SIZE_MASK)
-			return -EINVAL;
-
-		count = ioeventfd.flags & VFIO_DEVICE_IOEVENTFD_SIZE_MASK;
-
-		if (hweight8(count) != 1 || ioeventfd.fd < -1)
-			return -EINVAL;
-
-		return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
-					  ioeventfd.data, count, ioeventfd.fd);
-	}
-
-	return -ENOTTY;
-}
-
-static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
-			   size_t count, loff_t *ppos, bool iswrite)
-{
-	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	struct vfio_pci_device *vdev = device_data;
-
-	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
-		return -EINVAL;
-
-	switch (index) {
-	case VFIO_PCI_CONFIG_REGION_INDEX:
-		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
-
-	case VFIO_PCI_ROM_REGION_INDEX:
-		if (iswrite)
-			return -EINVAL;
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
-
-	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
-
-	case VFIO_PCI_VGA_REGION_INDEX:
-		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
-	default:
-		index -= VFIO_PCI_NUM_REGIONS;
-		return vdev->region[index].ops->rw(vdev, buf,
-						   count, ppos, iswrite);
-	}
-
-	return -EINVAL;
-}
-
-ssize_t vfio_pci_read(void *device_data, char __user *buf,
-			     size_t count, loff_t *ppos)
-{
-	if (!count)
-		return 0;
-
-	return vfio_pci_rw(device_data, buf, count, ppos, false);
-}
-
-ssize_t vfio_pci_write(void *device_data, const char __user *buf,
-			      size_t count, loff_t *ppos)
-{
-	if (!count)
-		return 0;
-
-	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
-}
-
-int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
-{
-	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
-	unsigned int index;
-	u64 phys_len, req_len, pgoff, req_start;
-	int ret;
-
-	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
-
-	if (vma->vm_end < vma->vm_start)
-		return -EINVAL;
-	if ((vma->vm_flags & VM_SHARED) == 0)
-		return -EINVAL;
-	if (index >= VFIO_PCI_NUM_REGIONS) {
-		int regnum = index - VFIO_PCI_NUM_REGIONS;
-		struct vfio_pci_region *region = vdev->region + regnum;
-
-		if (region && region->ops && region->ops->mmap &&
-		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
-			return region->ops->mmap(vdev, region, vma);
-		return -EINVAL;
-	}
-	if (index >= VFIO_PCI_ROM_REGION_INDEX)
-		return -EINVAL;
-	if (!vdev->bar_mmap_supported[index])
-		return -EINVAL;
-
-	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
-	req_len = vma->vm_end - vma->vm_start;
-	pgoff = vma->vm_pgoff &
-		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
-	req_start = pgoff << PAGE_SHIFT;
-
-	if (req_start + req_len > phys_len)
-		return -EINVAL;
-
-	/*
-	 * Even though we don't make use of the barmap for the mmap,
-	 * we need to request the region and the barmap tracks that.
-	 */
-	if (!vdev->barmap[index]) {
-		ret = pci_request_selected_regions(pdev,
-						   1 << index, "vfio-pci");
-		if (ret)
-			return ret;
-
-		vdev->barmap[index] = pci_iomap(pdev, index, 0);
-		if (!vdev->barmap[index]) {
-			pci_release_selected_regions(pdev, 1 << index);
-			return -ENOMEM;
-		}
-	}
-
-	vma->vm_private_data = vdev;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
-
-	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       req_len, vma->vm_page_prot);
-}
-
-void vfio_pci_request(void *device_data, unsigned int count)
-{
-	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
-
-	mutex_lock(&vdev->igate);
-
-	if (vdev->req_trigger) {
-		if (!(count % 10))
-			pci_notice_ratelimited(pdev,
-				"Relaying device request to user (#%u)\n",
-				count);
-		eventfd_signal(vdev->req_trigger, 1);
-	} else if (count == 0) {
-		pci_warn(pdev,
-			"No device request channel registered, blocked until released by user\n");
-	}
-
-	mutex_unlock(&vdev->igate);
-}
-
 static const struct vfio_device_ops vfio_pci_ops = {
 	.name		= "vfio-pci",
 	.open		= vfio_pci_open,
@@ -1396,38 +220,6 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	}
 }
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
-{
-	struct vfio_pci_device *vdev;
-	struct vfio_device *device;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (device == NULL)
-		return PCI_ERS_RESULT_DISCONNECT;
-
-	vdev = vfio_device_data(device);
-	if (vdev == NULL) {
-		vfio_device_put(device);
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	mutex_lock(&vdev->igate);
-
-	if (vdev->err_trigger)
-		eventfd_signal(vdev->err_trigger, 1);
-
-	mutex_unlock(&vdev->igate);
-
-	vfio_device_put(device);
-
-	return PCI_ERS_RESULT_CAN_RECOVER;
-}
-
-static const struct pci_error_handlers vfio_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
-};
-
 static struct pci_driver vfio_pci_driver = {
 	.name		= "vfio-pci",
 	.id_table	= NULL, /* only dynamic ids */
@@ -1436,246 +228,12 @@ static struct pci_driver vfio_pci_driver = {
 	.err_handler	= &vfio_err_handlers,
 };
 
-static DEFINE_MUTEX(reflck_lock);
-
-static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
-{
-	struct vfio_pci_reflck *reflck;
-
-	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
-	if (!reflck)
-		return ERR_PTR(-ENOMEM);
-
-	kref_init(&reflck->kref);
-	mutex_init(&reflck->lock);
-
-	return reflck;
-}
-
-static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
-{
-	kref_get(&reflck->kref);
-}
-
-static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
-{
-	struct vfio_pci_device *vdev = data;
-	struct vfio_pci_reflck **preflck = &vdev->reflck;
-	struct vfio_device *device;
-	struct vfio_pci_device *tmp;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (!device)
-		return 0;
-
-	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
-		vfio_device_put(device);
-		return 0;
-	}
-
-	tmp = vfio_device_data(device);
-
-	if (tmp->reflck) {
-		vfio_pci_reflck_get(tmp->reflck);
-		*preflck = tmp->reflck;
-		vfio_device_put(device);
-		return 1;
-	}
-
-	vfio_device_put(device);
-	return 0;
-}
-
-int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
-{
-	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
-
-	mutex_lock(&reflck_lock);
-
-	if (pci_is_root_bus(vdev->pdev->bus) ||
-	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
-					  vdev, slot) <= 0)
-		vdev->reflck = vfio_pci_reflck_alloc();
-
-	mutex_unlock(&reflck_lock);
-
-	return PTR_ERR_OR_ZERO(vdev->reflck);
-}
-
-static void vfio_pci_reflck_release(struct kref *kref)
-{
-	struct vfio_pci_reflck *reflck = container_of(kref,
-						      struct vfio_pci_reflck,
-						      kref);
-
-	kfree(reflck);
-	mutex_unlock(&reflck_lock);
-}
-
-void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
-{
-	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
-}
-
-struct vfio_devices {
-	struct vfio_device **devices;
-	struct vfio_pci_device *vdev;
-	int cur_index;
-	int max_index;
-};
-
-static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
-{
-	struct vfio_devices *devs = data;
-	struct vfio_device *device;
-	struct vfio_pci_device *tmp;
-
-	if (devs->cur_index == devs->max_index)
-		return -ENOSPC;
-
-	device = vfio_device_get_from_dev(&pdev->dev);
-	if (!device)
-		return -EINVAL;
-
-	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
-		vfio_device_put(device);
-		return -EBUSY;
-	}
-
-	tmp = vfio_device_data(device);
-
-	/* Fault if the device is not unused */
-	if (tmp->refcnt) {
-		vfio_device_put(device);
-		return -EBUSY;
-	}
-
-	devs->devices[devs->cur_index++] = device;
-	return 0;
-}
-
-/*
- * If a bus or slot reset is available for the provided device and:
- *  - All of the devices affected by that bus or slot reset are unused
- *    (!refcnt)
- *  - At least one of the affected devices is marked dirty via
- *    needs_reset (such as by lack of FLR support)
- * Then attempt to perform that bus or slot reset.  Callers are required
- * to hold vdev->reflck->lock, protecting the bus/slot reset group from
- * concurrent opens.  A vfio_device reference is acquired for each device
- * to prevent unbinds during the reset operation.
- *
- * NB: vfio-core considers a group to be viable even if some devices are
- * bound to drivers like pci-stub or pcieport.  Here we require all devices
- * to be bound to vfio_pci since that's the only way we can be sure they
- * stay put.
- */
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
-{
-	struct vfio_devices devs = { .vdev = vdev, .cur_index = 0 };
-	int i = 0, ret = -EINVAL;
-	bool slot = false;
-	struct vfio_pci_device *tmp;
-
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return;
-
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
-					  &i, slot) || !i)
-		return;
-
-	devs.max_index = i;
-	devs.devices = kcalloc(i, sizeof(struct vfio_device *), GFP_KERNEL);
-	if (!devs.devices)
-		return;
-
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
-					  vfio_pci_get_unused_devs,
-					  &devs, slot))
-		goto put_devs;
-
-	/* Does at least one need a reset? */
-	for (i = 0; i < devs.cur_index; i++) {
-		tmp = vfio_device_data(devs.devices[i]);
-		if (tmp->needs_reset) {
-			ret = pci_reset_bus(vdev->pdev);
-			break;
-		}
-	}
-
-put_devs:
-	for (i = 0; i < devs.cur_index; i++) {
-		tmp = vfio_device_data(devs.devices[i]);
-
-		/*
-		 * If reset was successful, affected devices no longer need
-		 * a reset and we should return all the collateral devices
-		 * to low power.  If not successful, we either didn't reset
-		 * the bus or timed out waiting for it, so let's not touch
-		 * the power state.
-		 */
-		if (!ret) {
-			tmp->needs_reset = false;
-
-			if (tmp != vdev && !tmp->disable_idle_d3)
-				vfio_pci_set_power_state(tmp, PCI_D3hot);
-		}
-
-		vfio_device_put(devs.devices[i]);
-	}
-
-	kfree(devs.devices);
-}
-
 static void __exit vfio_pci_cleanup(void)
 {
 	pci_unregister_driver(&vfio_pci_driver);
 	vfio_pci_uninit_perm_bits();
 }
 
-void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
-{
-	char *p, *id;
-	int rc;
-
-	/* no ids passed actually */
-	if (ids[0] == '\0')
-		return;
-
-	/* add ids specified in the module parameter */
-	p = ids;
-	while ((id = strsep(&p, ","))) {
-		unsigned int vendor, device, subvendor = PCI_ANY_ID,
-			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
-		int fields;
-
-		if (!strlen(id))
-			continue;
-
-		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
-				&vendor, &device, &subvendor, &subdevice,
-				&class, &class_mask);
-
-		if (fields < 2) {
-			pr_warn("invalid id string \"%s\"\n", id);
-			continue;
-		}
-
-		rc = pci_add_dynid(driver, vendor, device,
-				   subvendor, subdevice, class, class_mask, 0);
-		if (rc)
-			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask, rc);
-		else
-			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask);
-	}
-}
-
 static int __init vfio_pci_init(void)
 {
 	int ret;
diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index b0894dfc..15d8b55 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -1234,7 +1234,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
-static const struct pci_error_handlers vfio_err_handlers = {
+const struct pci_error_handlers vfio_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 194d487..499dd04 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -135,6 +135,8 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
+extern const struct pci_error_handlers vfio_err_handlers;
+
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 08/12] vfio_pci: duplicate vfio_pci_private.h to include/linux
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (6 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files Liu Yi L
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch copies drivers/vfio/pci/vfio_pci_private.h to include/linux/
for preparation of splitting vfio_pci_private.h into a private header
file and a common header file, which is to support common vfio_pci code
sharing outside drivers/vfio/pci/. No code change in this file copy.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 include/linux/vfio_pci_common.h | 228 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 228 insertions(+)
 create mode 100644 include/linux/vfio_pci_common.h

diff --git a/include/linux/vfio_pci_common.h b/include/linux/vfio_pci_common.h
new file mode 100644
index 0000000..499dd04
--- /dev/null
+++ b/include/linux/vfio_pci_common.h
@@ -0,0 +1,228 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/irqbypass.h>
+#include <linux/types.h>
+
+#ifndef VFIO_PCI_PRIVATE_H
+#define VFIO_PCI_PRIVATE_H
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+/* Special capability IDs predefined access */
+#define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
+#define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
+
+/* Cap maximum number of ioeventfds per device (arbitrary) */
+#define VFIO_PCI_IOEVENTFD_MAX		1000
+
+struct vfio_pci_ioeventfd {
+	struct list_head	next;
+	struct virqfd		*virqfd;
+	void __iomem		*addr;
+	uint64_t		data;
+	loff_t			pos;
+	int			bar;
+	int			count;
+};
+
+struct vfio_pci_irq_ctx {
+	struct eventfd_ctx	*trigger;
+	struct virqfd		*unmask;
+	struct virqfd		*mask;
+	char			*name;
+	bool			masked;
+	struct irq_bypass_producer	producer;
+};
+
+struct vfio_pci_device;
+struct vfio_pci_region;
+
+struct vfio_pci_regops {
+	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
+		      size_t count, loff_t *ppos, bool iswrite);
+	void	(*release)(struct vfio_pci_device *vdev,
+			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_device *vdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
+	int	(*add_capability)(struct vfio_pci_device *vdev,
+				  struct vfio_pci_region *region,
+				  struct vfio_info_cap *caps);
+};
+
+struct vfio_pci_region {
+	u32				type;
+	u32				subtype;
+	const struct vfio_pci_regops	*ops;
+	void				*data;
+	size_t				size;
+	u32				flags;
+};
+
+struct vfio_pci_dummy_resource {
+	struct resource		resource;
+	int			index;
+	struct list_head	res_next;
+};
+
+struct vfio_pci_reflck {
+	struct kref		kref;
+	struct mutex		lock;
+};
+
+struct vfio_pci_device {
+	struct pci_dev		*pdev;
+	void __iomem		*barmap[PCI_STD_NUM_BARS];
+	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
+	u8			*pci_config_map;
+	u8			*vconfig;
+	struct perm_bits	*msi_perm;
+	spinlock_t		irqlock;
+	struct mutex		igate;
+	struct vfio_pci_irq_ctx	*ctx;
+	int			num_ctx;
+	int			irq_type;
+	int			num_regions;
+	struct vfio_pci_region	*region;
+	u8			msi_qmax;
+	u8			msix_bar;
+	u16			msix_size;
+	u32			msix_offset;
+	u32			rbar[7];
+	bool			pci_2_3;
+	bool			virq_disabled;
+	bool			reset_works;
+	bool			extended_caps;
+	bool			bardirty;
+	bool			has_vga;
+	bool			needs_reset;
+	bool			nointx;
+	bool			needs_pm_restore;
+	struct pci_saved_state	*pci_saved_state;
+	struct pci_saved_state	*pm_save;
+	struct vfio_pci_reflck	*reflck;
+	int			refcnt;
+	int			ioeventfds_nr;
+	struct eventfd_ctx	*err_trigger;
+	struct eventfd_ctx	*req_trigger;
+	struct list_head	dummy_resources_list;
+	struct mutex		ioeventfds_lock;
+	struct list_head	ioeventfds_list;
+	bool			nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	bool			disable_vga;
+#endif
+	bool			disable_idle_d3;
+};
+
+#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
+#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
+#define irq_is(vdev, type) (vdev->irq_type == type)
+
+extern const struct pci_error_handlers vfio_err_handlers;
+
+static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
+{
+	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
+}
+
+static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
+{
+#ifdef CONFIG_VFIO_PCI_VGA
+	return vdev->disable_vga;
+#else
+	return true;
+#endif
+}
+
+extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
+				bool nointxmask, bool disable_idle_d3);
+
+extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
+extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
+
+extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
+				   uint32_t flags, unsigned index,
+				   unsigned start, unsigned count, void *data);
+
+extern ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev,
+				  char __user *buf, size_t count,
+				  loff_t *ppos, bool iswrite);
+
+extern ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
+			       size_t count, loff_t *ppos, bool iswrite);
+
+extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
+			       size_t count, loff_t *ppos, bool iswrite);
+
+extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
+			       uint64_t data, int count, int fd);
+
+extern int vfio_pci_init_perm_bits(void);
+extern void vfio_pci_uninit_perm_bits(void);
+
+extern int vfio_config_init(struct vfio_pci_device *vdev);
+extern void vfio_config_free(struct vfio_pci_device *vdev);
+
+extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+					unsigned int type, unsigned int subtype,
+					const struct vfio_pci_regops *ops,
+					size_t size, u32 flags, void *data);
+
+extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
+				    pci_power_t state);
+extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga);
+extern int vfio_pci_enable(struct vfio_pci_device *vdev);
+extern void vfio_pci_disable(struct vfio_pci_device *vdev);
+extern long vfio_pci_ioctl(void *device_data,
+			unsigned int cmd, unsigned long arg);
+extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			size_t count, loff_t *ppos);
+extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
+extern void vfio_pci_request(void *device_data, unsigned int count);
+extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
+extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
+
+#ifdef CONFIG_VFIO_PCI_IGD
+extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
+#ifdef CONFIG_VFIO_PCI_NVLINK2
+extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
+extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+
+static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
+#endif /* VFIO_PCI_PRIVATE_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (7 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 08/12] vfio_pci: duplicate vfio_pci_private.h to include/linux Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-09 22:48   ` Alex Williamson
  2020-01-07 12:01 ` [PATCH v4 10/12] vfio: build vfio_pci_common.c into a kernel module Liu Yi L
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch splits the vfio_pci_private.h to be a private file
in drivers/vfio/pci and a common file under include/linux/. It
is a preparation for supporting vfio_pci common code sharing
outside drivers/vfio/pci/.

The common header file is shrunk from the previous copied
vfio_pci_common.h. The original vfio_pci_private.h is shrunk
accordingly as well.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_private.h | 133 +-----------------------------------
 include/linux/vfio_pci_common.h     |  86 ++---------------------
 2 files changed, 7 insertions(+), 212 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 499dd04..c4976a9 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -12,6 +12,7 @@
 #include <linux/pci.h>
 #include <linux/irqbypass.h>
 #include <linux/types.h>
+#include <linux/vfio_pci_common.h>
 
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
@@ -39,121 +40,12 @@ struct vfio_pci_ioeventfd {
 	int			count;
 };
 
-struct vfio_pci_irq_ctx {
-	struct eventfd_ctx	*trigger;
-	struct virqfd		*unmask;
-	struct virqfd		*mask;
-	char			*name;
-	bool			masked;
-	struct irq_bypass_producer	producer;
-};
-
-struct vfio_pci_device;
-struct vfio_pci_region;
-
-struct vfio_pci_regops {
-	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
-		      size_t count, loff_t *ppos, bool iswrite);
-	void	(*release)(struct vfio_pci_device *vdev,
-			   struct vfio_pci_region *region);
-	int	(*mmap)(struct vfio_pci_device *vdev,
-			struct vfio_pci_region *region,
-			struct vm_area_struct *vma);
-	int	(*add_capability)(struct vfio_pci_device *vdev,
-				  struct vfio_pci_region *region,
-				  struct vfio_info_cap *caps);
-};
-
-struct vfio_pci_region {
-	u32				type;
-	u32				subtype;
-	const struct vfio_pci_regops	*ops;
-	void				*data;
-	size_t				size;
-	u32				flags;
-};
-
 struct vfio_pci_dummy_resource {
 	struct resource		resource;
 	int			index;
 	struct list_head	res_next;
 };
 
-struct vfio_pci_reflck {
-	struct kref		kref;
-	struct mutex		lock;
-};
-
-struct vfio_pci_device {
-	struct pci_dev		*pdev;
-	void __iomem		*barmap[PCI_STD_NUM_BARS];
-	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
-	u8			*pci_config_map;
-	u8			*vconfig;
-	struct perm_bits	*msi_perm;
-	spinlock_t		irqlock;
-	struct mutex		igate;
-	struct vfio_pci_irq_ctx	*ctx;
-	int			num_ctx;
-	int			irq_type;
-	int			num_regions;
-	struct vfio_pci_region	*region;
-	u8			msi_qmax;
-	u8			msix_bar;
-	u16			msix_size;
-	u32			msix_offset;
-	u32			rbar[7];
-	bool			pci_2_3;
-	bool			virq_disabled;
-	bool			reset_works;
-	bool			extended_caps;
-	bool			bardirty;
-	bool			has_vga;
-	bool			needs_reset;
-	bool			nointx;
-	bool			needs_pm_restore;
-	struct pci_saved_state	*pci_saved_state;
-	struct pci_saved_state	*pm_save;
-	struct vfio_pci_reflck	*reflck;
-	int			refcnt;
-	int			ioeventfds_nr;
-	struct eventfd_ctx	*err_trigger;
-	struct eventfd_ctx	*req_trigger;
-	struct list_head	dummy_resources_list;
-	struct mutex		ioeventfds_lock;
-	struct list_head	ioeventfds_list;
-	bool			nointxmask;
-#ifdef CONFIG_VFIO_PCI_VGA
-	bool			disable_vga;
-#endif
-	bool			disable_idle_d3;
-};
-
-#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
-#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
-#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
-#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
-#define irq_is(vdev, type) (vdev->irq_type == type)
-
-extern const struct pci_error_handlers vfio_err_handlers;
-
-static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
-{
-	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-}
-
-static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
-{
-#ifdef CONFIG_VFIO_PCI_VGA
-	return vdev->disable_vga;
-#else
-	return true;
-#endif
-}
-
-extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
-				bool nointxmask, bool disable_idle_d3);
-
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
@@ -180,29 +72,6 @@ extern void vfio_pci_uninit_perm_bits(void);
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
 
-extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
-					unsigned int type, unsigned int subtype,
-					const struct vfio_pci_regops *ops,
-					size_t size, u32 flags, void *data);
-
-extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
-				    pci_power_t state);
-extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga);
-extern int vfio_pci_enable(struct vfio_pci_device *vdev);
-extern void vfio_pci_disable(struct vfio_pci_device *vdev);
-extern long vfio_pci_ioctl(void *device_data,
-			unsigned int cmd, unsigned long arg);
-extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
-			size_t count, loff_t *ppos);
-extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
-			size_t count, loff_t *ppos);
-extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
-extern void vfio_pci_request(void *device_data, unsigned int count);
-extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
-extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
-extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
-extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
-
 #ifdef CONFIG_VFIO_PCI_IGD
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
 #else
diff --git a/include/linux/vfio_pci_common.h b/include/linux/vfio_pci_common.h
index 499dd04..862cd80 100644
--- a/include/linux/vfio_pci_common.h
+++ b/include/linux/vfio_pci_common.h
@@ -1,5 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 /*
+ * VFIO PCI API definition
+ *
+ * Derived from original vfio/pci/vfio_pci_private.h:
  * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
  *     Author: Alex Williamson <alex.williamson@redhat.com>
  *
@@ -13,31 +16,8 @@
 #include <linux/irqbypass.h>
 #include <linux/types.h>
 
-#ifndef VFIO_PCI_PRIVATE_H
-#define VFIO_PCI_PRIVATE_H
-
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
-/* Special capability IDs predefined access */
-#define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
-#define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
-
-/* Cap maximum number of ioeventfds per device (arbitrary) */
-#define VFIO_PCI_IOEVENTFD_MAX		1000
-
-struct vfio_pci_ioeventfd {
-	struct list_head	next;
-	struct virqfd		*virqfd;
-	void __iomem		*addr;
-	uint64_t		data;
-	loff_t			pos;
-	int			bar;
-	int			count;
-};
+#ifndef VFIO_PCI_COMMON_H
+#define VFIO_PCI_COMMON_H
 
 struct vfio_pci_irq_ctx {
 	struct eventfd_ctx	*trigger;
@@ -73,12 +53,6 @@ struct vfio_pci_region {
 	u32				flags;
 };
 
-struct vfio_pci_dummy_resource {
-	struct resource		resource;
-	int			index;
-	struct list_head	res_next;
-};
-
 struct vfio_pci_reflck {
 	struct kref		kref;
 	struct mutex		lock;
@@ -154,32 +128,6 @@ static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
 extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
 				bool nointxmask, bool disable_idle_d3);
 
-extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
-extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
-
-extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
-				   uint32_t flags, unsigned index,
-				   unsigned start, unsigned count, void *data);
-
-extern ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev,
-				  char __user *buf, size_t count,
-				  loff_t *ppos, bool iswrite);
-
-extern ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
-			       size_t count, loff_t *ppos, bool iswrite);
-
-extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
-			       size_t count, loff_t *ppos, bool iswrite);
-
-extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
-			       uint64_t data, int count, int fd);
-
-extern int vfio_pci_init_perm_bits(void);
-extern void vfio_pci_uninit_perm_bits(void);
-
-extern int vfio_config_init(struct vfio_pci_device *vdev);
-extern void vfio_config_free(struct vfio_pci_device *vdev);
-
 extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 					unsigned int type, unsigned int subtype,
 					const struct vfio_pci_regops *ops,
@@ -203,26 +151,4 @@ extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
 extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
 extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
 
-#ifdef CONFIG_VFIO_PCI_IGD
-extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
-#else
-static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
-{
-	return -ENODEV;
-}
-#endif
-#ifdef CONFIG_VFIO_PCI_NVLINK2
-extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
-extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
-#else
-static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
-{
-	return -ENODEV;
-}
-
-static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
-{
-	return -ENODEV;
-}
-#endif
-#endif /* VFIO_PCI_PRIVATE_H */
+#endif /* VFIO_PCI_COMMON_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 10/12] vfio: build vfio_pci_common.c into a kernel module
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (8 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 11/12] samples: add vfio-mdev-pci driver Liu Yi L
  2020-01-07 12:01 ` [PATCH v4 12/12] samples: refine " Liu Yi L
  11 siblings, 0 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

This patch makes vfio_pci_common.c to be built as a kernel module,
which is a preparation for further share vfio_pci common codes outside
drivers/vfio/pci/.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Kconfig           |  9 ++++++--
 drivers/vfio/pci/Makefile          |  9 +++++---
 drivers/vfio/pci/vfio_pci.c        | 14 ++-----------
 drivers/vfio/pci/vfio_pci_common.c | 43 ++++++++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_common.h    |  2 +-
 5 files changed, 57 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index ac3c1dd..1a1fb3b 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,9 +1,14 @@
 # SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
-	tristate "VFIO support for PCI devices"
+
+config VFIO_PCI_COMMON
 	depends on VFIO && PCI && EVENTFD
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
+	tristate
+
+config VFIO_PCI
+	tristate "VFIO support for PCI devices"
+	select VFIO_PCI_COMMON
 	help
 	  Support for the PCI VFIO bus driver.  This is required to make
 	  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d94317a..ad60cfd 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+vfio-pci-common-y := vfio_pci_common.o vfio_pci_intrs.o \
 		vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-common-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-common-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
+vfio-pci-y := vfio_pci.o
+
+obj-$(CONFIG_VFIO_PCI_COMMON) += vfio-pci-common.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7e24da2..7047667 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -225,36 +225,26 @@ static struct pci_driver vfio_pci_driver = {
 	.id_table	= NULL, /* only dynamic ids */
 	.probe		= vfio_pci_probe,
 	.remove		= vfio_pci_remove,
-	.err_handler	= &vfio_err_handlers,
+	.err_handler	= &vfio_pci_err_handlers,
 };
 
 static void __exit vfio_pci_cleanup(void)
 {
 	pci_unregister_driver(&vfio_pci_driver);
-	vfio_pci_uninit_perm_bits();
 }
 
 static int __init vfio_pci_init(void)
 {
 	int ret;
 
-	/* Allocate shared config space permision data used by all devices */
-	ret = vfio_pci_init_perm_bits();
-	if (ret)
-		return ret;
-
 	/* Register and scan for devices */
 	ret = pci_register_driver(&vfio_pci_driver);
 	if (ret)
-		goto out_driver;
+		return ret;
 
 	vfio_pci_fill_ids(ids, &vfio_pci_driver);
 
 	return 0;
-
-out_driver:
-	vfio_pci_uninit_perm_bits();
-	return ret;
 }
 
 module_init(vfio_pci_init);
diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index 15d8b55..edda7e4 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -27,9 +27,14 @@
 #include <linux/vfio.h>
 #include <linux/vgaarb.h>
 #include <linux/nospec.h>
+#include <linux/vfio_pci_common.h>
 
 #include "vfio_pci_private.h"
 
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI Common"
+
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -69,6 +74,7 @@ unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 
 	return decodes;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_set_vga_decode);
 
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
@@ -183,6 +189,7 @@ void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
 
 	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_probe_power_state);
 
 /*
  * pci_set_power_state() wrapper handling devices which perform a soft reset on
@@ -221,6 +228,7 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_set_power_state);
 
 int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
@@ -328,6 +336,7 @@ int vfio_pci_enable(struct vfio_pci_device *vdev)
 	vfio_pci_disable(vdev);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_enable);
 
 void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
@@ -427,6 +436,7 @@ void vfio_pci_disable(struct vfio_pci_device *vdev)
 	if (!vdev->disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_disable);
 
 void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
 			bool nointxmask, bool disable_idle_d3)
@@ -434,6 +444,7 @@ void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
 	vdev->nointxmask = nointxmask;
 	vdev->disable_idle_d3 = disable_idle_d3;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_refresh_config);
 
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 {
@@ -618,6 +629,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
 
 long vfio_pci_ioctl(void *device_data,
 		   unsigned int cmd, unsigned long arg)
@@ -1072,6 +1084,7 @@ long vfio_pci_ioctl(void *device_data,
 
 	return -ENOTTY;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_ioctl);
 
 static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
@@ -1113,6 +1126,7 @@ ssize_t vfio_pci_read(void *device_data, char __user *buf,
 
 	return vfio_pci_rw(device_data, buf, count, ppos, false);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_read);
 
 ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 			      size_t count, loff_t *ppos)
@@ -1122,6 +1136,7 @@ ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_write);
 
 int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
@@ -1184,6 +1199,7 @@ int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
 			       req_len, vma->vm_page_prot);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_mmap);
 
 void vfio_pci_request(void *device_data, unsigned int count)
 {
@@ -1205,6 +1221,7 @@ void vfio_pci_request(void *device_data, unsigned int count)
 
 	mutex_unlock(&vdev->igate);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_request);
 
 static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 						  pci_channel_state_t state)
@@ -1234,9 +1251,10 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
-const struct pci_error_handlers vfio_err_handlers = {
+const struct pci_error_handlers vfio_pci_err_handlers = {
 	.error_detected = vfio_pci_aer_err_detected,
 };
+EXPORT_SYMBOL_GPL(vfio_pci_err_handlers);
 
 static DEFINE_MUTEX(reflck_lock);
 
@@ -1303,6 +1321,7 @@ int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 
 	return PTR_ERR_OR_ZERO(vdev->reflck);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_reflck_attach);
 
 static void vfio_pci_reflck_release(struct kref *kref)
 {
@@ -1318,6 +1337,7 @@ void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
 {
 	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_reflck_put);
 
 struct vfio_devices {
 	struct vfio_device **devices;
@@ -1431,7 +1451,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	kfree(devs.devices);
 }
 
-void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
+void vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 {
 	char *p, *id;
 	int rc;
@@ -1471,3 +1491,22 @@ void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
 				class, class_mask);
 	}
 }
+EXPORT_SYMBOL_GPL(vfio_pci_fill_ids);
+
+static int __init vfio_pci_common_init(void)
+{
+	/* Allocate shared config space permision data used by all devices */
+	return vfio_pci_init_perm_bits();
+}
+module_init(vfio_pci_common_init);
+
+static void __exit vfio_pci_common_exit(void)
+{
+	vfio_pci_uninit_perm_bits();
+}
+module_exit(vfio_pci_common_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio_pci_common.h b/include/linux/vfio_pci_common.h
index 862cd80..439666a 100644
--- a/include/linux/vfio_pci_common.h
+++ b/include/linux/vfio_pci_common.h
@@ -109,7 +109,7 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
-extern const struct pci_error_handlers vfio_err_handlers;
+extern const struct pci_error_handlers vfio_pci_err_handlers;
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (9 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 10/12] vfio: build vfio_pci_common.c into a kernel module Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  2020-01-09 22:48   ` Alex Williamson
  2020-01-15 12:30   ` Cornelia Huck
  2020-01-07 12:01 ` [PATCH v4 12/12] samples: refine " Liu Yi L
  11 siblings, 2 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu,
	Masahiro Yamada, Liu Yi L

This patch adds sample driver named vfio-mdev-pci. It is to wrap
a PCI device as a mediated device. For a pci device, once bound
to vfio-mdev-pci driver, user space access of this device will
go through vfio mdev framework. The usage of the device follows
mdev management method. e.g. user should create a mdev before
exposing the device to user-space.

Benefit of this new driver would be acting as a sample driver
for recent changes from "vfio/mdev: IOMMU aware mediated device"
patchset. Also it could be a good experiment driver for future
device specific mdev migration support. This sample driver only
supports singleton iommu groups, for non-singleton iommu groups,
this sample driver doesn't work. It will fail when trying to assign
the non-singleton iommu group to VMs.

To use this driver:
a) build and load vfio-mdev-pci.ko module
   execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
   then load it with following command:
   > sudo modprobe vfio
   > sudo modprobe vfio-pci
   > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko

b) unbind original device driver
   e.g. use following command to unbind its original driver
   > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind

c) bind vfio-mdev-pci driver to the physical device
   > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id

d) check the supported mdev instances
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
     vfio-mdev-pci-type_name
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type_name/
     available_instances  create  device_api  devices  name

e)  create mdev on this physical device (only 1 instance)
   > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
     /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type_name/create

f) passthru the mdev to guest
   add the following line in QEMU boot command
    -device vfio-pci,\
     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003

g) destroy mdev
   > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
     remove

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 samples/Kconfig                       |  10 +
 samples/Makefile                      |   1 +
 samples/vfio-mdev-pci/Makefile        |   4 +
 samples/vfio-mdev-pci/vfio_mdev_pci.c | 397 ++++++++++++++++++++++++++++++++++
 4 files changed, 412 insertions(+)
 create mode 100644 samples/vfio-mdev-pci/Makefile
 create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c

diff --git a/samples/Kconfig b/samples/Kconfig
index 9d236c3..50d207c 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
 	help
 	  Build a sample program to work with mei device.
 
+config SAMPLE_VFIO_MDEV_PCI
+	tristate "Sample driver for wrapping PCI device as a mdev"
+	select VFIO_PCI_COMMON
+	select VFIO_PCI
+	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
+	help
+	  Sample driver for wrapping a PCI device as a mdev. Once bound to
+	  this driver, device passthru should through mdev path.
+
+	  If you don't know what to do here, say N.
 
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 5ce50ef..84faced 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)	+= ftrace/
 obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
 obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
 obj-y					+= vfio-mdev/
+obj-y					+= vfio-mdev-pci/
 subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
 obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-pci/Makefile
new file mode 100644
index 0000000..41b2139
--- /dev/null
+++ b/samples/vfio-mdev-pci/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+vfio-mdev-pci-y := vfio_mdev_pci.o
+
+obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-pci/vfio_mdev_pci.c
new file mode 100644
index 0000000..b180356
--- /dev/null
+++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
@@ -0,0 +1,397 @@
+/*
+ * Copyright © 2020 Intel Corporation.
+ *     Author: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_pci.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+#include <linux/mdev.h>
+#include <linux/vfio_pci_common.h>
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
+#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a mdev"
+
+#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-mdev-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+static struct pci_driver vfio_mdev_pci_driver;
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%s-type1\n", dev_name(dev));
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%d\n", 1);
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+		char *buf)
+{
+	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+
+MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *vfio_mdev_pci_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group vfio_mdev_pci_type_group1 = {
+	.name  = "type1",
+	.attrs = vfio_mdev_pci_types_attrs,
+};
+
+struct attribute_group *vfio_mdev_pci_type_groups[] = {
+	&vfio_mdev_pci_type_group1,
+	NULL,
+};
+
+struct vfio_mdev_pci {
+	struct vfio_pci_device *vdev;
+	struct mdev_device *mdev;
+	unsigned long handle;
+};
+
+static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct device *pdev;
+	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci *pmdev;
+	int ret;
+
+	pdev = mdev_parent_dev(mdev);
+	vdev = dev_get_drvdata(pdev);
+	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
+	if (pmdev == NULL) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	pmdev->mdev = mdev;
+	pmdev->vdev = vdev;
+	mdev_set_drvdata(mdev, pmdev);
+	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
+	if (ret) {
+		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
+			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
+		goto out;
+	}
+
+	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
+		     dev_name(mdev_dev(mdev)));
+out:
+	return ret;
+}
+
+static int vfio_mdev_pci_remove(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	kfree(pmdev);
+	pr_info("%s, succeeded for mdev: %s\n", __func__,
+		     dev_name(mdev_dev(mdev)));
+
+	return 0;
+}
+
+static int vfio_mdev_pci_open(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (!ret)
+		pr_info("Succeeded to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+	else {
+		pr_info("Failed to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static void vfio_mdev_pci_release(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+
+	pr_info("Release mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
+}
+
+static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
+				struct vm_area_struct *vma)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_mmap(pmdev->vdev, vma);
+}
+
+static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
+}
+
+static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
+				const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
+}
+
+static const struct mdev_parent_ops vfio_mdev_pci_ops = {
+	.supported_type_groups	= vfio_mdev_pci_type_groups,
+	.create			= vfio_mdev_pci_create,
+	.remove			= vfio_mdev_pci_remove,
+
+	.open			= vfio_mdev_pci_open,
+	.release		= vfio_mdev_pci_release,
+
+	.read			= vfio_mdev_pci_read,
+	.write			= vfio_mdev_pci_write,
+	.mmap			= vfio_mdev_pci_mmap,
+	.ioctl			= vfio_mdev_pci_ioctl,
+};
+
+static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
+				       const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
+	 * reject these PFs and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	pci_set_drvdata(pdev, vdev);
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		pci_set_drvdata(pdev, NULL);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	if (ret)
+		pr_err("Cannot register mdev for device %s\n",
+			dev_name(&pdev->dev));
+	else
+		pr_info("Wrap device %s as a mdev\n", dev_name(&pdev->dev));
+
+	return ret;
+}
+
+static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = pci_get_drvdata(pdev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_mdev_pci_driver = {
+	.name		= VFIO_MDEV_PCI_NAME,
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_mdev_pci_driver_probe,
+	.remove		= vfio_mdev_pci_driver_remove,
+	.err_handler	= &vfio_pci_err_handlers,
+};
+
+static void __exit vfio_mdev_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_mdev_pci_driver);
+}
+
+static int __init vfio_mdev_pci_init(void)
+{
+	int ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_mdev_pci_driver);
+	if (ret)
+		return ret;
+
+	vfio_pci_fill_ids(ids, &vfio_mdev_pci_driver);
+
+	return 0;
+}
+
+module_init(vfio_mdev_pci_init);
+module_exit(vfio_mdev_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 12/12] samples: refine vfio-mdev-pci driver
  2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
                   ` (10 preceding siblings ...)
  2020-01-07 12:01 ` [PATCH v4 11/12] samples: add vfio-mdev-pci driver Liu Yi L
@ 2020-01-07 12:01 ` Liu Yi L
  11 siblings, 0 replies; 44+ messages in thread
From: Liu Yi L @ 2020-01-07 12:01 UTC (permalink / raw)
  To: alex.williamson, kwankhede
  Cc: linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu, Liu Yi L

From: Alex Williamson <alex.williamson@redhat.com>

This patch refines the implementation of original vfio-mdev-pci driver.

And the vfio-mdev-pci-type_name will be named per the following rule:

	vmdev->attr.name = kasprintf(GFP_KERNEL,
				     "%04x:%04x:%04x:%04x:%06x:%02x",
				     pdev->vendor, pdev->device,
				     pdev->subsystem_vendor,
				     pdev->subsystem_device, pdev->class,
				     pdev->revision);

Before usage, check the /sys/bus/pci/devices/$bdf/mdev_supported_types/
to ensure the final mdev_supported_types.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 samples/vfio-mdev-pci/vfio_mdev_pci.c | 123 ++++++++++++++++++++--------------
 1 file changed, 73 insertions(+), 50 deletions(-)

diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-pci/vfio_mdev_pci.c
index b180356..0377b5b 100644
--- a/samples/vfio-mdev-pci/vfio_mdev_pci.c
+++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
@@ -64,18 +64,22 @@ MODULE_PARM_DESC(disable_idle_d3,
 
 static struct pci_driver vfio_mdev_pci_driver;
 
-static ssize_t
-name_show(struct kobject *kobj, struct device *dev, char *buf)
-{
-	return sprintf(buf, "%s-type1\n", dev_name(dev));
-}
-
-MDEV_TYPE_ATTR_RO(name);
+struct vfio_mdev_pci_device {
+	struct vfio_pci_device vdev;
+	struct mdev_parent_ops ops;
+	struct attribute_group *groups[2];
+	struct attribute_group attr;
+	atomic_t avail;
+};
 
 static ssize_t
 available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "%d\n", 1);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = pci_get_drvdata(to_pci_dev(dev));
+
+	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
 }
 
 MDEV_TYPE_ATTR_RO(available_instances);
@@ -89,64 +93,61 @@ static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
 MDEV_TYPE_ATTR_RO(device_api);
 
 static struct attribute *vfio_mdev_pci_types_attrs[] = {
-	&mdev_type_attr_name.attr,
 	&mdev_type_attr_device_api.attr,
 	&mdev_type_attr_available_instances.attr,
 	NULL,
 };
 
-static struct attribute_group vfio_mdev_pci_type_group1 = {
-	.name  = "type1",
-	.attrs = vfio_mdev_pci_types_attrs,
-};
-
-struct attribute_group *vfio_mdev_pci_type_groups[] = {
-	&vfio_mdev_pci_type_group1,
-	NULL,
-};
-
 struct vfio_mdev_pci {
 	struct vfio_pci_device *vdev;
 	struct mdev_device *mdev;
-	unsigned long handle;
 };
 
 static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
 {
 	struct device *pdev;
-	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_mdev_pci *pmdev;
 	int ret;
 
 	pdev = mdev_parent_dev(mdev);
-	vdev = dev_get_drvdata(pdev);
+	vmdev = dev_get_drvdata(pdev);
+
+	if (atomic_dec_if_positive(&vmdev->avail) < 0)
+		return -ENOSPC;
+
 	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
-	if (pmdev == NULL) {
-		ret = -EBUSY;
-		goto out;
+	if (!pmdev) {
+		atomic_inc(&vmdev->avail);
+		return -ENOMEM;
 	}
 
 	pmdev->mdev = mdev;
-	pmdev->vdev = vdev;
+	pmdev->vdev = &vmdev->vdev;
 	mdev_set_drvdata(mdev, pmdev);
 	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
 	if (ret) {
 		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
 			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
-		goto out;
+		kfree(pmdev);
+		atomic_inc(&vmdev->avail);
+		return ret;
 	}
 
 	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
 		     dev_name(mdev_dev(mdev)));
-out:
-	return ret;
+	return 0;
 }
 
 static int vfio_mdev_pci_remove(struct mdev_device *mdev)
 {
 	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
 
 	kfree(pmdev);
+	atomic_inc(&vmdev->avail);
 	pr_info("%s, succeeded for mdev: %s\n", __func__,
 		     dev_name(mdev_dev(mdev)));
 
@@ -240,24 +241,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
 	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
 }
 
-static const struct mdev_parent_ops vfio_mdev_pci_ops = {
-	.supported_type_groups	= vfio_mdev_pci_type_groups,
-	.create			= vfio_mdev_pci_create,
-	.remove			= vfio_mdev_pci_remove,
-
-	.open			= vfio_mdev_pci_open,
-	.release		= vfio_mdev_pci_release,
-
-	.read			= vfio_mdev_pci_read,
-	.write			= vfio_mdev_pci_write,
-	.mmap			= vfio_mdev_pci_mmap,
-	.ioctl			= vfio_mdev_pci_ioctl,
-};
-
 static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 				       const struct pci_device_id *id)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
+	const struct mdev_parent_ops *ops;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -276,10 +265,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		return -EBUSY;
 	}
 
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev)
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (!vmdev)
 		return -ENOMEM;
 
+	vmdev->attr.name = kasprintf(GFP_KERNEL,
+				     "%04x:%04x:%04x:%04x:%06x:%02x",
+				     pdev->vendor, pdev->device,
+				     pdev->subsystem_vendor,
+				     pdev->subsystem_device, pdev->class,
+				     pdev->revision);
+	if (!vmdev->attr.name) {
+		kfree(vmdev);
+		return -ENOMEM;
+	}
+
+	atomic_set(&vmdev->avail, 1);
+
+	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
+	vmdev->groups[0] = &vmdev->attr;
+
+	vmdev->ops.supported_type_groups = vmdev->groups;
+	vmdev->ops.create = vfio_mdev_pci_create;
+	vmdev->ops.remove = vfio_mdev_pci_remove;
+	vmdev->ops.open	= vfio_mdev_pci_open;
+	vmdev->ops.release = vfio_mdev_pci_release;
+	vmdev->ops.read = vfio_mdev_pci_read;
+	vmdev->ops.write = vfio_mdev_pci_write;
+	vmdev->ops.mmap = vfio_mdev_pci_mmap;
+	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
+	ops = &vmdev->ops;
+
+	vdev = &vmdev->vdev;
 	vdev->pdev = pdev;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
@@ -292,7 +309,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 #endif
 	vdev->disable_idle_d3 = disable_idle_d3;
 
-	pci_set_drvdata(pdev, vdev);
+	pci_set_drvdata(pdev, vmdev);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret) {
@@ -323,7 +340,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	ret = mdev_register_device(&pdev->dev, ops);
 	if (ret)
 		pr_err("Cannot register mdev for device %s\n",
 			dev_name(&pdev->dev));
@@ -335,12 +352,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 
 static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
 
-	vdev = pci_get_drvdata(pdev);
-	if (!vdev)
+	mdev_unregister_device(&pdev->dev);
+
+	vmdev = pci_get_drvdata(pdev);
+	if (!vmdev)
 		return;
 
+	vdev = &vmdev->vdev;
+
 	vfio_pci_reflck_put(vdev->reflck);
 
 	kfree(vdev->region);
@@ -358,7 +380,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 
-	kfree(vdev);
+	kfree(vmdev->attr.name);
+	kfree(vmdev);
 }
 
 static struct pci_driver vfio_mdev_pci_driver = {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c
  2020-01-07 12:01 ` [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c Liu Yi L
@ 2020-01-08 11:24   ` kbuild test robot
  2020-01-09 22:48   ` Alex Williamson
  1 sibling, 0 replies; 44+ messages in thread
From: kbuild test robot @ 2020-01-08 11:24 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kbuild-all, alex.williamson, kwankhede, linux-kernel, kvm,
	kevin.tian, joro, peterx, baolu.lu, Liu Yi L

Hi Liu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v5.5-rc5]
[also build test WARNING on next-20200106]
[cannot apply to vfio/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio_pci-wrap-pci-device-as-a-mediated-device/20200108-020930
base:    c79f46a282390e0f5b306007bf7b11a46d529538
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-129-g341daf20-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> drivers/vfio/pci/vfio_pci_common.c:201:25: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:201:43: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:201:56: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:201:65: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:206:25: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:206:44: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:206:57: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:206:66: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:214:39: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_common.c:214:58: sparse: sparse: restricted pci_power_t degrades to integer

vim +201 drivers/vfio/pci/vfio_pci_common.c

8db581db521ec0 Liu Yi L 2020-01-07  186  
8db581db521ec0 Liu Yi L 2020-01-07  187  /*
8db581db521ec0 Liu Yi L 2020-01-07  188   * pci_set_power_state() wrapper handling devices which perform a soft reset on
8db581db521ec0 Liu Yi L 2020-01-07  189   * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
8db581db521ec0 Liu Yi L 2020-01-07  190   * restore when returned to D0.  Saved separately from pci_saved_state for use
8db581db521ec0 Liu Yi L 2020-01-07  191   * by PM capability emulation and separately from pci_dev internal saved state
8db581db521ec0 Liu Yi L 2020-01-07  192   * to avoid it being overwritten and consumed around other resets.
8db581db521ec0 Liu Yi L 2020-01-07  193   */
8db581db521ec0 Liu Yi L 2020-01-07  194  int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
8db581db521ec0 Liu Yi L 2020-01-07  195  {
8db581db521ec0 Liu Yi L 2020-01-07  196  	struct pci_dev *pdev = vdev->pdev;
8db581db521ec0 Liu Yi L 2020-01-07  197  	bool needs_restore = false, needs_save = false;
8db581db521ec0 Liu Yi L 2020-01-07  198  	int ret;
8db581db521ec0 Liu Yi L 2020-01-07  199  
8db581db521ec0 Liu Yi L 2020-01-07  200  	if (vdev->needs_pm_restore) {
8db581db521ec0 Liu Yi L 2020-01-07 @201  		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
8db581db521ec0 Liu Yi L 2020-01-07  202  			pci_save_state(pdev);
8db581db521ec0 Liu Yi L 2020-01-07  203  			needs_save = true;
8db581db521ec0 Liu Yi L 2020-01-07  204  		}
8db581db521ec0 Liu Yi L 2020-01-07  205  
8db581db521ec0 Liu Yi L 2020-01-07  206  		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
8db581db521ec0 Liu Yi L 2020-01-07  207  			needs_restore = true;
8db581db521ec0 Liu Yi L 2020-01-07  208  	}
8db581db521ec0 Liu Yi L 2020-01-07  209  
8db581db521ec0 Liu Yi L 2020-01-07  210  	ret = pci_set_power_state(pdev, state);
8db581db521ec0 Liu Yi L 2020-01-07  211  
8db581db521ec0 Liu Yi L 2020-01-07  212  	if (!ret) {
8db581db521ec0 Liu Yi L 2020-01-07  213  		/* D3 might be unsupported via quirk, skip unless in D3 */
8db581db521ec0 Liu Yi L 2020-01-07  214  		if (needs_save && pdev->current_state >= PCI_D3hot) {
8db581db521ec0 Liu Yi L 2020-01-07  215  			vdev->pm_save = pci_store_saved_state(pdev);
8db581db521ec0 Liu Yi L 2020-01-07  216  		} else if (needs_restore) {
8db581db521ec0 Liu Yi L 2020-01-07  217  			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
8db581db521ec0 Liu Yi L 2020-01-07  218  			pci_restore_state(pdev);
8db581db521ec0 Liu Yi L 2020-01-07  219  		}
8db581db521ec0 Liu Yi L 2020-01-07  220  	}
8db581db521ec0 Liu Yi L 2020-01-07  221  
8db581db521ec0 Liu Yi L 2020-01-07  222  	return ret;
8db581db521ec0 Liu Yi L 2020-01-07  223  }
8db581db521ec0 Liu Yi L 2020-01-07  224  

:::::: The code at line 201 was first introduced by commit
:::::: 8db581db521ec047e12946a9c933f085c4d680ba vfio_pci: duplicate vfio_pci.c

:::::: TO: Liu Yi L <yi.l.liu@intel.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  2020-01-07 12:01 ` [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
@ 2020-01-09 22:48   ` Alex Williamson
  2020-01-10  7:35     ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-09 22:48 UTC (permalink / raw)
  To: Liu Yi L; +Cc: kwankhede, linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu

On Tue,  7 Jan 2020 20:01:40 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch replaces the vfio_pci_driver reference in vfio_pci.c with
> pci_dev_driver(vdev->pdev) which is more helpful to make the functions
> be generic to module types.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c | 34 ++++++++++++++++++----------------
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 009d2df..9140f5e5 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1463,24 +1463,25 @@ static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
>  
>  static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
>  {
> -	struct vfio_pci_reflck **preflck = data;
> +	struct vfio_pci_device *vdev = data;
> +	struct vfio_pci_reflck **preflck = &vdev->reflck;
>  	struct vfio_device *device;
> -	struct vfio_pci_device *vdev;
> +	struct vfio_pci_device *tmp;
>  
>  	device = vfio_device_get_from_dev(&pdev->dev);
>  	if (!device)
>  		return 0;
>  
> -	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
> +	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
>  		vfio_device_put(device);
>  		return 0;
>  	}
>  
> -	vdev = vfio_device_data(device);
> +	tmp = vfio_device_data(device);
>  
> -	if (vdev->reflck) {
> -		vfio_pci_reflck_get(vdev->reflck);
> -		*preflck = vdev->reflck;
> +	if (tmp->reflck) {
> +		vfio_pci_reflck_get(tmp->reflck);
> +		*preflck = tmp->reflck;

Seems we can do away with preflck entirely with this refactor, this
simply becomes vdev->reflck = tmp->reflck.  Thanks,

Alex

>  		vfio_device_put(device);
>  		return 1;
>  	}
> @@ -1497,7 +1498,7 @@ static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
>  
>  	if (pci_is_root_bus(vdev->pdev->bus) ||
>  	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
> -					  &vdev->reflck, slot) <= 0)
> +					  vdev, slot) <= 0)
>  		vdev->reflck = vfio_pci_reflck_alloc();
>  
>  	mutex_unlock(&reflck_lock);
> @@ -1522,6 +1523,7 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
>  
>  struct vfio_devices {
>  	struct vfio_device **devices;
> +	struct vfio_pci_device *vdev;
>  	int cur_index;
>  	int max_index;
>  };
> @@ -1530,7 +1532,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_devices *devs = data;
>  	struct vfio_device *device;
> -	struct vfio_pci_device *vdev;
> +	struct vfio_pci_device *tmp;
>  
>  	if (devs->cur_index == devs->max_index)
>  		return -ENOSPC;
> @@ -1539,15 +1541,15 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
>  	if (!device)
>  		return -EINVAL;
>  
> -	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
> +	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
>  		vfio_device_put(device);
>  		return -EBUSY;
>  	}
>  
> -	vdev = vfio_device_data(device);
> +	tmp = vfio_device_data(device);
>  
>  	/* Fault if the device is not unused */
> -	if (vdev->refcnt) {
> +	if (tmp->refcnt) {
>  		vfio_device_put(device);
>  		return -EBUSY;
>  	}
> @@ -1574,7 +1576,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
>   */
>  static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
>  {
> -	struct vfio_devices devs = { .cur_index = 0 };
> +	struct vfio_devices devs = { .vdev = vdev, .cur_index = 0 };
>  	int i = 0, ret = -EINVAL;
>  	bool slot = false;
>  	struct vfio_pci_device *tmp;
> @@ -1637,7 +1639,7 @@ static void __exit vfio_pci_cleanup(void)
>  	vfio_pci_uninit_perm_bits();
>  }
>  
> -static void __init vfio_pci_fill_ids(char *ids)
> +static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
>  {
>  	char *p, *id;
>  	int rc;
> @@ -1665,7 +1667,7 @@ static void __init vfio_pci_fill_ids(char *ids)
>  			continue;
>  		}
>  
> -		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
> +		rc = pci_add_dynid(driver, vendor, device,
>  				   subvendor, subdevice, class, class_mask, 0);
>  		if (rc)
>  			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
> @@ -1692,7 +1694,7 @@ static int __init vfio_pci_init(void)
>  	if (ret)
>  		goto out_driver;
>  
> -	vfio_pci_fill_ids(ids);
> +	vfio_pci_fill_ids(ids, &vfio_pci_driver);
>  
>  	return 0;
>  


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module
  2020-01-07 12:01 ` [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
@ 2020-01-09 22:48   ` Alex Williamson
  2020-01-16 12:19     ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-09 22:48 UTC (permalink / raw)
  To: Liu Yi L; +Cc: kwankhede, linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu

On Tue,  7 Jan 2020 20:01:38 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds three fields in struct vfio_pci_device to pass the user
> configurations of vfio-pci.ko module to some functions which could be
> common in future usage. The values stored in struct vfio_pci_device will
> be initiated in probe and refreshed in device open phase to allow runtime
> modifications to parameters. e.g. disable_idle_d3 and nointxmask.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         | 37 ++++++++++++++++++++++++++-----------
>  drivers/vfio/pci/vfio_pci_private.h |  8 ++++++++
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 379a02c..af507c2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -54,10 +54,10 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(disable_idle_d3,
>  		 "Disable using the PCI D3 low power state for idle, unused devices");
>  
> -static inline bool vfio_vga_disabled(void)
> +static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
>  {
>  #ifdef CONFIG_VFIO_PCI_VGA
> -	return disable_vga;
> +	return vdev->disable_vga;
>  #else
>  	return true;
>  #endif
> @@ -78,7 +78,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
>  	unsigned char max_busnr;
>  	unsigned int decodes;
>  
> -	if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
> +	if (single_vga || !vfio_vga_disabled(vdev) ||
> +		pci_is_root_bus(pdev->bus))
>  		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
>  		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
>  
> @@ -289,7 +290,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  	if (!vdev->pci_saved_state)
>  		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
>  
> -	if (likely(!nointxmask)) {
> +	if (likely(!vdev->nointxmask)) {
>  		if (vfio_pci_nointx(pdev)) {
>  			pci_info(pdev, "Masking broken INTx support\n");
>  			vdev->nointx = true;
> @@ -326,7 +327,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  	} else
>  		vdev->msix_bar = 0xFF;
>  
> -	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
> +	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
>  		vdev->has_vga = true;
>  
>  
> @@ -462,10 +463,17 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>  
>  	vfio_pci_try_bus_reset(vdev);
>  
> -	if (!disable_idle_d3)
> +	if (!vdev->disable_idle_d3)
>  		vfio_pci_set_power_state(vdev, PCI_D3hot);
>  }
>  
> +void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
> +			bool nointxmask, bool disable_idle_d3)
> +{
> +	vdev->nointxmask = nointxmask;
> +	vdev->disable_idle_d3 = disable_idle_d3;

These two are selected (not disable_vga) because they're the only
writable module options, correct?

> +}
> +
>  static void vfio_pci_release(void *device_data)
>  {
>  	struct vfio_pci_device *vdev = device_data;
> @@ -490,6 +498,8 @@ static int vfio_pci_open(void *device_data)
>  	if (!try_module_get(THIS_MODULE))
>  		return -ENODEV;
>  
> +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> +
>  	mutex_lock(&vdev->reflck->lock);
>  
>  	if (!vdev->refcnt) {
> @@ -1330,6 +1340,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	spin_lock_init(&vdev->irqlock);
>  	mutex_init(&vdev->ioeventfds_lock);
>  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> +	vdev->nointxmask = nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> +	vdev->disable_vga = disable_vga;
> +#endif
> +	vdev->disable_idle_d3 = disable_idle_d3;

But this could still use vfio_pci_refresh_config() for those writable
options and set disable_vga separately, couldn't it?  Also, since
disable_idle_d3 is related to power handling of the device while it is
not opened by the user, shouldn't the config also be refreshed when the
device is released by the user?

>  
>  	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
>  	if (ret) {
> @@ -1354,7 +1369,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  
>  	vfio_pci_probe_power_state(vdev);
>  
> -	if (!disable_idle_d3) {
> +	if (!vdev->disable_idle_d3) {
>  		/*
>  		 * pci-core sets the device power state to an unknown value at
>  		 * bootup and after being removed from a driver.  The only
> @@ -1385,7 +1400,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>  	kfree(vdev->region);
>  	mutex_destroy(&vdev->ioeventfds_lock);
>  
> -	if (!disable_idle_d3)
> +	if (!vdev->disable_idle_d3)
>  		vfio_pci_set_power_state(vdev, PCI_D0);
>  
>  	kfree(vdev->pm_save);
> @@ -1620,7 +1635,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
>  		if (!ret) {
>  			tmp->needs_reset = false;
>  
> -			if (tmp != vdev && !disable_idle_d3)
> +			if (tmp != vdev && !tmp->disable_idle_d3)
>  				vfio_pci_set_power_state(tmp, PCI_D3hot);
>  		}
>  
> @@ -1636,7 +1651,7 @@ static void __exit vfio_pci_cleanup(void)
>  	vfio_pci_uninit_perm_bits();
>  }
>  
> -static void __init vfio_pci_fill_ids(void)
> +static void __init vfio_pci_fill_ids(char *ids)

This might be more clear if the global was also renamed vfio_pci_ids.

>  {
>  	char *p, *id;
>  	int rc;
> @@ -1691,7 +1706,7 @@ static int __init vfio_pci_init(void)
>  	if (ret)
>  		goto out_driver;
>  
> -	vfio_pci_fill_ids();
> +	vfio_pci_fill_ids(ids);
>  
>  	return 0;
>  
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a2c760..0398608 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -122,6 +122,11 @@ struct vfio_pci_device {
>  	struct list_head	dummy_resources_list;
>  	struct mutex		ioeventfds_lock;
>  	struct list_head	ioeventfds_list;
> +	bool			nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> +	bool			disable_vga;
> +#endif
> +	bool			disable_idle_d3;

It seems like there are more relevant places these could be within this
structure, ex. nointxmask next to nointx, disable_vga near has_vga,
disable_idle_d3 maybe near needs_pm_restore (even though those aren't
conceptually related).  Not necessarily related to this series, it
might be time to convert the existing bools to bit fields, but even
before that the alignment of adding these as bools grouped with the
existing bools is probably better.  Thanks,

Alex

>  };
>  
>  #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> @@ -130,6 +135,9 @@ struct vfio_pci_device {
>  #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
>  #define irq_is(vdev, type) (vdev->irq_type == type)
>  
> +extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
> +				bool nointxmask, bool disable_idle_d3);
> +
>  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>  


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c
  2020-01-07 12:01 ` [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c Liu Yi L
  2020-01-08 11:24   ` kbuild test robot
@ 2020-01-09 22:48   ` Alex Williamson
  2020-01-16 12:42     ` Liu, Yi L
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-09 22:48 UTC (permalink / raw)
  To: Liu Yi L; +Cc: kwankhede, linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu

On Tue,  7 Jan 2020 20:01:44 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch removes the common codes in vfio_pci.c, leave the module
> specific codes, new vfio_pci.c will leverage the common functions
> implemented in vfio_pci_common.c.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/Makefile           |    3 +-
>  drivers/vfio/pci/vfio_pci.c         | 1442 -----------------------------------
>  drivers/vfio/pci/vfio_pci_common.c  |    2 +-
>  drivers/vfio/pci/vfio_pci_private.h |    2 +
>  4 files changed, 5 insertions(+), 1444 deletions(-)
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index f027f8a..d94317a 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,6 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  
> -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> +vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
> +		vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 103e493..7e24da2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c

I think there are a bunch of headers that are no longer needed here
too.  It at least compiles without these:

-#include <linux/eventfd.h>
-#include <linux/file.h>
-#include <linux/interrupt.h>
-#include <linux/notifier.h>
-#include <linux/pm_runtime.h>
-#include <linux/uaccess.h>
-#include <linux/nospec.h>


> @@ -54,411 +54,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(disable_idle_d3,
>  		 "Disable using the PCI D3 low power state for idle, unused devices");
>  
> -/*
> - * Our VGA arbiter participation is limited since we don't know anything
> - * about the device itself.  However, if the device is the only VGA device
> - * downstream of a bridge and VFIO VGA support is disabled, then we can
> - * safely return legacy VGA IO and memory as not decoded since the user
> - * has no way to get to it and routing can be disabled externally at the
> - * bridge.
> - */
> -unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
> -{
> -	struct vfio_pci_device *vdev = opaque;
> -	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
> -	unsigned char max_busnr;
> -	unsigned int decodes;
> -
> -	if (single_vga || !vfio_vga_disabled(vdev) ||
> -		pci_is_root_bus(pdev->bus))
> -		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
> -		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
> -
> -	max_busnr = pci_bus_max_busnr(pdev->bus);
> -	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
> -
> -	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
> -		if (tmp == pdev ||
> -		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
> -		    pci_is_root_bus(tmp->bus))
> -			continue;
> -
> -		if (tmp->bus->number >= pdev->bus->number &&
> -		    tmp->bus->number <= max_busnr) {
> -			pci_dev_put(tmp);
> -			decodes |= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
> -			break;
> -		}
> -	}
> -
> -	return decodes;
> -}
> -
> -static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> -{
> -	struct resource *res;
> -	int i;
> -	struct vfio_pci_dummy_resource *dummy_res;
> -
> -	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> -
> -	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> -		int bar = i + PCI_STD_RESOURCES;
> -
> -		res = &vdev->pdev->resource[bar];
> -
> -		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> -			goto no_mmap;
> -
> -		if (!(res->flags & IORESOURCE_MEM))
> -			goto no_mmap;
> -
> -		/*
> -		 * The PCI core shouldn't set up a resource with a
> -		 * type but zero size. But there may be bugs that
> -		 * cause us to do that.
> -		 */
> -		if (!resource_size(res))
> -			goto no_mmap;
> -
> -		if (resource_size(res) >= PAGE_SIZE) {
> -			vdev->bar_mmap_supported[bar] = true;
> -			continue;
> -		}
> -
> -		if (!(res->start & ~PAGE_MASK)) {
> -			/*
> -			 * Add a dummy resource to reserve the remainder
> -			 * of the exclusive page in case that hot-add
> -			 * device's bar is assigned into it.
> -			 */
> -			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> -			if (dummy_res == NULL)
> -				goto no_mmap;
> -
> -			dummy_res->resource.name = "vfio sub-page reserved";
> -			dummy_res->resource.start = res->end + 1;
> -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> -			dummy_res->resource.flags = res->flags;
> -			if (request_resource(res->parent,
> -						&dummy_res->resource)) {
> -				kfree(dummy_res);
> -				goto no_mmap;
> -			}
> -			dummy_res->index = bar;
> -			list_add(&dummy_res->res_next,
> -					&vdev->dummy_resources_list);
> -			vdev->bar_mmap_supported[bar] = true;
> -			continue;
> -		}
> -		/*
> -		 * Here we don't handle the case when the BAR is not page
> -		 * aligned because we can't expect the BAR will be
> -		 * assigned into the same location in a page in guest
> -		 * when we passthrough the BAR. And it's hard to access
> -		 * this BAR in userspace because we have no way to get
> -		 * the BAR's location in a page.
> -		 */
> -no_mmap:
> -		vdev->bar_mmap_supported[bar] = false;
> -	}
> -}
> -
> -static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
> -
> -/*
> - * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> - * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
> - * If a device implements the former but not the latter we would typically
> - * expect broken_intx_masking be set and require an exclusive interrupt.
> - * However since we do have control of the device's ability to assert INTx,
> - * we can instead pretend that the device does not implement INTx, virtualizing
> - * the pin register to report zero and maintaining DisINTx set on the host.
> - */
> -static bool vfio_pci_nointx(struct pci_dev *pdev)
> -{
> -	switch (pdev->vendor) {
> -	case PCI_VENDOR_ID_INTEL:
> -		switch (pdev->device) {
> -		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
> -		case 0x1572:
> -		case 0x1574:
> -		case 0x1580 ... 0x1581:
> -		case 0x1583 ... 0x158b:
> -		case 0x37d0 ... 0x37d2:
> -			return true;
> -		default:
> -			return false;
> -		}
> -	}
> -
> -	return false;
> -}
> -
> -void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
> -{
> -	struct pci_dev *pdev = vdev->pdev;
> -	u16 pmcsr;
> -
> -	if (!pdev->pm_cap)
> -		return;
> -
> -	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
> -
> -	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
> -}
> -
> -/*
> - * pci_set_power_state() wrapper handling devices which perform a soft reset on
> - * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
> - * restore when returned to D0.  Saved separately from pci_saved_state for use
> - * by PM capability emulation and separately from pci_dev internal saved state
> - * to avoid it being overwritten and consumed around other resets.
> - */
> -int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
> -{
> -	struct pci_dev *pdev = vdev->pdev;
> -	bool needs_restore = false, needs_save = false;
> -	int ret;
> -
> -	if (vdev->needs_pm_restore) {
> -		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
> -			pci_save_state(pdev);
> -			needs_save = true;
> -		}
> -
> -		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
> -			needs_restore = true;
> -	}
> -
> -	ret = pci_set_power_state(pdev, state);
> -
> -	if (!ret) {
> -		/* D3 might be unsupported via quirk, skip unless in D3 */
> -		if (needs_save && pdev->current_state >= PCI_D3hot) {
> -			vdev->pm_save = pci_store_saved_state(pdev);
> -		} else if (needs_restore) {
> -			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
> -			pci_restore_state(pdev);
> -		}
> -	}


This gets a bit ugly, vfio_pci_remove() retains:

kfree(vdev->pm_save)

But vfio_pci.c otherwise has no use of this field on the
vfio_pci_device.  I'm afraid we're really just doing a pretty rough
splitting of the code rather than massaging the callbacks between the
modules into an actual API, for example maybe there should be init and
exit callbacks into the common code to handle such things.
ioeventfds_{list,lock} are similar, vfio_pci.c inits and destroys them,
but otherwise doesn't know what they're for.  I wonder how many more
such things exist.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-07 12:01 ` [PATCH v4 11/12] samples: add vfio-mdev-pci driver Liu Yi L
@ 2020-01-09 22:48   ` Alex Williamson
  2020-01-16 12:33     ` Liu, Yi L
  2020-01-15 12:30   ` Cornelia Huck
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-09 22:48 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kwankhede, linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu,
	Masahiro Yamada

On Tue,  7 Jan 2020 20:01:48 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds sample driver named vfio-mdev-pci. It is to wrap
> a PCI device as a mediated device. For a pci device, once bound
> to vfio-mdev-pci driver, user space access of this device will
> go through vfio mdev framework. The usage of the device follows
> mdev management method. e.g. user should create a mdev before
> exposing the device to user-space.
> 
> Benefit of this new driver would be acting as a sample driver
> for recent changes from "vfio/mdev: IOMMU aware mediated device"
> patchset. Also it could be a good experiment driver for future
> device specific mdev migration support. This sample driver only
> supports singleton iommu groups, for non-singleton iommu groups,
> this sample driver doesn't work. It will fail when trying to assign
> the non-singleton iommu group to VMs.
> 
> To use this driver:
> a) build and load vfio-mdev-pci.ko module
>    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
>    then load it with following command:
>    > sudo modprobe vfio
>    > sudo modprobe vfio-pci
>    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko  
> 
> b) unbind original device driver
>    e.g. use following command to unbind its original driver
>    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> 
> c) bind vfio-mdev-pci driver to the physical device
>    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> 
> d) check the supported mdev instances
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
>      vfio-mdev-pci-type_name
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
>      vfio-mdev-pci-type_name/
>      available_instances  create  device_api  devices  name
> 
> e)  create mdev on this physical device (only 1 instance)
>    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
>      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
>      vfio-mdev-pci-type_name/create
> 
> f) passthru the mdev to guest
>    add the following line in QEMU boot command
>     -device vfio-pci,\
>      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> 
> g) destroy mdev
>    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
>      remove
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  samples/Kconfig                       |  10 +
>  samples/Makefile                      |   1 +
>  samples/vfio-mdev-pci/Makefile        |   4 +
>  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397 ++++++++++++++++++++++++++++++++++
>  4 files changed, 412 insertions(+)
>  create mode 100644 samples/vfio-mdev-pci/Makefile
>  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> 
> diff --git a/samples/Kconfig b/samples/Kconfig
> index 9d236c3..50d207c 100644
> --- a/samples/Kconfig
> +++ b/samples/Kconfig
> @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
>  	help
>  	  Build a sample program to work with mei device.
>  
> +config SAMPLE_VFIO_MDEV_PCI
> +	tristate "Sample driver for wrapping PCI device as a mdev"
> +	select VFIO_PCI_COMMON
> +	select VFIO_PCI
> +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> +	help
> +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> +	  this driver, device passthru should through mdev path.
> +
> +	  If you don't know what to do here, say N.
>  
>  endif # SAMPLES
> diff --git a/samples/Makefile b/samples/Makefile
> index 5ce50ef..84faced 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)	+= ftrace/
>  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
>  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
>  obj-y					+= vfio-mdev/
> +obj-y					+= vfio-mdev-pci/

I think we could just lump this into vfio-mdev rather than making
another directory.

>  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
>  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-pci/Makefile
> new file mode 100644
> index 0000000..41b2139
> --- /dev/null
> +++ b/samples/vfio-mdev-pci/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +vfio-mdev-pci-y := vfio_mdev_pci.o
> +
> +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> new file mode 100644
> index 0000000..b180356
> --- /dev/null
> +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> @@ -0,0 +1,397 @@
> +/*
> + * Copyright © 2020 Intel Corporation.
> + *     Author: Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_pci.c:
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vgaarb.h>
> +#include <linux/nospec.h>
> +#include <linux/mdev.h>
> +#include <linux/vfio_pci_common.h>
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a mdev"
> +
> +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> +
> +static char ids[1024] __initdata;
> +module_param_string(ids, ids, sizeof(ids), 0);
> +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
> +
> +static bool nointxmask;
> +module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(nointxmask,
> +		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
> +
> +#ifdef CONFIG_VFIO_PCI_VGA
> +static bool disable_vga;
> +module_param(disable_vga, bool, S_IRUGO);
> +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-mdev-pci");
> +#endif
> +
> +static bool disable_idle_d3;
> +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(disable_idle_d3,
> +		 "Disable using the PCI D3 low power state for idle, unused devices");
> +
> +static struct pci_driver vfio_mdev_pci_driver;
> +
> +static ssize_t
> +name_show(struct kobject *kobj, struct device *dev, char *buf)
> +{
> +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> +}
> +
> +MDEV_TYPE_ATTR_RO(name);
> +
> +static ssize_t
> +available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
> +{
> +	return sprintf(buf, "%d\n", 1);
> +}
> +
> +MDEV_TYPE_ATTR_RO(available_instances);
> +
> +static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
> +		char *buf)
> +{
> +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> +}
> +
> +MDEV_TYPE_ATTR_RO(device_api);
> +
> +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> +	&mdev_type_attr_name.attr,
> +	&mdev_type_attr_device_api.attr,
> +	&mdev_type_attr_available_instances.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group vfio_mdev_pci_type_group1 = {
> +	.name  = "type1",
> +	.attrs = vfio_mdev_pci_types_attrs,
> +};
> +
> +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> +	&vfio_mdev_pci_type_group1,
> +	NULL,
> +};
> +
> +struct vfio_mdev_pci {
> +	struct vfio_pci_device *vdev;
> +	struct mdev_device *mdev;
> +	unsigned long handle;
> +};
> +
> +static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
> +{
> +	struct device *pdev;
> +	struct vfio_pci_device *vdev;
> +	struct vfio_mdev_pci *pmdev;
> +	int ret;
> +
> +	pdev = mdev_parent_dev(mdev);
> +	vdev = dev_get_drvdata(pdev);
> +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> +	if (pmdev == NULL) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	pmdev->mdev = mdev;
> +	pmdev->vdev = vdev;
> +	mdev_set_drvdata(mdev, pmdev);
> +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> +	if (ret) {
> +		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
> +			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> +		goto out;
> +	}
> +
> +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> +		     dev_name(mdev_dev(mdev)));
> +out:
> +	return ret;
> +}
> +
> +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +
> +	kfree(pmdev);
> +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> +		     dev_name(mdev_dev(mdev)));
> +
> +	return 0;
> +}
> +
> +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +	struct vfio_pci_device *vdev = pmdev->vdev;
> +	int ret = 0;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> +
> +	mutex_lock(&vdev->reflck->lock);
> +
> +	if (!vdev->refcnt) {
> +		ret = vfio_pci_enable(vdev);
> +		if (ret)
> +			goto error;
> +
> +		vfio_spapr_pci_eeh_open(vdev->pdev);
> +	}
> +	vdev->refcnt++;
> +error:
> +	mutex_unlock(&vdev->reflck->lock);
> +	if (!ret)
> +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
> +	else {
> +		pr_info("Failed to open mdev: %s on pf: %s\n",
> +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
> +		module_put(THIS_MODULE);
> +	}
> +	return ret;
> +}
> +
> +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +	struct vfio_pci_device *vdev = pmdev->vdev;
> +
> +	pr_info("Release mdev: %s on pf: %s\n",
> +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
> +
> +	mutex_lock(&vdev->reflck->lock);
> +
> +	if (!(--vdev->refcnt)) {
> +		vfio_spapr_pci_eeh_release(vdev->pdev);
> +		vfio_pci_disable(vdev);
> +	}
> +
> +	mutex_unlock(&vdev->reflck->lock);
> +
> +	module_put(THIS_MODULE);
> +}

open() and release() here are almost identical between vfio_pci and
vfio_mdev_pci, which suggests maybe there should be common functions to
call into like we do for the below.

> +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
> +			     unsigned long arg)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +
> +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> +}
> +
> +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> +				struct vm_area_struct *vma)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +
> +	return vfio_pci_mmap(pmdev->vdev, vma);
> +}
> +
> +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +
> +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> +				const char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +
> +	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
> +}
> +
> +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> +	.supported_type_groups	= vfio_mdev_pci_type_groups,
> +	.create			= vfio_mdev_pci_create,
> +	.remove			= vfio_mdev_pci_remove,
> +
> +	.open			= vfio_mdev_pci_open,
> +	.release		= vfio_mdev_pci_release,
> +
> +	.read			= vfio_mdev_pci_read,
> +	.write			= vfio_mdev_pci_write,
> +	.mmap			= vfio_mdev_pci_mmap,
> +	.ioctl			= vfio_mdev_pci_ioctl,
> +};
> +
> +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> +				       const struct pci_device_id *id)
> +{
> +	struct vfio_pci_device *vdev;
> +	int ret;
> +
> +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> +		return -EINVAL;
> +
> +	/*
> +	 * Prevent binding to PFs with VFs enabled, this too easily allows
> +	 * userspace instance with VFs and PFs from the same device, which
> +	 * cannot work.  Disabling SR-IOV here would initiate removing the
> +	 * VFs, which would unbind the driver, which is prone to blocking
> +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> +	 * reject these PFs and let the user sort it out.
> +	 */
> +	if (pci_num_vf(pdev)) {
> +		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
> +		return -EBUSY;
> +	}
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->pdev = pdev;
> +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> +	mutex_init(&vdev->igate);
> +	spin_lock_init(&vdev->irqlock);
> +	mutex_init(&vdev->ioeventfds_lock);
> +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> +	vdev->nointxmask = nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> +	vdev->disable_vga = disable_vga;
> +#endif
> +	vdev->disable_idle_d3 = disable_idle_d3;
> +
> +	pci_set_drvdata(pdev, vdev);
> +
> +	ret = vfio_pci_reflck_attach(vdev);
> +	if (ret) {
> +		pci_set_drvdata(pdev, NULL);
> +		kfree(vdev);
> +		return ret;
> +	}
> +
> +	if (vfio_pci_is_vga(pdev)) {
> +		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> +		vga_set_legacy_decoding(pdev,
> +					vfio_pci_set_vga_decode(vdev, false));
> +	}
> +
> +	vfio_pci_probe_power_state(vdev);
> +
> +	if (!vdev->disable_idle_d3) {
> +		/*
> +		 * pci-core sets the device power state to an unknown value at
> +		 * bootup and after being removed from a driver.  The only
> +		 * transition it allows from this unknown state is to D0, which
> +		 * typically happens when a driver calls pci_enable_device().
> +		 * We're not ready to enable the device yet, but we do want to
> +		 * be able to get to D3.  Therefore first do a D0 transition
> +		 * before going to D3.
> +		 */
> +		vfio_pci_set_power_state(vdev, PCI_D0);
> +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> +	}

Ditto here and remove below, this seems like boilerplate that shouldn't
be duplicated per leaf module.  Thanks,

Alex


> +
> +	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
> +	if (ret)
> +		pr_err("Cannot register mdev for device %s\n",
> +			dev_name(&pdev->dev));
> +	else
> +		pr_info("Wrap device %s as a mdev\n", dev_name(&pdev->dev));
> +
> +	return ret;
> +}
> +
> +static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_device *vdev;
> +
> +	vdev = pci_get_drvdata(pdev);
> +	if (!vdev)
> +		return;
> +
> +	vfio_pci_reflck_put(vdev->reflck);
> +
> +	kfree(vdev->region);
> +	mutex_destroy(&vdev->ioeventfds_lock);
> +
> +	if (!disable_idle_d3)
> +		vfio_pci_set_power_state(vdev, PCI_D0);
> +
> +	kfree(vdev->pm_save);
> +
> +	if (vfio_pci_is_vga(pdev)) {
> +		vga_client_register(pdev, NULL, NULL, NULL);
> +		vga_set_legacy_decoding(pdev,
> +				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
> +				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
> +	}
> +
> +	kfree(vdev);
> +}
> +
> +static struct pci_driver vfio_mdev_pci_driver = {
> +	.name		= VFIO_MDEV_PCI_NAME,
> +	.id_table	= NULL, /* only dynamic ids */
> +	.probe		= vfio_mdev_pci_driver_probe,
> +	.remove		= vfio_mdev_pci_driver_remove,
> +	.err_handler	= &vfio_pci_err_handlers,
> +};
> +
> +static void __exit vfio_mdev_pci_cleanup(void)
> +{
> +	pci_unregister_driver(&vfio_mdev_pci_driver);
> +}
> +
> +static int __init vfio_mdev_pci_init(void)
> +{
> +	int ret;
> +
> +	/* Register and scan for devices */
> +	ret = pci_register_driver(&vfio_mdev_pci_driver);
> +	if (ret)
> +		return ret;
> +
> +	vfio_pci_fill_ids(ids, &vfio_mdev_pci_driver);
> +
> +	return 0;
> +}
> +
> +module_init(vfio_mdev_pci_init);
> +module_exit(vfio_mdev_pci_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files
  2020-01-07 12:01 ` [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files Liu Yi L
@ 2020-01-09 22:48   ` Alex Williamson
  2020-01-16 11:59     ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-09 22:48 UTC (permalink / raw)
  To: Liu Yi L; +Cc: kwankhede, linux-kernel, kvm, kevin.tian, joro, peterx, baolu.lu

[-- Attachment #1: Type: text/plain, Size: 10691 bytes --]

On Tue,  7 Jan 2020 20:01:46 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch splits the vfio_pci_private.h to be a private file
> in drivers/vfio/pci and a common file under include/linux/. It
> is a preparation for supporting vfio_pci common code sharing
> outside drivers/vfio/pci/.
> 
> The common header file is shrunk from the previous copied
> vfio_pci_common.h. The original vfio_pci_private.h is shrunk
> accordingly as well.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 133 +-----------------------------------
>  include/linux/vfio_pci_common.h     |  86 ++---------------------
>  2 files changed, 7 insertions(+), 212 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 499dd04..c4976a9 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -12,6 +12,7 @@
>  #include <linux/pci.h>
>  #include <linux/irqbypass.h>
>  #include <linux/types.h>
> +#include <linux/vfio_pci_common.h>
>  
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
> @@ -39,121 +40,12 @@ struct vfio_pci_ioeventfd {
>  	int			count;
>  };
>  
> -struct vfio_pci_irq_ctx {
> -	struct eventfd_ctx	*trigger;
> -	struct virqfd		*unmask;
> -	struct virqfd		*mask;
> -	char			*name;
> -	bool			masked;
> -	struct irq_bypass_producer	producer;
> -};

I think this can stay here, vfio_pci_common.h just needs a forward
declaration.

> -
> -struct vfio_pci_device;
> -struct vfio_pci_region;
> -
> -struct vfio_pci_regops {
> -	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
> -		      size_t count, loff_t *ppos, bool iswrite);
> -	void	(*release)(struct vfio_pci_device *vdev,
> -			   struct vfio_pci_region *region);
> -	int	(*mmap)(struct vfio_pci_device *vdev,
> -			struct vfio_pci_region *region,
> -			struct vm_area_struct *vma);
> -	int	(*add_capability)(struct vfio_pci_device *vdev,
> -				  struct vfio_pci_region *region,
> -				  struct vfio_info_cap *caps);
> -};
> -
> -struct vfio_pci_region {
> -	u32				type;
> -	u32				subtype;
> -	const struct vfio_pci_regops	*ops;
> -	void				*data;
> -	size_t				size;
> -	u32				flags;
> -};
> -
>  struct vfio_pci_dummy_resource {
>  	struct resource		resource;
>  	int			index;
>  	struct list_head	res_next;
>  };
>  
> -struct vfio_pci_reflck {
> -	struct kref		kref;
> -	struct mutex		lock;
> -};

I think we can abstract this a little further to make it unnecessary to
put this in common as well.  See attached.

> -
> -struct vfio_pci_device {
> -	struct pci_dev		*pdev;
> -	void __iomem		*barmap[PCI_STD_NUM_BARS];
> -	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
> -	u8			*pci_config_map;
> -	u8			*vconfig;
> -	struct perm_bits	*msi_perm;
> -	spinlock_t		irqlock;
> -	struct mutex		igate;
> -	struct vfio_pci_irq_ctx	*ctx;
> -	int			num_ctx;
> -	int			irq_type;
> -	int			num_regions;
> -	struct vfio_pci_region	*region;
> -	u8			msi_qmax;
> -	u8			msix_bar;
> -	u16			msix_size;
> -	u32			msix_offset;
> -	u32			rbar[7];
> -	bool			pci_2_3;
> -	bool			virq_disabled;
> -	bool			reset_works;
> -	bool			extended_caps;
> -	bool			bardirty;
> -	bool			has_vga;
> -	bool			needs_reset;
> -	bool			nointx;
> -	bool			needs_pm_restore;
> -	struct pci_saved_state	*pci_saved_state;
> -	struct pci_saved_state	*pm_save;
> -	struct vfio_pci_reflck	*reflck;
> -	int			refcnt;
> -	int			ioeventfds_nr;
> -	struct eventfd_ctx	*err_trigger;
> -	struct eventfd_ctx	*req_trigger;
> -	struct list_head	dummy_resources_list;
> -	struct mutex		ioeventfds_lock;
> -	struct list_head	ioeventfds_list;
> -	bool			nointxmask;
> -#ifdef CONFIG_VFIO_PCI_VGA
> -	bool			disable_vga;
> -#endif
> -	bool			disable_idle_d3;
> -};
> -
> -#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> -#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
> -#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> -#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
> -#define irq_is(vdev, type) (vdev->irq_type == type)

I think these can stay in the private header too.

> -
> -extern const struct pci_error_handlers vfio_err_handlers;
> -
> -static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> -{
> -	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> -}
> -
> -static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
> -{
> -#ifdef CONFIG_VFIO_PCI_VGA
> -	return vdev->disable_vga;
> -#else
> -	return true;
> -#endif
> -}

vfio_vga_disabled() is only used in vfio_pci_common.c, I think it can
remain in private.

> -
> -extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
> -				bool nointxmask, bool disable_idle_d3);
> -
>  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>  
> @@ -180,29 +72,6 @@ extern void vfio_pci_uninit_perm_bits(void);
>  extern int vfio_config_init(struct vfio_pci_device *vdev);
>  extern void vfio_config_free(struct vfio_pci_device *vdev);
>  
> -extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
> -					unsigned int type, unsigned int subtype,
> -					const struct vfio_pci_regops *ops,
> -					size_t size, u32 flags, void *data);
> -
> -extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
> -				    pci_power_t state);
> -extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga);
> -extern int vfio_pci_enable(struct vfio_pci_device *vdev);
> -extern void vfio_pci_disable(struct vfio_pci_device *vdev);
> -extern long vfio_pci_ioctl(void *device_data,
> -			unsigned int cmd, unsigned long arg);
> -extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
> -			size_t count, loff_t *ppos);
> -extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
> -			size_t count, loff_t *ppos);
> -extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
> -extern void vfio_pci_request(void *device_data, unsigned int count);
> -extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
> -extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
> -extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
> -extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
> -
>  #ifdef CONFIG_VFIO_PCI_IGD
>  extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
>  #else
> diff --git a/include/linux/vfio_pci_common.h b/include/linux/vfio_pci_common.h
> index 499dd04..862cd80 100644
> --- a/include/linux/vfio_pci_common.h
> +++ b/include/linux/vfio_pci_common.h
> @@ -1,5 +1,8 @@
>  /* SPDX-License-Identifier: GPL-2.0-only */
>  /*
> + * VFIO PCI API definition
> + *
> + * Derived from original vfio/pci/vfio_pci_private.h:
>   * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
>   *     Author: Alex Williamson <alex.williamson@redhat.com>
>   *
> @@ -13,31 +16,8 @@
>  #include <linux/irqbypass.h>
>  #include <linux/types.h>
>  
> -#ifndef VFIO_PCI_PRIVATE_H
> -#define VFIO_PCI_PRIVATE_H
> -
> -#define VFIO_PCI_OFFSET_SHIFT   40
> -
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> -
> -/* Special capability IDs predefined access */
> -#define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
> -#define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
> -
> -/* Cap maximum number of ioeventfds per device (arbitrary) */
> -#define VFIO_PCI_IOEVENTFD_MAX		1000
> -
> -struct vfio_pci_ioeventfd {
> -	struct list_head	next;
> -	struct virqfd		*virqfd;
> -	void __iomem		*addr;
> -	uint64_t		data;
> -	loff_t			pos;
> -	int			bar;
> -	int			count;
> -};
> +#ifndef VFIO_PCI_COMMON_H
> +#define VFIO_PCI_COMMON_H
>  
>  struct vfio_pci_irq_ctx {
>  	struct eventfd_ctx	*trigger;
> @@ -73,12 +53,6 @@ struct vfio_pci_region {
>  	u32				flags;
>  };
>  
> -struct vfio_pci_dummy_resource {
> -	struct resource		resource;
> -	int			index;
> -	struct list_head	res_next;
> -};
> -
>  struct vfio_pci_reflck {
>  	struct kref		kref;
>  	struct mutex		lock;
> @@ -154,32 +128,6 @@ static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
>  extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
>  				bool nointxmask, bool disable_idle_d3);
>  
> -extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
> -extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
> -
> -extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
> -				   uint32_t flags, unsigned index,
> -				   unsigned start, unsigned count, void *data);
> -
> -extern ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev,
> -				  char __user *buf, size_t count,
> -				  loff_t *ppos, bool iswrite);
> -
> -extern ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
> -			       size_t count, loff_t *ppos, bool iswrite);
> -
> -extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
> -			       size_t count, loff_t *ppos, bool iswrite);
> -
> -extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
> -			       uint64_t data, int count, int fd);
> -
> -extern int vfio_pci_init_perm_bits(void);
> -extern void vfio_pci_uninit_perm_bits(void);
> -
> -extern int vfio_config_init(struct vfio_pci_device *vdev);
> -extern void vfio_config_free(struct vfio_pci_device *vdev);
> -
>  extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
>  					unsigned int type, unsigned int subtype,
>  					const struct vfio_pci_regops *ops,
> @@ -203,26 +151,4 @@ extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
>  extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
>  extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
>  
> -#ifdef CONFIG_VFIO_PCI_IGD
> -extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
> -#else
> -static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
> -{
> -	return -ENODEV;
> -}
> -#endif
> -#ifdef CONFIG_VFIO_PCI_NVLINK2
> -extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
> -extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
> -#else
> -static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
> -{
> -	return -ENODEV;
> -}
> -
> -static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
> -{
> -	return -ENODEV;
> -}
> -#endif
> -#endif /* VFIO_PCI_PRIVATE_H */
> +#endif /* VFIO_PCI_COMMON_H */


[-- Attachment #2: return-to-private --]
[-- Type: application/octet-stream, Size: 3419 bytes --]

These don't seem to be necessary in common header

From: Alex Williamson <alex.williamson@redhat.com>


---
 drivers/vfio/pci/vfio_pci_private.h |   29 +++++++++++++++++++++++++++++
 include/linux/vfio_pci_common.h     |   31 ++-----------------------------
 2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index c4976a948aaa..bf1995cf417d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -40,12 +40,41 @@ struct vfio_pci_ioeventfd {
 	int			count;
 };
 
+struct vfio_pci_irq_ctx {
+	struct eventfd_ctx	*trigger;
+	struct virqfd		*unmask;
+	struct virqfd		*mask;
+	char			*name;
+	bool			masked;
+	struct irq_bypass_producer	producer;
+};
+
 struct vfio_pci_dummy_resource {
 	struct resource		resource;
 	int			index;
 	struct list_head	res_next;
 };
 
+struct vfio_pci_reflck {
+	struct kref		kref;
+	struct mutex		lock;
+};
+
+#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
+#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
+#define irq_is(vdev, type) (vdev->irq_type == type)
+
+static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
+{
+#ifdef CONFIG_VFIO_PCI_VGA
+	return vdev->disable_vga;
+#else
+	return true;
+#endif
+}
+
 extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
 
diff --git a/include/linux/vfio_pci_common.h b/include/linux/vfio_pci_common.h
index 439666a8ce7a..fa572d388111 100644
--- a/include/linux/vfio_pci_common.h
+++ b/include/linux/vfio_pci_common.h
@@ -19,17 +19,10 @@
 #ifndef VFIO_PCI_COMMON_H
 #define VFIO_PCI_COMMON_H
 
-struct vfio_pci_irq_ctx {
-	struct eventfd_ctx	*trigger;
-	struct virqfd		*unmask;
-	struct virqfd		*mask;
-	char			*name;
-	bool			masked;
-	struct irq_bypass_producer	producer;
-};
-
+struct vfio_pci_irq_ctx;
 struct vfio_pci_device;
 struct vfio_pci_region;
+struct vfio_pci_reflck;
 
 struct vfio_pci_regops {
 	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
@@ -53,11 +46,6 @@ struct vfio_pci_region {
 	u32				flags;
 };
 
-struct vfio_pci_reflck {
-	struct kref		kref;
-	struct mutex		lock;
-};
-
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
@@ -103,12 +91,6 @@ struct vfio_pci_device {
 	bool			disable_idle_d3;
 };
 
-#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
-#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
-#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
-#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
-#define irq_is(vdev, type) (vdev->irq_type == type)
-
 extern const struct pci_error_handlers vfio_pci_err_handlers;
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
@@ -116,15 +98,6 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
-static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
-{
-#ifdef CONFIG_VFIO_PCI_VGA
-	return vdev->disable_vga;
-#else
-	return true;
-#endif
-}
-
 extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
 				bool nointxmask, bool disable_idle_d3);
 

[-- Attachment #3: abstract-reflck --]
[-- Type: application/octet-stream, Size: 5195 bytes --]

Make it (more) abstract

From: Alex Williamson <alex.williamson@redhat.com>


---
 drivers/vfio/pci/vfio_pci.c           |   10 +++++-----
 drivers/vfio/pci/vfio_pci_common.c    |   17 +++++++++++++++--
 include/linux/vfio_pci_common.h       |    4 +++-
 samples/vfio-mdev-pci/vfio_mdev_pci.c |   10 +++++-----
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 704766714c11..1e9d6e4e9c81 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -58,14 +58,14 @@ static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
-	mutex_lock(&vdev->reflck->lock);
+	vfio_pci_reflck_lock(vdev);
 
 	if (!(--vdev->refcnt)) {
 		vfio_spapr_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);
 	}
 
-	mutex_unlock(&vdev->reflck->lock);
+	vfio_pci_reflck_unlock(vdev);
 
 	module_put(THIS_MODULE);
 }
@@ -80,7 +80,7 @@ static int vfio_pci_open(void *device_data)
 
 	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
 
-	mutex_lock(&vdev->reflck->lock);
+	vfio_pci_reflck_lock(vdev);
 
 	if (!vdev->refcnt) {
 		ret = vfio_pci_enable(vdev);
@@ -91,7 +91,7 @@ static int vfio_pci_open(void *device_data)
 	}
 	vdev->refcnt++;
 error:
-	mutex_unlock(&vdev->reflck->lock);
+	vfio_pci_reflck_unlock(vdev);
 	if (ret)
 		module_put(THIS_MODULE);
 	return ret;
@@ -200,7 +200,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	if (!vdev)
 		return;
 
-	vfio_pci_reflck_put(vdev->reflck);
+	vfio_pci_reflck_put(vdev);
 
 	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
 	kfree(vdev->region);
diff --git a/drivers/vfio/pci/vfio_pci_common.c b/drivers/vfio/pci/vfio_pci_common.c
index edda7e4dc2e7..c0462799fc8d 100644
--- a/drivers/vfio/pci/vfio_pci_common.c
+++ b/drivers/vfio/pci/vfio_pci_common.c
@@ -1258,6 +1258,18 @@ EXPORT_SYMBOL_GPL(vfio_pci_err_handlers);
 
 static DEFINE_MUTEX(reflck_lock);
 
+void vfio_pci_reflck_lock(struct vfio_pci_device *vdev)
+{
+	mutex_lock(&vdev->reflck->lock);
+}
+EXPORT_SYMBOL(vfio_pci_reflck_lock);
+
+void vfio_pci_reflck_unlock(struct vfio_pci_device *vdev)
+{
+	mutex_unlock(&vdev->reflck->lock);
+}
+EXPORT_SYMBOL(vfio_pci_reflck_unlock);
+
 static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
 {
 	struct vfio_pci_reflck *reflck;
@@ -1333,9 +1345,10 @@ static void vfio_pci_reflck_release(struct kref *kref)
 	mutex_unlock(&reflck_lock);
 }
 
-void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+void vfio_pci_reflck_put(struct vfio_pci_device *vdev)
 {
-	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
+	kref_put_mutex(&vdev->reflck->kref,
+		       vfio_pci_reflck_release, &reflck_lock);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_reflck_put);
 
diff --git a/include/linux/vfio_pci_common.h b/include/linux/vfio_pci_common.h
index fa572d388111..8090d5469183 100644
--- a/include/linux/vfio_pci_common.h
+++ b/include/linux/vfio_pci_common.h
@@ -120,8 +120,10 @@ extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
 extern void vfio_pci_request(void *device_data, unsigned int count);
 extern void vfio_pci_fill_ids(char *ids, struct pci_driver *driver);
+extern void vfio_pci_reflck_lock(struct vfio_pci_device *vdev);
+extern void vfio_pci_reflck_unlock(struct vfio_pci_device *vdev);
 extern int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
-extern void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+extern void vfio_pci_reflck_put(struct vfio_pci_device *vdev);
 extern void vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
 
 #endif /* VFIO_PCI_COMMON_H */
diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-pci/vfio_mdev_pci.c
index b180356bb4ee..c98328cb4e3f 100644
--- a/samples/vfio-mdev-pci/vfio_mdev_pci.c
+++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
@@ -164,7 +164,7 @@ static int vfio_mdev_pci_open(struct mdev_device *mdev)
 
 	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
 
-	mutex_lock(&vdev->reflck->lock);
+	vfio_pci_reflck_lock(vdev);
 
 	if (!vdev->refcnt) {
 		ret = vfio_pci_enable(vdev);
@@ -175,7 +175,7 @@ static int vfio_mdev_pci_open(struct mdev_device *mdev)
 	}
 	vdev->refcnt++;
 error:
-	mutex_unlock(&vdev->reflck->lock);
+	vfio_pci_reflck_unlock(vdev);
 	if (!ret)
 		pr_info("Succeeded to open mdev: %s on pf: %s\n",
 		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
@@ -195,14 +195,14 @@ static void vfio_mdev_pci_release(struct mdev_device *mdev)
 	pr_info("Release mdev: %s on pf: %s\n",
 		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
 
-	mutex_lock(&vdev->reflck->lock);
+	vfio_pci_reflck_lock(vdev);
 
 	if (!(--vdev->refcnt)) {
 		vfio_spapr_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);
 	}
 
-	mutex_unlock(&vdev->reflck->lock);
+	vfio_pci_reflck_unlock(vdev);
 
 	module_put(THIS_MODULE);
 }
@@ -341,7 +341,7 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 	if (!vdev)
 		return;
 
-	vfio_pci_reflck_put(vdev->reflck);
+	vfio_pci_reflck_put(vdev);
 
 	kfree(vdev->region);
 	mutex_destroy(&vdev->ioeventfds_lock);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
  2020-01-09 22:48   ` Alex Williamson
@ 2020-01-10  7:35     ` Liu, Yi L
  0 siblings, 0 replies; 44+ messages in thread
From: Liu, Yi L @ 2020-01-10  7:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx, baolu.lu

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, January 10, 2020 6:48 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c
> 
> On Tue,  7 Jan 2020 20:01:40 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch replaces the vfio_pci_driver reference in vfio_pci.c with
> > pci_dev_driver(vdev->pdev) which is more helpful to make the functions
> > be generic to module types.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 34 ++++++++++++++++++----------------
> >  1 file changed, 18 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 009d2df..9140f5e5 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1463,24 +1463,25 @@ static void vfio_pci_reflck_get(struct vfio_pci_reflck
> *reflck)
> >
> >  static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
> >  {
> > -	struct vfio_pci_reflck **preflck = data;
> > +	struct vfio_pci_device *vdev = data;
> > +	struct vfio_pci_reflck **preflck = &vdev->reflck;
> >  	struct vfio_device *device;
> > -	struct vfio_pci_device *vdev;
> > +	struct vfio_pci_device *tmp;
> >
> >  	device = vfio_device_get_from_dev(&pdev->dev);
> >  	if (!device)
> >  		return 0;
> >
> > -	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
> > +	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
> >  		vfio_device_put(device);
> >  		return 0;
> >  	}
> >
> > -	vdev = vfio_device_data(device);
> > +	tmp = vfio_device_data(device);
> >
> > -	if (vdev->reflck) {
> > -		vfio_pci_reflck_get(vdev->reflck);
> > -		*preflck = vdev->reflck;
> > +	if (tmp->reflck) {
> > +		vfio_pci_reflck_get(tmp->reflck);
> > +		*preflck = tmp->reflck;
> 
> Seems we can do away with preflck entirely with this refactor, this
> simply becomes vdev->reflck = tmp->reflck.  Thanks,

yes, it is. Will modify it.

> Alex

Thanks,
Yi Liu

> >  		vfio_device_put(device);
> >  		return 1;
> >  	}
> > @@ -1497,7 +1498,7 @@ static int vfio_pci_reflck_attach(struct vfio_pci_device
> *vdev)
> >
> >  	if (pci_is_root_bus(vdev->pdev->bus) ||
> >  	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
> > -					  &vdev->reflck, slot) <= 0)
> > +					  vdev, slot) <= 0)
> >  		vdev->reflck = vfio_pci_reflck_alloc();
> >
> >  	mutex_unlock(&reflck_lock);
> > @@ -1522,6 +1523,7 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck
> *reflck)
> >
> >  struct vfio_devices {
> >  	struct vfio_device **devices;
> > +	struct vfio_pci_device *vdev;
> >  	int cur_index;
> >  	int max_index;
> >  };
> > @@ -1530,7 +1532,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev
> *pdev, void *data)
> >  {
> >  	struct vfio_devices *devs = data;
> >  	struct vfio_device *device;
> > -	struct vfio_pci_device *vdev;
> > +	struct vfio_pci_device *tmp;
> >
> >  	if (devs->cur_index == devs->max_index)
> >  		return -ENOSPC;
> > @@ -1539,15 +1541,15 @@ static int vfio_pci_get_unused_devs(struct pci_dev
> *pdev, void *data)
> >  	if (!device)
> >  		return -EINVAL;
> >
> > -	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
> > +	if (pci_dev_driver(pdev) != pci_dev_driver(devs->vdev->pdev)) {
> >  		vfio_device_put(device);
> >  		return -EBUSY;
> >  	}
> >
> > -	vdev = vfio_device_data(device);
> > +	tmp = vfio_device_data(device);
> >
> >  	/* Fault if the device is not unused */
> > -	if (vdev->refcnt) {
> > +	if (tmp->refcnt) {
> >  		vfio_device_put(device);
> >  		return -EBUSY;
> >  	}
> > @@ -1574,7 +1576,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev
> *pdev, void *data)
> >   */
> >  static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
> >  {
> > -	struct vfio_devices devs = { .cur_index = 0 };
> > +	struct vfio_devices devs = { .vdev = vdev, .cur_index = 0 };
> >  	int i = 0, ret = -EINVAL;
> >  	bool slot = false;
> >  	struct vfio_pci_device *tmp;
> > @@ -1637,7 +1639,7 @@ static void __exit vfio_pci_cleanup(void)
> >  	vfio_pci_uninit_perm_bits();
> >  }
> >
> > -static void __init vfio_pci_fill_ids(char *ids)
> > +static void __init vfio_pci_fill_ids(char *ids, struct pci_driver *driver)
> >  {
> >  	char *p, *id;
> >  	int rc;
> > @@ -1665,7 +1667,7 @@ static void __init vfio_pci_fill_ids(char *ids)
> >  			continue;
> >  		}
> >
> > -		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
> > +		rc = pci_add_dynid(driver, vendor, device,
> >  				   subvendor, subdevice, class, class_mask, 0);
> >  		if (rc)
> >  			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]]
> class %#08x/%08x (%d)\n",
> > @@ -1692,7 +1694,7 @@ static int __init vfio_pci_init(void)
> >  	if (ret)
> >  		goto out_driver;
> >
> > -	vfio_pci_fill_ids(ids);
> > +	vfio_pci_fill_ids(ids, &vfio_pci_driver);
> >
> >  	return 0;
> >


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file
  2020-01-07 12:01 ` [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file Liu Yi L
@ 2020-01-15 10:43   ` Cornelia Huck
  2020-01-16 12:46     ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Cornelia Huck @ 2020-01-15 10:43 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, kevin.tian, joro,
	peterx, baolu.lu

On Tue,  7 Jan 2020 20:01:39 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch moves two inline functions to vfio_pci_private.h for further
> sharing across source files. Also avoids below compiling error in further
> code split.
> 
> "error: inlining failed in call to always_inline ‘vfio_pci_is_vga’:
> function body not available".

"We want to use these functions from other files, so move them to a
header" seems to be justification enough; why mention the compilation
error?

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         | 14 --------------
>  drivers/vfio/pci/vfio_pci_private.h | 14 ++++++++++++++
>  2 files changed, 14 insertions(+), 14 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 04/12] vfio_pci: make common functions be extern
  2020-01-07 12:01 ` [PATCH v4 04/12] vfio_pci: make common functions be extern Liu Yi L
@ 2020-01-15 10:56   ` Cornelia Huck
  2020-01-16 12:48     ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Cornelia Huck @ 2020-01-15 10:56 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, kevin.tian, joro,
	peterx, baolu.lu

On Tue,  7 Jan 2020 20:01:41 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch makes the common functions (module agnostic functions) in
> vfio_pci.c to be extern. So that such functions could be moved to a
> common source file.
> 
> *) vfio_pci_set_vga_decode
> *) vfio_pci_probe_power_state
> *) vfio_pci_set_power_state
> *) vfio_pci_enable
> *) vfio_pci_disable
> *) vfio_pci_refresh_config
> *) vfio_pci_register_dev_region
> *) vfio_pci_ioctl
> *) vfio_pci_read
> *) vfio_pci_write
> *) vfio_pci_mmap
> *) vfio_pci_request
> *) vfio_pci_err_handlers
> *) vfio_pci_reflck_attach
> *) vfio_pci_reflck_put
> *) vfio_pci_fill_ids

I find it a bit hard to understand what "module agnostic functions" are
supposed to be. The functions you want to move seem to be some "basic"
functions that can be shared between normal vfio-pci and
vfio-mdev-pci... maybe talk about "functions that provide basic vfio
functionality for pci devices" and also mention the mdev part?

[My rationale behind complaining about the commit messages is that if I
look at this change in a year from now, I want to be able to know why
and to what end that change was made.]

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         | 30 +++++++++++++-----------------
>  drivers/vfio/pci/vfio_pci_private.h | 15 +++++++++++++++
>  2 files changed, 28 insertions(+), 17 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c
  2020-01-07 12:01 ` [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c Liu Yi L
@ 2020-01-15 11:03   ` Cornelia Huck
  2020-01-15 15:12     ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Cornelia Huck @ 2020-01-15 11:03 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, kevin.tian, joro,
	peterx, baolu.lu

On Tue,  7 Jan 2020 20:01:42 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch has no code change, just a file copy. In following patches,
> vfio_pci_common.c will be modified to only include the common functions
> and related static functions in original vfio_pci.c. Meanwhile, vfio_pci.c
> will be modified to only include vfio-pci module specific codes.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_common.c | 1708 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 1708 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_common.c

This whole procedure of "let's copy the file and rip out unneeded stuff
later" looks very ugly to me, especially if I'd come across it in the
future, e.g. during a bisect. This patch only adds a file that is not
compiled, and later changes will be "rip out unwanted stuff from
vfio_pci_common.c" instead of the more positive "move common stuff to
vfio_pci_common.c". I think refactoring/moving interfaces/code that it
makes sense to share makes this more reviewable, both now and in the
future.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-07 12:01 ` [PATCH v4 11/12] samples: add vfio-mdev-pci driver Liu Yi L
  2020-01-09 22:48   ` Alex Williamson
@ 2020-01-15 12:30   ` Cornelia Huck
  2020-01-16 13:23     ` Liu, Yi L
  1 sibling, 1 reply; 44+ messages in thread
From: Cornelia Huck @ 2020-01-15 12:30 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, kevin.tian, joro,
	peterx, baolu.lu, Masahiro Yamada

On Tue,  7 Jan 2020 20:01:48 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds sample driver named vfio-mdev-pci. It is to wrap
> a PCI device as a mediated device. For a pci device, once bound
> to vfio-mdev-pci driver, user space access of this device will
> go through vfio mdev framework. The usage of the device follows
> mdev management method. e.g. user should create a mdev before
> exposing the device to user-space.
> 
> Benefit of this new driver would be acting as a sample driver
> for recent changes from "vfio/mdev: IOMMU aware mediated device"
> patchset. Also it could be a good experiment driver for future
> device specific mdev migration support. This sample driver only
> supports singleton iommu groups, for non-singleton iommu groups,
> this sample driver doesn't work. It will fail when trying to assign
> the non-singleton iommu group to VMs.
> 
> To use this driver:
> a) build and load vfio-mdev-pci.ko module
>    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
>    then load it with following command:
>    > sudo modprobe vfio
>    > sudo modprobe vfio-pci
>    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko  
> 
> b) unbind original device driver
>    e.g. use following command to unbind its original driver
>    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> 
> c) bind vfio-mdev-pci driver to the physical device
>    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> 
> d) check the supported mdev instances
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
>      vfio-mdev-pci-type_name
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
>      vfio-mdev-pci-type_name/
>      available_instances  create  device_api  devices  name
> 
> e)  create mdev on this physical device (only 1 instance)
>    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
>      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
>      vfio-mdev-pci-type_name/create
> 
> f) passthru the mdev to guest
>    add the following line in QEMU boot command
>     -device vfio-pci,\
>      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> 
> g) destroy mdev
>    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
>      remove

I think much/most of those instructions should go (additionally) into
the sample driver source. Otherwise, it's not clear to the reader why
they should wrap the device in mdev instead of simply using a normal
vfio-pci device.

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  samples/Kconfig                       |  10 +
>  samples/Makefile                      |   1 +
>  samples/vfio-mdev-pci/Makefile        |   4 +
>  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397 ++++++++++++++++++++++++++++++++++
>  4 files changed, 412 insertions(+)
>  create mode 100644 samples/vfio-mdev-pci/Makefile
>  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> 
> diff --git a/samples/Kconfig b/samples/Kconfig
> index 9d236c3..50d207c 100644
> --- a/samples/Kconfig
> +++ b/samples/Kconfig
> @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
>  	help
>  	  Build a sample program to work with mei device.
>  
> +config SAMPLE_VFIO_MDEV_PCI
> +	tristate "Sample driver for wrapping PCI device as a mdev"
> +	select VFIO_PCI_COMMON
> +	select VFIO_PCI

Why does this still need to select VFIO_PCI? Shouldn't all needed
infrastructure rather be covered by VFIO_PCI_COMMON already?

> +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE

VFIO_MDEV_DEVICE already depends on VFIO_MDEV. But maybe also make this
depend on PCI?

> +	help
> +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> +	  this driver, device passthru should through mdev path.

"A PCI device bound to this driver will be assigned through the
mediated device framework."

?

> +
> +	  If you don't know what to do here, say N.
>  
>  endif # SAMPLES


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c
  2020-01-15 11:03   ` Cornelia Huck
@ 2020-01-15 15:12     ` Alex Williamson
  0 siblings, 0 replies; 44+ messages in thread
From: Alex Williamson @ 2020-01-15 15:12 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Liu Yi L, kwankhede, linux-kernel, kvm, kevin.tian, joro, peterx,
	baolu.lu

On Wed, 15 Jan 2020 12:03:00 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue,  7 Jan 2020 20:01:42 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch has no code change, just a file copy. In following patches,
> > vfio_pci_common.c will be modified to only include the common functions
> > and related static functions in original vfio_pci.c. Meanwhile, vfio_pci.c
> > will be modified to only include vfio-pci module specific codes.
> > 
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_common.c | 1708 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 1708 insertions(+)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_common.c  
> 
> This whole procedure of "let's copy the file and rip out unneeded stuff
> later" looks very ugly to me, especially if I'd come across it in the
> future, e.g. during a bisect. This patch only adds a file that is not
> compiled, and later changes will be "rip out unwanted stuff from
> vfio_pci_common.c" instead of the more positive "move common stuff to
> vfio_pci_common.c". I think refactoring/moving interfaces/code that it
> makes sense to share makes this more reviewable, both now and in the
> future.

I think this comes largely at my request from previous reviews.  It's
very easy to apply this patch and diff the files to see that nothing
has changed, then review the subsequent patch to see that code is only
added or removed to check that there are no actual code changes.  If we
just selectively move code then I think it's left to our inspection to
verify nothing has changed.  Maybe this is a dummy step in a bisect,
but I don't see that you lose any granularity.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files
  2020-01-09 22:48   ` Alex Williamson
@ 2020-01-16 11:59     ` Liu, Yi L
  0 siblings, 0 replies; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 11:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx, baolu.lu

Thanks, Alex. All the four comments accepted. :-) Will apply your suggested
patch in new version. :-)

Regards,
Yi Liu

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, January 10, 2020 6:49 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files
> 
> On Tue,  7 Jan 2020 20:01:46 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch splits the vfio_pci_private.h to be a private file in
> > drivers/vfio/pci and a common file under include/linux/. It is a
> > preparation for supporting vfio_pci common code sharing outside
> > drivers/vfio/pci/.
> >
> > The common header file is shrunk from the previous copied
> > vfio_pci_common.h. The original vfio_pci_private.h is shrunk
> > accordingly as well.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_private.h | 133 +-----------------------------------
> >  include/linux/vfio_pci_common.h     |  86 ++---------------------
> >  2 files changed, 7 insertions(+), 212 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_private.h
> > b/drivers/vfio/pci/vfio_pci_private.h
> > index 499dd04..c4976a9 100644
> > --- a/drivers/vfio/pci/vfio_pci_private.h
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/irqbypass.h>
> >  #include <linux/types.h>
> > +#include <linux/vfio_pci_common.h>
> >
> >  #ifndef VFIO_PCI_PRIVATE_H
> >  #define VFIO_PCI_PRIVATE_H
> > @@ -39,121 +40,12 @@ struct vfio_pci_ioeventfd {
> >  	int			count;
> >  };
> >
> > -struct vfio_pci_irq_ctx {
> > -	struct eventfd_ctx	*trigger;
> > -	struct virqfd		*unmask;
> > -	struct virqfd		*mask;
> > -	char			*name;
> > -	bool			masked;
> > -	struct irq_bypass_producer	producer;
> > -};
> 
> I think this can stay here, vfio_pci_common.h just needs a forward declaration.
> 
> > -
> > -struct vfio_pci_device;
> > -struct vfio_pci_region;
> > -
> > -struct vfio_pci_regops {
> > -	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
> > -		      size_t count, loff_t *ppos, bool iswrite);
> > -	void	(*release)(struct vfio_pci_device *vdev,
> > -			   struct vfio_pci_region *region);
> > -	int	(*mmap)(struct vfio_pci_device *vdev,
> > -			struct vfio_pci_region *region,
> > -			struct vm_area_struct *vma);
> > -	int	(*add_capability)(struct vfio_pci_device *vdev,
> > -				  struct vfio_pci_region *region,
> > -				  struct vfio_info_cap *caps);
> > -};
> > -
> > -struct vfio_pci_region {
> > -	u32				type;
> > -	u32				subtype;
> > -	const struct vfio_pci_regops	*ops;
> > -	void				*data;
> > -	size_t				size;
> > -	u32				flags;
> > -};
> > -
> >  struct vfio_pci_dummy_resource {
> >  	struct resource		resource;
> >  	int			index;
> >  	struct list_head	res_next;
> >  };
> >
> > -struct vfio_pci_reflck {
> > -	struct kref		kref;
> > -	struct mutex		lock;
> > -};
> 
> I think we can abstract this a little further to make it unnecessary to put this in
> common as well.  See attached.
> 
> > -
> > -struct vfio_pci_device {
> > -	struct pci_dev		*pdev;
> > -	void __iomem		*barmap[PCI_STD_NUM_BARS];
> > -	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
> > -	u8			*pci_config_map;
> > -	u8			*vconfig;
> > -	struct perm_bits	*msi_perm;
> > -	spinlock_t		irqlock;
> > -	struct mutex		igate;
> > -	struct vfio_pci_irq_ctx	*ctx;
> > -	int			num_ctx;
> > -	int			irq_type;
> > -	int			num_regions;
> > -	struct vfio_pci_region	*region;
> > -	u8			msi_qmax;
> > -	u8			msix_bar;
> > -	u16			msix_size;
> > -	u32			msix_offset;
> > -	u32			rbar[7];
> > -	bool			pci_2_3;
> > -	bool			virq_disabled;
> > -	bool			reset_works;
> > -	bool			extended_caps;
> > -	bool			bardirty;
> > -	bool			has_vga;
> > -	bool			needs_reset;
> > -	bool			nointx;
> > -	bool			needs_pm_restore;
> > -	struct pci_saved_state	*pci_saved_state;
> > -	struct pci_saved_state	*pm_save;
> > -	struct vfio_pci_reflck	*reflck;
> > -	int			refcnt;
> > -	int			ioeventfds_nr;
> > -	struct eventfd_ctx	*err_trigger;
> > -	struct eventfd_ctx	*req_trigger;
> > -	struct list_head	dummy_resources_list;
> > -	struct mutex		ioeventfds_lock;
> > -	struct list_head	ioeventfds_list;
> > -	bool			nointxmask;
> > -#ifdef CONFIG_VFIO_PCI_VGA
> > -	bool			disable_vga;
> > -#endif
> > -	bool			disable_idle_d3;
> > -};
> > -
> > -#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> > -#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
> > -#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> > -#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
> > is_msix(vdev))) -#define irq_is(vdev, type) (vdev->irq_type == type)
> 
> I think these can stay in the private header too.
> 
> > -
> > -extern const struct pci_error_handlers vfio_err_handlers;
> > -
> > -static inline bool vfio_pci_is_vga(struct pci_dev *pdev) -{
> > -	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> > -}
> > -
> > -static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev) -{
> > -#ifdef CONFIG_VFIO_PCI_VGA
> > -	return vdev->disable_vga;
> > -#else
> > -	return true;
> > -#endif
> > -}
> 
> vfio_vga_disabled() is only used in vfio_pci_common.c, I think it can remain in
> private.
> 
> > -
> > -extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
> > -				bool nointxmask, bool disable_idle_d3);
> > -
> >  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);  extern
> > void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
> >
> > @@ -180,29 +72,6 @@ extern void vfio_pci_uninit_perm_bits(void);
> > extern int vfio_config_init(struct vfio_pci_device *vdev);  extern
> > void vfio_config_free(struct vfio_pci_device *vdev);
> >
> > -extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
> > -					unsigned int type, unsigned int subtype,
> > -					const struct vfio_pci_regops *ops,
> > -					size_t size, u32 flags, void *data);
> > -
> > -extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
> > -				    pci_power_t state);
> > -extern unsigned int vfio_pci_set_vga_decode(void *opaque, bool
> > single_vga); -extern int vfio_pci_enable(struct vfio_pci_device
> > *vdev); -extern void vfio_pci_disable(struct vfio_pci_device *vdev);
> > -extern long vfio_pci_ioctl(void *device_data,
> > -			unsigned int cmd, unsigned long arg);
> > -extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
> > -			size_t count, loff_t *ppos);
> > -extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
> > -			size_t count, loff_t *ppos);
> > -extern int vfio_pci_mmap(void *device_data, struct vm_area_struct
> > *vma); -extern void vfio_pci_request(void *device_data, unsigned int
> > count); -extern void vfio_pci_fill_ids(char *ids, struct pci_driver
> > *driver); -extern int vfio_pci_reflck_attach(struct vfio_pci_device
> > *vdev); -extern void vfio_pci_reflck_put(struct vfio_pci_reflck
> > *reflck); -extern void vfio_pci_probe_power_state(struct
> > vfio_pci_device *vdev);
> > -
> >  #ifdef CONFIG_VFIO_PCI_IGD
> >  extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);  #else
> > diff --git a/include/linux/vfio_pci_common.h
> > b/include/linux/vfio_pci_common.h index 499dd04..862cd80 100644
> > --- a/include/linux/vfio_pci_common.h
> > +++ b/include/linux/vfio_pci_common.h
> > @@ -1,5 +1,8 @@
> >  /* SPDX-License-Identifier: GPL-2.0-only */
> >  /*
> > + * VFIO PCI API definition
> > + *
> > + * Derived from original vfio/pci/vfio_pci_private.h:
> >   * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> >   *     Author: Alex Williamson <alex.williamson@redhat.com>
> >   *
> > @@ -13,31 +16,8 @@
> >  #include <linux/irqbypass.h>
> >  #include <linux/types.h>
> >
> > -#ifndef VFIO_PCI_PRIVATE_H
> > -#define VFIO_PCI_PRIVATE_H
> > -
> > -#define VFIO_PCI_OFFSET_SHIFT   40
> > -
> > -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> > -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) <<
> VFIO_PCI_OFFSET_SHIFT)
> > -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> > -
> > -/* Special capability IDs predefined access */
> > -#define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
> > -#define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
> > -
> > -/* Cap maximum number of ioeventfds per device (arbitrary) */
> > -#define VFIO_PCI_IOEVENTFD_MAX		1000
> > -
> > -struct vfio_pci_ioeventfd {
> > -	struct list_head	next;
> > -	struct virqfd		*virqfd;
> > -	void __iomem		*addr;
> > -	uint64_t		data;
> > -	loff_t			pos;
> > -	int			bar;
> > -	int			count;
> > -};
> > +#ifndef VFIO_PCI_COMMON_H
> > +#define VFIO_PCI_COMMON_H
> >
> >  struct vfio_pci_irq_ctx {
> >  	struct eventfd_ctx	*trigger;
> > @@ -73,12 +53,6 @@ struct vfio_pci_region {
> >  	u32				flags;
> >  };
> >
> > -struct vfio_pci_dummy_resource {
> > -	struct resource		resource;
> > -	int			index;
> > -	struct list_head	res_next;
> > -};
> > -
> >  struct vfio_pci_reflck {
> >  	struct kref		kref;
> >  	struct mutex		lock;
> > @@ -154,32 +128,6 @@ static inline bool vfio_vga_disabled(struct
> > vfio_pci_device *vdev)  extern void vfio_pci_refresh_config(struct vfio_pci_device
> *vdev,
> >  				bool nointxmask, bool disable_idle_d3);
> >
> > -extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev); -extern
> > void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
> > -
> > -extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
> > -				   uint32_t flags, unsigned index,
> > -				   unsigned start, unsigned count, void *data);
> > -
> > -extern ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev,
> > -				  char __user *buf, size_t count,
> > -				  loff_t *ppos, bool iswrite);
> > -
> > -extern ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
> > -			       size_t count, loff_t *ppos, bool iswrite);
> > -
> > -extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
> > -			       size_t count, loff_t *ppos, bool iswrite);
> > -
> > -extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
> > -			       uint64_t data, int count, int fd);
> > -
> > -extern int vfio_pci_init_perm_bits(void); -extern void
> > vfio_pci_uninit_perm_bits(void);
> > -
> > -extern int vfio_config_init(struct vfio_pci_device *vdev); -extern
> > void vfio_config_free(struct vfio_pci_device *vdev);
> > -
> >  extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
> >  					unsigned int type, unsigned int subtype,
> >  					const struct vfio_pci_regops *ops, @@ -
> 203,26 +151,4 @@ extern
> > int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);  extern void
> > vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);  extern void
> > vfio_pci_probe_power_state(struct vfio_pci_device *vdev);
> >
> > -#ifdef CONFIG_VFIO_PCI_IGD
> > -extern int vfio_pci_igd_init(struct vfio_pci_device *vdev); -#else
> > -static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev) -{
> > -	return -ENODEV;
> > -}
> > -#endif
> > -#ifdef CONFIG_VFIO_PCI_NVLINK2
> > -extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device
> > *vdev); -extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device
> > *vdev); -#else -static inline int
> > vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) -{
> > -	return -ENODEV;
> > -}
> > -
> > -static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device
> > *vdev) -{
> > -	return -ENODEV;
> > -}
> > -#endif
> > -#endif /* VFIO_PCI_PRIVATE_H */
> > +#endif /* VFIO_PCI_COMMON_H */


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module
  2020-01-09 22:48   ` Alex Williamson
@ 2020-01-16 12:19     ` Liu, Yi L
  0 siblings, 0 replies; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 12:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx, baolu.lu

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, January 10, 2020 6:48 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci
> module
> 
> On Tue,  7 Jan 2020 20:01:38 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds three fields in struct vfio_pci_device to pass the user
> > configurations of vfio-pci.ko module to some functions which could be
> > common in future usage. The values stored in struct vfio_pci_device will
> > be initiated in probe and refreshed in device open phase to allow runtime
> > modifications to parameters. e.g. disable_idle_d3 and nointxmask.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c         | 37 ++++++++++++++++++++++++++-----------
> >  drivers/vfio/pci/vfio_pci_private.h |  8 ++++++++
> >  2 files changed, 34 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 379a02c..af507c2 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -54,10 +54,10 @@ module_param(disable_idle_d3, bool, S_IRUGO |
> S_IWUSR);
> >  MODULE_PARM_DESC(disable_idle_d3,
> >  		 "Disable using the PCI D3 low power state for idle, unused devices");
> >
> > -static inline bool vfio_vga_disabled(void)
> > +static inline bool vfio_vga_disabled(struct vfio_pci_device *vdev)
> >  {
> >  #ifdef CONFIG_VFIO_PCI_VGA
> > -	return disable_vga;
> > +	return vdev->disable_vga;
> >  #else
> >  	return true;
> >  #endif
> > @@ -78,7 +78,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque,
> bool single_vga)
> >  	unsigned char max_busnr;
> >  	unsigned int decodes;
> >
> > -	if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
> > +	if (single_vga || !vfio_vga_disabled(vdev) ||
> > +		pci_is_root_bus(pdev->bus))
> >  		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
> >  		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
> >
> > @@ -289,7 +290,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >  	if (!vdev->pci_saved_state)
> >  		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
> >
> > -	if (likely(!nointxmask)) {
> > +	if (likely(!vdev->nointxmask)) {
> >  		if (vfio_pci_nointx(pdev)) {
> >  			pci_info(pdev, "Masking broken INTx support\n");
> >  			vdev->nointx = true;
> > @@ -326,7 +327,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >  	} else
> >  		vdev->msix_bar = 0xFF;
> >
> > -	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
> > +	if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
> >  		vdev->has_vga = true;
> >
> >
> > @@ -462,10 +463,17 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
> >
> >  	vfio_pci_try_bus_reset(vdev);
> >
> > -	if (!disable_idle_d3)
> > +	if (!vdev->disable_idle_d3)
> >  		vfio_pci_set_power_state(vdev, PCI_D3hot);
> >  }
> >
> > +void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
> > +			bool nointxmask, bool disable_idle_d3)
> > +{
> > +	vdev->nointxmask = nointxmask;
> > +	vdev->disable_idle_d3 = disable_idle_d3;
> 
> These two are selected (not disable_vga) because they're the only
> writable module options, correct?

yep. These were selected per previous review comments from
you. I also checked in the code, the existing 4 module options
are clarified as below, and I can see only nointxmask and disable_idle_d3
are writable. I guess this should be the evidence for selecting the
modules options to be refreshed in vfio_pci_refresh_config().

static char ids[1024] __initdata;
module_param_string(ids, ids, sizeof(ids), 0);

static bool nointxmask;
module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);

static bool disable_vga;
module_param(disable_vga, bool, S_IRUGO);

static bool disable_idle_d3;
module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);

> 
> > +}
> > +
> >  static void vfio_pci_release(void *device_data)
> >  {
> >  	struct vfio_pci_device *vdev = device_data;
> > @@ -490,6 +498,8 @@ static int vfio_pci_open(void *device_data)
> >  	if (!try_module_get(THIS_MODULE))
> >  		return -ENODEV;
> >
> > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > +
> >  	mutex_lock(&vdev->reflck->lock);
> >
> >  	if (!vdev->refcnt) {
> > @@ -1330,6 +1340,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
> >  	spin_lock_init(&vdev->irqlock);
> >  	mutex_init(&vdev->ioeventfds_lock);
> >  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > +	vdev->nointxmask = nointxmask;
> > +#ifdef CONFIG_VFIO_PCI_VGA
> > +	vdev->disable_vga = disable_vga;
> > +#endif
> > +	vdev->disable_idle_d3 = disable_idle_d3;
> 
> But this could still use vfio_pci_refresh_config() for those writable
> options and set disable_vga separately, couldn't it? 

Right, would modify it. thanks.

> Also, since
> disable_idle_d3 is related to power handling of the device while it is
> not opened by the user, shouldn't the config also be refreshed when the
> device is released by the user?

Oh, yes. You told me to do it. But I assumed that we only care
about the config used during an open() and a release() circle.
I missed it will affect the power management. Let me add the
config refresh at release() all the same. Thanks.

> 
> >
> >  	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
> >  	if (ret) {
> > @@ -1354,7 +1369,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
> >
> >  	vfio_pci_probe_power_state(vdev);
> >
> > -	if (!disable_idle_d3) {
> > +	if (!vdev->disable_idle_d3) {
> >  		/*
> >  		 * pci-core sets the device power state to an unknown value at
> >  		 * bootup and after being removed from a driver.  The only
> > @@ -1385,7 +1400,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
> >  	kfree(vdev->region);
> >  	mutex_destroy(&vdev->ioeventfds_lock);
> >
> > -	if (!disable_idle_d3)
> > +	if (!vdev->disable_idle_d3)
> >  		vfio_pci_set_power_state(vdev, PCI_D0);
> >
> >  	kfree(vdev->pm_save);
> > @@ -1620,7 +1635,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device
> *vdev)
> >  		if (!ret) {
> >  			tmp->needs_reset = false;
> >
> > -			if (tmp != vdev && !disable_idle_d3)
> > +			if (tmp != vdev && !tmp->disable_idle_d3)
> >  				vfio_pci_set_power_state(tmp, PCI_D3hot);
> >  		}
> >
> > @@ -1636,7 +1651,7 @@ static void __exit vfio_pci_cleanup(void)
> >  	vfio_pci_uninit_perm_bits();
> >  }
> >
> > -static void __init vfio_pci_fill_ids(void)
> > +static void __init vfio_pci_fill_ids(char *ids)
> 
> This might be more clear if the global was also renamed vfio_pci_ids.

Yep. Let me rename it later.

> >  {
> >  	char *p, *id;
> >  	int rc;
> > @@ -1691,7 +1706,7 @@ static int __init vfio_pci_init(void)
> >  	if (ret)
> >  		goto out_driver;
> >
> > -	vfio_pci_fill_ids();
> > +	vfio_pci_fill_ids(ids);
> >
> >  	return 0;
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> > index 8a2c760..0398608 100644
> > --- a/drivers/vfio/pci/vfio_pci_private.h
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -122,6 +122,11 @@ struct vfio_pci_device {
> >  	struct list_head	dummy_resources_list;
> >  	struct mutex		ioeventfds_lock;
> >  	struct list_head	ioeventfds_list;
> > +	bool			nointxmask;
> > +#ifdef CONFIG_VFIO_PCI_VGA
> > +	bool			disable_vga;
> > +#endif
> > +	bool			disable_idle_d3;
> 
> It seems like there are more relevant places these could be within this
> structure, ex. nointxmask next to nointx, disable_vga near has_vga,
> disable_idle_d3 maybe near needs_pm_restore (even though those aren't
> conceptually related).  Not necessarily related to this series, it
> might be time to convert the existing bools to bit fields, but even
> before that the alignment of adding these as bools grouped with the
> existing bools is probably better.  Thanks,

Agreed. Will place the new bools at better place (with proper neighbors:-))

> Alex

Thanks,
Yi Liu

> 
> >  };
> >
> >  #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> > @@ -130,6 +135,9 @@ struct vfio_pci_device {
> >  #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
> >  #define irq_is(vdev, type) (vdev->irq_type == type)
> >
> > +extern void vfio_pci_refresh_config(struct vfio_pci_device *vdev,
> > +				bool nointxmask, bool disable_idle_d3);
> > +
> >  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
> >  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
> >


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-09 22:48   ` Alex Williamson
@ 2020-01-16 12:33     ` Liu, Yi L
  2020-01-16 21:24       ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 12:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx,
	baolu.lu, Masahiro Yamada

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, January 10, 2020 6:49 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> 
> On Tue,  7 Jan 2020 20:01:48 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > a PCI device as a mediated device. For a pci device, once bound
> > to vfio-mdev-pci driver, user space access of this device will
> > go through vfio mdev framework. The usage of the device follows
> > mdev management method. e.g. user should create a mdev before
> > exposing the device to user-space.
> >
> > Benefit of this new driver would be acting as a sample driver
> > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > patchset. Also it could be a good experiment driver for future
> > device specific mdev migration support. This sample driver only
> > supports singleton iommu groups, for non-singleton iommu groups,
> > this sample driver doesn't work. It will fail when trying to assign
> > the non-singleton iommu group to VMs.
> >
> > To use this driver:
> > a) build and load vfio-mdev-pci.ko module
> >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> >    then load it with following command:
> >    > sudo modprobe vfio
> >    > sudo modprobe vfio-pci
> >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko
> >
> > b) unbind original device driver
> >    e.g. use following command to unbind its original driver
> >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> >
> > c) bind vfio-mdev-pci driver to the physical device
> >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id
> >
> > d) check the supported mdev instances
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> >      vfio-mdev-pci-type_name
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type_name/
> >      available_instances  create  device_api  devices  name
> >
> > e)  create mdev on this physical device (only 1 instance)
> >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type_name/create
> >
> > f) passthru the mdev to guest
> >    add the following line in QEMU boot command
> >     -device vfio-pci,\
> >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> >
> > g) destroy mdev
> >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
> >      remove
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  samples/Kconfig                       |  10 +
> >  samples/Makefile                      |   1 +
> >  samples/vfio-mdev-pci/Makefile        |   4 +
> >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397
> ++++++++++++++++++++++++++++++++++
> >  4 files changed, 412 insertions(+)
> >  create mode 100644 samples/vfio-mdev-pci/Makefile
> >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> >
> > diff --git a/samples/Kconfig b/samples/Kconfig
> > index 9d236c3..50d207c 100644
> > --- a/samples/Kconfig
> > +++ b/samples/Kconfig
> > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> >  	help
> >  	  Build a sample program to work with mei device.
> >
> > +config SAMPLE_VFIO_MDEV_PCI
> > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > +	select VFIO_PCI_COMMON
> > +	select VFIO_PCI
> > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > +	help
> > +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> > +	  this driver, device passthru should through mdev path.
> > +
> > +	  If you don't know what to do here, say N.
> >
> >  endif # SAMPLES
> > diff --git a/samples/Makefile b/samples/Makefile
> > index 5ce50ef..84faced 100644
> > --- a/samples/Makefile
> > +++ b/samples/Makefile
> > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)	+= ftrace/
> >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> >  obj-y					+= vfio-mdev/
> > +obj-y					+= vfio-mdev-pci/
> 
> I think we could just lump this into vfio-mdev rather than making
> another directory.

sure. will move it. :-)

> 
> >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-pci/Makefile
> > new file mode 100644
> > index 0000000..41b2139
> > --- /dev/null
> > +++ b/samples/vfio-mdev-pci/Makefile
> > @@ -0,0 +1,4 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > +
> > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-
> pci/vfio_mdev_pci.c
> > new file mode 100644
> > index 0000000..b180356
> > --- /dev/null
> > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > @@ -0,0 +1,397 @@
> > +/*
> > + * Copyright © 2020 Intel Corporation.
> > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio_pci.c:
> > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/device.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/file.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/iommu.h>
> > +#include <linux/module.h>
> > +#include <linux/mutex.h>
> > +#include <linux/notifier.h>
> > +#include <linux/pci.h>
> > +#include <linux/pm_runtime.h>
> > +#include <linux/slab.h>
> > +#include <linux/types.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/vgaarb.h>
> > +#include <linux/nospec.h>
> > +#include <linux/mdev.h>
> > +#include <linux/vfio_pci_common.h>
> > +
> > +#define DRIVER_VERSION  "0.1"
> > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a
> mdev"
> > +
> > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > +
> > +static char ids[1024] __initdata;
> > +module_param_string(ids, ids, sizeof(ids), 0);
> > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver,
> format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and
> multiple comma separated entries can be specified");
> > +
> > +static bool nointxmask;
> > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
> > +MODULE_PARM_DESC(nointxmask,
> > +		  "Disable support for PCI 2.3 style INTx masking.  If this resolves
> problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so
> the device can be fixed automatically via the broken_intx_masking flag.");
> > +
> > +#ifdef CONFIG_VFIO_PCI_VGA
> > +static bool disable_vga;
> > +module_param(disable_vga, bool, S_IRUGO);
> > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-
> mdev-pci");
> > +#endif
> > +
> > +static bool disable_idle_d3;
> > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > +MODULE_PARM_DESC(disable_idle_d3,
> > +		 "Disable using the PCI D3 low power state for idle, unused devices");
> > +
> > +static struct pci_driver vfio_mdev_pci_driver;
> > +
> > +static ssize_t
> > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > +{
> > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > +}
> > +
> > +MDEV_TYPE_ATTR_RO(name);
> > +
> > +static ssize_t
> > +available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
> > +{
> > +	return sprintf(buf, "%d\n", 1);
> > +}
> > +
> > +MDEV_TYPE_ATTR_RO(available_instances);
> > +
> > +static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
> > +		char *buf)
> > +{
> > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > +}
> > +
> > +MDEV_TYPE_ATTR_RO(device_api);
> > +
> > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > +	&mdev_type_attr_name.attr,
> > +	&mdev_type_attr_device_api.attr,
> > +	&mdev_type_attr_available_instances.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > +	.name  = "type1",
> > +	.attrs = vfio_mdev_pci_types_attrs,
> > +};
> > +
> > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > +	&vfio_mdev_pci_type_group1,
> > +	NULL,
> > +};
> > +
> > +struct vfio_mdev_pci {
> > +	struct vfio_pci_device *vdev;
> > +	struct mdev_device *mdev;
> > +	unsigned long handle;
> > +};
> > +
> > +static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
> > +{
> > +	struct device *pdev;
> > +	struct vfio_pci_device *vdev;
> > +	struct vfio_mdev_pci *pmdev;
> > +	int ret;
> > +
> > +	pdev = mdev_parent_dev(mdev);
> > +	vdev = dev_get_drvdata(pdev);
> > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > +	if (pmdev == NULL) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	pmdev->mdev = mdev;
> > +	pmdev->vdev = vdev;
> > +	mdev_set_drvdata(mdev, pmdev);
> > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > +	if (ret) {
> > +		pr_info("%s, failed to config iommu isolation for mdev: %s on
> pf: %s\n",
> > +			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> > +		goto out;
> > +	}
> > +
> > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > +		     dev_name(mdev_dev(mdev)));
> > +out:
> > +	return ret;
> > +}
> > +
> > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +
> > +	kfree(pmdev);
> > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > +		     dev_name(mdev_dev(mdev)));
> > +
> > +	return 0;
> > +}
> > +
> > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > +	int ret = 0;
> > +
> > +	if (!try_module_get(THIS_MODULE))
> > +		return -ENODEV;
> > +
> > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > +
> > +	mutex_lock(&vdev->reflck->lock);
> > +
> > +	if (!vdev->refcnt) {
> > +		ret = vfio_pci_enable(vdev);
> > +		if (ret)
> > +			goto error;
> > +
> > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > +	}
> > +	vdev->refcnt++;
> > +error:
> > +	mutex_unlock(&vdev->reflck->lock);
> > +	if (!ret)
> > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> >dev));
> > +	else {
> > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> >dev));
> > +		module_put(THIS_MODULE);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > +
> > +	pr_info("Release mdev: %s on pf: %s\n",
> > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> >dev));
> > +
> > +	mutex_lock(&vdev->reflck->lock);
> > +
> > +	if (!(--vdev->refcnt)) {
> > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > +		vfio_pci_disable(vdev);
> > +	}
> > +
> > +	mutex_unlock(&vdev->reflck->lock);
> > +
> > +	module_put(THIS_MODULE);
> > +}
> 
> open() and release() here are almost identical between vfio_pci and
> vfio_mdev_pci, which suggests maybe there should be common functions to
> call into like we do for the below.

yes, let me have more study and do better abstract in next version. :-)

> > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
> > +			     unsigned long arg)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +
> > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > +}
> > +
> > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +
> > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > +}
> > +
> > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
> > +			size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +
> > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > +}
> > +
> > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > +				const char __user *buf,
> > +				size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > +
> > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
> > +}
> > +
> > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > +	.supported_type_groups	= vfio_mdev_pci_type_groups,
> > +	.create			= vfio_mdev_pci_create,
> > +	.remove			= vfio_mdev_pci_remove,
> > +
> > +	.open			= vfio_mdev_pci_open,
> > +	.release		= vfio_mdev_pci_release,
> > +
> > +	.read			= vfio_mdev_pci_read,
> > +	.write			= vfio_mdev_pci_write,
> > +	.mmap			= vfio_mdev_pci_mmap,
> > +	.ioctl			= vfio_mdev_pci_ioctl,
> > +};
> > +
> > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > +				       const struct pci_device_id *id)
> > +{
> > +	struct vfio_pci_device *vdev;
> > +	int ret;
> > +
> > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Prevent binding to PFs with VFs enabled, this too easily allows
> > +	 * userspace instance with VFs and PFs from the same device, which
> > +	 * cannot work.  Disabling SR-IOV here would initiate removing the
> > +	 * VFs, which would unbind the driver, which is prone to blocking
> > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > +	 * reject these PFs and let the user sort it out.
> > +	 */
> > +	if (pci_num_vf(pdev)) {
> > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
> > +		return -EBUSY;
> > +	}
> > +
> > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > +	if (!vdev)
> > +		return -ENOMEM;
> > +
> > +	vdev->pdev = pdev;
> > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > +	mutex_init(&vdev->igate);
> > +	spin_lock_init(&vdev->irqlock);
> > +	mutex_init(&vdev->ioeventfds_lock);
> > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > +	vdev->nointxmask = nointxmask;
> > +#ifdef CONFIG_VFIO_PCI_VGA
> > +	vdev->disable_vga = disable_vga;
> > +#endif
> > +	vdev->disable_idle_d3 = disable_idle_d3;
> > +
> > +	pci_set_drvdata(pdev, vdev);
> > +
> > +	ret = vfio_pci_reflck_attach(vdev);
> > +	if (ret) {
> > +		pci_set_drvdata(pdev, NULL);
> > +		kfree(vdev);
> > +		return ret;
> > +	}
> > +
> > +	if (vfio_pci_is_vga(pdev)) {
> > +		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> > +		vga_set_legacy_decoding(pdev,
> > +					vfio_pci_set_vga_decode(vdev, false));
> > +	}
> > +
> > +	vfio_pci_probe_power_state(vdev);
> > +
> > +	if (!vdev->disable_idle_d3) {
> > +		/*
> > +		 * pci-core sets the device power state to an unknown value at
> > +		 * bootup and after being removed from a driver.  The only
> > +		 * transition it allows from this unknown state is to D0, which
> > +		 * typically happens when a driver calls pci_enable_device().
> > +		 * We're not ready to enable the device yet, but we do want to
> > +		 * be able to get to D3.  Therefore first do a D0 transition
> > +		 * before going to D3.
> > +		 */
> > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > +	}
> 
> Ditto here and remove below, this seems like boilerplate that shouldn't
> be duplicated per leaf module.  Thanks,

Sure, the code snippet above may also be abstracted to be a common API
provided by vfio-pci-common.ko. :-)

I have a confusion which may need confirm with you. Do you also want the
below code snippet be placed in the vfio-pci-common.ko and exposed out
as a wrapped API? Thus it can be used by sample driver and other future
drivers which want to wrap PCI device as a mdev. May be I misundstood
your comment. :-(

> 
> Alex

Thanks,
Yi Liu

> > +
> > +	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
> > +	if (ret)
> > +		pr_err("Cannot register mdev for device %s\n",
> > +			dev_name(&pdev->dev));
> > +	else
> > +		pr_info("Wrap device %s as a mdev\n", dev_name(&pdev->dev));
> > +
> > +	return ret;
> > +}
> > +
> > +static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
> > +{
> > +	struct vfio_pci_device *vdev;
> > +
> > +	vdev = pci_get_drvdata(pdev);
> > +	if (!vdev)
> > +		return;
> > +
> > +	vfio_pci_reflck_put(vdev->reflck);
> > +
> > +	kfree(vdev->region);
> > +	mutex_destroy(&vdev->ioeventfds_lock);
> > +
> > +	if (!disable_idle_d3)
> > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > +
> > +	kfree(vdev->pm_save);
> > +
> > +	if (vfio_pci_is_vga(pdev)) {
> > +		vga_client_register(pdev, NULL, NULL, NULL);
> > +		vga_set_legacy_decoding(pdev,
> > +				VGA_RSRC_NORMAL_IO |
> VGA_RSRC_NORMAL_MEM |
> > +				VGA_RSRC_LEGACY_IO |
> VGA_RSRC_LEGACY_MEM);
> > +	}
> > +
> > +	kfree(vdev);
> > +}
> > +
> > +static struct pci_driver vfio_mdev_pci_driver = {
> > +	.name		= VFIO_MDEV_PCI_NAME,
> > +	.id_table	= NULL, /* only dynamic ids */
> > +	.probe		= vfio_mdev_pci_driver_probe,
> > +	.remove		= vfio_mdev_pci_driver_remove,
> > +	.err_handler	= &vfio_pci_err_handlers,
> > +};
> > +
> > +static void __exit vfio_mdev_pci_cleanup(void)
> > +{
> > +	pci_unregister_driver(&vfio_mdev_pci_driver);
> > +}
> > +
> > +static int __init vfio_mdev_pci_init(void)
> > +{
> > +	int ret;
> > +
> > +	/* Register and scan for devices */
> > +	ret = pci_register_driver(&vfio_mdev_pci_driver);
> > +	if (ret)
> > +		return ret;
> > +
> > +	vfio_pci_fill_ids(ids, &vfio_mdev_pci_driver);
> > +
> > +	return 0;
> > +}
> > +
> > +module_init(vfio_mdev_pci_init);
> > +module_exit(vfio_mdev_pci_cleanup);
> > +
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(DRIVER_DESC);


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c
  2020-01-09 22:48   ` Alex Williamson
@ 2020-01-16 12:42     ` Liu, Yi L
  0 siblings, 0 replies; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 12:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx, baolu.lu

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, January 10, 2020 6:48 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c
> 
> On Tue,  7 Jan 2020 20:01:44 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch removes the common codes in vfio_pci.c, leave the module
> > specific codes, new vfio_pci.c will leverage the common functions
> > implemented in vfio_pci_common.c.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/Makefile           |    3 +-
> >  drivers/vfio/pci/vfio_pci.c         | 1442 -----------------------------------
> >  drivers/vfio/pci/vfio_pci_common.c  |    2 +-
> >  drivers/vfio/pci/vfio_pci_private.h |    2 +
> >  4 files changed, 5 insertions(+), 1444 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > index f027f8a..d94317a 100644
> > --- a/drivers/vfio/pci/Makefile
> > +++ b/drivers/vfio/pci/Makefile
> > @@ -1,6 +1,7 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> >
> > -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> > +vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
> > +		vfio_pci_rdwr.o vfio_pci_config.o
> >  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> >  vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
> >
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 103e493..7e24da2 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> 
> I think there are a bunch of headers that are no longer needed here
> too.  It at least compiles without these:
> 
> -#include <linux/eventfd.h>
> -#include <linux/file.h>
> -#include <linux/interrupt.h>
> -#include <linux/notifier.h>
> -#include <linux/pm_runtime.h>
> -#include <linux/uaccess.h>
> -#include <linux/nospec.h>

Got it, let me remove them.

> 
> 
> > @@ -54,411 +54,6 @@ module_param(disable_idle_d3, bool, S_IRUGO |
> S_IWUSR);
> >  MODULE_PARM_DESC(disable_idle_d3,
> >  		 "Disable using the PCI D3 low power state for idle, unused devices");
> >
> > -/*
> > - * Our VGA arbiter participation is limited since we don't know anything
> > - * about the device itself.  However, if the device is the only VGA device
> > - * downstream of a bridge and VFIO VGA support is disabled, then we can
> > - * safely return legacy VGA IO and memory as not decoded since the user
> > - * has no way to get to it and routing can be disabled externally at the
> > - * bridge.
> > - */
> > -unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
> > -{
> > -	struct vfio_pci_device *vdev = opaque;
> > -	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
> > -	unsigned char max_busnr;
> > -	unsigned int decodes;
> > -
> > -	if (single_vga || !vfio_vga_disabled(vdev) ||
> > -		pci_is_root_bus(pdev->bus))
> > -		return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
> > -		       VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
> > -
> > -	max_busnr = pci_bus_max_busnr(pdev->bus);
> > -	decodes = VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM;
> > -
> > -	while ((tmp = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, tmp)) != NULL) {
> > -		if (tmp == pdev ||
> > -		    pci_domain_nr(tmp->bus) != pci_domain_nr(pdev->bus) ||
> > -		    pci_is_root_bus(tmp->bus))
> > -			continue;
> > -
> > -		if (tmp->bus->number >= pdev->bus->number &&
> > -		    tmp->bus->number <= max_busnr) {
> > -			pci_dev_put(tmp);
> > -			decodes |= VGA_RSRC_LEGACY_IO |
> VGA_RSRC_LEGACY_MEM;
> > -			break;
> > -		}
> > -	}
> > -
> > -	return decodes;
> > -}
> > -
> > -static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > -{
> > -	struct resource *res;
> > -	int i;
> > -	struct vfio_pci_dummy_resource *dummy_res;
> > -
> > -	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> > -
> > -	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> > -		int bar = i + PCI_STD_RESOURCES;
> > -
> > -		res = &vdev->pdev->resource[bar];
> > -
> > -		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> > -			goto no_mmap;
> > -
> > -		if (!(res->flags & IORESOURCE_MEM))
> > -			goto no_mmap;
> > -
> > -		/*
> > -		 * The PCI core shouldn't set up a resource with a
> > -		 * type but zero size. But there may be bugs that
> > -		 * cause us to do that.
> > -		 */
> > -		if (!resource_size(res))
> > -			goto no_mmap;
> > -
> > -		if (resource_size(res) >= PAGE_SIZE) {
> > -			vdev->bar_mmap_supported[bar] = true;
> > -			continue;
> > -		}
> > -
> > -		if (!(res->start & ~PAGE_MASK)) {
> > -			/*
> > -			 * Add a dummy resource to reserve the remainder
> > -			 * of the exclusive page in case that hot-add
> > -			 * device's bar is assigned into it.
> > -			 */
> > -			dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> > -			if (dummy_res == NULL)
> > -				goto no_mmap;
> > -
> > -			dummy_res->resource.name = "vfio sub-page reserved";
> > -			dummy_res->resource.start = res->end + 1;
> > -			dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> > -			dummy_res->resource.flags = res->flags;
> > -			if (request_resource(res->parent,
> > -						&dummy_res->resource)) {
> > -				kfree(dummy_res);
> > -				goto no_mmap;
> > -			}
> > -			dummy_res->index = bar;
> > -			list_add(&dummy_res->res_next,
> > -					&vdev->dummy_resources_list);
> > -			vdev->bar_mmap_supported[bar] = true;
> > -			continue;
> > -		}
> > -		/*
> > -		 * Here we don't handle the case when the BAR is not page
> > -		 * aligned because we can't expect the BAR will be
> > -		 * assigned into the same location in a page in guest
> > -		 * when we passthrough the BAR. And it's hard to access
> > -		 * this BAR in userspace because we have no way to get
> > -		 * the BAR's location in a page.
> > -		 */
> > -no_mmap:
> > -		vdev->bar_mmap_supported[bar] = false;
> > -	}
> > -}
> > -
> > -static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
> > -
> > -/*
> > - * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> > - * _and_ the ability detect when the device is asserting INTx via PCI_STATUS.
> > - * If a device implements the former but not the latter we would typically
> > - * expect broken_intx_masking be set and require an exclusive interrupt.
> > - * However since we do have control of the device's ability to assert INTx,
> > - * we can instead pretend that the device does not implement INTx, virtualizing
> > - * the pin register to report zero and maintaining DisINTx set on the host.
> > - */
> > -static bool vfio_pci_nointx(struct pci_dev *pdev)
> > -{
> > -	switch (pdev->vendor) {
> > -	case PCI_VENDOR_ID_INTEL:
> > -		switch (pdev->device) {
> > -		/* All i40e (XL710/X710/XXV710) 10/20/25/40GbE NICs */
> > -		case 0x1572:
> > -		case 0x1574:
> > -		case 0x1580 ... 0x1581:
> > -		case 0x1583 ... 0x158b:
> > -		case 0x37d0 ... 0x37d2:
> > -			return true;
> > -		default:
> > -			return false;
> > -		}
> > -	}
> > -
> > -	return false;
> > -}
> > -
> > -void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
> > -{
> > -	struct pci_dev *pdev = vdev->pdev;
> > -	u16 pmcsr;
> > -
> > -	if (!pdev->pm_cap)
> > -		return;
> > -
> > -	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > -
> > -	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
> > -}
> > -
> > -/*
> > - * pci_set_power_state() wrapper handling devices which perform a soft reset on
> > - * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
> > - * restore when returned to D0.  Saved separately from pci_saved_state for use
> > - * by PM capability emulation and separately from pci_dev internal saved state
> > - * to avoid it being overwritten and consumed around other resets.
> > - */
> > -int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
> > -{
> > -	struct pci_dev *pdev = vdev->pdev;
> > -	bool needs_restore = false, needs_save = false;
> > -	int ret;
> > -
> > -	if (vdev->needs_pm_restore) {
> > -		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
> > -			pci_save_state(pdev);
> > -			needs_save = true;
> > -		}
> > -
> > -		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
> > -			needs_restore = true;
> > -	}
> > -
> > -	ret = pci_set_power_state(pdev, state);
> > -
> > -	if (!ret) {
> > -		/* D3 might be unsupported via quirk, skip unless in D3 */
> > -		if (needs_save && pdev->current_state >= PCI_D3hot) {
> > -			vdev->pm_save = pci_store_saved_state(pdev);
> > -		} else if (needs_restore) {
> > -			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
> > -			pci_restore_state(pdev);
> > -		}
> > -	}
> 
> 
> This gets a bit ugly, vfio_pci_remove() retains:
> 
> kfree(vdev->pm_save)
> 
> But vfio_pci.c otherwise has no use of this field on the
> vfio_pci_device.  I'm afraid we're really just doing a pretty rough
> splitting of the code rather than massaging the callbacks between the
> modules into an actual API, for example maybe there should be init and
> exit callbacks into the common code to handle such things.
> ioeventfds_{list,lock} are similar, vfio_pci.c inits and destroys them,
> but otherwise doesn't know what they're for.  I wonder how many more
> such things exist.  Thanks,

yeah, I tried to keep the code as what it looks like today. So it is
now much more like a code splitting). But I agree we need to make it
more thorough. I had been considering how to make the code work as
what you described here since I saw your comment last week. I may
need a more detailed investigation on it per your direction, and answer
your question better.

Thanks very much for your guidelines.

Regards,
Yi Liu

> 
> Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file
  2020-01-15 10:43   ` Cornelia Huck
@ 2020-01-16 12:46     ` Liu, Yi L
  0 siblings, 0 replies; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 12:46 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, Tian, Kevin, joro,
	peterx, baolu.lu

> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, January 15, 2020 6:43 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to
> header file
> 
> On Tue,  7 Jan 2020 20:01:39 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch moves two inline functions to vfio_pci_private.h for
> > further sharing across source files. Also avoids below compiling error
> > in further code split.
> >
> > "error: inlining failed in call to always_inline ‘vfio_pci_is_vga’:
> > function body not available".
> 
> "We want to use these functions from other files, so move them to a header" seems
> to be justification enough; why mention the compilation error?

Exactly. What a stupid commit message I made. Thanks very much. I
encountered such compilation error during one step in my development,
so added it in the commit message. I agree it is not necessary.

Thanks,
Yi Liu

> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c         | 14 --------------
> >  drivers/vfio/pci/vfio_pci_private.h | 14 ++++++++++++++
> >  2 files changed, 14 insertions(+), 14 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 04/12] vfio_pci: make common functions be extern
  2020-01-15 10:56   ` Cornelia Huck
@ 2020-01-16 12:48     ` Liu, Yi L
  0 siblings, 0 replies; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 12:48 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, Tian, Kevin, joro,
	peterx, baolu.lu

> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, January 15, 2020 6:56 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 04/12] vfio_pci: make common functions be extern
> 
> On Tue,  7 Jan 2020 20:01:41 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch makes the common functions (module agnostic functions) in
> > vfio_pci.c to be extern. So that such functions could be moved to a
> > common source file.
> >
> > *) vfio_pci_set_vga_decode
> > *) vfio_pci_probe_power_state
> > *) vfio_pci_set_power_state
> > *) vfio_pci_enable
> > *) vfio_pci_disable
> > *) vfio_pci_refresh_config
> > *) vfio_pci_register_dev_region
> > *) vfio_pci_ioctl
> > *) vfio_pci_read
> > *) vfio_pci_write
> > *) vfio_pci_mmap
> > *) vfio_pci_request
> > *) vfio_pci_err_handlers
> > *) vfio_pci_reflck_attach
> > *) vfio_pci_reflck_put
> > *) vfio_pci_fill_ids
> 
> I find it a bit hard to understand what "module agnostic functions" are supposed to
> be. The functions you want to move seem to be some "basic"
> functions that can be shared between normal vfio-pci and vfio-mdev-pci... maybe
> talk about "functions that provide basic vfio functionality for pci devices" and also
> mention the mdev part?
> 
> [My rationale behind complaining about the commit messages is that if I look at this
> change in a year from now, I want to be able to know why and to what end that
> change was made.]

Right, agreed with your comments. I'll change the commit message accordingly
per your suggestion.

Thanks,
Yi Liu

> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c         | 30 +++++++++++++-----------------
> >  drivers/vfio/pci/vfio_pci_private.h | 15 +++++++++++++++
> >  2 files changed, 28 insertions(+), 17 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-15 12:30   ` Cornelia Huck
@ 2020-01-16 13:23     ` Liu, Yi L
  2020-01-16 17:40       ` Cornelia Huck
  0 siblings, 1 reply; 44+ messages in thread
From: Liu, Yi L @ 2020-01-16 13:23 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, Tian, Kevin, joro,
	peterx, baolu.lu, Masahiro Yamada

> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, January 15, 2020 8:30 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> 
> On Tue,  7 Jan 2020 20:01:48 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > a PCI device as a mediated device. For a pci device, once bound
> > to vfio-mdev-pci driver, user space access of this device will
> > go through vfio mdev framework. The usage of the device follows
> > mdev management method. e.g. user should create a mdev before
> > exposing the device to user-space.
> >
> > Benefit of this new driver would be acting as a sample driver
> > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > patchset. Also it could be a good experiment driver for future
> > device specific mdev migration support. This sample driver only
> > supports singleton iommu groups, for non-singleton iommu groups,
> > this sample driver doesn't work. It will fail when trying to assign
> > the non-singleton iommu group to VMs.
> >
> > To use this driver:
> > a) build and load vfio-mdev-pci.ko module
> >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> >    then load it with following command:
> >    > sudo modprobe vfio
> >    > sudo modprobe vfio-pci
> >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko
> >
> > b) unbind original device driver
> >    e.g. use following command to unbind its original driver
> >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> >
> > c) bind vfio-mdev-pci driver to the physical device
> >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id
> >
> > d) check the supported mdev instances
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> >      vfio-mdev-pci-type_name
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type_name/
> >      available_instances  create  device_api  devices  name
> >
> > e)  create mdev on this physical device (only 1 instance)
> >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type_name/create
> >
> > f) passthru the mdev to guest
> >    add the following line in QEMU boot command
> >     -device vfio-pci,\
> >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> >
> > g) destroy mdev
> >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
> >      remove
> 
> I think much/most of those instructions should go (additionally) into
> the sample driver source.

yes, it would be helpful to add it in a doc.

> Otherwise, it's not clear to the reader why
> they should wrap the device in mdev instead of simply using a normal
> vfio-pci device.

Actually, the reason of wrapping device in mdev instead of simply using
a normal vfio-pci is to let vendor specific driver to intercept some
device access which is not allowed in vfio-pci usage. We only have PCI
config space access intercepted and some other special accesses intercepted
in vfio-pci. While for some vendor specific handling, it would be nice
to have a way to let vendor specific driver intercept in. mdev allows it.

And back to the purpose of introducing this sample driver, it is supposed
to test IOMMU-capable mdev. We don't have real hardware on market, there
is no way to test the VFIO extensions for IOMMU-capable mdev. Wrapping a
PCI device in mdev can test the VFIO extensions well as it has hardware
enforce DMA isolation. Thus makes it possible to test the extensions in VFIO.

> 
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  samples/Kconfig                       |  10 +
> >  samples/Makefile                      |   1 +
> >  samples/vfio-mdev-pci/Makefile        |   4 +
> >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397
> ++++++++++++++++++++++++++++++++++
> >  4 files changed, 412 insertions(+)
> >  create mode 100644 samples/vfio-mdev-pci/Makefile
> >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> >
> > diff --git a/samples/Kconfig b/samples/Kconfig
> > index 9d236c3..50d207c 100644
> > --- a/samples/Kconfig
> > +++ b/samples/Kconfig
> > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> >  	help
> >  	  Build a sample program to work with mei device.
> >
> > +config SAMPLE_VFIO_MDEV_PCI
> > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > +	select VFIO_PCI_COMMON
> > +	select VFIO_PCI
> 
> Why does this still need to select VFIO_PCI? Shouldn't all needed
> infrastructure rather be covered by VFIO_PCI_COMMON already?

VFIO_PCI_COMMON is supposed to be the dependency of both VFIO_PCI and
SAMPLE_VFIO_MDEV_PCI. However, the source code of VFIO_PCI_COMMON are
under drivers/vfio/pci which is compiled per the configuration of VFIO_PCI.
Besides of letting SAMPLE_VFIO_MDEV_PCI select VFIO_PCI, I can also add
a line in drivers/vfio/Makefile to make the source code under drivers/vfio/pci
to be compiled when either VFIO_PCI or VFIO_PCI_COMMON are configed. But
I'm afraid it is a bit ugly. So I choose to let SAMPLE_VFIO_MDEV_PCI select
VFIO_PCI. If you have other idea, I would be pleased to
know it. :-)

> 
> > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> 
> VFIO_MDEV_DEVICE already depends on VFIO_MDEV. But maybe also make this
> depend on PCI?
> 
> > +	help
> > +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> > +	  this driver, device passthru should through mdev path.
> 
> "A PCI device bound to this driver will be assigned through the
> mediated device framework."
> 
> ?

Maybe I should have mentioned it as "A PCI device bound to this
sample driver should follow the passthru steps for mdevs as showed
in Documentation/driver-api/vfio-mediated-device.rst."

Does it make more sense?

Thanks,
Yi Liu

> 
> > +
> > +	  If you don't know what to do here, say N.
> >
> >  endif # SAMPLES


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-16 13:23     ` Liu, Yi L
@ 2020-01-16 17:40       ` Cornelia Huck
  2020-01-18 14:23         ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Cornelia Huck @ 2020-01-16 17:40 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, Tian, Kevin, joro,
	peterx, baolu.lu, Masahiro Yamada

On Thu, 16 Jan 2020 13:23:28 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Cornelia Huck [mailto:cohuck@redhat.com]
> > Sent: Wednesday, January 15, 2020 8:30 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > 
> > On Tue,  7 Jan 2020 20:01:48 +0800
> > Liu Yi L <yi.l.liu@intel.com> wrote:

> > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > index 9d236c3..50d207c 100644
> > > --- a/samples/Kconfig
> > > +++ b/samples/Kconfig
> > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > >  	help
> > >  	  Build a sample program to work with mei device.
> > >
> > > +config SAMPLE_VFIO_MDEV_PCI
> > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > +	select VFIO_PCI_COMMON
> > > +	select VFIO_PCI  
> > 
> > Why does this still need to select VFIO_PCI? Shouldn't all needed
> > infrastructure rather be covered by VFIO_PCI_COMMON already?  
> 
> VFIO_PCI_COMMON is supposed to be the dependency of both VFIO_PCI and
> SAMPLE_VFIO_MDEV_PCI. However, the source code of VFIO_PCI_COMMON are
> under drivers/vfio/pci which is compiled per the configuration of VFIO_PCI.
> Besides of letting SAMPLE_VFIO_MDEV_PCI select VFIO_PCI, I can also add
> a line in drivers/vfio/Makefile to make the source code under drivers/vfio/pci
> to be compiled when either VFIO_PCI or VFIO_PCI_COMMON are configed. But
> I'm afraid it is a bit ugly. So I choose to let SAMPLE_VFIO_MDEV_PCI select
> VFIO_PCI. If you have other idea, I would be pleased to
> know it. :-)

Shouldn't building drivers/vfio/pci/ for CONFIG_VFIO_PCI_COMMON already
be enough (the Makefile changes look fine to me)? Or am I missing
something obvious?

> 
> >   
> > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE  
> > 
> > VFIO_MDEV_DEVICE already depends on VFIO_MDEV. But maybe also make this
> > depend on PCI?
> >   
> > > +	help
> > > +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> > > +	  this driver, device passthru should through mdev path.  
> > 
> > "A PCI device bound to this driver will be assigned through the
> > mediated device framework."
> > 
> > ?  
> 
> Maybe I should have mentioned it as "A PCI device bound to this
> sample driver should follow the passthru steps for mdevs as showed
> in Documentation/driver-api/vfio-mediated-device.rst."
> 
> Does it make more sense?

Yes, it does :)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-16 12:33     ` Liu, Yi L
@ 2020-01-16 21:24       ` Alex Williamson
  2020-01-18 14:25         ` Liu, Yi L
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-16 21:24 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx,
	baolu.lu, Masahiro Yamada

On Thu, 16 Jan 2020 12:33:06 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, January 10, 2020 6:49 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > 
> > On Tue,  7 Jan 2020 20:01:48 +0800
> > Liu Yi L <yi.l.liu@intel.com> wrote:
> >   
> > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > a PCI device as a mediated device. For a pci device, once bound
> > > to vfio-mdev-pci driver, user space access of this device will
> > > go through vfio mdev framework. The usage of the device follows
> > > mdev management method. e.g. user should create a mdev before
> > > exposing the device to user-space.
> > >
> > > Benefit of this new driver would be acting as a sample driver
> > > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > > patchset. Also it could be a good experiment driver for future
> > > device specific mdev migration support. This sample driver only
> > > supports singleton iommu groups, for non-singleton iommu groups,
> > > this sample driver doesn't work. It will fail when trying to assign
> > > the non-singleton iommu group to VMs.
> > >
> > > To use this driver:
> > > a) build and load vfio-mdev-pci.ko module
> > >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> > >    then load it with following command:  
> > >    > sudo modprobe vfio
> > >    > sudo modprobe vfio-pci
> > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko  
> > >
> > > b) unbind original device driver
> > >    e.g. use following command to unbind its original driver  
> > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> > >
> > > c) bind vfio-mdev-pci driver to the physical device  
> > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> > >
> > > d) check the supported mdev instances  
> > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
> > >      vfio-mdev-pci-type_name  
> > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
> > >      vfio-mdev-pci-type_name/
> > >      available_instances  create  device_api  devices  name
> > >
> > > e)  create mdev on this physical device (only 1 instance)  
> > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
> > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > >      vfio-mdev-pci-type_name/create
> > >
> > > f) passthru the mdev to guest
> > >    add the following line in QEMU boot command
> > >     -device vfio-pci,\
> > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> > >
> > > g) destroy mdev  
> > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
> > >      remove
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  samples/Kconfig                       |  10 +
> > >  samples/Makefile                      |   1 +
> > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397  
> > ++++++++++++++++++++++++++++++++++  
> > >  4 files changed, 412 insertions(+)
> > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > >
> > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > index 9d236c3..50d207c 100644
> > > --- a/samples/Kconfig
> > > +++ b/samples/Kconfig
> > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > >  	help
> > >  	  Build a sample program to work with mei device.
> > >
> > > +config SAMPLE_VFIO_MDEV_PCI
> > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > +	select VFIO_PCI_COMMON
> > > +	select VFIO_PCI
> > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > +	help
> > > +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> > > +	  this driver, device passthru should through mdev path.
> > > +
> > > +	  If you don't know what to do here, say N.
> > >
> > >  endif # SAMPLES
> > > diff --git a/samples/Makefile b/samples/Makefile
> > > index 5ce50ef..84faced 100644
> > > --- a/samples/Makefile
> > > +++ b/samples/Makefile
> > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)	+= ftrace/
> > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > >  obj-y					+= vfio-mdev/
> > > +obj-y					+= vfio-mdev-pci/  
> > 
> > I think we could just lump this into vfio-mdev rather than making
> > another directory.  
> 
> sure. will move it. :-)
> 
> >   
> > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-pci/Makefile
> > > new file mode 100644
> > > index 0000000..41b2139
> > > --- /dev/null
> > > +++ b/samples/vfio-mdev-pci/Makefile
> > > @@ -0,0 +1,4 @@
> > > +# SPDX-License-Identifier: GPL-2.0-only
> > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > +
> > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-  
> > pci/vfio_mdev_pci.c  
> > > new file mode 100644
> > > index 0000000..b180356
> > > --- /dev/null
> > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > @@ -0,0 +1,397 @@
> > > +/*
> > > + * Copyright © 2020 Intel Corporation.
> > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * Derived from original vfio_pci.c:
> > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > + *
> > > + * Derived from original vfio:
> > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > + * Author: Tom Lyon, pugs@cisco.com
> > > + */
> > > +
> > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/eventfd.h>
> > > +#include <linux/file.h>
> > > +#include <linux/interrupt.h>
> > > +#include <linux/iommu.h>
> > > +#include <linux/module.h>
> > > +#include <linux/mutex.h>
> > > +#include <linux/notifier.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/pm_runtime.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/types.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/vfio.h>
> > > +#include <linux/vgaarb.h>
> > > +#include <linux/nospec.h>
> > > +#include <linux/mdev.h>
> > > +#include <linux/vfio_pci_common.h>
> > > +
> > > +#define DRIVER_VERSION  "0.1"
> > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a  
> > mdev"  
> > > +
> > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > +
> > > +static char ids[1024] __initdata;
> > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver,  
> > format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and
> > multiple comma separated entries can be specified");  
> > > +
> > > +static bool nointxmask;
> > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
> > > +MODULE_PARM_DESC(nointxmask,
> > > +		  "Disable support for PCI 2.3 style INTx masking.  If this resolves  
> > problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so
> > the device can be fixed automatically via the broken_intx_masking flag.");  
> > > +
> > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > +static bool disable_vga;
> > > +module_param(disable_vga, bool, S_IRUGO);
> > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-  
> > mdev-pci");  
> > > +#endif
> > > +
> > > +static bool disable_idle_d3;
> > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > +MODULE_PARM_DESC(disable_idle_d3,
> > > +		 "Disable using the PCI D3 low power state for idle, unused devices");
> > > +
> > > +static struct pci_driver vfio_mdev_pci_driver;
> > > +
> > > +static ssize_t
> > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > +{
> > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > +}
> > > +
> > > +MDEV_TYPE_ATTR_RO(name);
> > > +
> > > +static ssize_t
> > > +available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
> > > +{
> > > +	return sprintf(buf, "%d\n", 1);
> > > +}
> > > +
> > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > +
> > > +static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
> > > +		char *buf)
> > > +{
> > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > +}
> > > +
> > > +MDEV_TYPE_ATTR_RO(device_api);
> > > +
> > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > +	&mdev_type_attr_name.attr,
> > > +	&mdev_type_attr_device_api.attr,
> > > +	&mdev_type_attr_available_instances.attr,
> > > +	NULL,
> > > +};
> > > +
> > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > +	.name  = "type1",
> > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > +};
> > > +
> > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > +	&vfio_mdev_pci_type_group1,
> > > +	NULL,
> > > +};
> > > +
> > > +struct vfio_mdev_pci {
> > > +	struct vfio_pci_device *vdev;
> > > +	struct mdev_device *mdev;
> > > +	unsigned long handle;
> > > +};
> > > +
> > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
> > > +{
> > > +	struct device *pdev;
> > > +	struct vfio_pci_device *vdev;
> > > +	struct vfio_mdev_pci *pmdev;
> > > +	int ret;
> > > +
> > > +	pdev = mdev_parent_dev(mdev);
> > > +	vdev = dev_get_drvdata(pdev);
> > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > +	if (pmdev == NULL) {
> > > +		ret = -EBUSY;
> > > +		goto out;
> > > +	}
> > > +
> > > +	pmdev->mdev = mdev;
> > > +	pmdev->vdev = vdev;
> > > +	mdev_set_drvdata(mdev, pmdev);
> > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > +	if (ret) {
> > > +		pr_info("%s, failed to config iommu isolation for mdev: %s on  
> > pf: %s\n",  
> > > +			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> > > +		goto out;
> > > +	}
> > > +
> > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > +		     dev_name(mdev_dev(mdev)));
> > > +out:
> > > +	return ret;
> > > +}
> > > +
> > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +
> > > +	kfree(pmdev);
> > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > +		     dev_name(mdev_dev(mdev)));
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > +	int ret = 0;
> > > +
> > > +	if (!try_module_get(THIS_MODULE))
> > > +		return -ENODEV;
> > > +
> > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > +
> > > +	mutex_lock(&vdev->reflck->lock);
> > > +
> > > +	if (!vdev->refcnt) {
> > > +		ret = vfio_pci_enable(vdev);
> > > +		if (ret)
> > > +			goto error;
> > > +
> > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > +	}
> > > +	vdev->refcnt++;
> > > +error:
> > > +	mutex_unlock(&vdev->reflck->lock);
> > > +	if (!ret)
> > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > >dev));
> > > +	else {
> > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > >dev));
> > > +		module_put(THIS_MODULE);
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > +
> > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > >dev));
> > > +
> > > +	mutex_lock(&vdev->reflck->lock);
> > > +
> > > +	if (!(--vdev->refcnt)) {
> > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > +		vfio_pci_disable(vdev);
> > > +	}
> > > +
> > > +	mutex_unlock(&vdev->reflck->lock);
> > > +
> > > +	module_put(THIS_MODULE);
> > > +}  
> > 
> > open() and release() here are almost identical between vfio_pci and
> > vfio_mdev_pci, which suggests maybe there should be common functions to
> > call into like we do for the below.  
> 
> yes, let me have more study and do better abstract in next version. :-)
> 
> > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
> > > +			     unsigned long arg)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +
> > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > +}
> > > +
> > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > +				struct vm_area_struct *vma)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +
> > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > +}
> > > +
> > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
> > > +			size_t count, loff_t *ppos)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +
> > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > +}
> > > +
> > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > +				const char __user *buf,
> > > +				size_t count, loff_t *ppos)
> > > +{
> > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > +
> > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
> > > +}
> > > +
> > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > +	.supported_type_groups	= vfio_mdev_pci_type_groups,
> > > +	.create			= vfio_mdev_pci_create,
> > > +	.remove			= vfio_mdev_pci_remove,
> > > +
> > > +	.open			= vfio_mdev_pci_open,
> > > +	.release		= vfio_mdev_pci_release,
> > > +
> > > +	.read			= vfio_mdev_pci_read,
> > > +	.write			= vfio_mdev_pci_write,
> > > +	.mmap			= vfio_mdev_pci_mmap,
> > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > +};
> > > +
> > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > +				       const struct pci_device_id *id)
> > > +{
> > > +	struct vfio_pci_device *vdev;
> > > +	int ret;
> > > +
> > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > +		return -EINVAL;
> > > +
> > > +	/*
> > > +	 * Prevent binding to PFs with VFs enabled, this too easily allows
> > > +	 * userspace instance with VFs and PFs from the same device, which
> > > +	 * cannot work.  Disabling SR-IOV here would initiate removing the
> > > +	 * VFs, which would unbind the driver, which is prone to blocking
> > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > +	 * reject these PFs and let the user sort it out.
> > > +	 */
> > > +	if (pci_num_vf(pdev)) {
> > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
> > > +		return -EBUSY;
> > > +	}
> > > +
> > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > +	if (!vdev)
> > > +		return -ENOMEM;
> > > +
> > > +	vdev->pdev = pdev;
> > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > +	mutex_init(&vdev->igate);
> > > +	spin_lock_init(&vdev->irqlock);
> > > +	mutex_init(&vdev->ioeventfds_lock);
> > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > +	vdev->nointxmask = nointxmask;
> > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > +	vdev->disable_vga = disable_vga;
> > > +#endif
> > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > +
> > > +	pci_set_drvdata(pdev, vdev);
> > > +
> > > +	ret = vfio_pci_reflck_attach(vdev);
> > > +	if (ret) {
> > > +		pci_set_drvdata(pdev, NULL);
> > > +		kfree(vdev);
> > > +		return ret;
> > > +	}
> > > +
> > > +	if (vfio_pci_is_vga(pdev)) {
> > > +		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> > > +		vga_set_legacy_decoding(pdev,
> > > +					vfio_pci_set_vga_decode(vdev, false));
> > > +	}
> > > +
> > > +	vfio_pci_probe_power_state(vdev);
> > > +
> > > +	if (!vdev->disable_idle_d3) {
> > > +		/*
> > > +		 * pci-core sets the device power state to an unknown value at
> > > +		 * bootup and after being removed from a driver.  The only
> > > +		 * transition it allows from this unknown state is to D0, which
> > > +		 * typically happens when a driver calls pci_enable_device().
> > > +		 * We're not ready to enable the device yet, but we do want to
> > > +		 * be able to get to D3.  Therefore first do a D0 transition
> > > +		 * before going to D3.
> > > +		 */
> > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > +	}  
> > 
> > Ditto here and remove below, this seems like boilerplate that shouldn't
> > be duplicated per leaf module.  Thanks,  
> 
> Sure, the code snippet above may also be abstracted to be a common API
> provided by vfio-pci-common.ko. :-)
> 
> I have a confusion which may need confirm with you. Do you also want the
> below code snippet be placed in the vfio-pci-common.ko and exposed out
> as a wrapped API? Thus it can be used by sample driver and other future
> drivers which want to wrap PCI device as a mdev. May be I misundstood
> your comment. :-(


I think some sort of vfio_pci_common_{probe,remove}() would be a
reasonable starting point where the respective module _{probe,remove}
functions would call into these and add their module specific code
around it.  That would at least give us a point to cleanup things that
are only used by the common code in the common code.

I'm still struggling how we make this user consumable should we accept
this and progress beyond a proof of concept sample driver though.  For
example, if a vendor actually implements an mdev wrapper driver or even
just a device specific vfio-pci wrapper, to enable for example
migration support, how does a user know which driver to use for each
particular feature?  The best I can come up with so far is something
like was done for vfio-platform reset modules.  For instance a module
that extends features for a given device in vfio-pci might register an
ops structure and id table with vfio-pci, along with creating a module
alias (or aliases) for the devices it supports.  When a device is
probed by vfio-pci it could try to match against registered id tables
to find a device specific ops structure, if one is not found it could
do a request_module using the PCI vendor and device IDs and some unique
vfio-pci string, check again, and use the default ops if device
specific ops are still not present.  That would solve the problem on
the vfio-pci side.  For mdevs, I tend to assume that this vfio-mdev-pci
meta driver is an anomaly only for the purpose of creating a generic
test device for IOMMU backed mdevs and that "real" mdev vendor drivers
will just be mdev enlightened host drivers, like i915 and nvidia are
now.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-16 17:40       ` Cornelia Huck
@ 2020-01-18 14:23         ` Liu, Yi L
  2020-01-20  8:55           ` Cornelia Huck
  0 siblings, 1 reply; 44+ messages in thread
From: Liu, Yi L @ 2020-01-18 14:23 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, Tian, Kevin, joro,
	peterx, baolu.lu, Masahiro Yamada

> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Friday, January 17, 2020 1:40 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> 
> On Thu, 16 Jan 2020 13:23:28 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Cornelia Huck [mailto:cohuck@redhat.com]
> > > Sent: Wednesday, January 15, 2020 8:30 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > >
> > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > > > diff --git a/samples/Kconfig b/samples/Kconfig index
> > > > 9d236c3..50d207c 100644
> > > > --- a/samples/Kconfig
> > > > +++ b/samples/Kconfig
> > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > >  	help
> > > >  	  Build a sample program to work with mei device.
> > > >
> > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > +	select VFIO_PCI_COMMON
> > > > +	select VFIO_PCI
> > >
> > > Why does this still need to select VFIO_PCI? Shouldn't all needed
> > > infrastructure rather be covered by VFIO_PCI_COMMON already?
> >
> > VFIO_PCI_COMMON is supposed to be the dependency of both VFIO_PCI and
> > SAMPLE_VFIO_MDEV_PCI. However, the source code of VFIO_PCI_COMMON are
> > under drivers/vfio/pci which is compiled per the configuration of VFIO_PCI.
> > Besides of letting SAMPLE_VFIO_MDEV_PCI select VFIO_PCI, I can also
> > add a line in drivers/vfio/Makefile to make the source code under
> > drivers/vfio/pci to be compiled when either VFIO_PCI or
> > VFIO_PCI_COMMON are configed. But I'm afraid it is a bit ugly. So I
> > choose to let SAMPLE_VFIO_MDEV_PCI select VFIO_PCI. If you have other
> > idea, I would be pleased to know it. :-)
> 
> Shouldn't building drivers/vfio/pci/ for CONFIG_VFIO_PCI_COMMON already be
> enough (the Makefile changes look fine to me)? Or am I missing something obvious?

The problem is in the drivers/vfio/Makefile. If CONFIG_VFIO_PCI is not
selected then the pci/ directory is not compiled. Even CONFIG_VFIO_PCI=M,
it will throw error if SAMPLE_VFIO_MDEV_PCI=y. So I let SAMPLE_VFIO_MDEV_PCI
select CONFIG_VFIO_PCI all the same. I'm not sure if this is good. Or maybe
there is better way to ensure pci/ is compiled.

# SPDX-License-Identifier: GPL-2.0
vfio_virqfd-y := virqfd.o

obj-$(CONFIG_VFIO) += vfio.o
obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
obj-$(CONFIG_VFIO_PCI) += pci/
obj-$(CONFIG_VFIO_PLATFORM) += platform/
obj-$(CONFIG_VFIO_MDEV) += mdev/

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-16 21:24       ` Alex Williamson
@ 2020-01-18 14:25         ` Liu, Yi L
  2020-01-20 21:07           ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Liu, Yi L @ 2020-01-18 14:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx,
	baolu.lu, Masahiro Yamada

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, January 17, 2020 5:24 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> 
> On Thu, 16 Jan 2020 12:33:06 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, January 10, 2020 6:49 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > >
> > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > a PCI device as a mediated device. For a pci device, once bound
> > > > to vfio-mdev-pci driver, user space access of this device will
> > > > go through vfio mdev framework. The usage of the device follows
> > > > mdev management method. e.g. user should create a mdev before
> > > > exposing the device to user-space.
> > > >
> > > > Benefit of this new driver would be acting as a sample driver
> > > > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > > > patchset. Also it could be a good experiment driver for future
> > > > device specific mdev migration support. This sample driver only
> > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > this sample driver doesn't work. It will fail when trying to assign
> > > > the non-singleton iommu group to VMs.
> > > >
> > > > To use this driver:
> > > > a) build and load vfio-mdev-pci.ko module
> > > >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> > > >    then load it with following command:
> > > >    > sudo modprobe vfio
> > > >    > sudo modprobe vfio-pci
> > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko
> > > >
> > > > b) unbind original device driver
> > > >    e.g. use following command to unbind its original driver
> > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> > > >
> > > > c) bind vfio-mdev-pci driver to the physical device
> > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id
> > > >
> > > > d) check the supported mdev instances
> > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> > > >      vfio-mdev-pci-type_name
> > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > >      vfio-mdev-pci-type_name/
> > > >      available_instances  create  device_api  devices  name
> > > >
> > > > e)  create mdev on this physical device (only 1 instance)
> > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > >      vfio-mdev-pci-type_name/create
> > > >
> > > > f) passthru the mdev to guest
> > > >    add the following line in QEMU boot command
> > > >     -device vfio-pci,\
> > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> > > >
> > > > g) destroy mdev
> > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
> > > >      remove
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  samples/Kconfig                       |  10 +
> > > >  samples/Makefile                      |   1 +
> > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397
> > > ++++++++++++++++++++++++++++++++++
> > > >  4 files changed, 412 insertions(+)
> > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > >
> > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > index 9d236c3..50d207c 100644
> > > > --- a/samples/Kconfig
> > > > +++ b/samples/Kconfig
> > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > >  	help
> > > >  	  Build a sample program to work with mei device.
> > > >
> > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > +	select VFIO_PCI_COMMON
> > > > +	select VFIO_PCI
> > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > +	help
> > > > +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> > > > +	  this driver, device passthru should through mdev path.
> > > > +
> > > > +	  If you don't know what to do here, say N.
> > > >
> > > >  endif # SAMPLES
> > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > index 5ce50ef..84faced 100644
> > > > --- a/samples/Makefile
> > > > +++ b/samples/Makefile
> > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)	+= ftrace/
> > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > >  obj-y					+= vfio-mdev/
> > > > +obj-y					+= vfio-mdev-pci/
> > >
> > > I think we could just lump this into vfio-mdev rather than making
> > > another directory.
> >
> > sure. will move it. :-)
> >
> > >
> > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-pci/Makefile
> > > > new file mode 100644
> > > > index 0000000..41b2139
> > > > --- /dev/null
> > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > @@ -0,0 +1,4 @@
> > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > +
> > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-
> > > pci/vfio_mdev_pci.c
> > > > new file mode 100644
> > > > index 0000000..b180356
> > > > --- /dev/null
> > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > @@ -0,0 +1,397 @@
> > > > +/*
> > > > + * Copyright © 2020 Intel Corporation.
> > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or modify
> > > > + * it under the terms of the GNU General Public License version 2 as
> > > > + * published by the Free Software Foundation.
> > > > + *
> > > > + * Derived from original vfio_pci.c:
> > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > + *
> > > > + * Derived from original vfio:
> > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > + */
> > > > +
> > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > +
> > > > +#include <linux/device.h>
> > > > +#include <linux/eventfd.h>
> > > > +#include <linux/file.h>
> > > > +#include <linux/interrupt.h>
> > > > +#include <linux/iommu.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/mutex.h>
> > > > +#include <linux/notifier.h>
> > > > +#include <linux/pci.h>
> > > > +#include <linux/pm_runtime.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/types.h>
> > > > +#include <linux/uaccess.h>
> > > > +#include <linux/vfio.h>
> > > > +#include <linux/vgaarb.h>
> > > > +#include <linux/nospec.h>
> > > > +#include <linux/mdev.h>
> > > > +#include <linux/vfio_pci_common.h>
> > > > +
> > > > +#define DRIVER_VERSION  "0.1"
> > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a
> > > mdev"
> > > > +
> > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > +
> > > > +static char ids[1024] __initdata;
> > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver,
> > > format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and
> > > multiple comma separated entries can be specified");
> > > > +
> > > > +static bool nointxmask;
> > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
> > > > +MODULE_PARM_DESC(nointxmask,
> > > > +		  "Disable support for PCI 2.3 style INTx masking.  If this resolves
> > > problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org
> so
> > > the device can be fixed automatically via the broken_intx_masking flag.");
> > > > +
> > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > +static bool disable_vga;
> > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through
> vfio-
> > > mdev-pci");
> > > > +#endif
> > > > +
> > > > +static bool disable_idle_d3;
> > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > +		 "Disable using the PCI D3 low power state for idle, unused devices");
> > > > +
> > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > +
> > > > +static ssize_t
> > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > +{
> > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > +}
> > > > +
> > > > +MDEV_TYPE_ATTR_RO(name);
> > > > +
> > > > +static ssize_t
> > > > +available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > +{
> > > > +	return sprintf(buf, "%d\n", 1);
> > > > +}
> > > > +
> > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > +
> > > > +static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
> > > > +		char *buf)
> > > > +{
> > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > +}
> > > > +
> > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > +
> > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > +	&mdev_type_attr_name.attr,
> > > > +	&mdev_type_attr_device_api.attr,
> > > > +	&mdev_type_attr_available_instances.attr,
> > > > +	NULL,
> > > > +};
> > > > +
> > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > +	.name  = "type1",
> > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > +};
> > > > +
> > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > +	&vfio_mdev_pci_type_group1,
> > > > +	NULL,
> > > > +};
> > > > +
> > > > +struct vfio_mdev_pci {
> > > > +	struct vfio_pci_device *vdev;
> > > > +	struct mdev_device *mdev;
> > > > +	unsigned long handle;
> > > > +};
> > > > +
> > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device
> *mdev)
> > > > +{
> > > > +	struct device *pdev;
> > > > +	struct vfio_pci_device *vdev;
> > > > +	struct vfio_mdev_pci *pmdev;
> > > > +	int ret;
> > > > +
> > > > +	pdev = mdev_parent_dev(mdev);
> > > > +	vdev = dev_get_drvdata(pdev);
> > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > +	if (pmdev == NULL) {
> > > > +		ret = -EBUSY;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	pmdev->mdev = mdev;
> > > > +	pmdev->vdev = vdev;
> > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > +	if (ret) {
> > > > +		pr_info("%s, failed to config iommu isolation for mdev: %s on
> > > pf: %s\n",
> > > > +			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > +		     dev_name(mdev_dev(mdev)));
> > > > +out:
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +
> > > > +	kfree(pmdev);
> > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > +		     dev_name(mdev_dev(mdev)));
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > +	int ret = 0;
> > > > +
> > > > +	if (!try_module_get(THIS_MODULE))
> > > > +		return -ENODEV;
> > > > +
> > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > +
> > > > +	mutex_lock(&vdev->reflck->lock);
> > > > +
> > > > +	if (!vdev->refcnt) {
> > > > +		ret = vfio_pci_enable(vdev);
> > > > +		if (ret)
> > > > +			goto error;
> > > > +
> > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > +	}
> > > > +	vdev->refcnt++;
> > > > +error:
> > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > +	if (!ret)
> > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > > >dev));
> > > > +	else {
> > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > > >dev));
> > > > +		module_put(THIS_MODULE);
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > +
> > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > > >dev));
> > > > +
> > > > +	mutex_lock(&vdev->reflck->lock);
> > > > +
> > > > +	if (!(--vdev->refcnt)) {
> > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > +		vfio_pci_disable(vdev);
> > > > +	}
> > > > +
> > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > +
> > > > +	module_put(THIS_MODULE);
> > > > +}
> > >
> > > open() and release() here are almost identical between vfio_pci and
> > > vfio_mdev_pci, which suggests maybe there should be common functions to
> > > call into like we do for the below.
> >
> > yes, let me have more study and do better abstract in next version. :-)
> >
> > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
> > > > +			     unsigned long arg)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +
> > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > +}
> > > > +
> > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > +				struct vm_area_struct *vma)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +
> > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > +}
> > > > +
> > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user
> *buf,
> > > > +			size_t count, loff_t *ppos)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +
> > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > +}
> > > > +
> > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > +				const char __user *buf,
> > > > +				size_t count, loff_t *ppos)
> > > > +{
> > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > +
> > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
> > > > +}
> > > > +
> > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > +	.supported_type_groups	= vfio_mdev_pci_type_groups,
> > > > +	.create			= vfio_mdev_pci_create,
> > > > +	.remove			= vfio_mdev_pci_remove,
> > > > +
> > > > +	.open			= vfio_mdev_pci_open,
> > > > +	.release		= vfio_mdev_pci_release,
> > > > +
> > > > +	.read			= vfio_mdev_pci_read,
> > > > +	.write			= vfio_mdev_pci_write,
> > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > +};
> > > > +
> > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > +				       const struct pci_device_id *id)
> > > > +{
> > > > +	struct vfio_pci_device *vdev;
> > > > +	int ret;
> > > > +
> > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > +		return -EINVAL;
> > > > +
> > > > +	/*
> > > > +	 * Prevent binding to PFs with VFs enabled, this too easily allows
> > > > +	 * userspace instance with VFs and PFs from the same device, which
> > > > +	 * cannot work.  Disabling SR-IOV here would initiate removing the
> > > > +	 * VFs, which would unbind the driver, which is prone to blocking
> > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > +	 * reject these PFs and let the user sort it out.
> > > > +	 */
> > > > +	if (pci_num_vf(pdev)) {
> > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
> > > > +		return -EBUSY;
> > > > +	}
> > > > +
> > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > +	if (!vdev)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	vdev->pdev = pdev;
> > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > +	mutex_init(&vdev->igate);
> > > > +	spin_lock_init(&vdev->irqlock);
> > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > +	vdev->nointxmask = nointxmask;
> > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > +	vdev->disable_vga = disable_vga;
> > > > +#endif
> > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > +
> > > > +	pci_set_drvdata(pdev, vdev);
> > > > +
> > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > +	if (ret) {
> > > > +		pci_set_drvdata(pdev, NULL);
> > > > +		kfree(vdev);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > +		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> > > > +		vga_set_legacy_decoding(pdev,
> > > > +					vfio_pci_set_vga_decode(vdev, false));
> > > > +	}
> > > > +
> > > > +	vfio_pci_probe_power_state(vdev);
> > > > +
> > > > +	if (!vdev->disable_idle_d3) {
> > > > +		/*
> > > > +		 * pci-core sets the device power state to an unknown value at
> > > > +		 * bootup and after being removed from a driver.  The only
> > > > +		 * transition it allows from this unknown state is to D0, which
> > > > +		 * typically happens when a driver calls pci_enable_device().
> > > > +		 * We're not ready to enable the device yet, but we do want to
> > > > +		 * be able to get to D3.  Therefore first do a D0 transition
> > > > +		 * before going to D3.
> > > > +		 */
> > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > +	}
> > >
> > > Ditto here and remove below, this seems like boilerplate that shouldn't
> > > be duplicated per leaf module.  Thanks,
> >
> > Sure, the code snippet above may also be abstracted to be a common API
> > provided by vfio-pci-common.ko. :-)
> >
> > I have a confusion which may need confirm with you. Do you also want the
> > below code snippet be placed in the vfio-pci-common.ko and exposed out
> > as a wrapped API? Thus it can be used by sample driver and other future
> > drivers which want to wrap PCI device as a mdev. May be I misundstood
> > your comment. :-(
> 
> 
> I think some sort of vfio_pci_common_{probe,remove}() would be a
> reasonable starting point where the respective module _{probe,remove}
> functions would call into these and add their module specific code
> around it.  That would at least give us a point to cleanup things that
> are only used by the common code in the common code.

sure, I can start from here if we are still going with this direction. :-)

> I'm still struggling how we make this user consumable should we accept
> this and progress beyond a proof of concept sample driver though.  For
> example, if a vendor actually implements an mdev wrapper driver or even
> just a device specific vfio-pci wrapper, to enable for example
> migration support, how does a user know which driver to use for each
> particular feature?  The best I can come up with so far is something
> like was done for vfio-platform reset modules.  For instance a module
> that extends features for a given device in vfio-pci might register an
> ops structure and id table with vfio-pci, along with creating a module
> alias (or aliases) for the devices it supports.  When a device is
> probed by vfio-pci it could try to match against registered id tables
> to find a device specific ops structure, if one is not found it could
> do a request_module using the PCI vendor and device IDs and some unique
> vfio-pci string, check again, and use the default ops if device
> specific ops are still not present.  That would solve the problem on
> the vfio-pci side. 

yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
I think this is what Yan is trying to do.

> For mdevs, I tend to assume that this vfio-mdev-pci
> meta driver is an anomaly only for the purpose of creating a generic
> test device for IOMMU backed mdevs and that "real" mdev vendor drivers
> will just be mdev enlightened host drivers, like i915 and nvidia are
> now.  Thanks,

yes, this vfio-mdev-pci meta driver is just creating a test device.
Do we still go with the current direction, or find any other way
which may be easier for adding this meta driver?

Compared with the "real" mdev vendor drivers, it is like a
"vfio-pci + dummy mdev ops" driver. dummy mdev ops means
no vendor specific handling and passthru to vfio-pci codes directly.

I think this meta driver is even lighter than the "real" mdev vendor
drivers. right? Is it possible to let this driver follow the way of
registering ops structure and id table with vfio-pci? The obstacle
I can see is the meta driver is a generic driver, which means it has
no id table... For the "real" mdev vendor drivers, they naturally have
such info. If vfio-mdev-pci can also get the id info without binding
to a device, it may be possible. thoughts? :-)

Thanks,
Yi Liu

> 
> Alex

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-18 14:23         ` Liu, Yi L
@ 2020-01-20  8:55           ` Cornelia Huck
  0 siblings, 0 replies; 44+ messages in thread
From: Cornelia Huck @ 2020-01-20  8:55 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, kwankhede, linux-kernel, kvm, Tian, Kevin, joro,
	peterx, baolu.lu, Masahiro Yamada

On Sat, 18 Jan 2020 14:23:45 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Cornelia Huck [mailto:cohuck@redhat.com]
> > Sent: Friday, January 17, 2020 1:40 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > 
> > On Thu, 16 Jan 2020 13:23:28 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Cornelia Huck [mailto:cohuck@redhat.com]
> > > > Sent: Wednesday, January 15, 2020 8:30 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > >
> > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > Liu Yi L <yi.l.liu@intel.com> wrote:  
> >   
> > > > > diff --git a/samples/Kconfig b/samples/Kconfig index
> > > > > 9d236c3..50d207c 100644
> > > > > --- a/samples/Kconfig
> > > > > +++ b/samples/Kconfig
> > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > >  	help
> > > > >  	  Build a sample program to work with mei device.
> > > > >
> > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > +	select VFIO_PCI_COMMON
> > > > > +	select VFIO_PCI  
> > > >
> > > > Why does this still need to select VFIO_PCI? Shouldn't all needed
> > > > infrastructure rather be covered by VFIO_PCI_COMMON already?  
> > >
> > > VFIO_PCI_COMMON is supposed to be the dependency of both VFIO_PCI and
> > > SAMPLE_VFIO_MDEV_PCI. However, the source code of VFIO_PCI_COMMON are
> > > under drivers/vfio/pci which is compiled per the configuration of VFIO_PCI.
> > > Besides of letting SAMPLE_VFIO_MDEV_PCI select VFIO_PCI, I can also
> > > add a line in drivers/vfio/Makefile to make the source code under
> > > drivers/vfio/pci to be compiled when either VFIO_PCI or
> > > VFIO_PCI_COMMON are configed. But I'm afraid it is a bit ugly. So I
> > > choose to let SAMPLE_VFIO_MDEV_PCI select VFIO_PCI. If you have other
> > > idea, I would be pleased to know it. :-)  
> > 
> > Shouldn't building drivers/vfio/pci/ for CONFIG_VFIO_PCI_COMMON already be
> > enough (the Makefile changes look fine to me)? Or am I missing something obvious?  
> 
> The problem is in the drivers/vfio/Makefile. If CONFIG_VFIO_PCI is not
> selected then the pci/ directory is not compiled. Even CONFIG_VFIO_PCI=M,
> it will throw error if SAMPLE_VFIO_MDEV_PCI=y. So I let SAMPLE_VFIO_MDEV_PCI
> select CONFIG_VFIO_PCI all the same. I'm not sure if this is good. Or maybe
> there is better way to ensure pci/ is compiled.
> 
> # SPDX-License-Identifier: GPL-2.0
> vfio_virqfd-y := virqfd.o
> 
> obj-$(CONFIG_VFIO) += vfio.o
> obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
> obj-$(CONFIG_VFIO_PCI) += pci/

That's actually what I meant: s/CONFIG_VFIO_PCI/CONFIG_VFIO_PCI_COMMON/ here.

> obj-$(CONFIG_VFIO_PLATFORM) += platform/
> obj-$(CONFIG_VFIO_MDEV) += mdev/
> 
> Thanks,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-18 14:25         ` Liu, Yi L
@ 2020-01-20 21:07           ` Alex Williamson
  2020-01-21  7:43             ` Tian, Kevin
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-20 21:07 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kwankhede, linux-kernel, kvm, Tian, Kevin, joro, peterx,
	baolu.lu, Masahiro Yamada

On Sat, 18 Jan 2020 14:25:11 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, January 17, 2020 5:24 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > 
> > On Thu, 16 Jan 2020 12:33:06 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > >
> > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > go through vfio mdev framework. The usage of the device follows
> > > > > mdev management method. e.g. user should create a mdev before
> > > > > exposing the device to user-space.
> > > > >
> > > > > Benefit of this new driver would be acting as a sample driver
> > > > > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > > > > patchset. Also it could be a good experiment driver for future
> > > > > device specific mdev migration support. This sample driver only
> > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > the non-singleton iommu group to VMs.
> > > > >
> > > > > To use this driver:
> > > > > a) build and load vfio-mdev-pci.ko module
> > > > >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> > > > >    then load it with following command:  
> > > > >    > sudo modprobe vfio
> > > > >    > sudo modprobe vfio-pci
> > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko  
> > > > >
> > > > > b) unbind original device driver
> > > > >    e.g. use following command to unbind its original driver  
> > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> > > > >
> > > > > c) bind vfio-mdev-pci driver to the physical device  
> > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> > > > >
> > > > > d) check the supported mdev instances  
> > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
> > > > >      vfio-mdev-pci-type_name  
> > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
> > > > >      vfio-mdev-pci-type_name/
> > > > >      available_instances  create  device_api  devices  name
> > > > >
> > > > > e)  create mdev on this physical device (only 1 instance)  
> > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
> > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > >      vfio-mdev-pci-type_name/create
> > > > >
> > > > > f) passthru the mdev to guest
> > > > >    add the following line in QEMU boot command
> > > > >     -device vfio-pci,\
> > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> > > > >
> > > > > g) destroy mdev  
> > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
> > > > >      remove
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > ---
> > > > >  samples/Kconfig                       |  10 +
> > > > >  samples/Makefile                      |   1 +
> > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397  
> > > > ++++++++++++++++++++++++++++++++++  
> > > > >  4 files changed, 412 insertions(+)
> > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > >
> > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > index 9d236c3..50d207c 100644
> > > > > --- a/samples/Kconfig
> > > > > +++ b/samples/Kconfig
> > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > >  	help
> > > > >  	  Build a sample program to work with mei device.
> > > > >
> > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > +	select VFIO_PCI_COMMON
> > > > > +	select VFIO_PCI
> > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > +	help
> > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once bound to
> > > > > +	  this driver, device passthru should through mdev path.
> > > > > +
> > > > > +	  If you don't know what to do here, say N.
> > > > >
> > > > >  endif # SAMPLES
> > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > index 5ce50ef..84faced 100644
> > > > > --- a/samples/Makefile
> > > > > +++ b/samples/Makefile
> > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)	+= ftrace/
> > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > >  obj-y					+= vfio-mdev/
> > > > > +obj-y					+= vfio-mdev-pci/  
> > > >
> > > > I think we could just lump this into vfio-mdev rather than making
> > > > another directory.  
> > >
> > > sure. will move it. :-)
> > >  
> > > >  
> > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-pci/Makefile
> > > > > new file mode 100644
> > > > > index 0000000..41b2139
> > > > > --- /dev/null
> > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > @@ -0,0 +1,4 @@
> > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > +
> > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-mdev-  
> > > > pci/vfio_mdev_pci.c  
> > > > > new file mode 100644
> > > > > index 0000000..b180356
> > > > > --- /dev/null
> > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > @@ -0,0 +1,397 @@
> > > > > +/*
> > > > > + * Copyright © 2020 Intel Corporation.
> > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > + *
> > > > > + * This program is free software; you can redistribute it and/or modify
> > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > + * published by the Free Software Foundation.
> > > > > + *
> > > > > + * Derived from original vfio_pci.c:
> > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > + *
> > > > > + * Derived from original vfio:
> > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > + */
> > > > > +
> > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > +
> > > > > +#include <linux/device.h>
> > > > > +#include <linux/eventfd.h>
> > > > > +#include <linux/file.h>
> > > > > +#include <linux/interrupt.h>
> > > > > +#include <linux/iommu.h>
> > > > > +#include <linux/module.h>
> > > > > +#include <linux/mutex.h>
> > > > > +#include <linux/notifier.h>
> > > > > +#include <linux/pci.h>
> > > > > +#include <linux/pm_runtime.h>
> > > > > +#include <linux/slab.h>
> > > > > +#include <linux/types.h>
> > > > > +#include <linux/uaccess.h>
> > > > > +#include <linux/vfio.h>
> > > > > +#include <linux/vgaarb.h>
> > > > > +#include <linux/nospec.h>
> > > > > +#include <linux/mdev.h>
> > > > > +#include <linux/vfio_pci_common.h>
> > > > > +
> > > > > +#define DRIVER_VERSION  "0.1"
> > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a  
> > > > mdev"  
> > > > > +
> > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > +
> > > > > +static char ids[1024] __initdata;
> > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver,  
> > > > format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and
> > > > multiple comma separated entries can be specified");  
> > > > > +
> > > > > +static bool nointxmask;
> > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
> > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If this resolves  
> > > > problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org  
> > so  
> > > > the device can be fixed automatically via the broken_intx_masking flag.");  
> > > > > +
> > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > +static bool disable_vga;
> > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through  
> > vfio-  
> > > > mdev-pci");  
> > > > > +#endif
> > > > > +
> > > > > +static bool disable_idle_d3;
> > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > +		 "Disable using the PCI D3 low power state for idle, unused devices");
> > > > > +
> > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > +
> > > > > +static ssize_t
> > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > +{
> > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > +}
> > > > > +
> > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > +
> > > > > +static ssize_t
> > > > > +available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > +{
> > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > +}
> > > > > +
> > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > +
> > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
> > > > > +		char *buf)
> > > > > +{
> > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > +}
> > > > > +
> > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > +
> > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > +	&mdev_type_attr_name.attr,
> > > > > +	&mdev_type_attr_device_api.attr,
> > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > +	NULL,
> > > > > +};
> > > > > +
> > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > +	.name  = "type1",
> > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > +};
> > > > > +
> > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > +	&vfio_mdev_pci_type_group1,
> > > > > +	NULL,
> > > > > +};
> > > > > +
> > > > > +struct vfio_mdev_pci {
> > > > > +	struct vfio_pci_device *vdev;
> > > > > +	struct mdev_device *mdev;
> > > > > +	unsigned long handle;
> > > > > +};
> > > > > +
> > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device  
> > *mdev)  
> > > > > +{
> > > > > +	struct device *pdev;
> > > > > +	struct vfio_pci_device *vdev;
> > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > +	int ret;
> > > > > +
> > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > +	if (pmdev == NULL) {
> > > > > +		ret = -EBUSY;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	pmdev->mdev = mdev;
> > > > > +	pmdev->vdev = vdev;
> > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > +	if (ret) {
> > > > > +		pr_info("%s, failed to config iommu isolation for mdev: %s on  
> > > > pf: %s\n",  
> > > > > +			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > +out:
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +
> > > > > +	kfree(pmdev);
> > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > +		return -ENODEV;
> > > > > +
> > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > +
> > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > +
> > > > > +	if (!vdev->refcnt) {
> > > > > +		ret = vfio_pci_enable(vdev);
> > > > > +		if (ret)
> > > > > +			goto error;
> > > > > +
> > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > +	}
> > > > > +	vdev->refcnt++;
> > > > > +error:
> > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > +	if (!ret)
> > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > > > >dev));
> > > > > +	else {
> > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > > > >dev));
> > > > > +		module_put(THIS_MODULE);
> > > > > +	}
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > +
> > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev-
> > > > >dev));
> > > > > +
> > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > +
> > > > > +	if (!(--vdev->refcnt)) {
> > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > +		vfio_pci_disable(vdev);
> > > > > +	}
> > > > > +
> > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > +
> > > > > +	module_put(THIS_MODULE);
> > > > > +}  
> > > >
> > > > open() and release() here are almost identical between vfio_pci and
> > > > vfio_mdev_pci, which suggests maybe there should be common functions to
> > > > call into like we do for the below.  
> > >
> > > yes, let me have more study and do better abstract in next version. :-)
> > >  
> > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
> > > > > +			     unsigned long arg)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +
> > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > +}
> > > > > +
> > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > +				struct vm_area_struct *vma)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +
> > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > +}
> > > > > +
> > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user  
> > *buf,  
> > > > > +			size_t count, loff_t *ppos)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +
> > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > +}
> > > > > +
> > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > +				const char __user *buf,
> > > > > +				size_t count, loff_t *ppos)
> > > > > +{
> > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > +
> > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
> > > > > +}
> > > > > +
> > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > +	.supported_type_groups	= vfio_mdev_pci_type_groups,
> > > > > +	.create			= vfio_mdev_pci_create,
> > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > +
> > > > > +	.open			= vfio_mdev_pci_open,
> > > > > +	.release		= vfio_mdev_pci_release,
> > > > > +
> > > > > +	.read			= vfio_mdev_pci_read,
> > > > > +	.write			= vfio_mdev_pci_write,
> > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > +};
> > > > > +
> > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > +				       const struct pci_device_id *id)
> > > > > +{
> > > > > +	struct vfio_pci_device *vdev;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	/*
> > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily allows
> > > > > +	 * userspace instance with VFs and PFs from the same device, which
> > > > > +	 * cannot work.  Disabling SR-IOV here would initiate removing the
> > > > > +	 * VFs, which would unbind the driver, which is prone to blocking
> > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > +	 * reject these PFs and let the user sort it out.
> > > > > +	 */
> > > > > +	if (pci_num_vf(pdev)) {
> > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
> > > > > +		return -EBUSY;
> > > > > +	}
> > > > > +
> > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > +	if (!vdev)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	vdev->pdev = pdev;
> > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > +	mutex_init(&vdev->igate);
> > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > +	vdev->nointxmask = nointxmask;
> > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > +	vdev->disable_vga = disable_vga;
> > > > > +#endif
> > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > +
> > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > +
> > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > +	if (ret) {
> > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > +		kfree(vdev);
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > +		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> > > > > +		vga_set_legacy_decoding(pdev,
> > > > > +					vfio_pci_set_vga_decode(vdev, false));
> > > > > +	}
> > > > > +
> > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > +
> > > > > +	if (!vdev->disable_idle_d3) {
> > > > > +		/*
> > > > > +		 * pci-core sets the device power state to an unknown value at
> > > > > +		 * bootup and after being removed from a driver.  The only
> > > > > +		 * transition it allows from this unknown state is to D0, which
> > > > > +		 * typically happens when a driver calls pci_enable_device().
> > > > > +		 * We're not ready to enable the device yet, but we do want to
> > > > > +		 * be able to get to D3.  Therefore first do a D0 transition
> > > > > +		 * before going to D3.
> > > > > +		 */
> > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > +	}  
> > > >
> > > > Ditto here and remove below, this seems like boilerplate that shouldn't
> > > > be duplicated per leaf module.  Thanks,  
> > >
> > > Sure, the code snippet above may also be abstracted to be a common API
> > > provided by vfio-pci-common.ko. :-)
> > >
> > > I have a confusion which may need confirm with you. Do you also want the
> > > below code snippet be placed in the vfio-pci-common.ko and exposed out
> > > as a wrapped API? Thus it can be used by sample driver and other future
> > > drivers which want to wrap PCI device as a mdev. May be I misundstood
> > > your comment. :-(  
> > 
> > 
> > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > reasonable starting point where the respective module _{probe,remove}
> > functions would call into these and add their module specific code
> > around it.  That would at least give us a point to cleanup things that
> > are only used by the common code in the common code.  
> 
> sure, I can start from here if we are still going with this direction. :-)
> 
> > I'm still struggling how we make this user consumable should we accept
> > this and progress beyond a proof of concept sample driver though.  For
> > example, if a vendor actually implements an mdev wrapper driver or even
> > just a device specific vfio-pci wrapper, to enable for example
> > migration support, how does a user know which driver to use for each
> > particular feature?  The best I can come up with so far is something
> > like was done for vfio-platform reset modules.  For instance a module
> > that extends features for a given device in vfio-pci might register an
> > ops structure and id table with vfio-pci, along with creating a module
> > alias (or aliases) for the devices it supports.  When a device is
> > probed by vfio-pci it could try to match against registered id tables
> > to find a device specific ops structure, if one is not found it could
> > do a request_module using the PCI vendor and device IDs and some unique
> > vfio-pci string, check again, and use the default ops if device
> > specific ops are still not present.  That would solve the problem on
> > the vfio-pci side.   
> 
> yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> I think this is what Yan is trying to do.

I think I'm suggesting a callback ops structure a level above what Yan
previously proposed.  For example, could we have device specific
vfio_device_ops where the vendor module can call out to common code
rather than requiring common code to test for and optionally call out
to device specific code.
 
> > For mdevs, I tend to assume that this vfio-mdev-pci
> > meta driver is an anomaly only for the purpose of creating a generic
> > test device for IOMMU backed mdevs and that "real" mdev vendor
> > drivers will just be mdev enlightened host drivers, like i915 and
> > nvidia are now.  Thanks,  
> 
> yes, this vfio-mdev-pci meta driver is just creating a test device.
> Do we still go with the current direction, or find any other way
> which may be easier for adding this meta driver?

I think if the code split allows us to create an environment where
vendor drivers can re-use much of vfio-pci while creating a
vfio_device_ops that supports additional features for their device and
we bring that all together with a request module interface and module
aliases to make that work seamlessly, then it has value.  A concern I
have in only doing this split in order to create the vfio-mdev-pci
module is that it leaves open the question and groundwork for forking
vfio-pci into multiple vendor specific modules that would become a mess
for user's to mange.
 
> Compared with the "real" mdev vendor drivers, it is like a
> "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> no vendor specific handling and passthru to vfio-pci codes directly.
> 
> I think this meta driver is even lighter than the "real" mdev vendor
> drivers. right? Is it possible to let this driver follow the way of
> registering ops structure and id table with vfio-pci? The obstacle
> I can see is the meta driver is a generic driver, which means it has
> no id table... For the "real" mdev vendor drivers, they naturally have
> such info. If vfio-mdev-pci can also get the id info without binding
> to a device, it may be possible. thoughts? :-)

IDs could be provided via a module option or potentially with
build-time options.  That might allow us to test all aspects of the
above proposal, ie. allowing sub-modules to provide vfio_device_ops for
specific devices, allowing those vendor vfio_device_ops to re-use much
of the existing vfio-pci code in that implementation, and a mechanism
for generically testing IOMMU backed mdevs.  That's starting to sound a
lot more worthwhile than moving a bunch of code around only to
implement a sample driver for the latter.  Thoughts?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-20 21:07           ` Alex Williamson
@ 2020-01-21  7:43             ` Tian, Kevin
  2020-01-21  8:43               ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Tian, Kevin @ 2020-01-21  7:43 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: kwankhede, linux-kernel, kvm, joro, peterx, baolu.lu, Masahiro Yamada

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, January 21, 2020 5:08 AM
> 
> On Sat, 18 Jan 2020 14:25:11 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, January 17, 2020 5:24 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > >
> > > On Thu, 16 Jan 2020 12:33:06 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > >
> > > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > exposing the device to user-space.
> > > > > >
> > > > > > Benefit of this new driver would be acting as a sample driver
> > > > > > for recent changes from "vfio/mdev: IOMMU aware mediated
> device"
> > > > > > patchset. Also it could be a good experiment driver for future
> > > > > > device specific mdev migration support. This sample driver only
> > > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > > the non-singleton iommu group to VMs.
> > > > > >
> > > > > > To use this driver:
> > > > > > a) build and load vfio-mdev-pci.ko module
> > > > > >    execute "make menuconfig" and config
> CONFIG_SAMPLE_VFIO_MDEV_PCI
> > > > > >    then load it with following command:
> > > > > >    > sudo modprobe vfio
> > > > > >    > sudo modprobe vfio-pci
> > > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko
> > > > > >
> > > > > > b) unbind original device driver
> > > > > >    e.g. use following command to unbind its original driver
> > > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> > > > > >
> > > > > > c) bind vfio-mdev-pci driver to the physical device
> > > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-
> pci/new_id
> > > > > >
> > > > > > d) check the supported mdev instances
> > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> > > > > >      vfio-mdev-pci-type_name
> > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > >      vfio-mdev-pci-type_name/
> > > > > >      available_instances  create  device_api  devices  name
> > > > > >
> > > > > > e)  create mdev on this physical device (only 1 instance)
> > > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> > > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > >      vfio-mdev-pci-type_name/create
> > > > > >
> > > > > > f) passthru the mdev to guest
> > > > > >    add the following line in QEMU boot command
> > > > > >     -device vfio-pci,\
> > > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-
> e6bfe0fa1003
> > > > > >
> > > > > > g) destroy mdev
> > > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-
> e6bfe0fa1003/\
> > > > > >      remove
> > > > > >
> > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > ---
> > > > > >  samples/Kconfig                       |  10 +
> > > > > >  samples/Makefile                      |   1 +
> > > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397
> > > > > ++++++++++++++++++++++++++++++++++
> > > > > >  4 files changed, 412 insertions(+)
> > > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > >
> > > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > > index 9d236c3..50d207c 100644
> > > > > > --- a/samples/Kconfig
> > > > > > +++ b/samples/Kconfig
> > > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > > >  	help
> > > > > >  	  Build a sample program to work with mei device.
> > > > > >
> > > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > > +	select VFIO_PCI_COMMON
> > > > > > +	select VFIO_PCI
> > > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > > +	help
> > > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once
> bound to
> > > > > > +	  this driver, device passthru should through mdev path.
> > > > > > +
> > > > > > +	  If you don't know what to do here, say N.
> > > > > >
> > > > > >  endif # SAMPLES
> > > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > > index 5ce50ef..84faced 100644
> > > > > > --- a/samples/Makefile
> > > > > > +++ b/samples/Makefile
> > > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)
> 	+= ftrace/
> > > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > > >  obj-y					+= vfio-mdev/
> > > > > > +obj-y					+= vfio-mdev-pci/
> > > > >
> > > > > I think we could just lump this into vfio-mdev rather than making
> > > > > another directory.
> > > >
> > > > sure. will move it. :-)
> > > >
> > > > >
> > > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-
> pci/Makefile
> > > > > > new file mode 100644
> > > > > > index 0000000..41b2139
> > > > > > --- /dev/null
> > > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > > @@ -0,0 +1,4 @@
> > > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > > +
> > > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-
> mdev-
> > > > > pci/vfio_mdev_pci.c
> > > > > > new file mode 100644
> > > > > > index 0000000..b180356
> > > > > > --- /dev/null
> > > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > @@ -0,0 +1,397 @@
> > > > > > +/*
> > > > > > + * Copyright © 2020 Intel Corporation.
> > > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > > + *
> > > > > > + * This program is free software; you can redistribute it and/or
> modify
> > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > + * published by the Free Software Foundation.
> > > > > > + *
> > > > > > + * Derived from original vfio_pci.c:
> > > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > > + *
> > > > > > + * Derived from original vfio:
> > > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > > + */
> > > > > > +
> > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > +
> > > > > > +#include <linux/device.h>
> > > > > > +#include <linux/eventfd.h>
> > > > > > +#include <linux/file.h>
> > > > > > +#include <linux/interrupt.h>
> > > > > > +#include <linux/iommu.h>
> > > > > > +#include <linux/module.h>
> > > > > > +#include <linux/mutex.h>
> > > > > > +#include <linux/notifier.h>
> > > > > > +#include <linux/pci.h>
> > > > > > +#include <linux/pm_runtime.h>
> > > > > > +#include <linux/slab.h>
> > > > > > +#include <linux/types.h>
> > > > > > +#include <linux/uaccess.h>
> > > > > > +#include <linux/vfio.h>
> > > > > > +#include <linux/vgaarb.h>
> > > > > > +#include <linux/nospec.h>
> > > > > > +#include <linux/mdev.h>
> > > > > > +#include <linux/vfio_pci_common.h>
> > > > > > +
> > > > > > +#define DRIVER_VERSION  "0.1"
> > > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI
> device as a
> > > > > mdev"
> > > > > > +
> > > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > > +
> > > > > > +static char ids[1024] __initdata;
> > > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-
> pci driver,
> > > > > format is
> \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and
> > > > > multiple comma separated entries can be specified");
> > > > > > +
> > > > > > +static bool nointxmask;
> > > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO |
> S_IWUSR);
> > > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If
> this resolves
> > > > > problems for specific devices, report lspci -vvvxxx to linux-
> pci@vger.kernel.org
> > > so
> > > > > the device can be fixed automatically via the broken_intx_masking
> flag.");
> > > > > > +
> > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > +static bool disable_vga;
> > > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access
> through
> > > vfio-
> > > > > mdev-pci");
> > > > > > +#endif
> > > > > > +
> > > > > > +static bool disable_idle_d3;
> > > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > > +		 "Disable using the PCI D3 low power state for idle,
> unused devices");
> > > > > > +
> > > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > > +
> > > > > > +static ssize_t
> > > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > > +{
> > > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > > +}
> > > > > > +
> > > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > > +
> > > > > > +static ssize_t
> > > > > > +available_instances_show(struct kobject *kobj, struct device *dev,
> char *buf)
> > > > > > +{
> > > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > > +}
> > > > > > +
> > > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > > +
> > > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device
> *dev,
> > > > > > +		char *buf)
> > > > > > +{
> > > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > > +}
> > > > > > +
> > > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > > +
> > > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > > +	&mdev_type_attr_name.attr,
> > > > > > +	&mdev_type_attr_device_api.attr,
> > > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > > +	NULL,
> > > > > > +};
> > > > > > +
> > > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > > +	.name  = "type1",
> > > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > > +};
> > > > > > +
> > > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > > +	&vfio_mdev_pci_type_group1,
> > > > > > +	NULL,
> > > > > > +};
> > > > > > +
> > > > > > +struct vfio_mdev_pci {
> > > > > > +	struct vfio_pci_device *vdev;
> > > > > > +	struct mdev_device *mdev;
> > > > > > +	unsigned long handle;
> > > > > > +};
> > > > > > +
> > > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct
> mdev_device
> > > *mdev)
> > > > > > +{
> > > > > > +	struct device *pdev;
> > > > > > +	struct vfio_pci_device *vdev;
> > > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > > +	if (pmdev == NULL) {
> > > > > > +		ret = -EBUSY;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	pmdev->mdev = mdev;
> > > > > > +	pmdev->vdev = vdev;
> > > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > > +	if (ret) {
> > > > > > +		pr_info("%s, failed to config iommu isolation for
> mdev: %s on
> > > > > pf: %s\n",
> > > > > > +			__func__, dev_name(mdev_dev(mdev)),
> dev_name(pdev));
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > +out:
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +
> > > > > > +	kfree(pmdev);
> > > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > +		return -ENODEV;
> > > > > > +
> > > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > > +
> > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > +
> > > > > > +	if (!vdev->refcnt) {
> > > > > > +		ret = vfio_pci_enable(vdev);
> > > > > > +		if (ret)
> > > > > > +			goto error;
> > > > > > +
> > > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > > +	}
> > > > > > +	vdev->refcnt++;
> > > > > > +error:
> > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > +	if (!ret)
> > > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-
> >vdev->pdev-
> > > > > >dev));
> > > > > > +	else {
> > > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-
> >vdev->pdev-
> > > > > >dev));
> > > > > > +		module_put(THIS_MODULE);
> > > > > > +	}
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > +
> > > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-
> >vdev->pdev-
> > > > > >dev));
> > > > > > +
> > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > +
> > > > > > +	if (!(--vdev->refcnt)) {
> > > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > > +		vfio_pci_disable(vdev);
> > > > > > +	}
> > > > > > +
> > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > +
> > > > > > +	module_put(THIS_MODULE);
> > > > > > +}
> > > > >
> > > > > open() and release() here are almost identical between vfio_pci and
> > > > > vfio_mdev_pci, which suggests maybe there should be common
> functions to
> > > > > call into like we do for the below.
> > > >
> > > > yes, let me have more study and do better abstract in next version. :-)
> > > >
> > > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev,
> unsigned int cmd,
> > > > > > +			     unsigned long arg)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +
> > > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > > +}
> > > > > > +
> > > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > > +				struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +
> > > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > > +}
> > > > > > +
> > > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char
> __user
> > > *buf,
> > > > > > +			size_t count, loff_t *ppos)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +
> > > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > > +}
> > > > > > +
> > > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > > +				const char __user *buf,
> > > > > > +				size_t count, loff_t *ppos)
> > > > > > +{
> > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > +
> > > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf,
> count, ppos);
> > > > > > +}
> > > > > > +
> > > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > > +	.supported_type_groups	=
> vfio_mdev_pci_type_groups,
> > > > > > +	.create			= vfio_mdev_pci_create,
> > > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > > +
> > > > > > +	.open			= vfio_mdev_pci_open,
> > > > > > +	.release		= vfio_mdev_pci_release,
> > > > > > +
> > > > > > +	.read			= vfio_mdev_pci_read,
> > > > > > +	.write			= vfio_mdev_pci_write,
> > > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > > +};
> > > > > > +
> > > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > > +				       const struct pci_device_id *id)
> > > > > > +{
> > > > > > +	struct vfio_pci_device *vdev;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily
> allows
> > > > > > +	 * userspace instance with VFs and PFs from the same device,
> which
> > > > > > +	 * cannot work.  Disabling SR-IOV here would initiate
> removing the
> > > > > > +	 * VFs, which would unbind the driver, which is prone to
> blocking
> > > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > > +	 * reject these PFs and let the user sort it out.
> > > > > > +	 */
> > > > > > +	if (pci_num_vf(pdev)) {
> > > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV
> enabled\n");
> > > > > > +		return -EBUSY;
> > > > > > +	}
> > > > > > +
> > > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > > +	if (!vdev)
> > > > > > +		return -ENOMEM;
> > > > > > +
> > > > > > +	vdev->pdev = pdev;
> > > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > > +	mutex_init(&vdev->igate);
> > > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > > +	vdev->nointxmask = nointxmask;
> > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > +	vdev->disable_vga = disable_vga;
> > > > > > +#endif
> > > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > > +
> > > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > > +
> > > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > > +	if (ret) {
> > > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > > +		kfree(vdev);
> > > > > > +		return ret;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > > +		vga_client_register(pdev, vdev, NULL,
> vfio_pci_set_vga_decode);
> > > > > > +		vga_set_legacy_decoding(pdev,
> > > > > > +
> 	vfio_pci_set_vga_decode(vdev, false));
> > > > > > +	}
> > > > > > +
> > > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > > +
> > > > > > +	if (!vdev->disable_idle_d3) {
> > > > > > +		/*
> > > > > > +		 * pci-core sets the device power state to an
> unknown value at
> > > > > > +		 * bootup and after being removed from a driver.
> The only
> > > > > > +		 * transition it allows from this unknown state is to
> D0, which
> > > > > > +		 * typically happens when a driver calls
> pci_enable_device().
> > > > > > +		 * We're not ready to enable the device yet, but we
> do want to
> > > > > > +		 * be able to get to D3.  Therefore first do a D0
> transition
> > > > > > +		 * before going to D3.
> > > > > > +		 */
> > > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > > +	}
> > > > >
> > > > > Ditto here and remove below, this seems like boilerplate that
> shouldn't
> > > > > be duplicated per leaf module.  Thanks,
> > > >
> > > > Sure, the code snippet above may also be abstracted to be a common
> API
> > > > provided by vfio-pci-common.ko. :-)
> > > >
> > > > I have a confusion which may need confirm with you. Do you also want
> the
> > > > below code snippet be placed in the vfio-pci-common.ko and exposed
> out
> > > > as a wrapped API? Thus it can be used by sample driver and other
> future
> > > > drivers which want to wrap PCI device as a mdev. May be I
> misundstood
> > > > your comment. :-(
> > >
> > >
> > > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > > reasonable starting point where the respective module _{probe,remove}
> > > functions would call into these and add their module specific code
> > > around it.  That would at least give us a point to cleanup things that
> > > are only used by the common code in the common code.
> >
> > sure, I can start from here if we are still going with this direction. :-)
> >
> > > I'm still struggling how we make this user consumable should we accept
> > > this and progress beyond a proof of concept sample driver though.  For
> > > example, if a vendor actually implements an mdev wrapper driver or
> even
> > > just a device specific vfio-pci wrapper, to enable for example
> > > migration support, how does a user know which driver to use for each
> > > particular feature?  The best I can come up with so far is something
> > > like was done for vfio-platform reset modules.  For instance a module
> > > that extends features for a given device in vfio-pci might register an
> > > ops structure and id table with vfio-pci, along with creating a module
> > > alias (or aliases) for the devices it supports.  When a device is
> > > probed by vfio-pci it could try to match against registered id tables
> > > to find a device specific ops structure, if one is not found it could
> > > do a request_module using the PCI vendor and device IDs and some
> unique
> > > vfio-pci string, check again, and use the default ops if device
> > > specific ops are still not present.  That would solve the problem on
> > > the vfio-pci side.
> >
> > yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> > I think this is what Yan is trying to do.
> 
> I think I'm suggesting a callback ops structure a level above what Yan
> previously proposed.  For example, could we have device specific
> vfio_device_ops where the vendor module can call out to common code
> rather than requiring common code to test for and optionally call out
> to device specific code.
> 
> > > For mdevs, I tend to assume that this vfio-mdev-pci
> > > meta driver is an anomaly only for the purpose of creating a generic
> > > test device for IOMMU backed mdevs and that "real" mdev vendor
> > > drivers will just be mdev enlightened host drivers, like i915 and
> > > nvidia are now.  Thanks,
> >
> > yes, this vfio-mdev-pci meta driver is just creating a test device.
> > Do we still go with the current direction, or find any other way
> > which may be easier for adding this meta driver?
> 
> I think if the code split allows us to create an environment where
> vendor drivers can re-use much of vfio-pci while creating a
> vfio_device_ops that supports additional features for their device and
> we bring that all together with a request module interface and module
> aliases to make that work seamlessly, then it has value.  A concern I
> have in only doing this split in order to create the vfio-mdev-pci
> module is that it leaves open the question and groundwork for forking
> vfio-pci into multiple vendor specific modules that would become a mess
> for user's to mange.
> 
> > Compared with the "real" mdev vendor drivers, it is like a
> > "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> > no vendor specific handling and passthru to vfio-pci codes directly.
> >
> > I think this meta driver is even lighter than the "real" mdev vendor
> > drivers. right? Is it possible to let this driver follow the way of
> > registering ops structure and id table with vfio-pci? The obstacle
> > I can see is the meta driver is a generic driver, which means it has
> > no id table... For the "real" mdev vendor drivers, they naturally have
> > such info. If vfio-mdev-pci can also get the id info without binding
> > to a device, it may be possible. thoughts? :-)
> 
> IDs could be provided via a module option or potentially with
> build-time options.  That might allow us to test all aspects of the
> above proposal, ie. allowing sub-modules to provide vfio_device_ops for
> specific devices, allowing those vendor vfio_device_ops to re-use much
> of the existing vfio-pci code in that implementation, and a mechanism
> for generically testing IOMMU backed mdevs.  That's starting to sound a
> lot more worthwhile than moving a bunch of code around only to
> implement a sample driver for the latter.  Thoughts?  Thanks,
> 

sounds a good idea. If feasible suppose Yan's mediate_ops series
can be also largely avoided. The vendor driver can directly register its
own vfio_device_ops and selectively introduces proprietary logic 
(e.g. for tracking dirty pages) on top of the generic vfio_pci code.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-21  7:43             ` Tian, Kevin
@ 2020-01-21  8:43               ` Yan Zhao
  2020-01-21 20:04                 ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2020-01-21  8:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, kwankhede, linux-kernel, kvm, joro,
	peterx, baolu.lu, Masahiro Yamada

On Tue, Jan 21, 2020 at 03:43:02PM +0800, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, January 21, 2020 5:08 AM
> > 
> > On Sat, 18 Jan 2020 14:25:11 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, January 17, 2020 5:24 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > >
> > > > On Thu, 16 Jan 2020 12:33:06 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > >
> > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > exposing the device to user-space.
> > > > > > >
> > > > > > > Benefit of this new driver would be acting as a sample driver
> > > > > > > for recent changes from "vfio/mdev: IOMMU aware mediated
> > device"
> > > > > > > patchset. Also it could be a good experiment driver for future
> > > > > > > device specific mdev migration support. This sample driver only
> > > > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > > > the non-singleton iommu group to VMs.
> > > > > > >
> > > > > > > To use this driver:
> > > > > > > a) build and load vfio-mdev-pci.ko module
> > > > > > >    execute "make menuconfig" and config
> > CONFIG_SAMPLE_VFIO_MDEV_PCI
> > > > > > >    then load it with following command:
> > > > > > >    > sudo modprobe vfio
> > > > > > >    > sudo modprobe vfio-pci
> > > > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko
> > > > > > >
> > > > > > > b) unbind original device driver
> > > > > > >    e.g. use following command to unbind its original driver
> > > > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> > > > > > >
> > > > > > > c) bind vfio-mdev-pci driver to the physical device
> > > > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-
> > pci/new_id
> > > > > > >
> > > > > > > d) check the supported mdev instances
> > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> > > > > > >      vfio-mdev-pci-type_name
> > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > > >      vfio-mdev-pci-type_name/
> > > > > > >      available_instances  create  device_api  devices  name
> > > > > > >
> > > > > > > e)  create mdev on this physical device (only 1 instance)
> > > > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> > > > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > > >      vfio-mdev-pci-type_name/create
> > > > > > >
> > > > > > > f) passthru the mdev to guest
> > > > > > >    add the following line in QEMU boot command
> > > > > > >     -device vfio-pci,\
> > > > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-
> > e6bfe0fa1003
> > > > > > >
> > > > > > > g) destroy mdev
> > > > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-
> > e6bfe0fa1003/\
> > > > > > >      remove
> > > > > > >
> > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > ---
> > > > > > >  samples/Kconfig                       |  10 +
> > > > > > >  samples/Makefile                      |   1 +
> > > > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397
> > > > > > ++++++++++++++++++++++++++++++++++
> > > > > > >  4 files changed, 412 insertions(+)
> > > > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > >
> > > > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > > > index 9d236c3..50d207c 100644
> > > > > > > --- a/samples/Kconfig
> > > > > > > +++ b/samples/Kconfig
> > > > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > > > >  	help
> > > > > > >  	  Build a sample program to work with mei device.
> > > > > > >
> > > > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > > > +	select VFIO_PCI_COMMON
> > > > > > > +	select VFIO_PCI
> > > > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > > > +	help
> > > > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once
> > bound to
> > > > > > > +	  this driver, device passthru should through mdev path.
> > > > > > > +
> > > > > > > +	  If you don't know what to do here, say N.
> > > > > > >
> > > > > > >  endif # SAMPLES
> > > > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > > > index 5ce50ef..84faced 100644
> > > > > > > --- a/samples/Makefile
> > > > > > > +++ b/samples/Makefile
> > > > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)
> > 	+= ftrace/
> > > > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > > > >  obj-y					+= vfio-mdev/
> > > > > > > +obj-y					+= vfio-mdev-pci/
> > > > > >
> > > > > > I think we could just lump this into vfio-mdev rather than making
> > > > > > another directory.
> > > > >
> > > > > sure. will move it. :-)
> > > > >
> > > > > >
> > > > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-
> > pci/Makefile
> > > > > > > new file mode 100644
> > > > > > > index 0000000..41b2139
> > > > > > > --- /dev/null
> > > > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > > > @@ -0,0 +1,4 @@
> > > > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > > > +
> > > > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-
> > mdev-
> > > > > > pci/vfio_mdev_pci.c
> > > > > > > new file mode 100644
> > > > > > > index 0000000..b180356
> > > > > > > --- /dev/null
> > > > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > @@ -0,0 +1,397 @@
> > > > > > > +/*
> > > > > > > + * Copyright © 2020 Intel Corporation.
> > > > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > + *
> > > > > > > + * This program is free software; you can redistribute it and/or
> > modify
> > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > + * published by the Free Software Foundation.
> > > > > > > + *
> > > > > > > + * Derived from original vfio_pci.c:
> > > > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > + *
> > > > > > > + * Derived from original vfio:
> > > > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > > > + */
> > > > > > > +
> > > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > > +
> > > > > > > +#include <linux/device.h>
> > > > > > > +#include <linux/eventfd.h>
> > > > > > > +#include <linux/file.h>
> > > > > > > +#include <linux/interrupt.h>
> > > > > > > +#include <linux/iommu.h>
> > > > > > > +#include <linux/module.h>
> > > > > > > +#include <linux/mutex.h>
> > > > > > > +#include <linux/notifier.h>
> > > > > > > +#include <linux/pci.h>
> > > > > > > +#include <linux/pm_runtime.h>
> > > > > > > +#include <linux/slab.h>
> > > > > > > +#include <linux/types.h>
> > > > > > > +#include <linux/uaccess.h>
> > > > > > > +#include <linux/vfio.h>
> > > > > > > +#include <linux/vgaarb.h>
> > > > > > > +#include <linux/nospec.h>
> > > > > > > +#include <linux/mdev.h>
> > > > > > > +#include <linux/vfio_pci_common.h>
> > > > > > > +
> > > > > > > +#define DRIVER_VERSION  "0.1"
> > > > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI
> > device as a
> > > > > > mdev"
> > > > > > > +
> > > > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > > > +
> > > > > > > +static char ids[1024] __initdata;
> > > > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-
> > pci driver,
> > > > > > format is
> > \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and
> > > > > > multiple comma separated entries can be specified");
> > > > > > > +
> > > > > > > +static bool nointxmask;
> > > > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO |
> > S_IWUSR);
> > > > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If
> > this resolves
> > > > > > problems for specific devices, report lspci -vvvxxx to linux-
> > pci@vger.kernel.org
> > > > so
> > > > > > the device can be fixed automatically via the broken_intx_masking
> > flag.");
> > > > > > > +
> > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > +static bool disable_vga;
> > > > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access
> > through
> > > > vfio-
> > > > > > mdev-pci");
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +static bool disable_idle_d3;
> > > > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > > > +		 "Disable using the PCI D3 low power state for idle,
> > unused devices");
> > > > > > > +
> > > > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > > > +
> > > > > > > +static ssize_t
> > > > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > > > +{
> > > > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > > > +}
> > > > > > > +
> > > > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > > > +
> > > > > > > +static ssize_t
> > > > > > > +available_instances_show(struct kobject *kobj, struct device *dev,
> > char *buf)
> > > > > > > +{
> > > > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > > > +}
> > > > > > > +
> > > > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > > > +
> > > > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device
> > *dev,
> > > > > > > +		char *buf)
> > > > > > > +{
> > > > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > > > +}
> > > > > > > +
> > > > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > > > +
> > > > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > > > +	&mdev_type_attr_name.attr,
> > > > > > > +	&mdev_type_attr_device_api.attr,
> > > > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > > > +	NULL,
> > > > > > > +};
> > > > > > > +
> > > > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > > > +	.name  = "type1",
> > > > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > > > +};
> > > > > > > +
> > > > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > > > +	&vfio_mdev_pci_type_group1,
> > > > > > > +	NULL,
> > > > > > > +};
> > > > > > > +
> > > > > > > +struct vfio_mdev_pci {
> > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > +	struct mdev_device *mdev;
> > > > > > > +	unsigned long handle;
> > > > > > > +};
> > > > > > > +
> > > > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct
> > mdev_device
> > > > *mdev)
> > > > > > > +{
> > > > > > > +	struct device *pdev;
> > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > > > +	if (pmdev == NULL) {
> > > > > > > +		ret = -EBUSY;
> > > > > > > +		goto out;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	pmdev->mdev = mdev;
> > > > > > > +	pmdev->vdev = vdev;
> > > > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > > > +	if (ret) {
> > > > > > > +		pr_info("%s, failed to config iommu isolation for
> > mdev: %s on
> > > > > > pf: %s\n",
> > > > > > > +			__func__, dev_name(mdev_dev(mdev)),
> > dev_name(pdev));
> > > > > > > +		goto out;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > +out:
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +
> > > > > > > +	kfree(pmdev);
> > > > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > +	int ret = 0;
> > > > > > > +
> > > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > > +		return -ENODEV;
> > > > > > > +
> > > > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > > > +
> > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > +
> > > > > > > +	if (!vdev->refcnt) {
> > > > > > > +		ret = vfio_pci_enable(vdev);
> > > > > > > +		if (ret)
> > > > > > > +			goto error;
> > > > > > > +
> > > > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > > > +	}
> > > > > > > +	vdev->refcnt++;
> > > > > > > +error:
> > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > +	if (!ret)
> > > > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-
> > >vdev->pdev-
> > > > > > >dev));
> > > > > > > +	else {
> > > > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-
> > >vdev->pdev-
> > > > > > >dev));
> > > > > > > +		module_put(THIS_MODULE);
> > > > > > > +	}
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > +
> > > > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-
> > >vdev->pdev-
> > > > > > >dev));
> > > > > > > +
> > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > +
> > > > > > > +	if (!(--vdev->refcnt)) {
> > > > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > > > +		vfio_pci_disable(vdev);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > +
> > > > > > > +	module_put(THIS_MODULE);
> > > > > > > +}
> > > > > >
> > > > > > open() and release() here are almost identical between vfio_pci and
> > > > > > vfio_mdev_pci, which suggests maybe there should be common
> > functions to
> > > > > > call into like we do for the below.
> > > > >
> > > > > yes, let me have more study and do better abstract in next version. :-)
> > > > >
> > > > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev,
> > unsigned int cmd,
> > > > > > > +			     unsigned long arg)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +
> > > > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > > > +				struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +
> > > > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char
> > __user
> > > > *buf,
> > > > > > > +			size_t count, loff_t *ppos)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +
> > > > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > > > +				const char __user *buf,
> > > > > > > +				size_t count, loff_t *ppos)
> > > > > > > +{
> > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > +
> > > > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf,
> > count, ppos);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > > > +	.supported_type_groups	=
> > vfio_mdev_pci_type_groups,
> > > > > > > +	.create			= vfio_mdev_pci_create,
> > > > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > > > +
> > > > > > > +	.open			= vfio_mdev_pci_open,
> > > > > > > +	.release		= vfio_mdev_pci_release,
> > > > > > > +
> > > > > > > +	.read			= vfio_mdev_pci_read,
> > > > > > > +	.write			= vfio_mdev_pci_write,
> > > > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > > > +};
> > > > > > > +
> > > > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > > > +				       const struct pci_device_id *id)
> > > > > > > +{
> > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > > > +		return -EINVAL;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily
> > allows
> > > > > > > +	 * userspace instance with VFs and PFs from the same device,
> > which
> > > > > > > +	 * cannot work.  Disabling SR-IOV here would initiate
> > removing the
> > > > > > > +	 * VFs, which would unbind the driver, which is prone to
> > blocking
> > > > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > > > +	 * reject these PFs and let the user sort it out.
> > > > > > > +	 */
> > > > > > > +	if (pci_num_vf(pdev)) {
> > > > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV
> > enabled\n");
> > > > > > > +		return -EBUSY;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > > > +	if (!vdev)
> > > > > > > +		return -ENOMEM;
> > > > > > > +
> > > > > > > +	vdev->pdev = pdev;
> > > > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > > > +	mutex_init(&vdev->igate);
> > > > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > > > +	vdev->nointxmask = nointxmask;
> > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > +	vdev->disable_vga = disable_vga;
> > > > > > > +#endif
> > > > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > > > +
> > > > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > > > +
> > > > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > > > +	if (ret) {
> > > > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > > > +		kfree(vdev);
> > > > > > > +		return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > > > +		vga_client_register(pdev, vdev, NULL,
> > vfio_pci_set_vga_decode);
> > > > > > > +		vga_set_legacy_decoding(pdev,
> > > > > > > +
> > 	vfio_pci_set_vga_decode(vdev, false));
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > > > +
> > > > > > > +	if (!vdev->disable_idle_d3) {
> > > > > > > +		/*
> > > > > > > +		 * pci-core sets the device power state to an
> > unknown value at
> > > > > > > +		 * bootup and after being removed from a driver.
> > The only
> > > > > > > +		 * transition it allows from this unknown state is to
> > D0, which
> > > > > > > +		 * typically happens when a driver calls
> > pci_enable_device().
> > > > > > > +		 * We're not ready to enable the device yet, but we
> > do want to
> > > > > > > +		 * be able to get to D3.  Therefore first do a D0
> > transition
> > > > > > > +		 * before going to D3.
> > > > > > > +		 */
> > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > > > +	}
> > > > > >
> > > > > > Ditto here and remove below, this seems like boilerplate that
> > shouldn't
> > > > > > be duplicated per leaf module.  Thanks,
> > > > >
> > > > > Sure, the code snippet above may also be abstracted to be a common
> > API
> > > > > provided by vfio-pci-common.ko. :-)
> > > > >
> > > > > I have a confusion which may need confirm with you. Do you also want
> > the
> > > > > below code snippet be placed in the vfio-pci-common.ko and exposed
> > out
> > > > > as a wrapped API? Thus it can be used by sample driver and other
> > future
> > > > > drivers which want to wrap PCI device as a mdev. May be I
> > misundstood
> > > > > your comment. :-(
> > > >
> > > >
> > > > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > > > reasonable starting point where the respective module _{probe,remove}
> > > > functions would call into these and add their module specific code
> > > > around it.  That would at least give us a point to cleanup things that
> > > > are only used by the common code in the common code.
> > >
> > > sure, I can start from here if we are still going with this direction. :-)
> > >
> > > > I'm still struggling how we make this user consumable should we accept
> > > > this and progress beyond a proof of concept sample driver though.  For
> > > > example, if a vendor actually implements an mdev wrapper driver or
> > even
> > > > just a device specific vfio-pci wrapper, to enable for example
> > > > migration support, how does a user know which driver to use for each
> > > > particular feature?  The best I can come up with so far is something
> > > > like was done for vfio-platform reset modules.  For instance a module
> > > > that extends features for a given device in vfio-pci might register an
> > > > ops structure and id table with vfio-pci, along with creating a module
> > > > alias (or aliases) for the devices it supports.  When a device is
> > > > probed by vfio-pci it could try to match against registered id tables
> > > > to find a device specific ops structure, if one is not found it could
> > > > do a request_module using the PCI vendor and device IDs and some
> > unique
> > > > vfio-pci string, check again, and use the default ops if device
> > > > specific ops are still not present.  That would solve the problem on
> > > > the vfio-pci side.
> > >
> > > yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> > > I think this is what Yan is trying to do.
> > 
> > I think I'm suggesting a callback ops structure a level above what Yan
> > previously proposed.  For example, could we have device specific
> > vfio_device_ops where the vendor module can call out to common code
> > rather than requiring common code to test for and optionally call out
> > to device specific code.
> > 
> > > > For mdevs, I tend to assume that this vfio-mdev-pci
> > > > meta driver is an anomaly only for the purpose of creating a generic
> > > > test device for IOMMU backed mdevs and that "real" mdev vendor
> > > > drivers will just be mdev enlightened host drivers, like i915 and
> > > > nvidia are now.  Thanks,
> > >
> > > yes, this vfio-mdev-pci meta driver is just creating a test device.
> > > Do we still go with the current direction, or find any other way
> > > which may be easier for adding this meta driver?
> > 
> > I think if the code split allows us to create an environment where
> > vendor drivers can re-use much of vfio-pci while creating a
> > vfio_device_ops that supports additional features for their device and
> > we bring that all together with a request module interface and module
> > aliases to make that work seamlessly, then it has value.  A concern I
> > have in only doing this split in order to create the vfio-mdev-pci
> > module is that it leaves open the question and groundwork for forking
> > vfio-pci into multiple vendor specific modules that would become a mess
> > for user's to mange.
> > 
> > > Compared with the "real" mdev vendor drivers, it is like a
> > > "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> > > no vendor specific handling and passthru to vfio-pci codes directly.
> > >
> > > I think this meta driver is even lighter than the "real" mdev vendor
> > > drivers. right? Is it possible to let this driver follow the way of
> > > registering ops structure and id table with vfio-pci? The obstacle
> > > I can see is the meta driver is a generic driver, which means it has
> > > no id table... For the "real" mdev vendor drivers, they naturally have
> > > such info. If vfio-mdev-pci can also get the id info without binding
> > > to a device, it may be possible. thoughts? :-)
> > 
> > IDs could be provided via a module option or potentially with
> > build-time options.  That might allow us to test all aspects of the
> > above proposal, ie. allowing sub-modules to provide vfio_device_ops for
> > specific devices, allowing those vendor vfio_device_ops to re-use much
> > of the existing vfio-pci code in that implementation, and a mechanism
> > for generically testing IOMMU backed mdevs.  That's starting to sound a
> > lot more worthwhile than moving a bunch of code around only to
> > implement a sample driver for the latter.  Thoughts?  Thanks,
> > 
> 
> sounds a good idea. If feasible suppose Yan's mediate_ops series
> can be also largely avoided. The vendor driver can directly register its
> own vfio_device_ops and selectively introduces proprietary logic 
> (e.g. for tracking dirty pages) on top of the generic vfio_pci code.

hi Alex
as our previously discussed, I'm preparing to implement my v2 as this
way:

1. on vfio-pci binding to a device, it will modprobe modules of alias
"vfio-pci-(vendorid)-(deviceid)", as a way to notify vendor drivers of
registering their vendor ops. (I renamed mediate_ops to vendor_ops in
v2)
2. in a module aliasing to "vfio-pci-(vendor_id)-(devivce_id)", in its
module_init, it will register a vendor ops to vfio-pci.
If there are two modules of the same alias and both registering vendor
ops at the same time, they are chained according to the prio in
its vendor ops.
3. vfio-pci would ask for region_infos for all vendor ops of a vdev in
vfio_pci_open, and init regions for vendor drivers. Current code in
vfio_pci_igd.c, vfio_pci_nvlink2.c, vfio_pci_nvlink2.c would all be
wrapped into separate modules. so current vfio_pci_register_dev_region()
would be removed accordingly. vfio_pci_rw would now be direct to 
vendor_ops->region[i].rw. higher priority module's ops wins.
For example, module vfio_pci_igd may register to regions of index 10,
11, 12 for its opregion, and two cfg regions. still, vendor driver can
provide a module named i915_migration to register for regions of index 0
and 13 for BAR0 and migration.

Thanks
Yan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-21  8:43               ` Yan Zhao
@ 2020-01-21 20:04                 ` Alex Williamson
  2020-01-21 21:54                   ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-21 20:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, kwankhede, linux-kernel, kvm, joro,
	peterx, baolu.lu, Masahiro Yamada

On Tue, 21 Jan 2020 03:43:51 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Jan 21, 2020 at 03:43:02PM +0800, Tian, Kevin wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, January 21, 2020 5:08 AM
> > > 
> > > On Sat, 18 Jan 2020 14:25:11 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >   
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, January 17, 2020 5:24 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > >
> > > > > On Thu, 16 Jan 2020 12:33:06 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >  
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > > >
> > > > > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > >  
> > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > > exposing the device to user-space.
> > > > > > > >
> > > > > > > > Benefit of this new driver would be acting as a sample driver
> > > > > > > > for recent changes from "vfio/mdev: IOMMU aware mediated  
> > > device"  
> > > > > > > > patchset. Also it could be a good experiment driver for future
> > > > > > > > device specific mdev migration support. This sample driver only
> > > > > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > > > > the non-singleton iommu group to VMs.
> > > > > > > >
> > > > > > > > To use this driver:
> > > > > > > > a) build and load vfio-mdev-pci.ko module
> > > > > > > >    execute "make menuconfig" and config  
> > > CONFIG_SAMPLE_VFIO_MDEV_PCI  
> > > > > > > >    then load it with following command:  
> > > > > > > >    > sudo modprobe vfio
> > > > > > > >    > sudo modprobe vfio-pci
> > > > > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko  
> > > > > > > >
> > > > > > > > b) unbind original device driver
> > > > > > > >    e.g. use following command to unbind its original driver  
> > > > > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> > > > > > > >
> > > > > > > > c) bind vfio-mdev-pci driver to the physical device  
> > > > > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-  
> > > pci/new_id  
> > > > > > > >
> > > > > > > > d) check the supported mdev instances  
> > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
> > > > > > > >      vfio-mdev-pci-type_name  
> > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
> > > > > > > >      vfio-mdev-pci-type_name/
> > > > > > > >      available_instances  create  device_api  devices  name
> > > > > > > >
> > > > > > > > e)  create mdev on this physical device (only 1 instance)  
> > > > > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
> > > > > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > > > >      vfio-mdev-pci-type_name/create
> > > > > > > >
> > > > > > > > f) passthru the mdev to guest
> > > > > > > >    add the following line in QEMU boot command
> > > > > > > >     -device vfio-pci,\
> > > > > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-  
> > > e6bfe0fa1003  
> > > > > > > >
> > > > > > > > g) destroy mdev  
> > > > > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-  
> > > e6bfe0fa1003/\  
> > > > > > > >      remove
> > > > > > > >
> > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > ---
> > > > > > > >  samples/Kconfig                       |  10 +
> > > > > > > >  samples/Makefile                      |   1 +
> > > > > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397  
> > > > > > > ++++++++++++++++++++++++++++++++++  
> > > > > > > >  4 files changed, 412 insertions(+)
> > > > > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > >
> > > > > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > > > > index 9d236c3..50d207c 100644
> > > > > > > > --- a/samples/Kconfig
> > > > > > > > +++ b/samples/Kconfig
> > > > > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > > > > >  	help
> > > > > > > >  	  Build a sample program to work with mei device.
> > > > > > > >
> > > > > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > > > > +	select VFIO_PCI_COMMON
> > > > > > > > +	select VFIO_PCI
> > > > > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > > > > +	help
> > > > > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once  
> > > bound to  
> > > > > > > > +	  this driver, device passthru should through mdev path.
> > > > > > > > +
> > > > > > > > +	  If you don't know what to do here, say N.
> > > > > > > >
> > > > > > > >  endif # SAMPLES
> > > > > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > > > > index 5ce50ef..84faced 100644
> > > > > > > > --- a/samples/Makefile
> > > > > > > > +++ b/samples/Makefile
> > > > > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)  
> > > 	+= ftrace/  
> > > > > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > > > > >  obj-y					+= vfio-mdev/
> > > > > > > > +obj-y					+= vfio-mdev-pci/  
> > > > > > >
> > > > > > > I think we could just lump this into vfio-mdev rather than making
> > > > > > > another directory.  
> > > > > >
> > > > > > sure. will move it. :-)
> > > > > >  
> > > > > > >  
> > > > > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-  
> > > pci/Makefile  
> > > > > > > > new file mode 100644
> > > > > > > > index 0000000..41b2139
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > > > > @@ -0,0 +1,4 @@
> > > > > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > > > > +
> > > > > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-  
> > > mdev-  
> > > > > > > pci/vfio_mdev_pci.c  
> > > > > > > > new file mode 100644
> > > > > > > > index 0000000..b180356
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > @@ -0,0 +1,397 @@
> > > > > > > > +/*
> > > > > > > > + * Copyright © 2020 Intel Corporation.
> > > > > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > + *
> > > > > > > > + * This program is free software; you can redistribute it and/or  
> > > modify  
> > > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > > + * published by the Free Software Foundation.
> > > > > > > > + *
> > > > > > > > + * Derived from original vfio_pci.c:
> > > > > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > + *
> > > > > > > > + * Derived from original vfio:
> > > > > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > > > +
> > > > > > > > +#include <linux/device.h>
> > > > > > > > +#include <linux/eventfd.h>
> > > > > > > > +#include <linux/file.h>
> > > > > > > > +#include <linux/interrupt.h>
> > > > > > > > +#include <linux/iommu.h>
> > > > > > > > +#include <linux/module.h>
> > > > > > > > +#include <linux/mutex.h>
> > > > > > > > +#include <linux/notifier.h>
> > > > > > > > +#include <linux/pci.h>
> > > > > > > > +#include <linux/pm_runtime.h>
> > > > > > > > +#include <linux/slab.h>
> > > > > > > > +#include <linux/types.h>
> > > > > > > > +#include <linux/uaccess.h>
> > > > > > > > +#include <linux/vfio.h>
> > > > > > > > +#include <linux/vgaarb.h>
> > > > > > > > +#include <linux/nospec.h>
> > > > > > > > +#include <linux/mdev.h>
> > > > > > > > +#include <linux/vfio_pci_common.h>
> > > > > > > > +
> > > > > > > > +#define DRIVER_VERSION  "0.1"
> > > > > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI  
> > > device as a  
> > > > > > > mdev"  
> > > > > > > > +
> > > > > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > > > > +
> > > > > > > > +static char ids[1024] __initdata;
> > > > > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-  
> > > pci driver,  
> > > > > > > format is  
> > > \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and  
> > > > > > > multiple comma separated entries can be specified");  
> > > > > > > > +
> > > > > > > > +static bool nointxmask;
> > > > > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO |  
> > > S_IWUSR);  
> > > > > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If  
> > > this resolves  
> > > > > > > problems for specific devices, report lspci -vvvxxx to linux-  
> > > pci@vger.kernel.org  
> > > > > so  
> > > > > > > the device can be fixed automatically via the broken_intx_masking  
> > > flag.");  
> > > > > > > > +
> > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > +static bool disable_vga;
> > > > > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access  
> > > through  
> > > > > vfio-  
> > > > > > > mdev-pci");  
> > > > > > > > +#endif
> > > > > > > > +
> > > > > > > > +static bool disable_idle_d3;
> > > > > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > > > > +		 "Disable using the PCI D3 low power state for idle,  
> > > unused devices");  
> > > > > > > > +
> > > > > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > > > > +
> > > > > > > > +static ssize_t
> > > > > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > > > > +{
> > > > > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > > > > +
> > > > > > > > +static ssize_t
> > > > > > > > +available_instances_show(struct kobject *kobj, struct device *dev,  
> > > char *buf)  
> > > > > > > > +{
> > > > > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > > > > +
> > > > > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device  
> > > *dev,  
> > > > > > > > +		char *buf)
> > > > > > > > +{
> > > > > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > > > > +
> > > > > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > > > > +	&mdev_type_attr_name.attr,
> > > > > > > > +	&mdev_type_attr_device_api.attr,
> > > > > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > > > > +	NULL,
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > > > > +	.name  = "type1",
> > > > > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > > > > +	&vfio_mdev_pci_type_group1,
> > > > > > > > +	NULL,
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +struct vfio_mdev_pci {
> > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > +	struct mdev_device *mdev;
> > > > > > > > +	unsigned long handle;
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct  
> > > mdev_device  
> > > > > *mdev)  
> > > > > > > > +{
> > > > > > > > +	struct device *pdev;
> > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > > > > +	int ret;
> > > > > > > > +
> > > > > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > > > > +	if (pmdev == NULL) {
> > > > > > > > +		ret = -EBUSY;
> > > > > > > > +		goto out;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	pmdev->mdev = mdev;
> > > > > > > > +	pmdev->vdev = vdev;
> > > > > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > > > > +	if (ret) {
> > > > > > > > +		pr_info("%s, failed to config iommu isolation for  
> > > mdev: %s on  
> > > > > > > pf: %s\n",  
> > > > > > > > +			__func__, dev_name(mdev_dev(mdev)),  
> > > dev_name(pdev));  
> > > > > > > > +		goto out;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > +out:
> > > > > > > > +	return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +
> > > > > > > > +	kfree(pmdev);
> > > > > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > +
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > +	int ret = 0;
> > > > > > > > +
> > > > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > > > +		return -ENODEV;
> > > > > > > > +
> > > > > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > > > > +
> > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > +
> > > > > > > > +	if (!vdev->refcnt) {
> > > > > > > > +		ret = vfio_pci_enable(vdev);
> > > > > > > > +		if (ret)
> > > > > > > > +			goto error;
> > > > > > > > +
> > > > > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > > > > +	}
> > > > > > > > +	vdev->refcnt++;
> > > > > > > > +error:
> > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > +	if (!ret)
> > > > > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-  
> > > >vdev->pdev-  
> > > > > > > >dev));
> > > > > > > > +	else {
> > > > > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-  
> > > >vdev->pdev-  
> > > > > > > >dev));
> > > > > > > > +		module_put(THIS_MODULE);
> > > > > > > > +	}
> > > > > > > > +	return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > +
> > > > > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-  
> > > >vdev->pdev-  
> > > > > > > >dev));
> > > > > > > > +
> > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > +
> > > > > > > > +	if (!(--vdev->refcnt)) {
> > > > > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > > > > +		vfio_pci_disable(vdev);
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > +
> > > > > > > > +	module_put(THIS_MODULE);
> > > > > > > > +}  
> > > > > > >
> > > > > > > open() and release() here are almost identical between vfio_pci and
> > > > > > > vfio_mdev_pci, which suggests maybe there should be common  
> > > functions to  
> > > > > > > call into like we do for the below.  
> > > > > >
> > > > > > yes, let me have more study and do better abstract in next version. :-)
> > > > > >  
> > > > > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev,  
> > > unsigned int cmd,  
> > > > > > > > +			     unsigned long arg)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +
> > > > > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > > > > +				struct vm_area_struct *vma)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +
> > > > > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char  
> > > __user  
> > > > > *buf,  
> > > > > > > > +			size_t count, loff_t *ppos)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +
> > > > > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > > > > +				const char __user *buf,
> > > > > > > > +				size_t count, loff_t *ppos)
> > > > > > > > +{
> > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > +
> > > > > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf,  
> > > count, ppos);  
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > > > > +	.supported_type_groups	=  
> > > vfio_mdev_pci_type_groups,  
> > > > > > > > +	.create			= vfio_mdev_pci_create,
> > > > > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > > > > +
> > > > > > > > +	.open			= vfio_mdev_pci_open,
> > > > > > > > +	.release		= vfio_mdev_pci_release,
> > > > > > > > +
> > > > > > > > +	.read			= vfio_mdev_pci_read,
> > > > > > > > +	.write			= vfio_mdev_pci_write,
> > > > > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > > > > +				       const struct pci_device_id *id)
> > > > > > > > +{
> > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > +	int ret;
> > > > > > > > +
> > > > > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > > > > +		return -EINVAL;
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily  
> > > allows  
> > > > > > > > +	 * userspace instance with VFs and PFs from the same device,  
> > > which  
> > > > > > > > +	 * cannot work.  Disabling SR-IOV here would initiate  
> > > removing the  
> > > > > > > > +	 * VFs, which would unbind the driver, which is prone to  
> > > blocking  
> > > > > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > > > > +	 * reject these PFs and let the user sort it out.
> > > > > > > > +	 */
> > > > > > > > +	if (pci_num_vf(pdev)) {
> > > > > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV  
> > > enabled\n");  
> > > > > > > > +		return -EBUSY;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > > > > +	if (!vdev)
> > > > > > > > +		return -ENOMEM;
> > > > > > > > +
> > > > > > > > +	vdev->pdev = pdev;
> > > > > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > > > > +	mutex_init(&vdev->igate);
> > > > > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > > > > +	vdev->nointxmask = nointxmask;
> > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > +	vdev->disable_vga = disable_vga;
> > > > > > > > +#endif
> > > > > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > > > > +
> > > > > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > > > > +
> > > > > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > > > > +	if (ret) {
> > > > > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > > > > +		kfree(vdev);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > > > > +		vga_client_register(pdev, vdev, NULL,  
> > > vfio_pci_set_vga_decode);  
> > > > > > > > +		vga_set_legacy_decoding(pdev,
> > > > > > > > +  
> > > 	vfio_pci_set_vga_decode(vdev, false));  
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > > > > +
> > > > > > > > +	if (!vdev->disable_idle_d3) {
> > > > > > > > +		/*
> > > > > > > > +		 * pci-core sets the device power state to an  
> > > unknown value at  
> > > > > > > > +		 * bootup and after being removed from a driver.  
> > > The only  
> > > > > > > > +		 * transition it allows from this unknown state is to  
> > > D0, which  
> > > > > > > > +		 * typically happens when a driver calls  
> > > pci_enable_device().  
> > > > > > > > +		 * We're not ready to enable the device yet, but we  
> > > do want to  
> > > > > > > > +		 * be able to get to D3.  Therefore first do a D0  
> > > transition  
> > > > > > > > +		 * before going to D3.
> > > > > > > > +		 */
> > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > > > > +	}  
> > > > > > >
> > > > > > > Ditto here and remove below, this seems like boilerplate that  
> > > shouldn't  
> > > > > > > be duplicated per leaf module.  Thanks,  
> > > > > >
> > > > > > Sure, the code snippet above may also be abstracted to be a common  
> > > API  
> > > > > > provided by vfio-pci-common.ko. :-)
> > > > > >
> > > > > > I have a confusion which may need confirm with you. Do you also want  
> > > the  
> > > > > > below code snippet be placed in the vfio-pci-common.ko and exposed  
> > > out  
> > > > > > as a wrapped API? Thus it can be used by sample driver and other  
> > > future  
> > > > > > drivers which want to wrap PCI device as a mdev. May be I  
> > > misundstood  
> > > > > > your comment. :-(  
> > > > >
> > > > >
> > > > > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > > > > reasonable starting point where the respective module _{probe,remove}
> > > > > functions would call into these and add their module specific code
> > > > > around it.  That would at least give us a point to cleanup things that
> > > > > are only used by the common code in the common code.  
> > > >
> > > > sure, I can start from here if we are still going with this direction. :-)
> > > >  
> > > > > I'm still struggling how we make this user consumable should we accept
> > > > > this and progress beyond a proof of concept sample driver though.  For
> > > > > example, if a vendor actually implements an mdev wrapper driver or  
> > > even  
> > > > > just a device specific vfio-pci wrapper, to enable for example
> > > > > migration support, how does a user know which driver to use for each
> > > > > particular feature?  The best I can come up with so far is something
> > > > > like was done for vfio-platform reset modules.  For instance a module
> > > > > that extends features for a given device in vfio-pci might register an
> > > > > ops structure and id table with vfio-pci, along with creating a module
> > > > > alias (or aliases) for the devices it supports.  When a device is
> > > > > probed by vfio-pci it could try to match against registered id tables
> > > > > to find a device specific ops structure, if one is not found it could
> > > > > do a request_module using the PCI vendor and device IDs and some  
> > > unique  
> > > > > vfio-pci string, check again, and use the default ops if device
> > > > > specific ops are still not present.  That would solve the problem on
> > > > > the vfio-pci side.  
> > > >
> > > > yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> > > > I think this is what Yan is trying to do.  
> > > 
> > > I think I'm suggesting a callback ops structure a level above what Yan
> > > previously proposed.  For example, could we have device specific
> > > vfio_device_ops where the vendor module can call out to common code
> > > rather than requiring common code to test for and optionally call out
> > > to device specific code.
> > >   
> > > > > For mdevs, I tend to assume that this vfio-mdev-pci
> > > > > meta driver is an anomaly only for the purpose of creating a generic
> > > > > test device for IOMMU backed mdevs and that "real" mdev vendor
> > > > > drivers will just be mdev enlightened host drivers, like i915 and
> > > > > nvidia are now.  Thanks,  
> > > >
> > > > yes, this vfio-mdev-pci meta driver is just creating a test device.
> > > > Do we still go with the current direction, or find any other way
> > > > which may be easier for adding this meta driver?  
> > > 
> > > I think if the code split allows us to create an environment where
> > > vendor drivers can re-use much of vfio-pci while creating a
> > > vfio_device_ops that supports additional features for their device and
> > > we bring that all together with a request module interface and module
> > > aliases to make that work seamlessly, then it has value.  A concern I
> > > have in only doing this split in order to create the vfio-mdev-pci
> > > module is that it leaves open the question and groundwork for forking
> > > vfio-pci into multiple vendor specific modules that would become a mess
> > > for user's to mange.
> > >   
> > > > Compared with the "real" mdev vendor drivers, it is like a
> > > > "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> > > > no vendor specific handling and passthru to vfio-pci codes directly.
> > > >
> > > > I think this meta driver is even lighter than the "real" mdev vendor
> > > > drivers. right? Is it possible to let this driver follow the way of
> > > > registering ops structure and id table with vfio-pci? The obstacle
> > > > I can see is the meta driver is a generic driver, which means it has
> > > > no id table... For the "real" mdev vendor drivers, they naturally have
> > > > such info. If vfio-mdev-pci can also get the id info without binding
> > > > to a device, it may be possible. thoughts? :-)  
> > > 
> > > IDs could be provided via a module option or potentially with
> > > build-time options.  That might allow us to test all aspects of the
> > > above proposal, ie. allowing sub-modules to provide vfio_device_ops for
> > > specific devices, allowing those vendor vfio_device_ops to re-use much
> > > of the existing vfio-pci code in that implementation, and a mechanism
> > > for generically testing IOMMU backed mdevs.  That's starting to sound a
> > > lot more worthwhile than moving a bunch of code around only to
> > > implement a sample driver for the latter.  Thoughts?  Thanks,
> > >   
> > 
> > sounds a good idea. If feasible suppose Yan's mediate_ops series
> > can be also largely avoided. The vendor driver can directly register its
> > own vfio_device_ops and selectively introduces proprietary logic 
> > (e.g. for tracking dirty pages) on top of the generic vfio_pci code.  
> 
> hi Alex
> as our previously discussed, I'm preparing to implement my v2 as this
> way:
> 
> 1. on vfio-pci binding to a device, it will modprobe modules of alias
> "vfio-pci-(vendorid)-(deviceid)", as a way to notify vendor drivers of
> registering their vendor ops. (I renamed mediate_ops to vendor_ops in
> v2)
> 2. in a module aliasing to "vfio-pci-(vendor_id)-(devivce_id)", in its
> module_init, it will register a vendor ops to vfio-pci.
> If there are two modules of the same alias and both registering vendor
> ops at the same time, they are chained according to the prio in
> its vendor ops.
> 3. vfio-pci would ask for region_infos for all vendor ops of a vdev in
> vfio_pci_open, and init regions for vendor drivers. Current code in
> vfio_pci_igd.c, vfio_pci_nvlink2.c, vfio_pci_nvlink2.c would all be
> wrapped into separate modules. so current vfio_pci_register_dev_region()
> would be removed accordingly. vfio_pci_rw would now be direct to 
> vendor_ops->region[i].rw. higher priority module's ops wins.
> For example, module vfio_pci_igd may register to regions of index 10,
> 11, 12 for its opregion, and two cfg regions. still, vendor driver can
> provide a module named i915_migration to register for regions of index 0
> and 13 for BAR0 and migration.

My major complaint with the previous version was that sprinkling random
vendor ops call-outs everywhere in vfio-pci is ugly and hard to
maintain.  The idea I'm proposing here is that sub-modules (loaded via
alias) would provide the entire vfio_device_ops for a device.  Yi's
series here would split out common code to make it trivial for vendor
modules to implement those device ops using pieces of vfio-pci if they
wish to do so.  Having multiple modules implement features of a device
based on their loading priority sounds powerful, but also difficult to
maintain and debug.  Do we need that functionality if a vendor
vfio_device_ops can implement it themselves in a handful of lines of
code?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-21 20:04                 ` Alex Williamson
@ 2020-01-21 21:54                   ` Yan Zhao
  2020-01-23 23:33                     ` Alex Williamson
  0 siblings, 1 reply; 44+ messages in thread
From: Yan Zhao @ 2020-01-21 21:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, kwankhede, linux-kernel, kvm, joro,
	peterx, baolu.lu, Masahiro Yamada

On Wed, Jan 22, 2020 at 04:04:38AM +0800, Alex Williamson wrote:
> On Tue, 21 Jan 2020 03:43:51 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Jan 21, 2020 at 03:43:02PM +0800, Tian, Kevin wrote:
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, January 21, 2020 5:08 AM
> > > > 
> > > > On Sat, 18 Jan 2020 14:25:11 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >   
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, January 17, 2020 5:24 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Thu, 16 Jan 2020 12:33:06 +0000
> > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > > > >
> > > > > > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > > >  
> > > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > > > exposing the device to user-space.
> > > > > > > > >
> > > > > > > > > Benefit of this new driver would be acting as a sample driver
> > > > > > > > > for recent changes from "vfio/mdev: IOMMU aware mediated  
> > > > device"  
> > > > > > > > > patchset. Also it could be a good experiment driver for future
> > > > > > > > > device specific mdev migration support. This sample driver only
> > > > > > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > > > > > the non-singleton iommu group to VMs.
> > > > > > > > >
> > > > > > > > > To use this driver:
> > > > > > > > > a) build and load vfio-mdev-pci.ko module
> > > > > > > > >    execute "make menuconfig" and config  
> > > > CONFIG_SAMPLE_VFIO_MDEV_PCI  
> > > > > > > > >    then load it with following command:  
> > > > > > > > >    > sudo modprobe vfio
> > > > > > > > >    > sudo modprobe vfio-pci
> > > > > > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko  
> > > > > > > > >
> > > > > > > > > b) unbind original device driver
> > > > > > > > >    e.g. use following command to unbind its original driver  
> > > > > > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> > > > > > > > >
> > > > > > > > > c) bind vfio-mdev-pci driver to the physical device  
> > > > > > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-  
> > > > pci/new_id  
> > > > > > > > >
> > > > > > > > > d) check the supported mdev instances  
> > > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
> > > > > > > > >      vfio-mdev-pci-type_name  
> > > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
> > > > > > > > >      vfio-mdev-pci-type_name/
> > > > > > > > >      available_instances  create  device_api  devices  name
> > > > > > > > >
> > > > > > > > > e)  create mdev on this physical device (only 1 instance)  
> > > > > > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
> > > > > > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > > > > >      vfio-mdev-pci-type_name/create
> > > > > > > > >
> > > > > > > > > f) passthru the mdev to guest
> > > > > > > > >    add the following line in QEMU boot command
> > > > > > > > >     -device vfio-pci,\
> > > > > > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-  
> > > > e6bfe0fa1003  
> > > > > > > > >
> > > > > > > > > g) destroy mdev  
> > > > > > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-  
> > > > e6bfe0fa1003/\  
> > > > > > > > >      remove
> > > > > > > > >
> > > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > ---
> > > > > > > > >  samples/Kconfig                       |  10 +
> > > > > > > > >  samples/Makefile                      |   1 +
> > > > > > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > > > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397  
> > > > > > > > ++++++++++++++++++++++++++++++++++  
> > > > > > > > >  4 files changed, 412 insertions(+)
> > > > > > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > > > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > >
> > > > > > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > > > > > index 9d236c3..50d207c 100644
> > > > > > > > > --- a/samples/Kconfig
> > > > > > > > > +++ b/samples/Kconfig
> > > > > > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > > > > > >  	help
> > > > > > > > >  	  Build a sample program to work with mei device.
> > > > > > > > >
> > > > > > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > > > > > +	select VFIO_PCI_COMMON
> > > > > > > > > +	select VFIO_PCI
> > > > > > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > > > > > +	help
> > > > > > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once  
> > > > bound to  
> > > > > > > > > +	  this driver, device passthru should through mdev path.
> > > > > > > > > +
> > > > > > > > > +	  If you don't know what to do here, say N.
> > > > > > > > >
> > > > > > > > >  endif # SAMPLES
> > > > > > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > > > > > index 5ce50ef..84faced 100644
> > > > > > > > > --- a/samples/Makefile
> > > > > > > > > +++ b/samples/Makefile
> > > > > > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)  
> > > > 	+= ftrace/  
> > > > > > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > > > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > > > > > >  obj-y					+= vfio-mdev/
> > > > > > > > > +obj-y					+= vfio-mdev-pci/  
> > > > > > > >
> > > > > > > > I think we could just lump this into vfio-mdev rather than making
> > > > > > > > another directory.  
> > > > > > >
> > > > > > > sure. will move it. :-)
> > > > > > >  
> > > > > > > >  
> > > > > > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > > > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-  
> > > > pci/Makefile  
> > > > > > > > > new file mode 100644
> > > > > > > > > index 0000000..41b2139
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > > > > > @@ -0,0 +1,4 @@
> > > > > > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > > > > > +
> > > > > > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-  
> > > > mdev-  
> > > > > > > > pci/vfio_mdev_pci.c  
> > > > > > > > > new file mode 100644
> > > > > > > > > index 0000000..b180356
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > > @@ -0,0 +1,397 @@
> > > > > > > > > +/*
> > > > > > > > > + * Copyright © 2020 Intel Corporation.
> > > > > > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > + *
> > > > > > > > > + * This program is free software; you can redistribute it and/or  
> > > > modify  
> > > > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > > > + * published by the Free Software Foundation.
> > > > > > > > > + *
> > > > > > > > > + * Derived from original vfio_pci.c:
> > > > > > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > + *
> > > > > > > > > + * Derived from original vfio:
> > > > > > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > > > > > + */
> > > > > > > > > +
> > > > > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > > > > +
> > > > > > > > > +#include <linux/device.h>
> > > > > > > > > +#include <linux/eventfd.h>
> > > > > > > > > +#include <linux/file.h>
> > > > > > > > > +#include <linux/interrupt.h>
> > > > > > > > > +#include <linux/iommu.h>
> > > > > > > > > +#include <linux/module.h>
> > > > > > > > > +#include <linux/mutex.h>
> > > > > > > > > +#include <linux/notifier.h>
> > > > > > > > > +#include <linux/pci.h>
> > > > > > > > > +#include <linux/pm_runtime.h>
> > > > > > > > > +#include <linux/slab.h>
> > > > > > > > > +#include <linux/types.h>
> > > > > > > > > +#include <linux/uaccess.h>
> > > > > > > > > +#include <linux/vfio.h>
> > > > > > > > > +#include <linux/vgaarb.h>
> > > > > > > > > +#include <linux/nospec.h>
> > > > > > > > > +#include <linux/mdev.h>
> > > > > > > > > +#include <linux/vfio_pci_common.h>
> > > > > > > > > +
> > > > > > > > > +#define DRIVER_VERSION  "0.1"
> > > > > > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI  
> > > > device as a  
> > > > > > > > mdev"  
> > > > > > > > > +
> > > > > > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > > > > > +
> > > > > > > > > +static char ids[1024] __initdata;
> > > > > > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-  
> > > > pci driver,  
> > > > > > > > format is  
> > > > \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and  
> > > > > > > > multiple comma separated entries can be specified");  
> > > > > > > > > +
> > > > > > > > > +static bool nointxmask;
> > > > > > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO |  
> > > > S_IWUSR);  
> > > > > > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If  
> > > > this resolves  
> > > > > > > > problems for specific devices, report lspci -vvvxxx to linux-  
> > > > pci@vger.kernel.org  
> > > > > > so  
> > > > > > > > the device can be fixed automatically via the broken_intx_masking  
> > > > flag.");  
> > > > > > > > > +
> > > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > > +static bool disable_vga;
> > > > > > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access  
> > > > through  
> > > > > > vfio-  
> > > > > > > > mdev-pci");  
> > > > > > > > > +#endif
> > > > > > > > > +
> > > > > > > > > +static bool disable_idle_d3;
> > > > > > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > > > > > +		 "Disable using the PCI D3 low power state for idle,  
> > > > unused devices");  
> > > > > > > > > +
> > > > > > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > > > > > +
> > > > > > > > > +static ssize_t
> > > > > > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > > > > > +{
> > > > > > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > > > > > +
> > > > > > > > > +static ssize_t
> > > > > > > > > +available_instances_show(struct kobject *kobj, struct device *dev,  
> > > > char *buf)  
> > > > > > > > > +{
> > > > > > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > > > > > +
> > > > > > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device  
> > > > *dev,  
> > > > > > > > > +		char *buf)
> > > > > > > > > +{
> > > > > > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > > > > > +
> > > > > > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > > > > > +	&mdev_type_attr_name.attr,
> > > > > > > > > +	&mdev_type_attr_device_api.attr,
> > > > > > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > > > > > +	NULL,
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > > > > > +	.name  = "type1",
> > > > > > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > > > > > +	&vfio_mdev_pci_type_group1,
> > > > > > > > > +	NULL,
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > > +struct vfio_mdev_pci {
> > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > +	struct mdev_device *mdev;
> > > > > > > > > +	unsigned long handle;
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct  
> > > > mdev_device  
> > > > > > *mdev)  
> > > > > > > > > +{
> > > > > > > > > +	struct device *pdev;
> > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > > > > > +	int ret;
> > > > > > > > > +
> > > > > > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > > > > > +	if (pmdev == NULL) {
> > > > > > > > > +		ret = -EBUSY;
> > > > > > > > > +		goto out;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	pmdev->mdev = mdev;
> > > > > > > > > +	pmdev->vdev = vdev;
> > > > > > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > > > > > +	if (ret) {
> > > > > > > > > +		pr_info("%s, failed to config iommu isolation for  
> > > > mdev: %s on  
> > > > > > > > pf: %s\n",  
> > > > > > > > > +			__func__, dev_name(mdev_dev(mdev)),  
> > > > dev_name(pdev));  
> > > > > > > > > +		goto out;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > > +out:
> > > > > > > > > +	return ret;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +
> > > > > > > > > +	kfree(pmdev);
> > > > > > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > > +
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > > +	int ret = 0;
> > > > > > > > > +
> > > > > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > > > > +		return -ENODEV;
> > > > > > > > > +
> > > > > > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > > > > > +
> > > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > > +
> > > > > > > > > +	if (!vdev->refcnt) {
> > > > > > > > > +		ret = vfio_pci_enable(vdev);
> > > > > > > > > +		if (ret)
> > > > > > > > > +			goto error;
> > > > > > > > > +
> > > > > > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > > > > > +	}
> > > > > > > > > +	vdev->refcnt++;
> > > > > > > > > +error:
> > > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > > +	if (!ret)
> > > > > > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-  
> > > > >vdev->pdev-  
> > > > > > > > >dev));
> > > > > > > > > +	else {
> > > > > > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-  
> > > > >vdev->pdev-  
> > > > > > > > >dev));
> > > > > > > > > +		module_put(THIS_MODULE);
> > > > > > > > > +	}
> > > > > > > > > +	return ret;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > > +
> > > > > > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-  
> > > > >vdev->pdev-  
> > > > > > > > >dev));
> > > > > > > > > +
> > > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > > +
> > > > > > > > > +	if (!(--vdev->refcnt)) {
> > > > > > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > > > > > +		vfio_pci_disable(vdev);
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > > +
> > > > > > > > > +	module_put(THIS_MODULE);
> > > > > > > > > +}  
> > > > > > > >
> > > > > > > > open() and release() here are almost identical between vfio_pci and
> > > > > > > > vfio_mdev_pci, which suggests maybe there should be common  
> > > > functions to  
> > > > > > > > call into like we do for the below.  
> > > > > > >
> > > > > > > yes, let me have more study and do better abstract in next version. :-)
> > > > > > >  
> > > > > > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev,  
> > > > unsigned int cmd,  
> > > > > > > > > +			     unsigned long arg)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +
> > > > > > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > > > > > +				struct vm_area_struct *vma)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +
> > > > > > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char  
> > > > __user  
> > > > > > *buf,  
> > > > > > > > > +			size_t count, loff_t *ppos)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +
> > > > > > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > > > > > +				const char __user *buf,
> > > > > > > > > +				size_t count, loff_t *ppos)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > +
> > > > > > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf,  
> > > > count, ppos);  
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > > > > > +	.supported_type_groups	=  
> > > > vfio_mdev_pci_type_groups,  
> > > > > > > > > +	.create			= vfio_mdev_pci_create,
> > > > > > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > > > > > +
> > > > > > > > > +	.open			= vfio_mdev_pci_open,
> > > > > > > > > +	.release		= vfio_mdev_pci_release,
> > > > > > > > > +
> > > > > > > > > +	.read			= vfio_mdev_pci_read,
> > > > > > > > > +	.write			= vfio_mdev_pci_write,
> > > > > > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > > > > > +				       const struct pci_device_id *id)
> > > > > > > > > +{
> > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > +	int ret;
> > > > > > > > > +
> > > > > > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > > > > > +		return -EINVAL;
> > > > > > > > > +
> > > > > > > > > +	/*
> > > > > > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily  
> > > > allows  
> > > > > > > > > +	 * userspace instance with VFs and PFs from the same device,  
> > > > which  
> > > > > > > > > +	 * cannot work.  Disabling SR-IOV here would initiate  
> > > > removing the  
> > > > > > > > > +	 * VFs, which would unbind the driver, which is prone to  
> > > > blocking  
> > > > > > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > > > > > +	 * reject these PFs and let the user sort it out.
> > > > > > > > > +	 */
> > > > > > > > > +	if (pci_num_vf(pdev)) {
> > > > > > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV  
> > > > enabled\n");  
> > > > > > > > > +		return -EBUSY;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > > > > > +	if (!vdev)
> > > > > > > > > +		return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +	vdev->pdev = pdev;
> > > > > > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > > > > > +	mutex_init(&vdev->igate);
> > > > > > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > > > > > +	vdev->nointxmask = nointxmask;
> > > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > > +	vdev->disable_vga = disable_vga;
> > > > > > > > > +#endif
> > > > > > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > > > > > +
> > > > > > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > > > > > +
> > > > > > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > > > > > +	if (ret) {
> > > > > > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > > > > > +		kfree(vdev);
> > > > > > > > > +		return ret;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > > > > > +		vga_client_register(pdev, vdev, NULL,  
> > > > vfio_pci_set_vga_decode);  
> > > > > > > > > +		vga_set_legacy_decoding(pdev,
> > > > > > > > > +  
> > > > 	vfio_pci_set_vga_decode(vdev, false));  
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > > > > > +
> > > > > > > > > +	if (!vdev->disable_idle_d3) {
> > > > > > > > > +		/*
> > > > > > > > > +		 * pci-core sets the device power state to an  
> > > > unknown value at  
> > > > > > > > > +		 * bootup and after being removed from a driver.  
> > > > The only  
> > > > > > > > > +		 * transition it allows from this unknown state is to  
> > > > D0, which  
> > > > > > > > > +		 * typically happens when a driver calls  
> > > > pci_enable_device().  
> > > > > > > > > +		 * We're not ready to enable the device yet, but we  
> > > > do want to  
> > > > > > > > > +		 * be able to get to D3.  Therefore first do a D0  
> > > > transition  
> > > > > > > > > +		 * before going to D3.
> > > > > > > > > +		 */
> > > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > > > > > +	}  
> > > > > > > >
> > > > > > > > Ditto here and remove below, this seems like boilerplate that  
> > > > shouldn't  
> > > > > > > > be duplicated per leaf module.  Thanks,  
> > > > > > >
> > > > > > > Sure, the code snippet above may also be abstracted to be a common  
> > > > API  
> > > > > > > provided by vfio-pci-common.ko. :-)
> > > > > > >
> > > > > > > I have a confusion which may need confirm with you. Do you also want  
> > > > the  
> > > > > > > below code snippet be placed in the vfio-pci-common.ko and exposed  
> > > > out  
> > > > > > > as a wrapped API? Thus it can be used by sample driver and other  
> > > > future  
> > > > > > > drivers which want to wrap PCI device as a mdev. May be I  
> > > > misundstood  
> > > > > > > your comment. :-(  
> > > > > >
> > > > > >
> > > > > > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > > > > > reasonable starting point where the respective module _{probe,remove}
> > > > > > functions would call into these and add their module specific code
> > > > > > around it.  That would at least give us a point to cleanup things that
> > > > > > are only used by the common code in the common code.  
> > > > >
> > > > > sure, I can start from here if we are still going with this direction. :-)
> > > > >  
> > > > > > I'm still struggling how we make this user consumable should we accept
> > > > > > this and progress beyond a proof of concept sample driver though.  For
> > > > > > example, if a vendor actually implements an mdev wrapper driver or  
> > > > even  
> > > > > > just a device specific vfio-pci wrapper, to enable for example
> > > > > > migration support, how does a user know which driver to use for each
> > > > > > particular feature?  The best I can come up with so far is something
> > > > > > like was done for vfio-platform reset modules.  For instance a module
> > > > > > that extends features for a given device in vfio-pci might register an
> > > > > > ops structure and id table with vfio-pci, along with creating a module
> > > > > > alias (or aliases) for the devices it supports.  When a device is
> > > > > > probed by vfio-pci it could try to match against registered id tables
> > > > > > to find a device specific ops structure, if one is not found it could
> > > > > > do a request_module using the PCI vendor and device IDs and some  
> > > > unique  
> > > > > > vfio-pci string, check again, and use the default ops if device
> > > > > > specific ops are still not present.  That would solve the problem on
> > > > > > the vfio-pci side.  
> > > > >
> > > > > yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> > > > > I think this is what Yan is trying to do.  
> > > > 
> > > > I think I'm suggesting a callback ops structure a level above what Yan
> > > > previously proposed.  For example, could we have device specific
> > > > vfio_device_ops where the vendor module can call out to common code
> > > > rather than requiring common code to test for and optionally call out
> > > > to device specific code.
> > > >   
> > > > > > For mdevs, I tend to assume that this vfio-mdev-pci
> > > > > > meta driver is an anomaly only for the purpose of creating a generic
> > > > > > test device for IOMMU backed mdevs and that "real" mdev vendor
> > > > > > drivers will just be mdev enlightened host drivers, like i915 and
> > > > > > nvidia are now.  Thanks,  
> > > > >
> > > > > yes, this vfio-mdev-pci meta driver is just creating a test device.
> > > > > Do we still go with the current direction, or find any other way
> > > > > which may be easier for adding this meta driver?  
> > > > 
> > > > I think if the code split allows us to create an environment where
> > > > vendor drivers can re-use much of vfio-pci while creating a
> > > > vfio_device_ops that supports additional features for their device and
> > > > we bring that all together with a request module interface and module
> > > > aliases to make that work seamlessly, then it has value.  A concern I
> > > > have in only doing this split in order to create the vfio-mdev-pci
> > > > module is that it leaves open the question and groundwork for forking
> > > > vfio-pci into multiple vendor specific modules that would become a mess
> > > > for user's to mange.
> > > >   
> > > > > Compared with the "real" mdev vendor drivers, it is like a
> > > > > "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> > > > > no vendor specific handling and passthru to vfio-pci codes directly.
> > > > >
> > > > > I think this meta driver is even lighter than the "real" mdev vendor
> > > > > drivers. right? Is it possible to let this driver follow the way of
> > > > > registering ops structure and id table with vfio-pci? The obstacle
> > > > > I can see is the meta driver is a generic driver, which means it has
> > > > > no id table... For the "real" mdev vendor drivers, they naturally have
> > > > > such info. If vfio-mdev-pci can also get the id info without binding
> > > > > to a device, it may be possible. thoughts? :-)  
> > > > 
> > > > IDs could be provided via a module option or potentially with
> > > > build-time options.  That might allow us to test all aspects of the
> > > > above proposal, ie. allowing sub-modules to provide vfio_device_ops for
> > > > specific devices, allowing those vendor vfio_device_ops to re-use much
> > > > of the existing vfio-pci code in that implementation, and a mechanism
> > > > for generically testing IOMMU backed mdevs.  That's starting to sound a
> > > > lot more worthwhile than moving a bunch of code around only to
> > > > implement a sample driver for the latter.  Thoughts?  Thanks,
> > > >   
> > > 
> > > sounds a good idea. If feasible suppose Yan's mediate_ops series
> > > can be also largely avoided. The vendor driver can directly register its
> > > own vfio_device_ops and selectively introduces proprietary logic 
> > > (e.g. for tracking dirty pages) on top of the generic vfio_pci code.  
> > 
> > hi Alex
> > as our previously discussed, I'm preparing to implement my v2 as this
> > way:
> > 
> > 1. on vfio-pci binding to a device, it will modprobe modules of alias
> > "vfio-pci-(vendorid)-(deviceid)", as a way to notify vendor drivers of
> > registering their vendor ops. (I renamed mediate_ops to vendor_ops in
> > v2)
> > 2. in a module aliasing to "vfio-pci-(vendor_id)-(devivce_id)", in its
> > module_init, it will register a vendor ops to vfio-pci.
> > If there are two modules of the same alias and both registering vendor
> > ops at the same time, they are chained according to the prio in
> > its vendor ops.
> > 3. vfio-pci would ask for region_infos for all vendor ops of a vdev in
> > vfio_pci_open, and init regions for vendor drivers. Current code in
> > vfio_pci_igd.c, vfio_pci_nvlink2.c, vfio_pci_nvlink2.c would all be
> > wrapped into separate modules. so current vfio_pci_register_dev_region()
> > would be removed accordingly. vfio_pci_rw would now be direct to 
> > vendor_ops->region[i].rw. higher priority module's ops wins.
> > For example, module vfio_pci_igd may register to regions of index 10,
> > 11, 12 for its opregion, and two cfg regions. still, vendor driver can
> > provide a module named i915_migration to register for regions of index 0
> > and 13 for BAR0 and migration.
> 
> My major complaint with the previous version was that sprinkling random
> vendor ops call-outs everywhere in vfio-pci is ugly and hard to
> maintain.  The idea I'm proposing here is that sub-modules (loaded via
> alias) would provide the entire vfio_device_ops for a device.  Yi's
> series here would split out common code to make it trivial for vendor
> modules to implement those device ops using pieces of vfio-pci if they
> wish to do so.  Having multiple modules implement features of a device
> based on their loading priority sounds powerful, but also difficult to
> maintain and debug.  Do we need that functionality if a vendor
> vfio_device_ops can implement it themselves in a handful of lines of
> code?  Thanks,

The main purpose of providing multiple modules is to enable each module
to focus on implementing regions of their own interest. If vendor module
has to provide vfio_device_ops, I don't think it's only a handful of
lines of code for them.
For example. in vfio_device_ops.open(), they at least have to hold the
&vdev->reflck->lock and call vfio_pci_enable and
vfio_spapr_pci_eeh_open. Also, vdev is private inside vfio_pci, do we
really want to export this structure?
The same to vfio_device_ops.ioctl(). if vendor driver has to implement a
little different than vfio_pci_ioctl(), e.g. init a new region, it has
to decode region index and knows inside vdev->region[i].
when it comes to vfio_device_ops.remove(), in Yi's code, it even has to
free each lock and region... in vdev.

Besides that, one thing I don't understand is that, Yi's sample code is
a mdev driver, so rather than binding to vfio-pci, a pci device would
bind to Yi's driver directly. Then, how this registering to vfio-pci way
work for him?

Thanks
Yan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-21 21:54                   ` Yan Zhao
@ 2020-01-23 23:33                     ` Alex Williamson
  2020-01-31  2:26                       ` Yan Zhao
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2020-01-23 23:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, Liu, Yi L, kwankhede, linux-kernel, kvm, joro,
	peterx, baolu.lu, Masahiro Yamada

On Tue, 21 Jan 2020 16:54:45 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, Jan 22, 2020 at 04:04:38AM +0800, Alex Williamson wrote:
> > On Tue, 21 Jan 2020 03:43:51 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Tue, Jan 21, 2020 at 03:43:02PM +0800, Tian, Kevin wrote:  
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Tuesday, January 21, 2020 5:08 AM
> > > > > 
> > > > > On Sat, 18 Jan 2020 14:25:11 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >     
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Friday, January 17, 2020 5:24 AM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > > >
> > > > > > > On Thu, 16 Jan 2020 12:33:06 +0000
> > > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > > >    
> > > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > > > > >
> > > > > > > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > > > >    
> > > > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > > > > exposing the device to user-space.
> > > > > > > > > >
> > > > > > > > > > Benefit of this new driver would be acting as a sample driver
> > > > > > > > > > for recent changes from "vfio/mdev: IOMMU aware mediated    
> > > > > device"    
> > > > > > > > > > patchset. Also it could be a good experiment driver for future
> > > > > > > > > > device specific mdev migration support. This sample driver only
> > > > > > > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > > > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > > > > > > the non-singleton iommu group to VMs.
> > > > > > > > > >
> > > > > > > > > > To use this driver:
> > > > > > > > > > a) build and load vfio-mdev-pci.ko module
> > > > > > > > > >    execute "make menuconfig" and config    
> > > > > CONFIG_SAMPLE_VFIO_MDEV_PCI    
> > > > > > > > > >    then load it with following command:    
> > > > > > > > > >    > sudo modprobe vfio
> > > > > > > > > >    > sudo modprobe vfio-pci
> > > > > > > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko    
> > > > > > > > > >
> > > > > > > > > > b) unbind original device driver
> > > > > > > > > >    e.g. use following command to unbind its original driver    
> > > > > > > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind    
> > > > > > > > > >
> > > > > > > > > > c) bind vfio-mdev-pci driver to the physical device    
> > > > > > > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-    
> > > > > pci/new_id    
> > > > > > > > > >
> > > > > > > > > > d) check the supported mdev instances    
> > > > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/    
> > > > > > > > > >      vfio-mdev-pci-type_name    
> > > > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\    
> > > > > > > > > >      vfio-mdev-pci-type_name/
> > > > > > > > > >      available_instances  create  device_api  devices  name
> > > > > > > > > >
> > > > > > > > > > e)  create mdev on this physical device (only 1 instance)    
> > > > > > > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \    
> > > > > > > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > > > > > >      vfio-mdev-pci-type_name/create
> > > > > > > > > >
> > > > > > > > > > f) passthru the mdev to guest
> > > > > > > > > >    add the following line in QEMU boot command
> > > > > > > > > >     -device vfio-pci,\
> > > > > > > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-    
> > > > > e6bfe0fa1003    
> > > > > > > > > >
> > > > > > > > > > g) destroy mdev    
> > > > > > > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-    
> > > > > e6bfe0fa1003/\    
> > > > > > > > > >      remove
> > > > > > > > > >
> > > > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > > > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > > > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >  samples/Kconfig                       |  10 +
> > > > > > > > > >  samples/Makefile                      |   1 +
> > > > > > > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > > > > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397    
> > > > > > > > > ++++++++++++++++++++++++++++++++++    
> > > > > > > > > >  4 files changed, 412 insertions(+)
> > > > > > > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > > > > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > > >
> > > > > > > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > > > > > > index 9d236c3..50d207c 100644
> > > > > > > > > > --- a/samples/Kconfig
> > > > > > > > > > +++ b/samples/Kconfig
> > > > > > > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > > > > > > >  	help
> > > > > > > > > >  	  Build a sample program to work with mei device.
> > > > > > > > > >
> > > > > > > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > > > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > > > > > > +	select VFIO_PCI_COMMON
> > > > > > > > > > +	select VFIO_PCI
> > > > > > > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > > > > > > +	help
> > > > > > > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once    
> > > > > bound to    
> > > > > > > > > > +	  this driver, device passthru should through mdev path.
> > > > > > > > > > +
> > > > > > > > > > +	  If you don't know what to do here, say N.
> > > > > > > > > >
> > > > > > > > > >  endif # SAMPLES
> > > > > > > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > > > > > > index 5ce50ef..84faced 100644
> > > > > > > > > > --- a/samples/Makefile
> > > > > > > > > > +++ b/samples/Makefile
> > > > > > > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)    
> > > > > 	+= ftrace/    
> > > > > > > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > > > > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > > > > > > >  obj-y					+= vfio-mdev/
> > > > > > > > > > +obj-y					+= vfio-mdev-pci/    
> > > > > > > > >
> > > > > > > > > I think we could just lump this into vfio-mdev rather than making
> > > > > > > > > another directory.    
> > > > > > > >
> > > > > > > > sure. will move it. :-)
> > > > > > > >    
> > > > > > > > >    
> > > > > > > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > > > > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > > > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-    
> > > > > pci/Makefile    
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 0000000..41b2139
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > > > > > > @@ -0,0 +1,4 @@
> > > > > > > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > > > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > > > > > > +
> > > > > > > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > > > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-    
> > > > > mdev-    
> > > > > > > > > pci/vfio_mdev_pci.c    
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 0000000..b180356
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > > > @@ -0,0 +1,397 @@
> > > > > > > > > > +/*
> > > > > > > > > > + * Copyright © 2020 Intel Corporation.
> > > > > > > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > > + *
> > > > > > > > > > + * This program is free software; you can redistribute it and/or    
> > > > > modify    
> > > > > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > > > > + * published by the Free Software Foundation.
> > > > > > > > > > + *
> > > > > > > > > > + * Derived from original vfio_pci.c:
> > > > > > > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > > > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > + *
> > > > > > > > > > + * Derived from original vfio:
> > > > > > > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > > > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > > > > > > + */
> > > > > > > > > > +
> > > > > > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > > > > > +
> > > > > > > > > > +#include <linux/device.h>
> > > > > > > > > > +#include <linux/eventfd.h>
> > > > > > > > > > +#include <linux/file.h>
> > > > > > > > > > +#include <linux/interrupt.h>
> > > > > > > > > > +#include <linux/iommu.h>
> > > > > > > > > > +#include <linux/module.h>
> > > > > > > > > > +#include <linux/mutex.h>
> > > > > > > > > > +#include <linux/notifier.h>
> > > > > > > > > > +#include <linux/pci.h>
> > > > > > > > > > +#include <linux/pm_runtime.h>
> > > > > > > > > > +#include <linux/slab.h>
> > > > > > > > > > +#include <linux/types.h>
> > > > > > > > > > +#include <linux/uaccess.h>
> > > > > > > > > > +#include <linux/vfio.h>
> > > > > > > > > > +#include <linux/vgaarb.h>
> > > > > > > > > > +#include <linux/nospec.h>
> > > > > > > > > > +#include <linux/mdev.h>
> > > > > > > > > > +#include <linux/vfio_pci_common.h>
> > > > > > > > > > +
> > > > > > > > > > +#define DRIVER_VERSION  "0.1"
> > > > > > > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > > > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI    
> > > > > device as a    
> > > > > > > > > mdev"    
> > > > > > > > > > +
> > > > > > > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > > > > > > +
> > > > > > > > > > +static char ids[1024] __initdata;
> > > > > > > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > > > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-    
> > > > > pci driver,    
> > > > > > > > > format is    
> > > > > \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and    
> > > > > > > > > multiple comma separated entries can be specified");    
> > > > > > > > > > +
> > > > > > > > > > +static bool nointxmask;
> > > > > > > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO |    
> > > > > S_IWUSR);    
> > > > > > > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > > > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If    
> > > > > this resolves    
> > > > > > > > > problems for specific devices, report lspci -vvvxxx to linux-    
> > > > > pci@vger.kernel.org    
> > > > > > > so    
> > > > > > > > > the device can be fixed automatically via the broken_intx_masking    
> > > > > flag.");    
> > > > > > > > > > +
> > > > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > > > +static bool disable_vga;
> > > > > > > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > > > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access    
> > > > > through    
> > > > > > > vfio-    
> > > > > > > > > mdev-pci");    
> > > > > > > > > > +#endif
> > > > > > > > > > +
> > > > > > > > > > +static bool disable_idle_d3;
> > > > > > > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > > > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > > > > > > +		 "Disable using the PCI D3 low power state for idle,    
> > > > > unused devices");    
> > > > > > > > > > +
> > > > > > > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > > > > > > +
> > > > > > > > > > +static ssize_t
> > > > > > > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > > > > > > +{
> > > > > > > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > > > > > > +
> > > > > > > > > > +static ssize_t
> > > > > > > > > > +available_instances_show(struct kobject *kobj, struct device *dev,    
> > > > > char *buf)    
> > > > > > > > > > +{
> > > > > > > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > > > > > > +
> > > > > > > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device    
> > > > > *dev,    
> > > > > > > > > > +		char *buf)
> > > > > > > > > > +{
> > > > > > > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > > > > > > +
> > > > > > > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > > > > > > +	&mdev_type_attr_name.attr,
> > > > > > > > > > +	&mdev_type_attr_device_api.attr,
> > > > > > > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > > > > > > +	NULL,
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > > > > > > +	.name  = "type1",
> > > > > > > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > > > > > > +	&vfio_mdev_pci_type_group1,
> > > > > > > > > > +	NULL,
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > > +struct vfio_mdev_pci {
> > > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > > +	struct mdev_device *mdev;
> > > > > > > > > > +	unsigned long handle;
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct    
> > > > > mdev_device    
> > > > > > > *mdev)    
> > > > > > > > > > +{
> > > > > > > > > > +	struct device *pdev;
> > > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > > > > > > +	int ret;
> > > > > > > > > > +
> > > > > > > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > > > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > > > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > > > > > > +	if (pmdev == NULL) {
> > > > > > > > > > +		ret = -EBUSY;
> > > > > > > > > > +		goto out;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	pmdev->mdev = mdev;
> > > > > > > > > > +	pmdev->vdev = vdev;
> > > > > > > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > > > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > > > > > > +	if (ret) {
> > > > > > > > > > +		pr_info("%s, failed to config iommu isolation for    
> > > > > mdev: %s on    
> > > > > > > > > pf: %s\n",    
> > > > > > > > > > +			__func__, dev_name(mdev_dev(mdev)),    
> > > > > dev_name(pdev));    
> > > > > > > > > > +		goto out;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > > > +out:
> > > > > > > > > > +	return ret;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +
> > > > > > > > > > +	kfree(pmdev);
> > > > > > > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > > > +
> > > > > > > > > > +	return 0;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > > > +	int ret = 0;
> > > > > > > > > > +
> > > > > > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > > > > > +		return -ENODEV;
> > > > > > > > > > +
> > > > > > > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > > > > > > +
> > > > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > > > +
> > > > > > > > > > +	if (!vdev->refcnt) {
> > > > > > > > > > +		ret = vfio_pci_enable(vdev);
> > > > > > > > > > +		if (ret)
> > > > > > > > > > +			goto error;
> > > > > > > > > > +
> > > > > > > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > > > > > > +	}
> > > > > > > > > > +	vdev->refcnt++;
> > > > > > > > > > +error:
> > > > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > > > +	if (!ret)
> > > > > > > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-    
> > > > > >vdev->pdev-    
> > > > > > > > > >dev));
> > > > > > > > > > +	else {
> > > > > > > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-    
> > > > > >vdev->pdev-    
> > > > > > > > > >dev));
> > > > > > > > > > +		module_put(THIS_MODULE);
> > > > > > > > > > +	}
> > > > > > > > > > +	return ret;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > > > +
> > > > > > > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-    
> > > > > >vdev->pdev-    
> > > > > > > > > >dev));
> > > > > > > > > > +
> > > > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > > > +
> > > > > > > > > > +	if (!(--vdev->refcnt)) {
> > > > > > > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > > > > > > +		vfio_pci_disable(vdev);
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > > > +
> > > > > > > > > > +	module_put(THIS_MODULE);
> > > > > > > > > > +}    
> > > > > > > > >
> > > > > > > > > open() and release() here are almost identical between vfio_pci and
> > > > > > > > > vfio_mdev_pci, which suggests maybe there should be common    
> > > > > functions to    
> > > > > > > > > call into like we do for the below.    
> > > > > > > >
> > > > > > > > yes, let me have more study and do better abstract in next version. :-)
> > > > > > > >    
> > > > > > > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev,    
> > > > > unsigned int cmd,    
> > > > > > > > > > +			     unsigned long arg)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +
> > > > > > > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > > > > > > +				struct vm_area_struct *vma)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +
> > > > > > > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char    
> > > > > __user    
> > > > > > > *buf,    
> > > > > > > > > > +			size_t count, loff_t *ppos)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +
> > > > > > > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > > > > > > +				const char __user *buf,
> > > > > > > > > > +				size_t count, loff_t *ppos)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > +
> > > > > > > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf,    
> > > > > count, ppos);    
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > > > > > > +	.supported_type_groups	=    
> > > > > vfio_mdev_pci_type_groups,    
> > > > > > > > > > +	.create			= vfio_mdev_pci_create,
> > > > > > > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > > > > > > +
> > > > > > > > > > +	.open			= vfio_mdev_pci_open,
> > > > > > > > > > +	.release		= vfio_mdev_pci_release,
> > > > > > > > > > +
> > > > > > > > > > +	.read			= vfio_mdev_pci_read,
> > > > > > > > > > +	.write			= vfio_mdev_pci_write,
> > > > > > > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > > > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > > > > > > +				       const struct pci_device_id *id)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > > +	int ret;
> > > > > > > > > > +
> > > > > > > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > > > > > > +		return -EINVAL;
> > > > > > > > > > +
> > > > > > > > > > +	/*
> > > > > > > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily    
> > > > > allows    
> > > > > > > > > > +	 * userspace instance with VFs and PFs from the same device,    
> > > > > which    
> > > > > > > > > > +	 * cannot work.  Disabling SR-IOV here would initiate    
> > > > > removing the    
> > > > > > > > > > +	 * VFs, which would unbind the driver, which is prone to    
> > > > > blocking    
> > > > > > > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > > > > > > +	 * reject these PFs and let the user sort it out.
> > > > > > > > > > +	 */
> > > > > > > > > > +	if (pci_num_vf(pdev)) {
> > > > > > > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV    
> > > > > enabled\n");    
> > > > > > > > > > +		return -EBUSY;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > > > > > > +	if (!vdev)
> > > > > > > > > > +		return -ENOMEM;
> > > > > > > > > > +
> > > > > > > > > > +	vdev->pdev = pdev;
> > > > > > > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > > > > > > +	mutex_init(&vdev->igate);
> > > > > > > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > > > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > > > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > > > > > > +	vdev->nointxmask = nointxmask;
> > > > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > > > +	vdev->disable_vga = disable_vga;
> > > > > > > > > > +#endif
> > > > > > > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > > > > > > +
> > > > > > > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > > > > > > +
> > > > > > > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > > > > > > +	if (ret) {
> > > > > > > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > > > > > > +		kfree(vdev);
> > > > > > > > > > +		return ret;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > > > > > > +		vga_client_register(pdev, vdev, NULL,    
> > > > > vfio_pci_set_vga_decode);    
> > > > > > > > > > +		vga_set_legacy_decoding(pdev,
> > > > > > > > > > +    
> > > > > 	vfio_pci_set_vga_decode(vdev, false));    
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > > > > > > +
> > > > > > > > > > +	if (!vdev->disable_idle_d3) {
> > > > > > > > > > +		/*
> > > > > > > > > > +		 * pci-core sets the device power state to an    
> > > > > unknown value at    
> > > > > > > > > > +		 * bootup and after being removed from a driver.    
> > > > > The only    
> > > > > > > > > > +		 * transition it allows from this unknown state is to    
> > > > > D0, which    
> > > > > > > > > > +		 * typically happens when a driver calls    
> > > > > pci_enable_device().    
> > > > > > > > > > +		 * We're not ready to enable the device yet, but we    
> > > > > do want to    
> > > > > > > > > > +		 * be able to get to D3.  Therefore first do a D0    
> > > > > transition    
> > > > > > > > > > +		 * before going to D3.
> > > > > > > > > > +		 */
> > > > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > > > > > > +	}    
> > > > > > > > >
> > > > > > > > > Ditto here and remove below, this seems like boilerplate that    
> > > > > shouldn't    
> > > > > > > > > be duplicated per leaf module.  Thanks,    
> > > > > > > >
> > > > > > > > Sure, the code snippet above may also be abstracted to be a common    
> > > > > API    
> > > > > > > > provided by vfio-pci-common.ko. :-)
> > > > > > > >
> > > > > > > > I have a confusion which may need confirm with you. Do you also want    
> > > > > the    
> > > > > > > > below code snippet be placed in the vfio-pci-common.ko and exposed    
> > > > > out    
> > > > > > > > as a wrapped API? Thus it can be used by sample driver and other    
> > > > > future    
> > > > > > > > drivers which want to wrap PCI device as a mdev. May be I    
> > > > > misundstood    
> > > > > > > > your comment. :-(    
> > > > > > >
> > > > > > >
> > > > > > > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > > > > > > reasonable starting point where the respective module _{probe,remove}
> > > > > > > functions would call into these and add their module specific code
> > > > > > > around it.  That would at least give us a point to cleanup things that
> > > > > > > are only used by the common code in the common code.    
> > > > > >
> > > > > > sure, I can start from here if we are still going with this direction. :-)
> > > > > >    
> > > > > > > I'm still struggling how we make this user consumable should we accept
> > > > > > > this and progress beyond a proof of concept sample driver though.  For
> > > > > > > example, if a vendor actually implements an mdev wrapper driver or    
> > > > > even    
> > > > > > > just a device specific vfio-pci wrapper, to enable for example
> > > > > > > migration support, how does a user know which driver to use for each
> > > > > > > particular feature?  The best I can come up with so far is something
> > > > > > > like was done for vfio-platform reset modules.  For instance a module
> > > > > > > that extends features for a given device in vfio-pci might register an
> > > > > > > ops structure and id table with vfio-pci, along with creating a module
> > > > > > > alias (or aliases) for the devices it supports.  When a device is
> > > > > > > probed by vfio-pci it could try to match against registered id tables
> > > > > > > to find a device specific ops structure, if one is not found it could
> > > > > > > do a request_module using the PCI vendor and device IDs and some    
> > > > > unique    
> > > > > > > vfio-pci string, check again, and use the default ops if device
> > > > > > > specific ops are still not present.  That would solve the problem on
> > > > > > > the vfio-pci side.    
> > > > > >
> > > > > > yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> > > > > > I think this is what Yan is trying to do.    
> > > > > 
> > > > > I think I'm suggesting a callback ops structure a level above what Yan
> > > > > previously proposed.  For example, could we have device specific
> > > > > vfio_device_ops where the vendor module can call out to common code
> > > > > rather than requiring common code to test for and optionally call out
> > > > > to device specific code.
> > > > >     
> > > > > > > For mdevs, I tend to assume that this vfio-mdev-pci
> > > > > > > meta driver is an anomaly only for the purpose of creating a generic
> > > > > > > test device for IOMMU backed mdevs and that "real" mdev vendor
> > > > > > > drivers will just be mdev enlightened host drivers, like i915 and
> > > > > > > nvidia are now.  Thanks,    
> > > > > >
> > > > > > yes, this vfio-mdev-pci meta driver is just creating a test device.
> > > > > > Do we still go with the current direction, or find any other way
> > > > > > which may be easier for adding this meta driver?    
> > > > > 
> > > > > I think if the code split allows us to create an environment where
> > > > > vendor drivers can re-use much of vfio-pci while creating a
> > > > > vfio_device_ops that supports additional features for their device and
> > > > > we bring that all together with a request module interface and module
> > > > > aliases to make that work seamlessly, then it has value.  A concern I
> > > > > have in only doing this split in order to create the vfio-mdev-pci
> > > > > module is that it leaves open the question and groundwork for forking
> > > > > vfio-pci into multiple vendor specific modules that would become a mess
> > > > > for user's to mange.
> > > > >     
> > > > > > Compared with the "real" mdev vendor drivers, it is like a
> > > > > > "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> > > > > > no vendor specific handling and passthru to vfio-pci codes directly.
> > > > > >
> > > > > > I think this meta driver is even lighter than the "real" mdev vendor
> > > > > > drivers. right? Is it possible to let this driver follow the way of
> > > > > > registering ops structure and id table with vfio-pci? The obstacle
> > > > > > I can see is the meta driver is a generic driver, which means it has
> > > > > > no id table... For the "real" mdev vendor drivers, they naturally have
> > > > > > such info. If vfio-mdev-pci can also get the id info without binding
> > > > > > to a device, it may be possible. thoughts? :-)    
> > > > > 
> > > > > IDs could be provided via a module option or potentially with
> > > > > build-time options.  That might allow us to test all aspects of the
> > > > > above proposal, ie. allowing sub-modules to provide vfio_device_ops for
> > > > > specific devices, allowing those vendor vfio_device_ops to re-use much
> > > > > of the existing vfio-pci code in that implementation, and a mechanism
> > > > > for generically testing IOMMU backed mdevs.  That's starting to sound a
> > > > > lot more worthwhile than moving a bunch of code around only to
> > > > > implement a sample driver for the latter.  Thoughts?  Thanks,
> > > > >     
> > > > 
> > > > sounds a good idea. If feasible suppose Yan's mediate_ops series
> > > > can be also largely avoided. The vendor driver can directly register its
> > > > own vfio_device_ops and selectively introduces proprietary logic 
> > > > (e.g. for tracking dirty pages) on top of the generic vfio_pci code.    
> > > 
> > > hi Alex
> > > as our previously discussed, I'm preparing to implement my v2 as this
> > > way:
> > > 
> > > 1. on vfio-pci binding to a device, it will modprobe modules of alias
> > > "vfio-pci-(vendorid)-(deviceid)", as a way to notify vendor drivers of
> > > registering their vendor ops. (I renamed mediate_ops to vendor_ops in
> > > v2)
> > > 2. in a module aliasing to "vfio-pci-(vendor_id)-(devivce_id)", in its
> > > module_init, it will register a vendor ops to vfio-pci.
> > > If there are two modules of the same alias and both registering vendor
> > > ops at the same time, they are chained according to the prio in
> > > its vendor ops.
> > > 3. vfio-pci would ask for region_infos for all vendor ops of a vdev in
> > > vfio_pci_open, and init regions for vendor drivers. Current code in
> > > vfio_pci_igd.c, vfio_pci_nvlink2.c, vfio_pci_nvlink2.c would all be
> > > wrapped into separate modules. so current vfio_pci_register_dev_region()
> > > would be removed accordingly. vfio_pci_rw would now be direct to 
> > > vendor_ops->region[i].rw. higher priority module's ops wins.
> > > For example, module vfio_pci_igd may register to regions of index 10,
> > > 11, 12 for its opregion, and two cfg regions. still, vendor driver can
> > > provide a module named i915_migration to register for regions of index 0
> > > and 13 for BAR0 and migration.  
> > 
> > My major complaint with the previous version was that sprinkling random
> > vendor ops call-outs everywhere in vfio-pci is ugly and hard to
> > maintain.  The idea I'm proposing here is that sub-modules (loaded via
> > alias) would provide the entire vfio_device_ops for a device.  Yi's
> > series here would split out common code to make it trivial for vendor
> > modules to implement those device ops using pieces of vfio-pci if they
> > wish to do so.  Having multiple modules implement features of a device
> > based on their loading priority sounds powerful, but also difficult to
> > maintain and debug.  Do we need that functionality if a vendor
> > vfio_device_ops can implement it themselves in a handful of lines of
> > code?  Thanks,  
> 
> The main purpose of providing multiple modules is to enable each module
> to focus on implementing regions of their own interest. If vendor module
> has to provide vfio_device_ops, I don't think it's only a handful of
> lines of code for them.
> For example. in vfio_device_ops.open(), they at least have to hold the
> &vdev->reflck->lock and call vfio_pci_enable and
> vfio_spapr_pci_eeh_open. Also, vdev is private inside vfio_pci, do we
> really want to export this structure?
> The same to vfio_device_ops.ioctl(). if vendor driver has to implement a
> little different than vfio_pci_ioctl(), e.g. init a new region, it has
> to decode region index and knows inside vdev->region[i].
> when it comes to vfio_device_ops.remove(), in Yi's code, it even has to
> free each lock and region... in vdev.

You make some good points, replacing vfio_device_ops altogether per
vendor module might be too simplistic.  However, we also can't create a
special case for vendor module handling on every interface.  For
example, why would vfio_pci_rw() test for and call out to
mediate_ops->rw() when we've already got per region rw() handlers via
vfio_pci_regops?  Seems we need to make use of vfio_pci_regops
ubiquitous for all regions and create an API for a vendor module to
register new regions with ops (ie. expose vfio_pci_register_dev_region)
and also manipulate the ops of existing regions.  When a vendor module
registers, it might just need to provide an open function callback and
an id table, and perhaps everything else is handled via registering new
regions and dynamically changing existing regions when we call the open
callback for a device.  Something about that series needs to change, I
can't handle the proposed mediated device ops being tested and called
everywhere.
 
> Besides that, one thing I don't understand is that, Yi's sample code is
> a mdev driver, so rather than binding to vfio-pci, a pci device would
> bind to Yi's driver directly. Then, how this registering to vfio-pci way
> work for him?

It wouldn't, I was trying to justify the code rework that Yi is trying
to do as also usable to these vfio-pci vendor extension modules.  You
may have poked a hole in that proposal though, which again puts in
doubt whether we should really pursue it for a sample driver.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
  2020-01-23 23:33                     ` Alex Williamson
@ 2020-01-31  2:26                       ` Yan Zhao
  0 siblings, 0 replies; 44+ messages in thread
From: Yan Zhao @ 2020-01-31  2:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, kwankhede, linux-kernel, kvm, joro,
	peterx, baolu.lu, Masahiro Yamada

On Fri, Jan 24, 2020 at 07:33:22AM +0800, Alex Williamson wrote:
> On Tue, 21 Jan 2020 16:54:45 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Wed, Jan 22, 2020 at 04:04:38AM +0800, Alex Williamson wrote:
> > > On Tue, 21 Jan 2020 03:43:51 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Tue, Jan 21, 2020 at 03:43:02PM +0800, Tian, Kevin wrote:  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Tuesday, January 21, 2020 5:08 AM
> > > > > > 
> > > > > > On Sat, 18 Jan 2020 14:25:11 +0000
> > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > >     
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Friday, January 17, 2020 5:24 AM
> > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > > > >
> > > > > > > > On Thu, 16 Jan 2020 12:33:06 +0000
> > > > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > > > >    
> > > > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > > > Sent: Friday, January 10, 2020 6:49 AM
> > > > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > > > Subject: Re: [PATCH v4 11/12] samples: add vfio-mdev-pci driver
> > > > > > > > > >
> > > > > > > > > > On Tue,  7 Jan 2020 20:01:48 +0800
> > > > > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > > > > >    
> > > > > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > > > > > exposing the device to user-space.
> > > > > > > > > > >
> > > > > > > > > > > Benefit of this new driver would be acting as a sample driver
> > > > > > > > > > > for recent changes from "vfio/mdev: IOMMU aware mediated    
> > > > > > device"    
> > > > > > > > > > > patchset. Also it could be a good experiment driver for future
> > > > > > > > > > > device specific mdev migration support. This sample driver only
> > > > > > > > > > > supports singleton iommu groups, for non-singleton iommu groups,
> > > > > > > > > > > this sample driver doesn't work. It will fail when trying to assign
> > > > > > > > > > > the non-singleton iommu group to VMs.
> > > > > > > > > > >
> > > > > > > > > > > To use this driver:
> > > > > > > > > > > a) build and load vfio-mdev-pci.ko module
> > > > > > > > > > >    execute "make menuconfig" and config    
> > > > > > CONFIG_SAMPLE_VFIO_MDEV_PCI    
> > > > > > > > > > >    then load it with following command:    
> > > > > > > > > > >    > sudo modprobe vfio
> > > > > > > > > > >    > sudo modprobe vfio-pci
> > > > > > > > > > >    > sudo insmod samples/vfio-mdev-pci/vfio-mdev-pci.ko    
> > > > > > > > > > >
> > > > > > > > > > > b) unbind original device driver
> > > > > > > > > > >    e.g. use following command to unbind its original driver    
> > > > > > > > > > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind    
> > > > > > > > > > >
> > > > > > > > > > > c) bind vfio-mdev-pci driver to the physical device    
> > > > > > > > > > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-    
> > > > > > pci/new_id    
> > > > > > > > > > >
> > > > > > > > > > > d) check the supported mdev instances    
> > > > > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/    
> > > > > > > > > > >      vfio-mdev-pci-type_name    
> > > > > > > > > > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\    
> > > > > > > > > > >      vfio-mdev-pci-type_name/
> > > > > > > > > > >      available_instances  create  device_api  devices  name
> > > > > > > > > > >
> > > > > > > > > > > e)  create mdev on this physical device (only 1 instance)    
> > > > > > > > > > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \    
> > > > > > > > > > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > > > > > > > > > >      vfio-mdev-pci-type_name/create
> > > > > > > > > > >
> > > > > > > > > > > f) passthru the mdev to guest
> > > > > > > > > > >    add the following line in QEMU boot command
> > > > > > > > > > >     -device vfio-pci,\
> > > > > > > > > > >      sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-    
> > > > > > e6bfe0fa1003    
> > > > > > > > > > >
> > > > > > > > > > > g) destroy mdev    
> > > > > > > > > > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-    
> > > > > > e6bfe0fa1003/\    
> > > > > > > > > > >      remove
> > > > > > > > > > >
> > > > > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > > > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > > > > > > > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > > > > > > > > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  samples/Kconfig                       |  10 +
> > > > > > > > > > >  samples/Makefile                      |   1 +
> > > > > > > > > > >  samples/vfio-mdev-pci/Makefile        |   4 +
> > > > > > > > > > >  samples/vfio-mdev-pci/vfio_mdev_pci.c | 397    
> > > > > > > > > > ++++++++++++++++++++++++++++++++++    
> > > > > > > > > > >  4 files changed, 412 insertions(+)
> > > > > > > > > > >  create mode 100644 samples/vfio-mdev-pci/Makefile
> > > > > > > > > > >  create mode 100644 samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/samples/Kconfig b/samples/Kconfig
> > > > > > > > > > > index 9d236c3..50d207c 100644
> > > > > > > > > > > --- a/samples/Kconfig
> > > > > > > > > > > +++ b/samples/Kconfig
> > > > > > > > > > > @@ -190,5 +190,15 @@ config SAMPLE_INTEL_MEI
> > > > > > > > > > >  	help
> > > > > > > > > > >  	  Build a sample program to work with mei device.
> > > > > > > > > > >
> > > > > > > > > > > +config SAMPLE_VFIO_MDEV_PCI
> > > > > > > > > > > +	tristate "Sample driver for wrapping PCI device as a mdev"
> > > > > > > > > > > +	select VFIO_PCI_COMMON
> > > > > > > > > > > +	select VFIO_PCI
> > > > > > > > > > > +	depends on VFIO_MDEV && VFIO_MDEV_DEVICE
> > > > > > > > > > > +	help
> > > > > > > > > > > +	  Sample driver for wrapping a PCI device as a mdev. Once    
> > > > > > bound to    
> > > > > > > > > > > +	  this driver, device passthru should through mdev path.
> > > > > > > > > > > +
> > > > > > > > > > > +	  If you don't know what to do here, say N.
> > > > > > > > > > >
> > > > > > > > > > >  endif # SAMPLES
> > > > > > > > > > > diff --git a/samples/Makefile b/samples/Makefile
> > > > > > > > > > > index 5ce50ef..84faced 100644
> > > > > > > > > > > --- a/samples/Makefile
> > > > > > > > > > > +++ b/samples/Makefile
> > > > > > > > > > > @@ -21,5 +21,6 @@ obj-$(CONFIG_SAMPLE_FTRACE_DIRECT)    
> > > > > > 	+= ftrace/    
> > > > > > > > > > >  obj-$(CONFIG_SAMPLE_TRACE_ARRAY)	+= ftrace/
> > > > > > > > > > >  obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
> > > > > > > > > > >  obj-y					+= vfio-mdev/
> > > > > > > > > > > +obj-y					+= vfio-mdev-pci/    
> > > > > > > > > >
> > > > > > > > > > I think we could just lump this into vfio-mdev rather than making
> > > > > > > > > > another directory.    
> > > > > > > > >
> > > > > > > > > sure. will move it. :-)
> > > > > > > > >    
> > > > > > > > > >    
> > > > > > > > > > >  subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
> > > > > > > > > > >  obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
> > > > > > > > > > > diff --git a/samples/vfio-mdev-pci/Makefile b/samples/vfio-mdev-    
> > > > > > pci/Makefile    
> > > > > > > > > > > new file mode 100644
> > > > > > > > > > > index 0000000..41b2139
> > > > > > > > > > > --- /dev/null
> > > > > > > > > > > +++ b/samples/vfio-mdev-pci/Makefile
> > > > > > > > > > > @@ -0,0 +1,4 @@
> > > > > > > > > > > +# SPDX-License-Identifier: GPL-2.0-only
> > > > > > > > > > > +vfio-mdev-pci-y := vfio_mdev_pci.o
> > > > > > > > > > > +
> > > > > > > > > > > +obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
> > > > > > > > > > > diff --git a/samples/vfio-mdev-pci/vfio_mdev_pci.c b/samples/vfio-    
> > > > > > mdev-    
> > > > > > > > > > pci/vfio_mdev_pci.c    
> > > > > > > > > > > new file mode 100644
> > > > > > > > > > > index 0000000..b180356
> > > > > > > > > > > --- /dev/null
> > > > > > > > > > > +++ b/samples/vfio-mdev-pci/vfio_mdev_pci.c
> > > > > > > > > > > @@ -0,0 +1,397 @@
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Copyright © 2020 Intel Corporation.
> > > > > > > > > > > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > > > + *
> > > > > > > > > > > + * This program is free software; you can redistribute it and/or    
> > > > > > modify    
> > > > > > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > > > > > + * published by the Free Software Foundation.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Derived from original vfio_pci.c:
> > > > > > > > > > > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > > > > > > > > > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > > + *
> > > > > > > > > > > + * Derived from original vfio:
> > > > > > > > > > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > > > > > > > > > + * Author: Tom Lyon, pugs@cisco.com
> > > > > > > > > > > + */
> > > > > > > > > > > +
> > > > > > > > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > > > > > > > > +
> > > > > > > > > > > +#include <linux/device.h>
> > > > > > > > > > > +#include <linux/eventfd.h>
> > > > > > > > > > > +#include <linux/file.h>
> > > > > > > > > > > +#include <linux/interrupt.h>
> > > > > > > > > > > +#include <linux/iommu.h>
> > > > > > > > > > > +#include <linux/module.h>
> > > > > > > > > > > +#include <linux/mutex.h>
> > > > > > > > > > > +#include <linux/notifier.h>
> > > > > > > > > > > +#include <linux/pci.h>
> > > > > > > > > > > +#include <linux/pm_runtime.h>
> > > > > > > > > > > +#include <linux/slab.h>
> > > > > > > > > > > +#include <linux/types.h>
> > > > > > > > > > > +#include <linux/uaccess.h>
> > > > > > > > > > > +#include <linux/vfio.h>
> > > > > > > > > > > +#include <linux/vgaarb.h>
> > > > > > > > > > > +#include <linux/nospec.h>
> > > > > > > > > > > +#include <linux/mdev.h>
> > > > > > > > > > > +#include <linux/vfio_pci_common.h>
> > > > > > > > > > > +
> > > > > > > > > > > +#define DRIVER_VERSION  "0.1"
> > > > > > > > > > > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > > > > > > > > > > +#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI    
> > > > > > device as a    
> > > > > > > > > > mdev"    
> > > > > > > > > > > +
> > > > > > > > > > > +#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
> > > > > > > > > > > +
> > > > > > > > > > > +static char ids[1024] __initdata;
> > > > > > > > > > > +module_param_string(ids, ids, sizeof(ids), 0);
> > > > > > > > > > > +MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-    
> > > > > > pci driver,    
> > > > > > > > > > format is    
> > > > > > \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and    
> > > > > > > > > > multiple comma separated entries can be specified");    
> > > > > > > > > > > +
> > > > > > > > > > > +static bool nointxmask;
> > > > > > > > > > > +module_param_named(nointxmask, nointxmask, bool, S_IRUGO |    
> > > > > > S_IWUSR);    
> > > > > > > > > > > +MODULE_PARM_DESC(nointxmask,
> > > > > > > > > > > +		  "Disable support for PCI 2.3 style INTx masking.  If    
> > > > > > this resolves    
> > > > > > > > > > problems for specific devices, report lspci -vvvxxx to linux-    
> > > > > > pci@vger.kernel.org    
> > > > > > > > so    
> > > > > > > > > > the device can be fixed automatically via the broken_intx_masking    
> > > > > > flag.");    
> > > > > > > > > > > +
> > > > > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > > > > +static bool disable_vga;
> > > > > > > > > > > +module_param(disable_vga, bool, S_IRUGO);
> > > > > > > > > > > +MODULE_PARM_DESC(disable_vga, "Disable VGA resource access    
> > > > > > through    
> > > > > > > > vfio-    
> > > > > > > > > > mdev-pci");    
> > > > > > > > > > > +#endif
> > > > > > > > > > > +
> > > > > > > > > > > +static bool disable_idle_d3;
> > > > > > > > > > > +module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > > > > > > > +MODULE_PARM_DESC(disable_idle_d3,
> > > > > > > > > > > +		 "Disable using the PCI D3 low power state for idle,    
> > > > > > unused devices");    
> > > > > > > > > > > +
> > > > > > > > > > > +static struct pci_driver vfio_mdev_pci_driver;
> > > > > > > > > > > +
> > > > > > > > > > > +static ssize_t
> > > > > > > > > > > +name_show(struct kobject *kobj, struct device *dev, char *buf)
> > > > > > > > > > > +{
> > > > > > > > > > > +	return sprintf(buf, "%s-type1\n", dev_name(dev));
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +MDEV_TYPE_ATTR_RO(name);
> > > > > > > > > > > +
> > > > > > > > > > > +static ssize_t
> > > > > > > > > > > +available_instances_show(struct kobject *kobj, struct device *dev,    
> > > > > > char *buf)    
> > > > > > > > > > > +{
> > > > > > > > > > > +	return sprintf(buf, "%d\n", 1);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +MDEV_TYPE_ATTR_RO(available_instances);
> > > > > > > > > > > +
> > > > > > > > > > > +static ssize_t device_api_show(struct kobject *kobj, struct device    
> > > > > > *dev,    
> > > > > > > > > > > +		char *buf)
> > > > > > > > > > > +{
> > > > > > > > > > > +	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +MDEV_TYPE_ATTR_RO(device_api);
> > > > > > > > > > > +
> > > > > > > > > > > +static struct attribute *vfio_mdev_pci_types_attrs[] = {
> > > > > > > > > > > +	&mdev_type_attr_name.attr,
> > > > > > > > > > > +	&mdev_type_attr_device_api.attr,
> > > > > > > > > > > +	&mdev_type_attr_available_instances.attr,
> > > > > > > > > > > +	NULL,
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > > +static struct attribute_group vfio_mdev_pci_type_group1 = {
> > > > > > > > > > > +	.name  = "type1",
> > > > > > > > > > > +	.attrs = vfio_mdev_pci_types_attrs,
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > > +struct attribute_group *vfio_mdev_pci_type_groups[] = {
> > > > > > > > > > > +	&vfio_mdev_pci_type_group1,
> > > > > > > > > > > +	NULL,
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > > +struct vfio_mdev_pci {
> > > > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > > > +	struct mdev_device *mdev;
> > > > > > > > > > > +	unsigned long handle;
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > > +static int vfio_mdev_pci_create(struct kobject *kobj, struct    
> > > > > > mdev_device    
> > > > > > > > *mdev)    
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct device *pdev;
> > > > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev;
> > > > > > > > > > > +	int ret;
> > > > > > > > > > > +
> > > > > > > > > > > +	pdev = mdev_parent_dev(mdev);
> > > > > > > > > > > +	vdev = dev_get_drvdata(pdev);
> > > > > > > > > > > +	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> > > > > > > > > > > +	if (pmdev == NULL) {
> > > > > > > > > > > +		ret = -EBUSY;
> > > > > > > > > > > +		goto out;
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	pmdev->mdev = mdev;
> > > > > > > > > > > +	pmdev->vdev = vdev;
> > > > > > > > > > > +	mdev_set_drvdata(mdev, pmdev);
> > > > > > > > > > > +	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
> > > > > > > > > > > +	if (ret) {
> > > > > > > > > > > +		pr_info("%s, failed to config iommu isolation for    
> > > > > > mdev: %s on    
> > > > > > > > > > pf: %s\n",    
> > > > > > > > > > > +			__func__, dev_name(mdev_dev(mdev)),    
> > > > > > dev_name(pdev));    
> > > > > > > > > > > +		goto out;
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	pr_info("%s, creation succeeded for mdev: %s\n", __func__,
> > > > > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > > > > +out:
> > > > > > > > > > > +	return ret;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	kfree(pmdev);
> > > > > > > > > > > +	pr_info("%s, succeeded for mdev: %s\n", __func__,
> > > > > > > > > > > +		     dev_name(mdev_dev(mdev)));
> > > > > > > > > > > +
> > > > > > > > > > > +	return 0;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static int vfio_mdev_pci_open(struct mdev_device *mdev)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > > > > +	int ret = 0;
> > > > > > > > > > > +
> > > > > > > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > > > > > > +		return -ENODEV;
> > > > > > > > > > > +
> > > > > > > > > > > +	vfio_pci_refresh_config(vdev, nointxmask, disable_idle_d3);
> > > > > > > > > > > +
> > > > > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > > > > +
> > > > > > > > > > > +	if (!vdev->refcnt) {
> > > > > > > > > > > +		ret = vfio_pci_enable(vdev);
> > > > > > > > > > > +		if (ret)
> > > > > > > > > > > +			goto error;
> > > > > > > > > > > +
> > > > > > > > > > > +		vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > > > > > > > +	}
> > > > > > > > > > > +	vdev->refcnt++;
> > > > > > > > > > > +error:
> > > > > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > > > > +	if (!ret)
> > > > > > > > > > > +		pr_info("Succeeded to open mdev: %s on pf: %s\n",
> > > > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-    
> > > > > > >vdev->pdev-    
> > > > > > > > > > >dev));
> > > > > > > > > > > +	else {
> > > > > > > > > > > +		pr_info("Failed to open mdev: %s on pf: %s\n",
> > > > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-    
> > > > > > >vdev->pdev-    
> > > > > > > > > > >dev));
> > > > > > > > > > > +		module_put(THIS_MODULE);
> > > > > > > > > > > +	}
> > > > > > > > > > > +	return ret;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static void vfio_mdev_pci_release(struct mdev_device *mdev)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +	struct vfio_pci_device *vdev = pmdev->vdev;
> > > > > > > > > > > +
> > > > > > > > > > > +	pr_info("Release mdev: %s on pf: %s\n",
> > > > > > > > > > > +		dev_name(mdev_dev(mdev)), dev_name(&pmdev-    
> > > > > > >vdev->pdev-    
> > > > > > > > > > >dev));
> > > > > > > > > > > +
> > > > > > > > > > > +	mutex_lock(&vdev->reflck->lock);
> > > > > > > > > > > +
> > > > > > > > > > > +	if (!(--vdev->refcnt)) {
> > > > > > > > > > > +		vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > > > > > > > +		vfio_pci_disable(vdev);
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	mutex_unlock(&vdev->reflck->lock);
> > > > > > > > > > > +
> > > > > > > > > > > +	module_put(THIS_MODULE);
> > > > > > > > > > > +}    
> > > > > > > > > >
> > > > > > > > > > open() and release() here are almost identical between vfio_pci and
> > > > > > > > > > vfio_mdev_pci, which suggests maybe there should be common    
> > > > > > functions to    
> > > > > > > > > > call into like we do for the below.    
> > > > > > > > >
> > > > > > > > > yes, let me have more study and do better abstract in next version. :-)
> > > > > > > > >    
> > > > > > > > > > > +static long vfio_mdev_pci_ioctl(struct mdev_device *mdev,    
> > > > > > unsigned int cmd,    
> > > > > > > > > > > +			     unsigned long arg)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
> > > > > > > > > > > +				struct vm_area_struct *vma)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	return vfio_pci_mmap(pmdev->vdev, vma);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char    
> > > > > > __user    
> > > > > > > > *buf,    
> > > > > > > > > > > +			size_t count, loff_t *ppos)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
> > > > > > > > > > > +				const char __user *buf,
> > > > > > > > > > > +				size_t count, loff_t *ppos)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	return vfio_pci_write(pmdev->vdev, (char __user *)buf,    
> > > > > > count, ppos);    
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> > > > > > > > > > > +	.supported_type_groups	=    
> > > > > > vfio_mdev_pci_type_groups,    
> > > > > > > > > > > +	.create			= vfio_mdev_pci_create,
> > > > > > > > > > > +	.remove			= vfio_mdev_pci_remove,
> > > > > > > > > > > +
> > > > > > > > > > > +	.open			= vfio_mdev_pci_open,
> > > > > > > > > > > +	.release		= vfio_mdev_pci_release,
> > > > > > > > > > > +
> > > > > > > > > > > +	.read			= vfio_mdev_pci_read,
> > > > > > > > > > > +	.write			= vfio_mdev_pci_write,
> > > > > > > > > > > +	.mmap			= vfio_mdev_pci_mmap,
> > > > > > > > > > > +	.ioctl			= vfio_mdev_pci_ioctl,
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > > +static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
> > > > > > > > > > > +				       const struct pci_device_id *id)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vfio_pci_device *vdev;
> > > > > > > > > > > +	int ret;
> > > > > > > > > > > +
> > > > > > > > > > > +	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> > > > > > > > > > > +		return -EINVAL;
> > > > > > > > > > > +
> > > > > > > > > > > +	/*
> > > > > > > > > > > +	 * Prevent binding to PFs with VFs enabled, this too easily    
> > > > > > allows    
> > > > > > > > > > > +	 * userspace instance with VFs and PFs from the same device,    
> > > > > > which    
> > > > > > > > > > > +	 * cannot work.  Disabling SR-IOV here would initiate    
> > > > > > removing the    
> > > > > > > > > > > +	 * VFs, which would unbind the driver, which is prone to    
> > > > > > blocking    
> > > > > > > > > > > +	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
> > > > > > > > > > > +	 * reject these PFs and let the user sort it out.
> > > > > > > > > > > +	 */
> > > > > > > > > > > +	if (pci_num_vf(pdev)) {
> > > > > > > > > > > +		pci_warn(pdev, "Cannot bind to PF with SR-IOV    
> > > > > > enabled\n");    
> > > > > > > > > > > +		return -EBUSY;
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > > > > > > > > +	if (!vdev)
> > > > > > > > > > > +		return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > +	vdev->pdev = pdev;
> > > > > > > > > > > +	vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > > > > > > > > +	mutex_init(&vdev->igate);
> > > > > > > > > > > +	spin_lock_init(&vdev->irqlock);
> > > > > > > > > > > +	mutex_init(&vdev->ioeventfds_lock);
> > > > > > > > > > > +	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > > > > > > > > +	vdev->nointxmask = nointxmask;
> > > > > > > > > > > +#ifdef CONFIG_VFIO_PCI_VGA
> > > > > > > > > > > +	vdev->disable_vga = disable_vga;
> > > > > > > > > > > +#endif
> > > > > > > > > > > +	vdev->disable_idle_d3 = disable_idle_d3;
> > > > > > > > > > > +
> > > > > > > > > > > +	pci_set_drvdata(pdev, vdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	ret = vfio_pci_reflck_attach(vdev);
> > > > > > > > > > > +	if (ret) {
> > > > > > > > > > > +		pci_set_drvdata(pdev, NULL);
> > > > > > > > > > > +		kfree(vdev);
> > > > > > > > > > > +		return ret;
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	if (vfio_pci_is_vga(pdev)) {
> > > > > > > > > > > +		vga_client_register(pdev, vdev, NULL,    
> > > > > > vfio_pci_set_vga_decode);    
> > > > > > > > > > > +		vga_set_legacy_decoding(pdev,
> > > > > > > > > > > +    
> > > > > > 	vfio_pci_set_vga_decode(vdev, false));    
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	vfio_pci_probe_power_state(vdev);
> > > > > > > > > > > +
> > > > > > > > > > > +	if (!vdev->disable_idle_d3) {
> > > > > > > > > > > +		/*
> > > > > > > > > > > +		 * pci-core sets the device power state to an    
> > > > > > unknown value at    
> > > > > > > > > > > +		 * bootup and after being removed from a driver.    
> > > > > > The only    
> > > > > > > > > > > +		 * transition it allows from this unknown state is to    
> > > > > > D0, which    
> > > > > > > > > > > +		 * typically happens when a driver calls    
> > > > > > pci_enable_device().    
> > > > > > > > > > > +		 * We're not ready to enable the device yet, but we    
> > > > > > do want to    
> > > > > > > > > > > +		 * be able to get to D3.  Therefore first do a D0    
> > > > > > transition    
> > > > > > > > > > > +		 * before going to D3.
> > > > > > > > > > > +		 */
> > > > > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D0);
> > > > > > > > > > > +		vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > > > > > > > > +	}    
> > > > > > > > > >
> > > > > > > > > > Ditto here and remove below, this seems like boilerplate that    
> > > > > > shouldn't    
> > > > > > > > > > be duplicated per leaf module.  Thanks,    
> > > > > > > > >
> > > > > > > > > Sure, the code snippet above may also be abstracted to be a common    
> > > > > > API    
> > > > > > > > > provided by vfio-pci-common.ko. :-)
> > > > > > > > >
> > > > > > > > > I have a confusion which may need confirm with you. Do you also want    
> > > > > > the    
> > > > > > > > > below code snippet be placed in the vfio-pci-common.ko and exposed    
> > > > > > out    
> > > > > > > > > as a wrapped API? Thus it can be used by sample driver and other    
> > > > > > future    
> > > > > > > > > drivers which want to wrap PCI device as a mdev. May be I    
> > > > > > misundstood    
> > > > > > > > > your comment. :-(    
> > > > > > > >
> > > > > > > >
> > > > > > > > I think some sort of vfio_pci_common_{probe,remove}() would be a
> > > > > > > > reasonable starting point where the respective module _{probe,remove}
> > > > > > > > functions would call into these and add their module specific code
> > > > > > > > around it.  That would at least give us a point to cleanup things that
> > > > > > > > are only used by the common code in the common code.    
> > > > > > >
> > > > > > > sure, I can start from here if we are still going with this direction. :-)
> > > > > > >    
> > > > > > > > I'm still struggling how we make this user consumable should we accept
> > > > > > > > this and progress beyond a proof of concept sample driver though.  For
> > > > > > > > example, if a vendor actually implements an mdev wrapper driver or    
> > > > > > even    
> > > > > > > > just a device specific vfio-pci wrapper, to enable for example
> > > > > > > > migration support, how does a user know which driver to use for each
> > > > > > > > particular feature?  The best I can come up with so far is something
> > > > > > > > like was done for vfio-platform reset modules.  For instance a module
> > > > > > > > that extends features for a given device in vfio-pci might register an
> > > > > > > > ops structure and id table with vfio-pci, along with creating a module
> > > > > > > > alias (or aliases) for the devices it supports.  When a device is
> > > > > > > > probed by vfio-pci it could try to match against registered id tables
> > > > > > > > to find a device specific ops structure, if one is not found it could
> > > > > > > > do a request_module using the PCI vendor and device IDs and some    
> > > > > > unique    
> > > > > > > > vfio-pci string, check again, and use the default ops if device
> > > > > > > > specific ops are still not present.  That would solve the problem on
> > > > > > > > the vfio-pci side.    
> > > > > > >
> > > > > > > yeah, this is letting vfio-pci to invoke the ops from vendor drivers/modules.
> > > > > > > I think this is what Yan is trying to do.    
> > > > > > 
> > > > > > I think I'm suggesting a callback ops structure a level above what Yan
> > > > > > previously proposed.  For example, could we have device specific
> > > > > > vfio_device_ops where the vendor module can call out to common code
> > > > > > rather than requiring common code to test for and optionally call out
> > > > > > to device specific code.
> > > > > >     
> > > > > > > > For mdevs, I tend to assume that this vfio-mdev-pci
> > > > > > > > meta driver is an anomaly only for the purpose of creating a generic
> > > > > > > > test device for IOMMU backed mdevs and that "real" mdev vendor
> > > > > > > > drivers will just be mdev enlightened host drivers, like i915 and
> > > > > > > > nvidia are now.  Thanks,    
> > > > > > >
> > > > > > > yes, this vfio-mdev-pci meta driver is just creating a test device.
> > > > > > > Do we still go with the current direction, or find any other way
> > > > > > > which may be easier for adding this meta driver?    
> > > > > > 
> > > > > > I think if the code split allows us to create an environment where
> > > > > > vendor drivers can re-use much of vfio-pci while creating a
> > > > > > vfio_device_ops that supports additional features for their device and
> > > > > > we bring that all together with a request module interface and module
> > > > > > aliases to make that work seamlessly, then it has value.  A concern I
> > > > > > have in only doing this split in order to create the vfio-mdev-pci
> > > > > > module is that it leaves open the question and groundwork for forking
> > > > > > vfio-pci into multiple vendor specific modules that would become a mess
> > > > > > for user's to mange.
> > > > > >     
> > > > > > > Compared with the "real" mdev vendor drivers, it is like a
> > > > > > > "vfio-pci + dummy mdev ops" driver. dummy mdev ops means
> > > > > > > no vendor specific handling and passthru to vfio-pci codes directly.
> > > > > > >
> > > > > > > I think this meta driver is even lighter than the "real" mdev vendor
> > > > > > > drivers. right? Is it possible to let this driver follow the way of
> > > > > > > registering ops structure and id table with vfio-pci? The obstacle
> > > > > > > I can see is the meta driver is a generic driver, which means it has
> > > > > > > no id table... For the "real" mdev vendor drivers, they naturally have
> > > > > > > such info. If vfio-mdev-pci can also get the id info without binding
> > > > > > > to a device, it may be possible. thoughts? :-)    
> > > > > > 
> > > > > > IDs could be provided via a module option or potentially with
> > > > > > build-time options.  That might allow us to test all aspects of the
> > > > > > above proposal, ie. allowing sub-modules to provide vfio_device_ops for
> > > > > > specific devices, allowing those vendor vfio_device_ops to re-use much
> > > > > > of the existing vfio-pci code in that implementation, and a mechanism
> > > > > > for generically testing IOMMU backed mdevs.  That's starting to sound a
> > > > > > lot more worthwhile than moving a bunch of code around only to
> > > > > > implement a sample driver for the latter.  Thoughts?  Thanks,
> > > > > >     
> > > > > 
> > > > > sounds a good idea. If feasible suppose Yan's mediate_ops series
> > > > > can be also largely avoided. The vendor driver can directly register its
> > > > > own vfio_device_ops and selectively introduces proprietary logic 
> > > > > (e.g. for tracking dirty pages) on top of the generic vfio_pci code.    
> > > > 
> > > > hi Alex
> > > > as our previously discussed, I'm preparing to implement my v2 as this
> > > > way:
> > > > 
> > > > 1. on vfio-pci binding to a device, it will modprobe modules of alias
> > > > "vfio-pci-(vendorid)-(deviceid)", as a way to notify vendor drivers of
> > > > registering their vendor ops. (I renamed mediate_ops to vendor_ops in
> > > > v2)
> > > > 2. in a module aliasing to "vfio-pci-(vendor_id)-(devivce_id)", in its
> > > > module_init, it will register a vendor ops to vfio-pci.
> > > > If there are two modules of the same alias and both registering vendor
> > > > ops at the same time, they are chained according to the prio in
> > > > its vendor ops.
> > > > 3. vfio-pci would ask for region_infos for all vendor ops of a vdev in
> > > > vfio_pci_open, and init regions for vendor drivers. Current code in
> > > > vfio_pci_igd.c, vfio_pci_nvlink2.c, vfio_pci_nvlink2.c would all be
> > > > wrapped into separate modules. so current vfio_pci_register_dev_region()
> > > > would be removed accordingly. vfio_pci_rw would now be direct to 
> > > > vendor_ops->region[i].rw. higher priority module's ops wins.
> > > > For example, module vfio_pci_igd may register to regions of index 10,
> > > > 11, 12 for its opregion, and two cfg regions. still, vendor driver can
> > > > provide a module named i915_migration to register for regions of index 0
> > > > and 13 for BAR0 and migration.  
> > > 
> > > My major complaint with the previous version was that sprinkling random
> > > vendor ops call-outs everywhere in vfio-pci is ugly and hard to
> > > maintain.  The idea I'm proposing here is that sub-modules (loaded via
> > > alias) would provide the entire vfio_device_ops for a device.  Yi's
> > > series here would split out common code to make it trivial for vendor
> > > modules to implement those device ops using pieces of vfio-pci if they
> > > wish to do so.  Having multiple modules implement features of a device
> > > based on their loading priority sounds powerful, but also difficult to
> > > maintain and debug.  Do we need that functionality if a vendor
> > > vfio_device_ops can implement it themselves in a handful of lines of
> > > code?  Thanks,  
> > 
> > The main purpose of providing multiple modules is to enable each module
> > to focus on implementing regions of their own interest. If vendor module
> > has to provide vfio_device_ops, I don't think it's only a handful of
> > lines of code for them.
> > For example. in vfio_device_ops.open(), they at least have to hold the
> > &vdev->reflck->lock and call vfio_pci_enable and
> > vfio_spapr_pci_eeh_open. Also, vdev is private inside vfio_pci, do we
> > really want to export this structure?
> > The same to vfio_device_ops.ioctl(). if vendor driver has to implement a
> > little different than vfio_pci_ioctl(), e.g. init a new region, it has
> > to decode region index and knows inside vdev->region[i].
> > when it comes to vfio_device_ops.remove(), in Yi's code, it even has to
> > free each lock and region... in vdev.
> 
> You make some good points, replacing vfio_device_ops altogether per
> vendor module might be too simplistic.  However, we also can't create a
> special case for vendor module handling on every interface.  For
> example, why would vfio_pci_rw() test for and call out to
> mediate_ops->rw() when we've already got per region rw() handlers via
> vfio_pci_regops?  Seems we need to make use of vfio_pci_regops
> ubiquitous for all regions and create an API for a vendor module to
> register new regions with ops (ie. expose vfio_pci_register_dev_region)
> and also manipulate the ops of existing regions.  When a vendor module
> registers, it might just need to provide an open function callback and
> an id table, and perhaps everything else is handled via registering new
> regions and dynamically changing existing regions when we call the open
> callback for a device.  Something about that series needs to change, I
> can't handle the proposed mediated device ops being tested and called
> everywhere.
>  
> > Besides that, one thing I don't understand is that, Yi's sample code is
> > a mdev driver, so rather than binding to vfio-pci, a pci device would
> > bind to Yi's driver directly. Then, how this registering to vfio-pci way
> > work for him?
> 
> It wouldn't, I was trying to justify the code rework that Yi is trying
> to do as also usable to these vfio-pci vendor extension modules.  You
> may have poked a hole in that proposal though, which again puts in
> doubt whether we should really pursue it for a sample driver.  Thanks,
>

hi Alex
I've sent v2 of introducing vendor_ops to vfio-pci.
(https://lkml.org/lkml/2020/1/30/956)
By making vfio_pci_device partly public and incorporating vendor driver
data, it is now able to let vendor driver register their own
vfio_device_ops and calling vfio_pci_ops as default implementations.

Making use of vfio_pci_regops ubiquitous for all regions and create an API
for a vendor module to register new regions with ops is also a good idea,
but its drawback is that the usage is only limited to region customization.
I'd like to send my current v2 implementation to you first and see if you
think it's good.

Thanks
Yan


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-01-31  2:35 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-07 12:01 [PATCH v4 00/12] vfio_pci: wrap pci device as a mediated device Liu Yi L
2020-01-07 12:01 ` [PATCH v4 01/12] vfio_pci: refine user config reference in vfio-pci module Liu Yi L
2020-01-09 22:48   ` Alex Williamson
2020-01-16 12:19     ` Liu, Yi L
2020-01-07 12:01 ` [PATCH v4 02/12] vfio_pci: move vfio_pci_is_vga/vfio_vga_disabled to header file Liu Yi L
2020-01-15 10:43   ` Cornelia Huck
2020-01-16 12:46     ` Liu, Yi L
2020-01-07 12:01 ` [PATCH v4 03/12] vfio_pci: refine vfio_pci_driver reference in vfio_pci.c Liu Yi L
2020-01-09 22:48   ` Alex Williamson
2020-01-10  7:35     ` Liu, Yi L
2020-01-07 12:01 ` [PATCH v4 04/12] vfio_pci: make common functions be extern Liu Yi L
2020-01-15 10:56   ` Cornelia Huck
2020-01-16 12:48     ` Liu, Yi L
2020-01-07 12:01 ` [PATCH v4 05/12] vfio_pci: duplicate vfio_pci.c Liu Yi L
2020-01-15 11:03   ` Cornelia Huck
2020-01-15 15:12     ` Alex Williamson
2020-01-07 12:01 ` [PATCH v4 06/12] vfio_pci: shrink vfio_pci_common.c Liu Yi L
2020-01-07 12:01 ` [PATCH v4 07/12] vfio_pci: shrink vfio_pci.c Liu Yi L
2020-01-08 11:24   ` kbuild test robot
2020-01-09 22:48   ` Alex Williamson
2020-01-16 12:42     ` Liu, Yi L
2020-01-07 12:01 ` [PATCH v4 08/12] vfio_pci: duplicate vfio_pci_private.h to include/linux Liu Yi L
2020-01-07 12:01 ` [PATCH v4 09/12] vfio: split vfio_pci_private.h into two files Liu Yi L
2020-01-09 22:48   ` Alex Williamson
2020-01-16 11:59     ` Liu, Yi L
2020-01-07 12:01 ` [PATCH v4 10/12] vfio: build vfio_pci_common.c into a kernel module Liu Yi L
2020-01-07 12:01 ` [PATCH v4 11/12] samples: add vfio-mdev-pci driver Liu Yi L
2020-01-09 22:48   ` Alex Williamson
2020-01-16 12:33     ` Liu, Yi L
2020-01-16 21:24       ` Alex Williamson
2020-01-18 14:25         ` Liu, Yi L
2020-01-20 21:07           ` Alex Williamson
2020-01-21  7:43             ` Tian, Kevin
2020-01-21  8:43               ` Yan Zhao
2020-01-21 20:04                 ` Alex Williamson
2020-01-21 21:54                   ` Yan Zhao
2020-01-23 23:33                     ` Alex Williamson
2020-01-31  2:26                       ` Yan Zhao
2020-01-15 12:30   ` Cornelia Huck
2020-01-16 13:23     ` Liu, Yi L
2020-01-16 17:40       ` Cornelia Huck
2020-01-18 14:23         ` Liu, Yi L
2020-01-20  8:55           ` Cornelia Huck
2020-01-07 12:01 ` [PATCH v4 12/12] samples: refine " Liu Yi L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).