All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci
@ 2020-05-18  2:42 Yan Zhao
  2020-05-18  2:43 ` [RFC PATCH v4 01/10] vfio/pci: register/unregister vfio_pci_vendor_driver_ops Yan Zhao
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:42 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

When using vfio-pci to pass through devices, though it's desired to use
its default implementations in most of time, it is also sometimes
necessary to call vendors specific operations.
For example, in order to do device live migration, the way of dirty
pages detection and device state save-restore may be varied from device
to device.
Vendors may want to add a vendor device region or may want to
intercept writes to a BAR region.
So, in this series, we introduce a way to allow vendors to provide vendor
specific ops for VFIO devices and meanwhile export several vfio-pci
interfaces as default implementations to simplify code of vendor driver
and avoid duplication.

Vendor driver registration/unregistration goes like this:
(1) macros are provided to let vendor drivers register/unregister
vfio_pci_vendor_driver_ops to vfio_pci in their module_init() and
module_exit().
vfio_pci_vendor_driver_ops contains callbacks probe() and remove() and a
pointer to vfio_device_ops.

(2) vendor drivers define their module aliases as
"vfio-pci:$vendor_id-$device_id".
E.g. A vendor module for VF devices of Intel(R) Ethernet Controller XL710
family can define its module alias as MODULE_ALIAS("vfio-pci:8086-154c").

(3) when module vfio_pci is bound to a device, it would call modprobe in
user space for modules of alias "vfio-pci:$vendor_id-$device_id", which
would trigger unloaded vendor drivers to register their
vfio_pci_vendor_driver_ops to vfio_pci.
Then it searches registered ops list and calls probe() to test whether this
vendor driver supports this physical device.
A success probe() would make bind vfio device to vendor provided
vfio_device_ops, which would call exported default implementations in
vfio_pci_ops if necessary. 


                                        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                                  
 __________   (un)register vendor ops  |  ___________    ___________   |
|          |<----------------------------|    VF    |   |           |   
| vfio-pci |                           | |  vendor  |   | PF driver |  |
|__________|---------------------------->|  driver  |   |___________|   
     |           probe/remove()        |  -----------          |       |
     |                                                         |         
     |                                 |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _|
    \|/                                                       \|/
-----------                                              ------------
|    VF   |                                              |    PF    |
-----------                                              ------------
                   a typical usage in SRIOV



Ref counts:
(1) vendor drivers must be a module and compiled to depend on module
vfio_pci.
(2) In vfio_pci, a successful register would add refs of itself, and a
successful unregister would derefs of itself.
(3) In vfio_pci, a successful probe() of a vendor driver would add ref of
the vendor module. It derefs of the vendor module after calling remove().
(4) macro provided to make sure vendor module always unregister itself in
its module_exit

Those are to prevent below conditions:
a. vfio_pci is unloaded after a successful register from vendor driver.
   Though vfio_pci would later call modprobe to ask the vendor module to
   register again, it cannot help if vendor driver remain as loaded
   across unloading-loading of vfio_pci.
b. vendor driver unregisters itself after successfully probed by vfio_pci.
c. circular dependency between vfio_pci and the vendor driver.
   if vfio_pci adds refs to both vfio_pci and vendor driver on a successful
   register and if vendor driver only do the unregistration in its module_exit,
   then it would have no chance to do the unregistration.


Patch Overview
patches 1-2 provide register/unregister interfaces for vendor drivers
Patch 3     exports several members in vdev, including vendor_data, and
            exports functions in vfio_pci_ops to allow them accessible
	    from vendor drivers.
patches 4-5 export some more vdev members to vendor driver to simplify
            their implementations.
patch 6     is from Tina Zhang to define vendor specific Irq type
            capability.
patch 7     introduces a new vendor defined irq type
            VFIO_IRQ_TYPE_REMAP_BAR_REGION.
patches 8-10
            use VF live migration driver for Intel's 710 SRIOV devices
            as an example of how to implement this vendor ops interface.
    patch 8 first let the vendor ops pass through VFs.
    patch 9 implements a migration region based on migration protocol
            defined in [1][2].
            (Some dirty page tracking functions are intentionally
            commented out and would send out later in future.)
    patch 10 serves as an example of how to define vendor specific irq
            type. This irq will trigger qemu to dynamic map BAR regions
	    in order to implement software based dirty page track.

Changelog:
RFC v3- RFC v4:
- use exported function to make vendor driver access internal fields of
  vdev rather than make struct vfio_pci_device public. (Alex)
- add a new interface vfio_pci_get_barmap() to call vfio_pci_setup_barma()
  and let vfio_pci_setup_barmap() still able to return detailed errno.
  (Alex)
- removed sample code to pass through igd devices. instead, use the
  first patch (patch 8/10) of i40e vf migration as an mere pass-through
  example.
- rebased code to 5.7 and VFIO migration kernel patches v17 and qemu
  patches v16.
- added a demo of vendor defined irq type.

RFC v2- RFC v3:
- embedding struct vfio_pci_device into struct vfio_pci_device_private.
(Alex)

RFC v1- RFC v2:
- renamed mediate ops to vendor ops
- use of request_module and module alias to manage vendor driver load
  (Alex)
- changed from vfio_pci_ops calling vendor ops
  to vendor ops calling default vfio_pci_ops  (Alex)
- dropped patches for dynamic traps of BARs. will submit them later.

Links:
[1] VFIO migration kernel v17:
    https://patchwork.kernel.org/cover/11466129/
[2] VFIO migration qemu v16:
    https://patchwork.kernel.org/cover/11456557/

Previous versions:
RFC v3: https://lkml.org/lkml/2020/2/11/142

RFC v2: https://lkml.org/lkml/2020/1/30/956

RFC v1:
kernel part: https://www.spinics.net/lists/kernel/msg3337337.html.
qemu part: https://www.spinics.net/lists/kernel/msg3337337.html.


Tina Zhang (1):
  vfio: Define device specific irq type capability

Yan Zhao (9):
  vfio/pci: register/unregister vfio_pci_vendor_driver_ops
  vfio/pci: macros to generate module_init and module_exit for vendor
    modules
  vfio/pci: export vendor_data, irq_type, num_regions, pdev and
    functions in vfio_pci_ops
  vfio/pci: let vfio_pci know number of vendor regions and vendor irqs
  vfio/pci: export vfio_pci_get_barmap
  vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  i40e/vf_migration: VF live migration - pass-through VF first
  i40e/vf_migration: register a migration vendor region
  i40e/vf_migration: vendor defined irq_type to support dynamic bar map

 drivers/net/ethernet/intel/Kconfig            |  10 +
 drivers/net/ethernet/intel/i40e/Makefile      |   2 +
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 904 ++++++++++++++++++
 .../ethernet/intel/i40e/i40e_vf_migration.h   | 119 +++
 drivers/vfio/pci/vfio_pci.c                   | 181 +++-
 drivers/vfio/pci/vfio_pci_private.h           |   9 +
 drivers/vfio/pci/vfio_pci_rdwr.c              |  10 +
 include/linux/vfio.h                          |  58 ++
 include/uapi/linux/vfio.h                     |  30 +-
 9 files changed, 1311 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 01/10] vfio/pci: register/unregister vfio_pci_vendor_driver_ops
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
@ 2020-05-18  2:43 ` Yan Zhao
  2020-05-18  2:45 ` [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules Yan Zhao
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:43 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

vfio_pci_vendor_driver_ops includes two parts:
(1) .probe() and .remove() interface to be called by vfio_pci_probe()
and vfio_pci_remove().
(2) pointer to struct vfio_device_ops. It will be registered as ops of vfio
device if .probe() succeeds.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 102 +++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |   7 ++
 include/linux/vfio.h                |   9 +++
 3 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..43d10d34cbc2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -68,6 +68,11 @@ static inline bool vfio_vga_disabled(void)
 #endif
 }
 
+static struct vfio_pci {
+	struct  mutex		vendor_drivers_lock;
+	struct  list_head	vendor_drivers_list;
+} vfio_pci;
+
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -1570,6 +1575,35 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 	return 0;
 }
 
+static int probe_vendor_drivers(struct vfio_pci_device *vdev)
+{
+	struct vfio_pci_vendor_driver *driver;
+	int ret = -ENODEV;
+
+	request_module("vfio-pci:%x-%x", vdev->pdev->vendor,
+					 vdev->pdev->device);
+
+	mutex_lock(&vfio_pci.vendor_drivers_lock);
+	list_for_each_entry(driver, &vfio_pci.vendor_drivers_list, next) {
+		void *data;
+
+		if (!try_module_get(driver->ops->owner))
+			continue;
+
+		data = driver->ops->probe(vdev->pdev);
+		if (IS_ERR(data)) {
+			module_put(driver->ops->owner);
+			continue;
+		}
+		vdev->vendor_driver = driver;
+		vdev->vendor_data = data;
+		ret = 0;
+		break;
+	}
+	mutex_unlock(&vfio_pci.vendor_drivers_lock);
+	return ret;
+}
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
@@ -1609,7 +1643,12 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
 
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (probe_vendor_drivers(vdev))
+		ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	else
+		ret = vfio_add_group_dev(&pdev->dev,
+					 vdev->vendor_driver->ops->device_ops,
+					 vdev);
 	if (ret)
 		goto out_free;
 
@@ -1698,6 +1737,11 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	if (!disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D0);
 
+	if (vdev->vendor_driver) {
+		vdev->vendor_driver->ops->remove(vdev->vendor_data);
+		module_put(vdev->vendor_driver->ops->owner);
+	}
+
 	kfree(vdev->pm_save);
 	kfree(vdev);
 
@@ -2035,6 +2079,8 @@ static int __init vfio_pci_init(void)
 
 	vfio_pci_fill_ids();
 
+	mutex_init(&vfio_pci.vendor_drivers_lock);
+	INIT_LIST_HEAD(&vfio_pci.vendor_drivers_list);
 	return 0;
 
 out_driver:
@@ -2042,6 +2088,60 @@ static int __init vfio_pci_init(void)
 	return ret;
 }
 
+int __vfio_pci_register_vendor_driver(struct vfio_pci_vendor_driver_ops *ops)
+{
+	struct vfio_pci_vendor_driver *driver, *tmp;
+
+	if (!ops || !ops->device_ops)
+		return -EINVAL;
+
+	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
+	if (!driver)
+		return -ENOMEM;
+
+	driver->ops = ops;
+
+	mutex_lock(&vfio_pci.vendor_drivers_lock);
+
+	/* Check for duplicates */
+	list_for_each_entry(tmp, &vfio_pci.vendor_drivers_list, next) {
+		if (tmp->ops->device_ops == ops->device_ops) {
+			mutex_unlock(&vfio_pci.vendor_drivers_lock);
+			kfree(driver);
+			return -EINVAL;
+		}
+	}
+
+	list_add(&driver->next, &vfio_pci.vendor_drivers_list);
+
+	mutex_unlock(&vfio_pci.vendor_drivers_lock);
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(__vfio_pci_register_vendor_driver);
+
+void vfio_pci_unregister_vendor_driver(struct vfio_device_ops *device_ops)
+{
+	struct vfio_pci_vendor_driver *driver, *tmp;
+
+	mutex_lock(&vfio_pci.vendor_drivers_lock);
+	list_for_each_entry_safe(driver, tmp,
+				 &vfio_pci.vendor_drivers_list, next) {
+		if (driver->ops->device_ops == device_ops) {
+			list_del(&driver->next);
+			mutex_unlock(&vfio_pci.vendor_drivers_lock);
+			kfree(driver);
+			module_put(THIS_MODULE);
+			return;
+		}
+	}
+	mutex_unlock(&vfio_pci.vendor_drivers_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_unregister_vendor_driver);
+
 module_init(vfio_pci_init);
 module_exit(vfio_pci_cleanup);
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..7758a20546fa 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@ struct vfio_pci_vf_token {
 	int			users;
 };
 
+struct vfio_pci_vendor_driver {
+	const struct vfio_pci_vendor_driver_ops *ops;
+	struct list_head			next;
+};
+
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@ struct vfio_pci_device {
 	struct list_head	ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
 	struct notifier_block	nb;
+	void			*vendor_data;
+	struct vfio_pci_vendor_driver	*vendor_driver;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 38d3c6a8dc7e..3e53deb012b6 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -214,4 +214,13 @@ extern int vfio_virqfd_enable(void *opaque,
 			      void *data, struct virqfd **pvirqfd, int fd);
 extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
+struct vfio_pci_vendor_driver_ops {
+	char			*name;
+	struct module		*owner;
+	void			*(*probe)(struct pci_dev *pdev);
+	void			(*remove)(void *vendor_data);
+	struct vfio_device_ops *device_ops;
+};
+int __vfio_pci_register_vendor_driver(struct vfio_pci_vendor_driver_ops *ops);
+void vfio_pci_unregister_vendor_driver(struct vfio_device_ops *device_ops);
 #endif /* VFIO_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
  2020-05-18  2:43 ` [RFC PATCH v4 01/10] vfio/pci: register/unregister vfio_pci_vendor_driver_ops Yan Zhao
@ 2020-05-18  2:45 ` Yan Zhao
  2020-06-04 15:01   ` Cornelia Huck
  2020-05-18  2:49 ` [RFC PATCH v4 03/10] vfio/pci: export vendor_data, irq_type, num_regions, pdev and functions in vfio_pci_ops Yan Zhao
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:45 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

vendor modules call macro module_vfio_pci_register_vendor_handler to
generate module_init and module_exit.
It is necessary to ensure that vendor modules always call
vfio_pci_register_vendor_driver() on driver loading and
vfio_pci_unregister_vendor_driver on driver unloading,
because
(1) at compiling time, there's only a dependency of vendor modules on
vfio_pci.
(2) at runtime,
- vendor modules add refs of vfio_pci on a successful calling of
  vfio_pci_register_vendor_driver() and deref of vfio_pci on a
  successful calling of vfio_pci_unregister_vendor_driver().
- vfio_pci only adds refs of vendor module on a successful probe of vendor
  driver.
  vfio_pci derefs vendor module when unbinding from a device.

So, after vfio_pci is unbound from a device, the vendor module to that
device is free to get unloaded. However, if that vendor module does not
call vfio_pci_unregister_vendor_driver() in its module_exit, vfio_pci may
hold a stale pointer to vendor module.

Cc: Kevin Tian <kevin.tian@intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 include/linux/vfio.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 3e53deb012b6..f3746608c2d9 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -223,4 +223,31 @@ struct vfio_pci_vendor_driver_ops {
 };
 int __vfio_pci_register_vendor_driver(struct vfio_pci_vendor_driver_ops *ops);
 void vfio_pci_unregister_vendor_driver(struct vfio_device_ops *device_ops);
+
+#define vfio_pci_register_vendor_driver(__name, __probe, __remove,	\
+					__device_ops)			\
+static struct vfio_pci_vendor_driver_ops  __ops ## _node = {		\
+	.owner		= THIS_MODULE,					\
+	.name		= __name,					\
+	.probe		= __probe,					\
+	.remove		= __remove,					\
+	.device_ops	= __device_ops,					\
+};									\
+__vfio_pci_register_vendor_driver(&__ops ## _node)
+
+#define module_vfio_pci_register_vendor_handler(name, probe, remove,	\
+						device_ops)		\
+static int __init device_ops ## _module_init(void)			\
+{									\
+	vfio_pci_register_vendor_driver(name, probe, remove,		\
+					device_ops);			\
+	return 0;							\
+};									\
+static void __exit device_ops ## _module_exit(void)			\
+{									\
+	vfio_pci_unregister_vendor_driver(device_ops);			\
+};									\
+module_init(device_ops ## _module_init);				\
+module_exit(device_ops ## _module_exit)
+
 #endif /* VFIO_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 03/10] vfio/pci: export vendor_data, irq_type, num_regions, pdev and functions in vfio_pci_ops
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
  2020-05-18  2:43 ` [RFC PATCH v4 01/10] vfio/pci: register/unregister vfio_pci_vendor_driver_ops Yan Zhao
  2020-05-18  2:45 ` [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules Yan Zhao
@ 2020-05-18  2:49 ` Yan Zhao
  2020-05-18  2:49 ` [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs Yan Zhao
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

export functions vfio_pci_vendor_data(), vfio_pci_irq_type(),
vfio_pci_num_regions(), vfio_pci_pdev(), and functions in vfio_pci_ops,
so they are able to be called from outside modules and make them a kind of
inherited by vfio_device_ops provided by vendor modules

Cc: Kevin Tian <kevin.tian@intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 56 +++++++++++++++++++++++++++++++------
 include/linux/vfio.h        | 18 ++++++++++++
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 43d10d34cbc2..290b7ab55ecf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -73,6 +73,38 @@ static struct vfio_pci {
 	struct  list_head	vendor_drivers_list;
 } vfio_pci;
 
+struct pci_dev *vfio_pci_pdev(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	return vdev->pdev;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_pdev);
+
+int vfio_pci_num_regions(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	return vdev->num_regions;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_num_regions);
+
+int vfio_pci_irq_type(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	return vdev->irq_type;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_irq_type);
+
+void *vfio_pci_vendor_data(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	return vdev->vendor_data;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_vendor_data);
+
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -514,7 +546,7 @@ static void vfio_pci_vf_token_user_add(struct vfio_pci_device *vdev, int val)
 	vfio_device_put(pf_dev);
 }
 
-static void vfio_pci_release(void *device_data)
+void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
@@ -530,8 +562,9 @@ static void vfio_pci_release(void *device_data)
 
 	module_put(THIS_MODULE);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_release);
 
-static int vfio_pci_open(void *device_data)
+int vfio_pci_open(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 	int ret = 0;
@@ -556,6 +589,7 @@ static int vfio_pci_open(void *device_data)
 		module_put(THIS_MODULE);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_open);
 
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 {
@@ -741,7 +775,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static long vfio_pci_ioctl(void *device_data,
+long vfio_pci_ioctl(void *device_data,
 			   unsigned int cmd, unsigned long arg)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -1253,6 +1287,7 @@ static long vfio_pci_ioctl(void *device_data,
 
 	return -ENOTTY;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_ioctl);
 
 static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
@@ -1286,7 +1321,7 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 	return -EINVAL;
 }
 
-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+ssize_t vfio_pci_read(void *device_data, char __user *buf,
 			     size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1294,8 +1329,9 @@ static ssize_t vfio_pci_read(void *device_data, char __user *buf,
 
 	return vfio_pci_rw(device_data, buf, count, ppos, false);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_read);
 
-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 			      size_t count, loff_t *ppos)
 {
 	if (!count)
@@ -1303,8 +1339,9 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_write);
 
-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1365,8 +1402,9 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
 			       req_len, vma->vm_page_prot);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_mmap);
 
-static void vfio_pci_request(void *device_data, unsigned int count)
+void vfio_pci_request(void *device_data, unsigned int count)
 {
 	struct vfio_pci_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1386,6 +1424,7 @@ static void vfio_pci_request(void *device_data, unsigned int count)
 
 	mutex_unlock(&vdev->igate);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_request);
 
 static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 				      bool vf_token, uuid_t *uuid)
@@ -1482,7 +1521,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 
 #define VF_TOKEN_ARG "vf_token="
 
-static int vfio_pci_match(void *device_data, char *buf)
+int vfio_pci_match(void *device_data, char *buf)
 {
 	struct vfio_pci_device *vdev = device_data;
 	bool vf_token = false;
@@ -1530,6 +1569,7 @@ static int vfio_pci_match(void *device_data, char *buf)
 
 	return 1; /* Match */
 }
+EXPORT_SYMBOL_GPL(vfio_pci_match);
 
 static const struct vfio_device_ops vfio_pci_ops = {
 	.name		= "vfio-pci",
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index f3746608c2d9..6ededceb1964 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -214,6 +214,24 @@ extern int vfio_virqfd_enable(void *opaque,
 			      void *data, struct virqfd **pvirqfd, int fd);
 extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
+extern int vfio_pci_irq_type(void *device_data);
+extern int vfio_pci_num_regions(void *device_data);
+extern struct pci_dev *vfio_pci_pdev(void *device_data);
+
+extern long vfio_pci_ioctl(void *device_data,
+			   unsigned int cmd, unsigned long arg);
+extern ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos);
+extern ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos);
+extern int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma);
+extern void vfio_pci_request(void *device_data, unsigned int count);
+extern int vfio_pci_open(void *device_data);
+extern void vfio_pci_release(void *device_data);
+extern int vfio_pci_match(void *device_data, char *buf);
+
+extern void *vfio_pci_vendor_data(void *device_data);
+
 struct vfio_pci_vendor_driver_ops {
 	char			*name;
 	struct module		*owner;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (2 preceding siblings ...)
  2020-05-18  2:49 ` [RFC PATCH v4 03/10] vfio/pci: export vendor_data, irq_type, num_regions, pdev and functions in vfio_pci_ops Yan Zhao
@ 2020-05-18  2:49 ` Yan Zhao
  2020-06-04 15:25   ` Cornelia Huck
  2020-05-18  2:50 ` [RFC PATCH v4 05/10] vfio/pci: export vfio_pci_get_barmap Yan Zhao
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

This allows a simpler VFIO_DEVICE_GET_INFO ioctl in vendor driver

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 23 +++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_private.h |  2 ++
 include/linux/vfio.h                |  3 +++
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 290b7ab55ecf..30137c1c5308 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -105,6 +105,24 @@ void *vfio_pci_vendor_data(void *device_data)
 }
 EXPORT_SYMBOL_GPL(vfio_pci_vendor_data);
 
+int vfio_pci_set_vendor_regions(void *device_data, int num_vendor_regions)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	vdev->num_vendor_regions = num_vendor_regions;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_set_vendor_regions);
+
+
+int vfio_pci_set_vendor_irqs(void *device_data, int num_vendor_irqs)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	vdev->num_vendor_irqs = num_vendor_irqs;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_set_vendor_irqs);
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -797,8 +815,9 @@ long vfio_pci_ioctl(void *device_data,
 		if (vdev->reset_works)
 			info.flags |= VFIO_DEVICE_FLAGS_RESET;
 
-		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
-		info.num_irqs = VFIO_PCI_NUM_IRQS;
+		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions +
+						vdev->num_vendor_regions;
+		info.num_irqs = VFIO_PCI_NUM_IRQS + vdev->num_vendor_irqs;
 
 		return copy_to_user((void __user *)arg, &info, minsz) ?
 			-EFAULT : 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 7758a20546fa..c6cfc4605987 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -110,6 +110,8 @@ struct vfio_pci_device {
 	int			num_ctx;
 	int			irq_type;
 	int			num_regions;
+	int			num_vendor_regions;
+	int			num_vendor_irqs;
 	struct vfio_pci_region	*region;
 	u8			msi_qmax;
 	u8			msix_bar;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 6ededceb1964..6310c53f9d36 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -231,6 +231,9 @@ extern void vfio_pci_release(void *device_data);
 extern int vfio_pci_match(void *device_data, char *buf);
 
 extern void *vfio_pci_vendor_data(void *device_data);
+extern int vfio_pci_set_vendor_regions(void *device_data,
+				       int num_vendor_regions);
+extern int vfio_pci_set_vendor_irqs(void *device_data, int num_vendor_irqs);
 
 struct vfio_pci_vendor_driver_ops {
 	char			*name;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 05/10] vfio/pci: export vfio_pci_get_barmap
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (3 preceding siblings ...)
  2020-05-18  2:49 ` [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs Yan Zhao
@ 2020-05-18  2:50 ` Yan Zhao
  2020-05-18  6:37   ` kbuild test robot
  2020-05-18  2:50 ` [RFC PATCH v4 06/10] vfio: Define device specific irq type capability Yan Zhao
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:50 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

This allows vendor driver to read/write to bars directly which is useful
in security checking condition.

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci_rdwr.c | 10 ++++++++++
 include/linux/vfio.h             |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index a87992892a9f..e4085311ab28 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -153,6 +153,16 @@ static int vfio_pci_setup_barmap(struct vfio_pci_device *vdev, int bar)
 	return 0;
 }
 
+void __iomem *vfio_pci_get_barmap(void *device_data, int bar)
+{
+	int ret;
+	struct vfio_pci_device *vdev = device_data;
+
+	ret = vfio_pci_setup_barmap(vdev, bar);
+	return ret ? ERR_PTR(ret) : vdev->barmap[bar];
+}
+EXPORT_SYMBOL_GPL(vfio_pci_get_barmap);
+
 ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 			size_t count, loff_t *ppos, bool iswrite)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 6310c53f9d36..0c786fec4602 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -234,6 +234,7 @@ extern void *vfio_pci_vendor_data(void *device_data);
 extern int vfio_pci_set_vendor_regions(void *device_data,
 				       int num_vendor_regions);
 extern int vfio_pci_set_vendor_irqs(void *device_data, int num_vendor_irqs);
+extern void __iomem *vfio_pci_get_barmap(void *device_data, int bar);
 
 struct vfio_pci_vendor_driver_ops {
 	char			*name;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 06/10] vfio: Define device specific irq type capability
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (4 preceding siblings ...)
  2020-05-18  2:50 ` [RFC PATCH v4 05/10] vfio/pci: export vfio_pci_get_barmap Yan Zhao
@ 2020-05-18  2:50 ` Yan Zhao
  2020-05-18  2:52 ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Yan Zhao
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:50 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Tina Zhang,
	Eric Auger

From: Tina Zhang <tina.zhang@intel.com>

Cap the number of irqs with fixed indexes and use capability chains
to chain device specific irqs.

v2:
- Irq capability index starts from 1.

Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/uapi/linux/vfio.h | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0fe7c9a6f211..2d0d85c7c4d4 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -683,11 +683,27 @@ struct vfio_irq_info {
 #define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
 #define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
 #define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+#define VFIO_IRQ_INFO_FLAG_CAPS		(1 << 4) /* Info supports caps */
 	__u32	index;		/* IRQ index */
 	__u32	count;		/* Number of IRQs within this index */
+	__u32	cap_offset;	/* Offset within info struct of first cap */
 };
 #define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
 
+/*
+ * The irq type capability allows irqs unique to a specific device or
+ * class of devices to be exposed.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_IRQ_INFO_CAP_TYPE      1
+
+struct vfio_irq_info_cap_type {
+	struct vfio_info_cap_header header;
+	__u32 type;     /* global per bus driver */
+	__u32 subtype;  /* type specific */
+};
+
 /**
  * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
  *
@@ -789,7 +805,8 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
-	VFIO_PCI_NUM_IRQS
+	VFIO_PCI_NUM_IRQS = 5	/* Fixed user ABI, IRQ indexes >=5 use   */
+				/* device specific cap to define content */
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (5 preceding siblings ...)
  2020-05-18  2:50 ` [RFC PATCH v4 06/10] vfio: Define device specific irq type capability Yan Zhao
@ 2020-05-18  2:52 ` Yan Zhao
  2020-05-18  2:56   ` [QEMU RFC PATCH v4] hw/vfio/pci: remap bar region irq Yan Zhao
  2020-05-29 21:45   ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Alex Williamson
  2020-05-18  2:53 ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Yan Zhao
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:52 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

This is a virtual irq type.
vendor driver triggers this irq when it wants to notify userspace to
remap PCI BARs.

1. vendor driver triggers this irq and packs the target bar number in
   the ctx count. i.e. "1 << bar_number".
   if a bit is set, the corresponding bar is to be remapped.

2. userspace requery the specified PCI BAR from kernel and if flags of
the bar regions are changed, it removes the old subregions and attaches
subregions according to the new flags.

3. userspace notifies back to kernel by writing one to the eventfd of
this irq.

Please check the corresponding qemu implementation from the reply of this
patch, and a sample usage in vendor driver in patch [10/10].

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 include/uapi/linux/vfio.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2d0d85c7c4d4..55895f75d720 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
 	__u32 subtype;  /* type specific */
 };
 
+/* Bar Region Query IRQ TYPE */
+#define VFIO_IRQ_TYPE_REMAP_BAR_REGION			(1)
+
+/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
+/*
+ * This irq notifies userspace to re-query BAR region and remaps the
+ * subregions.
+ */
+#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION	(0)
+
+
 /**
  * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (6 preceding siblings ...)
  2020-05-18  2:52 ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Yan Zhao
@ 2020-05-18  2:53 ` Yan Zhao
  2020-05-18  8:49   ` kbuild test robot
                     ` (2 more replies)
  2020-05-18  2:54 ` [RFC PATCH v4 09/10] i40e/vf_migration: register a migration vendor region Yan Zhao
  2020-05-18  2:54 ` [RFC PATCH v4 10/10] i40e/vf_migration: vendor defined irq_type to support dynamic bar map Yan Zhao
  9 siblings, 3 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:53 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

This driver intercepts all device operations as long as it's probed
successfully by vfio-pci driver.

It will process regions and irqs of its interest and then forward
operations to default handlers exported from vfio pci if it wishes to.

In this patch, this driver does nothing but pass through VFs to guest
by calling to exported handlers from driver vfio-pci.

Cc: Shaopeng He <shaopeng.he@intel.com>

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/net/ethernet/intel/Kconfig            |  10 ++
 drivers/net/ethernet/intel/i40e/Makefile      |   2 +
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 165 ++++++++++++++++++
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  59 +++++++
 4 files changed, 236 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index ad34e4335df2..31780d9a59f1 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -264,6 +264,16 @@ config I40E_DCB
 
 	  If unsure, say N.
 
+config I40E_VF_MIGRATION
+	tristate "XL710 Family VF live migration support -- loadable modules only"
+	depends on I40E && VFIO_PCI && m
+	help
+	  Say m if you want to enable live migration of
+	  Virtual Functions of Intel(R) Ethernet Controller XL710
+	  Family of devices. It must be a module.
+	  This module serves as vendor module of module vfio_pci.
+	  VFs bind to module vfio_pci directly.
+
 # this is here to allow seamless migration from I40EVF --> IAVF name
 # so that CONFIG_IAVF symbol will always mirror the state of CONFIG_I40EVF
 config IAVF
diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 2f21b3e89fd0..b80c224c2602 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -27,3 +27,5 @@ i40e-objs := i40e_main.o \
 	i40e_xsk.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
+
+obj-$(CONFIG_I40E_VF_MIGRATION) += i40e_vf_migration.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
new file mode 100644
index 000000000000..96026dcf5c9d
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/vfio.h>
+#include <linux/pci.h>
+#include <linux/eventfd.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sysfs.h>
+#include <linux/file.h>
+#include <linux/pci.h>
+
+#include "i40e.h"
+#include "i40e_vf_migration.h"
+
+#define VERSION_STRING  "0.1"
+#define DRIVER_AUTHOR   "Intel Corporation"
+
+static int i40e_vf_open(void *device_data)
+{
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+	int ret;
+	struct vfio_device_migration_info *mig_ctl = NULL;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&i40e_vf_dev->reflock);
+	if (!i40e_vf_dev->refcnt) {
+		vfio_pci_set_vendor_regions(device_data, 0);
+		vfio_pci_set_vendor_irqs(device_data, 0);
+	}
+
+	ret = vfio_pci_open(device_data);
+	if (ret)
+		goto error;
+
+	i40e_vf_dev->refcnt++;
+	mutex_unlock(&i40e_vf_dev->reflock);
+	return 0;
+error:
+	if (!i40e_vf_dev->refcnt) {
+		vfio_pci_set_vendor_regions(device_data, 0);
+		vfio_pci_set_vendor_irqs(device_data, 0);
+	}
+	module_put(THIS_MODULE);
+	mutex_unlock(&i40e_vf_dev->reflock);
+	return ret;
+}
+
+void i40e_vf_release(void *device_data)
+{
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+
+	mutex_lock(&i40e_vf_dev->reflock);
+	if (!--i40e_vf_dev->refcnt) {
+		vfio_pci_set_vendor_regions(device_data, 0);
+		vfio_pci_set_vendor_irqs(device_data, 0);
+	}
+	vfio_pci_release(device_data);
+	mutex_unlock(&i40e_vf_dev->reflock);
+	module_put(THIS_MODULE);
+}
+
+static long i40e_vf_ioctl(void *device_data,
+			  unsigned int cmd, unsigned long arg)
+{
+	return vfio_pci_ioctl(device_data, cmd, arg);
+}
+
+static ssize_t i40e_vf_read(void *device_data, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	return vfio_pci_read(device_data, buf, count, ppos);
+}
+
+static ssize_t i40e_vf_write(void *device_data, const char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	return vfio_pci_write(device_data, buf, count, ppos);
+}
+
+static int i40e_vf_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	return vfio_pci_mmap(device_data, vma);
+}
+
+static void i40e_vf_request(void *device_data, unsigned int count)
+{
+	vfio_pci_request(device_data, count);
+}
+
+static struct vfio_device_ops i40e_vf_device_ops_node = {
+	.name		= "i40e_vf",
+	.open		= i40e_vf_open,
+	.release	= i40e_vf_release,
+	.ioctl		= i40e_vf_ioctl,
+	.read		= i40e_vf_read,
+	.write		= i40e_vf_write,
+	.mmap		= i40e_vf_mmap,
+	.request	= i40e_vf_request,
+};
+
+void *i40e_vf_probe(struct pci_dev *pdev)
+{
+	struct i40e_vf_migration *i40e_vf_dev = NULL;
+	struct pci_dev *pf_dev, *vf_dev;
+	struct i40e_pf *pf;
+	struct i40e_vf *vf;
+	unsigned int vf_devfn, devfn;
+	int vf_id = -1;
+	int i;
+
+	pf_dev = pdev->physfn;
+	pf = pci_get_drvdata(pf_dev);
+	vf_dev = pdev;
+	vf_devfn = vf_dev->devfn;
+
+	for (i = 0; i < pci_num_vf(pf_dev); i++) {
+		devfn = (pf_dev->devfn + pf_dev->sriov->offset +
+			 pf_dev->sriov->stride * i) & 0xff;
+		if (devfn == vf_devfn) {
+			vf_id = i;
+			break;
+		}
+	}
+
+	if (vf_id == -1)
+		return ERR_PTR(-EINVAL);
+
+	i40e_vf_dev = kzalloc(sizeof(*i40e_vf_dev), GFP_KERNEL);
+
+	if (!i40e_vf_dev)
+		return ERR_PTR(-ENOMEM);
+
+	i40e_vf_dev->vf_id = vf_id;
+	i40e_vf_dev->vf_vendor = pdev->vendor;
+	i40e_vf_dev->vf_device = pdev->device;
+	i40e_vf_dev->pf_dev = pf_dev;
+	i40e_vf_dev->vf_dev = vf_dev;
+	mutex_init(&i40e_vf_dev->reflock);
+
+	vf = &pf->vf[vf_id];
+
+	return i40e_vf_dev;
+}
+
+static void i40e_vf_remove(void *vendor_data)
+{
+	kfree(vendor_data);
+}
+
+#define i40e_vf_device_ops (&i40e_vf_device_ops_node)
+module_vfio_pci_register_vendor_handler("I40E VF", i40e_vf_probe,
+					i40e_vf_remove, i40e_vf_device_ops);
+
+MODULE_ALIAS("vfio-pci:8086-154c");
+MODULE_LICENSE("GPL v2");
+MODULE_INFO(supported, "Vendor driver of vfio pci to support VF live migration");
+MODULE_VERSION(VERSION_STRING);
+MODULE_AUTHOR(DRIVER_AUTHOR);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
new file mode 100644
index 000000000000..696d40601ec3
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
+
+#ifndef I40E_MIG_H
+#define I40E_MIG_H
+
+#include <linux/pci.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+
+#include "i40e.h"
+#include "i40e_txrx.h"
+
+/* helper macros copied from vfio-pci */
+#define VFIO_PCI_OFFSET_SHIFT   40
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+/* Single Root I/O Virtualization */
+struct pci_sriov {
+	int		pos;		/* Capability position */
+	int		nres;		/* Number of resources */
+	u32		cap;		/* SR-IOV Capabilities */
+	u16		ctrl;		/* SR-IOV Control */
+	u16		total_VFs;	/* Total VFs associated with the PF */
+	u16		initial_VFs;	/* Initial VFs associated with the PF */
+	u16		num_VFs;	/* Number of VFs available */
+	u16		offset;		/* First VF Routing ID offset */
+	u16		stride;		/* Following VF stride */
+	u16		vf_device;	/* VF device ID */
+	u32		pgsz;		/* Page size for BAR alignment */
+	u8		link;		/* Function Dependency Link */
+	u8		max_VF_buses;	/* Max buses consumed by VFs */
+	u16		driver_max_VFs;	/* Max num VFs driver supports */
+	struct pci_dev	*dev;		/* Lowest numbered PF */
+	struct pci_dev	*self;		/* This PF */
+	u32		cfg_size;	/* VF config space size */
+	u32		class;		/* VF device */
+	u8		hdr_type;	/* VF header type */
+	u16		subsystem_vendor; /* VF subsystem vendor */
+	u16		subsystem_device; /* VF subsystem device */
+	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
+	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
+};
+
+struct i40e_vf_migration {
+	__u32				vf_vendor;
+	__u32				vf_device;
+	__u32				handle;
+	struct pci_dev			*pf_dev;
+	struct pci_dev			*vf_dev;
+	int				vf_id;
+	int				refcnt;
+	struct				mutex reflock; /*mutex protect refcnt */
+};
+
+#endif /* I40E_MIG_H */
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 09/10] i40e/vf_migration: register a migration vendor region
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (7 preceding siblings ...)
  2020-05-18  2:53 ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Yan Zhao
@ 2020-05-18  2:54 ` Yan Zhao
  2020-05-18  6:47   ` kbuild test robot
  2020-05-18  2:54 ` [RFC PATCH v4 10/10] i40e/vf_migration: vendor defined irq_type to support dynamic bar map Yan Zhao
  9 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:54 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

This patch let the vendor driver register a migration region, so that
the migration detection code in userspace will be able to see this
region and triggers the migration flow according to VFIO migration
protocol.

This migration region works based on VFIO migration series with some
minor fixes:
[1] kernel v17: https://patchwork.kernel.org/cover/11466129/
[2] qemu v16: https://patchwork.kernel.org/cover/11456557/

Cc: Shaopeng He <shaopeng.he@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 429 +++++++++++++++++-
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  34 ++
 2 files changed, 460 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index 96026dcf5c9d..107a291909b3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -17,6 +17,351 @@
 
 #define VERSION_STRING  "0.1"
 #define DRIVER_AUTHOR   "Intel Corporation"
+#define TEST_DIRTY_IOVA_PFN 0
+
+static int i40e_vf_iommu_notifier(struct notifier_block *nb,
+				  unsigned long action, void *data)
+{
+	if (action == VFIO_IOMMU_NOTIFY_DMA_UNMAP) {
+		struct vfio_iommu_type1_dma_unmap *unmap = data;
+		unsigned long iova_pfn, end_iova_pfn;
+
+		iova_pfn = unmap->iova >> PAGE_SHIFT;
+		end_iova_pfn = iova_pfn + unmap->size / PAGE_SIZE;
+
+		pr_info("DMA UNMAP iova_pfn=%lx, end=%lx\n", iova_pfn,
+			end_iova_pfn);
+	}
+
+	return NOTIFY_OK;
+}
+
+/* transient pinning a page
+ */
+static int i40e_vf_set_page_dirty(struct i40e_vf_migration *i40e_vf_dev,
+				  unsigned long dirty_iova_pfn)
+{
+	unsigned long dirty_pfn, cnt = 1;
+	int ret;
+
+	ret = vfio_group_pin_pages(i40e_vf_dev->vfio_group,
+				   &dirty_iova_pfn, cnt,
+				   IOMMU_READ | IOMMU_WRITE, &dirty_pfn);
+	if (ret != cnt) {
+		pr_err("failed to track dirty of page of iova pfn %lx\n",
+		       dirty_iova_pfn);
+		return ret < 0 ? ret : -EFAULT;
+	}
+
+	vfio_group_unpin_pages(i40e_vf_dev->vfio_group, &dirty_iova_pfn, cnt);
+
+	return 0;
+}
+
+/* alloc dirty page tracking resources and
+ * do the first round dirty page scanning
+ */
+static int i40e_vf_prepare_dirty_track(struct i40e_vf_migration *i40e_vf_dev)
+{
+	struct vfio_group *vfio_group;
+	unsigned long events;
+	int ret;
+	struct device *dev = &i40e_vf_dev->vf_dev->dev;
+
+	if (i40e_vf_dev->in_dirty_track) {
+		pr_warn("%s, previous dirty track resources found\n",
+			__func__);
+		return 0;
+	}
+
+	i40e_vf_dev->iommu_notifier.notifier_call = i40e_vf_iommu_notifier;
+
+	events = VFIO_IOMMU_NOTIFY_DMA_UNMAP;
+	ret = vfio_register_notifier(dev, VFIO_IOMMU_NOTIFY, &events,
+				     &i40e_vf_dev->iommu_notifier);
+	if (ret) {
+		pr_err("failed to register vfio iommu notifier\n");
+		return ret;
+	}
+
+	vfio_group = vfio_group_get_external_user_from_dev(dev);
+	if (IS_ERR_OR_NULL(vfio_group)) {
+		ret = PTR_ERR(vfio_group);
+		pr_err("failed to get vfio group from dev\n");
+		goto out;
+	}
+
+	i40e_vf_dev->vfio_group = vfio_group;
+
+	ret = i40e_vf_set_page_dirty(i40e_vf_dev, TEST_DIRTY_IOVA_PFN);
+
+	if (ret) {
+		pr_err("failed to set dirty for test page\n");
+		goto out_group;
+	}
+
+	i40e_vf_dev->in_dirty_track = true;
+	return 0;
+
+out_group:
+	vfio_unregister_notifier(dev, VFIO_IOMMU_NOTIFY,
+				 &i40e_vf_dev->iommu_notifier);
+out:
+	vfio_group_put_external_user(i40e_vf_dev->vfio_group);
+	return ret;
+}
+
+static void i40e_vf_stop_dirty_track(struct i40e_vf_migration *i40e_vf_dev)
+{
+	if (!i40e_vf_dev->in_dirty_track)
+		return;
+
+	vfio_unregister_notifier(&i40e_vf_dev->vf_dev->dev,
+				 VFIO_IOMMU_NOTIFY,
+				 &i40e_vf_dev->iommu_notifier);
+	vfio_group_put_external_user(i40e_vf_dev->vfio_group);
+	i40e_vf_dev->in_dirty_track = false;
+}
+
+static size_t i40e_vf_set_device_state(struct i40e_vf_migration *i40e_vf_dev,
+				       u32 state)
+{
+	int ret = 0;
+	struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
+
+	if (state == mig_ctl->device_state)
+		return 0;
+
+	switch (state) {
+	case VFIO_DEVICE_STATE_RUNNING:
+		break;
+	case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+		ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
+		break;
+	case VFIO_DEVICE_STATE_SAVING:
+		// do the last round of dirty page scanning
+		break;
+	case VFIO_DEVICE_STATE_STOP:
+		// release dirty page tracking resources
+		if (mig_ctl->device_state == VFIO_DEVICE_STATE_SAVING)
+			i40e_vf_stop_dirty_track(i40e_vf_dev);
+		break;
+	case VFIO_DEVICE_STATE_RESUMING:
+		break;
+	default:
+		ret = -EFAULT;
+	}
+
+	if (!ret)
+		mig_ctl->device_state = state;
+
+	return ret;
+}
+
+static
+ssize_t i40e_vf_region_migration_rw(struct i40e_vf_migration *i40e_vf_dev,
+				    char __user *buf, size_t count,
+				    loff_t *ppos, bool iswrite)
+{
+#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
+	struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int ret = 0;
+
+	switch (pos) {
+	case VDM_OFFSET(device_state):
+		if (count != sizeof(mig_ctl->device_state)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (iswrite) {
+			u32 device_state;
+
+			if (copy_from_user(&device_state, buf, count)) {
+				ret = -EFAULT;
+				break;
+			}
+
+			ret = i40e_vf_set_device_state(i40e_vf_dev,
+						       device_state) ?
+						       ret : count;
+		} else {
+			ret = copy_to_user(buf, &mig_ctl->device_state,
+					   count) ? -EFAULT : count;
+		}
+		break;
+
+	case VDM_OFFSET(reserved):
+		ret = -EFAULT;
+		break;
+
+	case VDM_OFFSET(pending_bytes):
+		{
+			if (count != sizeof(mig_ctl->pending_bytes)) {
+				ret = -EINVAL;
+				break;
+			}
+
+			if (iswrite)
+				ret = -EFAULT;
+			else
+				ret = copy_to_user(buf,
+						   &mig_ctl->pending_bytes,
+						   count) ? -EFAULT : count;
+
+			break;
+		}
+
+	case VDM_OFFSET(data_offset):
+		{
+			/* as we don't support device internal dirty data
+			 * and our pending_bytes is always 0,
+			 * return error here.
+			 */
+			ret = -EFAULT;
+			break;
+		}
+	case VDM_OFFSET(data_size):
+		if (count != sizeof(mig_ctl->data_size)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (iswrite)
+			ret = copy_from_user(&mig_ctl->data_size, buf, count) ?
+					     -EFAULT : count;
+		else
+			ret = copy_to_user(buf, &mig_ctl->data_size, count) ?
+					   -EFAULT : count;
+		break;
+
+	default:
+		ret = -EFAULT;
+		break;
+	}
+	return ret;
+}
+
+static
+int i40e_vf_region_migration_mmap(struct i40e_vf_migration *i40e_vf_dev,
+				  struct i40e_vf_region *region,
+				  struct vm_area_struct *vma)
+{
+	return -EFAULT;
+}
+
+static
+void i40e_vf_region_migration_release(struct i40e_vf_migration *i40e_vf_dev,
+				      struct i40e_vf_region *region)
+{
+	kfree(i40e_vf_dev->mig_ctl);
+	i40e_vf_dev->mig_ctl = NULL;
+}
+
+static const struct i40e_vf_region_ops i40e_vf_region_ops_migration = {
+	.rw		= i40e_vf_region_migration_rw,
+	.release	= i40e_vf_region_migration_release,
+	.mmap		= i40e_vf_region_migration_mmap,
+};
+
+static int i40e_vf_register_region(struct i40e_vf_migration *i40e_vf_dev,
+				   unsigned int type, unsigned int subtype,
+				   const struct i40e_vf_region_ops *ops,
+				   size_t size, u32 flags, void *data)
+{
+	struct i40e_vf_region *regions;
+
+	regions = krealloc(i40e_vf_dev->regions,
+			   (i40e_vf_dev->num_regions + 1) * sizeof(*regions),
+			   GFP_KERNEL);
+	if (!regions)
+		return -ENOMEM;
+
+	i40e_vf_dev->regions = regions;
+	regions[i40e_vf_dev->num_regions].type = type;
+	regions[i40e_vf_dev->num_regions].subtype = subtype;
+	regions[i40e_vf_dev->num_regions].ops = ops;
+	regions[i40e_vf_dev->num_regions].size = size;
+	regions[i40e_vf_dev->num_regions].flags = flags;
+	regions[i40e_vf_dev->num_regions].data = data;
+	i40e_vf_dev->num_regions++;
+	return 0;
+}
+
+static long i40e_vf_get_region_info(void *device_data,
+				    unsigned int cmd, unsigned long arg)
+{
+	struct vfio_region_info info;
+	struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+	int index, ret;
+	struct vfio_region_info_cap_type cap_type = {
+		.header.id = VFIO_REGION_INFO_CAP_TYPE,
+		.header.version = 1 };
+	struct i40e_vf_region *regions;
+	int num_vdev_regions = vfio_pci_num_regions(device_data);
+	unsigned long minsz;
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+
+	minsz = offsetofend(struct vfio_region_info, offset);
+
+	if (cmd != VFIO_DEVICE_GET_REGION_INFO)
+		return -EINVAL;
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+	if (info.argsz < minsz)
+		return -EINVAL;
+	if (info.index < VFIO_PCI_NUM_REGIONS + num_vdev_regions)
+		goto default_handle;
+
+	index = info.index - VFIO_PCI_NUM_REGIONS - num_vdev_regions;
+	if (index > i40e_vf_dev->num_regions)
+		return -EINVAL;
+
+	info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+	regions = i40e_vf_dev->regions;
+	info.size = regions[index].size;
+	info.flags = regions[index].flags;
+	cap_type.type = regions[index].type;
+	cap_type.subtype = regions[index].subtype;
+
+	ret = vfio_info_add_capability(&caps, &cap_type.header,
+				       sizeof(cap_type));
+	if (ret)
+		return ret;
+
+	if (regions[index].ops->add_cap) {
+		ret = regions[index].ops->add_cap(i40e_vf_dev,
+				&regions[index], &caps);
+		if (ret)
+			return ret;
+	}
+
+	if (caps.size) {
+		info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+		if (info.argsz < sizeof(info) + caps.size) {
+			info.argsz = sizeof(info) + caps.size;
+			info.cap_offset = 0;
+		} else {
+			vfio_info_cap_shift(&caps, sizeof(info));
+			if (copy_to_user((void __user *)arg + sizeof(info),
+					 caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
+			}
+			info.cap_offset = sizeof(info);
+		}
+
+		kfree(caps.buf);
+	}
+
+	return copy_to_user((void __user *)arg, &info, minsz) ?
+		-EFAULT : 0;
+
+default_handle:
+	return vfio_pci_ioctl(device_data, cmd, arg);
+}
 
 static int i40e_vf_open(void *device_data)
 {
@@ -30,7 +375,26 @@ static int i40e_vf_open(void *device_data)
 
 	mutex_lock(&i40e_vf_dev->reflock);
 	if (!i40e_vf_dev->refcnt) {
-		vfio_pci_set_vendor_regions(device_data, 0);
+		mig_ctl = kzalloc(sizeof(*mig_ctl), GFP_KERNEL);
+		if (!mig_ctl) {
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = i40e_vf_register_region(i40e_vf_dev,
+					      VFIO_REGION_TYPE_MIGRATION,
+					      VFIO_REGION_SUBTYPE_MIGRATION,
+					      &i40e_vf_region_ops_migration,
+					      MIGRATION_REGION_SZ,
+					      VFIO_REGION_INFO_FLAG_READ |
+					      VFIO_REGION_INFO_FLAG_WRITE,
+					      NULL);
+		if (ret)
+			goto error;
+
+		i40e_vf_dev->mig_ctl = mig_ctl;
+		vfio_pci_set_vendor_regions(device_data,
+					    i40e_vf_dev->num_regions);
 		vfio_pci_set_vendor_irqs(device_data, 0);
 	}
 
@@ -43,6 +407,10 @@ static int i40e_vf_open(void *device_data)
 	return 0;
 error:
 	if (!i40e_vf_dev->refcnt) {
+		kfree(mig_ctl);
+		kfree(i40e_vf_dev->regions);
+		i40e_vf_dev->num_regions = 0;
+		i40e_vf_dev->regions = NULL;
 		vfio_pci_set_vendor_regions(device_data, 0);
 		vfio_pci_set_vendor_irqs(device_data, 0);
 	}
@@ -56,8 +424,17 @@ void i40e_vf_release(void *device_data)
 	struct i40e_vf_migration *i40e_vf_dev =
 		vfio_pci_vendor_data(device_data);
 
+	i40e_vf_stop_dirty_track(i40e_vf_dev);
 	mutex_lock(&i40e_vf_dev->reflock);
 	if (!--i40e_vf_dev->refcnt) {
+		int i;
+
+		for (i = 0; i < i40e_vf_dev->num_regions; i++)
+			i40e_vf_dev->regions[i].ops->release(i40e_vf_dev,
+						&i40e_vf_dev->regions[i]);
+		i40e_vf_dev->num_regions = 0;
+		kfree(i40e_vf_dev->regions);
+		i40e_vf_dev->regions = NULL;
 		vfio_pci_set_vendor_regions(device_data, 0);
 		vfio_pci_set_vendor_irqs(device_data, 0);
 	}
@@ -69,19 +446,65 @@ void i40e_vf_release(void *device_data)
 static long i40e_vf_ioctl(void *device_data,
 			  unsigned int cmd, unsigned long arg)
 {
+	if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+		return i40e_vf_get_region_info(device_data, cmd, arg);
+
 	return vfio_pci_ioctl(device_data, cmd, arg);
 }
 
 static ssize_t i40e_vf_read(void *device_data, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-	return vfio_pci_read(device_data, buf, count, ppos);
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+	struct i40e_vf_region *region;
+	int num_vdev_regions = vfio_pci_num_regions(device_data);
+	int num_vendor_region = i40e_vf_dev->num_regions;
+
+	if (index < VFIO_PCI_NUM_REGIONS + num_vdev_regions)
+		return vfio_pci_read(device_data, buf, count, ppos);
+	else if (index >= VFIO_PCI_NUM_REGIONS + num_vdev_regions +
+			num_vendor_region)
+		return -EINVAL;
+
+	index -= VFIO_PCI_NUM_REGIONS + num_vdev_regions;
+
+	region = &i40e_vf_dev->regions[index];
+	if (!region->ops->rw)
+		return -EINVAL;
+
+	return region->ops->rw(i40e_vf_dev, buf, count, ppos, false);
 }
 
 static ssize_t i40e_vf_write(void *device_data, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	return vfio_pci_write(device_data, buf, count, ppos);
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+	struct i40e_vf_region *region;
+	int num_vdev_regions = vfio_pci_num_regions(device_data);
+	int num_vendor_region = i40e_vf_dev->num_regions;
+
+	if (index == VFIO_PCI_BAR0_REGION_INDEX)
+		;// scan dirty pages
+
+	if (index < VFIO_PCI_NUM_REGIONS + num_vdev_regions)
+		return vfio_pci_write(device_data, buf, count, ppos);
+	else if (index >= VFIO_PCI_NUM_REGIONS + num_vdev_regions +
+			num_vendor_region)
+		return -EINVAL;
+
+	index -= VFIO_PCI_NUM_REGIONS + num_vdev_regions;
+
+	region = &i40e_vf_dev->regions[index];
+
+	if (!region->ops->rw)
+		return -EINVAL;
+
+	return region->ops->rw(i40e_vf_dev, (char __user *)buf,
+			       count, ppos, true);
 }
 
 static int i40e_vf_mmap(void *device_data, struct vm_area_struct *vma)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
index 696d40601ec3..918ba275d5b5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
@@ -17,6 +17,8 @@
 #define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
 #define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 
+#define MIGRATION_REGION_SZ (sizeof(struct vfio_device_migration_info))
+
 /* Single Root I/O Virtualization */
 struct pci_sriov {
 	int		pos;		/* Capability position */
@@ -53,6 +55,38 @@ struct i40e_vf_migration {
 	int				vf_id;
 	int				refcnt;
 	struct				mutex reflock; /*mutex protect refcnt */
+
+	struct vfio_device_migration_info *mig_ctl;
+	bool				in_dirty_track;
+
+	struct i40e_vf_region		*regions;
+	int				num_regions;
+	struct notifier_block		iommu_notifier;
+	struct vfio_group		*vfio_group;
+
+};
+
+struct i40e_vf_region_ops {
+	ssize_t	(*rw)(struct i40e_vf_migration *i40e_vf_dev,
+		      char __user *buf, size_t count,
+		      loff_t *ppos, bool iswrite);
+	void	(*release)(struct i40e_vf_migration *i40e_vf_dev,
+			   struct i40e_vf_region *region);
+	int	(*mmap)(struct i40e_vf_migration *i40e_vf_dev,
+			struct i40e_vf_region *region,
+			struct vm_area_struct *vma);
+	int	(*add_cap)(struct i40e_vf_migration *i40e_vf_dev,
+			   struct i40e_vf_region *region,
+			   struct vfio_info_cap *caps);
+};
+
+struct i40e_vf_region {
+	u32				type;
+	u32				subtype;
+	size_t				size;
+	u32				flags;
+	const struct i40e_vf_region_ops	*ops;
+	void				*data;
 };
 
 #endif /* I40E_MIG_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v4 10/10] i40e/vf_migration: vendor defined irq_type to support dynamic bar map
  2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
                   ` (8 preceding siblings ...)
  2020-05-18  2:54 ` [RFC PATCH v4 09/10] i40e/vf_migration: register a migration vendor region Yan Zhao
@ 2020-05-18  2:54 ` Yan Zhao
  9 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:54 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

This patch gives an example implementation to support vendor defined
irq_type.

- on this vendor driver open, it registers an irq of type
  VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
  1 vendor irq.

- after userspace detects and enables the irq of type
  VFIO_IRQ_TYPE_REMAP_BAR_REGION, this vendor driver will setup a virqfd
  to monitor file write to the fd of this irq.

  (1) when migration starts
  (the device state is set to _SAVING & _RUNNING),
  a. this vendor driver will signal the irq VFIO_IRQ_TYPE_REMAP_BAR_REGION
  to ask userspace to remap pci bars. It packs the target bar number in
  the ctx count. i.e. 1 << bar_number. if there are multiple bars to remap,
  the numbers are or'ed.

  b. on receiving this eventfd signal, userspace will read the bar number,
  re-query the bar flags (like READ/WRITE/MMAP/SPARSE ranges), and remap
  the bar's subregions.

  c. vendor driver reports bar 0 to be trapped (not MMAP'd).

  d. after remapping completion, it writes 0 to the eventfd so that the
  vendor driver waiting for it would complete too.

  (2) as the bar 0 is remapped to be trapped, vendor driver is able to
  start tracking dirty pages in software way.

  (3) when migration stops, similar to what's done in migration start, the
  vendor driver would signal to remap the bar back to un-trapped (MMAP'd),
  but it would not wait for the userspace writing back for remapping
  completion.

- on releasing this vendor driver, it frees resources to vendor defined
irqs.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Shaopeng He <shaopeng.he@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/net/ethernet/intel/Kconfig            |   2 +-
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 322 +++++++++++++++++-
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  26 ++
 3 files changed, 346 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 31780d9a59f1..6a52a197c4d8 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -266,7 +266,7 @@ config I40E_DCB
 
 config I40E_VF_MIGRATION
 	tristate "XL710 Family VF live migration support -- loadable modules only"
-	depends on I40E && VFIO_PCI && m
+	depends on I40E && VFIO_PCI && VFIO_VIRQFD && m
 	help
 	  Say m if you want to enable live migration of
 	  Virtual Functions of Intel(R) Ethernet Controller XL710
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index 107a291909b3..188829efaa19 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -19,6 +19,266 @@
 #define DRIVER_AUTHOR   "Intel Corporation"
 #define TEST_DIRTY_IOVA_PFN 0
 
+static int i40e_vf_remap_bars(struct i40e_vf_migration *i40e_vf_dev, bool wait)
+{
+	int bar_num = 0;
+
+	if (!i40e_vf_dev->remap_irq_ctx.init)
+		return -ENODEV;
+
+	/* set cnt to 2 as it will enter wait_handler too times.
+	 * one from this eventfd_signal,
+	 * one from userspace ack back
+	 */
+	atomic_set(&i40e_vf_dev->remap_irq_ctx.cnt, 2);
+	eventfd_signal(i40e_vf_dev->remap_irq_ctx.trigger, 1 << bar_num);
+
+	if (!wait)
+		return 0;
+
+	/* the wait cannot be executed in vcpu threads, as the eventfd write
+	 * from userspace we are waiting for is waiting on the lock vcpu
+	 * threads hold
+	 */
+	wait_event_killable(i40e_vf_dev->remap_irq_ctx.waitq,
+			    !atomic_read(&i40e_vf_dev->remap_irq_ctx.cnt));
+
+	return 0;
+}
+
+static int i40e_vf_remap_bar_wait_handler(void *opaque, void *unused)
+{
+	struct i40e_vf_migration *i40e_vf_dev = opaque;
+
+	atomic_dec_if_positive(&i40e_vf_dev->remap_irq_ctx.cnt);
+	wake_up(&i40e_vf_dev->remap_irq_ctx.waitq);
+	return 0;
+}
+
+static void i40e_vf_disable_remap_bars_irq(struct i40e_vf_migration *vf_dev)
+{
+	if (!vf_dev->remap_irq_ctx.init)
+		return;
+
+	if (vf_dev->remap_irq_ctx.sync)
+		vfio_virqfd_disable(&vf_dev->remap_irq_ctx.sync);
+
+	atomic_set(&vf_dev->remap_irq_ctx.cnt, 0);
+	wake_up(&vf_dev->remap_irq_ctx.waitq);
+
+	eventfd_ctx_put(vf_dev->remap_irq_ctx.trigger);
+	vf_dev->remap_irq_ctx.trigger = NULL;
+	vf_dev->remap_irq_ctx.init = false;
+}
+
+static int i40e_vf_enable_remap_bars_irq(struct i40e_vf_migration *vf_dev,
+					 struct eventfd_ctx *ctx, int32_t fd)
+{
+	int ret;
+
+	if (vf_dev->remap_irq_ctx.init)
+		return -EEXIST;
+
+	ret = vfio_virqfd_enable((void *)vf_dev,
+				 i40e_vf_remap_bar_wait_handler, NULL, ctx,
+				 &vf_dev->remap_irq_ctx.sync, fd);
+	if (ret) {
+		eventfd_ctx_put(ctx);
+		return ret;
+	}
+
+	init_waitqueue_head(&vf_dev->remap_irq_ctx.waitq);
+	atomic_set(&vf_dev->remap_irq_ctx.cnt, 0);
+	vf_dev->remap_irq_ctx.init = true;
+	vf_dev->remap_irq_ctx.trigger = ctx;
+	return 0;
+}
+
+static int i40e_vf_set_irq_remap_bars(struct i40e_vf_migration *i40e_vf_dev,
+				      u32 flags, unsigned int index,
+				      unsigned int start, unsigned int count,
+				      void *data)
+{
+	switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+	case VFIO_IRQ_SET_ACTION_MASK:
+	case VFIO_IRQ_SET_ACTION_UNMASK:
+		/* XXX Need masking support exported */
+		return 0;
+	case VFIO_IRQ_SET_ACTION_TRIGGER:
+		break;
+	default:
+		return 0;
+	}
+
+	if (start != 0 || count > 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		if (!count) {
+			i40e_vf_disable_remap_bars_irq(i40e_vf_dev);
+			return 0;
+		}
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		return -EINVAL;
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int fd;
+
+		if (!count || !data)
+			return -EINVAL;
+
+		fd = *(int32_t *)data;
+		if (fd == -1) {
+			i40e_vf_disable_remap_bars_irq(i40e_vf_dev);
+		} else if (fd >= 0) {
+			struct eventfd_ctx *efdctx;
+
+			efdctx = eventfd_ctx_fdget(fd);
+			if (IS_ERR(efdctx))
+				return PTR_ERR(efdctx);
+
+			i40e_vf_disable_remap_bars_irq(i40e_vf_dev);
+
+			return i40e_vf_enable_remap_bars_irq(i40e_vf_dev,
+							     efdctx, fd);
+		}
+		return 0;
+	}
+	return -EINVAL;
+}
+
+static const struct i40e_vf_irqops i40e_vf_irqops_remap_bars = {
+	.set_irqs = i40e_vf_set_irq_remap_bars,
+};
+
+static long i40e_vf_set_irqs(void *device_data,
+			     unsigned int cmd, unsigned long arg)
+{
+	struct vfio_irq_set hdr;
+	int index, ret;
+	u8 *data = NULL;
+	size_t data_size = 0;
+	unsigned long minsz;
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+
+	minsz = offsetofend(struct vfio_irq_set, count);
+	if (copy_from_user(&hdr, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz ||
+	    hdr.index >= VFIO_PCI_NUM_IRQS + i40e_vf_dev->num_irqs)
+		return -EINVAL;
+	if (hdr.index < VFIO_PCI_NUM_IRQS)
+		goto default_handle;
+
+	index = hdr.index - VFIO_PCI_NUM_IRQS;
+
+	ret = vfio_set_irqs_validate_and_prepare(&hdr,
+						 i40e_vf_dev->irqs[index].count,
+						 VFIO_PCI_NUM_IRQS +
+						 i40e_vf_dev->num_irqs,
+						 &data_size);
+	if (ret)
+		return ret;
+
+	if (data_size) {
+		data = memdup_user((void __user *)(arg + minsz), data_size);
+		if (IS_ERR(data))
+			return PTR_ERR(data);
+	}
+
+	ret = i40e_vf_dev->irqs[index].ops->set_irqs(i40e_vf_dev,
+						     hdr.flags, hdr.index,
+						     hdr.start, hdr.count,
+						     data);
+	kfree(data);
+	return ret;
+
+default_handle:
+	return vfio_pci_ioctl(device_data, cmd, arg);
+}
+
+static long i40e_vf_get_irq_info(void *device_data,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+	struct vfio_irq_info info;
+	int index, ret;
+	unsigned long minsz;
+	struct vfio_irq_info_cap_type cap_type = {
+		.header.id = VFIO_IRQ_INFO_CAP_TYPE,
+		.header.version = 1
+	};
+	struct i40e_vf_migration *i40e_vf_dev =
+		vfio_pci_vendor_data(device_data);
+
+	minsz = offsetofend(struct vfio_irq_info, count);
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz ||
+	    info.index >= VFIO_PCI_NUM_IRQS + i40e_vf_dev->num_irqs)
+		return -EINVAL;
+	if (info.index < VFIO_PCI_NUM_IRQS)
+		goto default_handle;
+
+	index = info.index - VFIO_PCI_NUM_IRQS;
+	info.flags = i40e_vf_dev->irqs[index].flags;
+	cap_type.type = i40e_vf_dev->irqs[index].type;
+	cap_type.subtype = i40e_vf_dev->irqs[index].subtype;
+
+	ret = vfio_info_add_capability(&caps, &cap_type.header,
+				       sizeof(cap_type));
+	if (ret)
+		return ret;
+
+	if (caps.size) {
+		info.flags |= VFIO_IRQ_INFO_FLAG_CAPS;
+		if (info.argsz < sizeof(info) + caps.size) {
+			info.argsz = sizeof(info) + caps.size;
+			info.cap_offset = 0;
+		} else {
+			vfio_info_cap_shift(&caps, sizeof(info));
+			if (copy_to_user((void __user *)arg + sizeof(info),
+					 caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
+			}
+			info.cap_offset = sizeof(info);
+			if (offsetofend(struct vfio_irq_info, cap_offset) >
+					minsz)
+				minsz = offsetofend(struct vfio_irq_info,
+						    cap_offset);
+		}
+		kfree(caps.buf);
+	}
+	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+
+default_handle:
+	return vfio_pci_ioctl(device_data, cmd, arg);
+}
+
+static int i40e_vf_register_irq(struct i40e_vf_migration *i40e_vf_dev,
+				unsigned int type, unsigned int subtype,
+				u32 flags, const struct i40e_vf_irqops *ops)
+{
+	struct i40e_vf_irq *irqs;
+
+	irqs = krealloc(i40e_vf_dev->irqs,
+			(i40e_vf_dev->num_irqs + 1) * sizeof(*irqs),
+			GFP_KERNEL);
+	if (!irqs)
+		return -ENOMEM;
+
+	i40e_vf_dev->irqs = irqs;
+	i40e_vf_dev->irqs[i40e_vf_dev->num_irqs].type = type;
+	i40e_vf_dev->irqs[i40e_vf_dev->num_irqs].subtype = subtype;
+	i40e_vf_dev->irqs[i40e_vf_dev->num_irqs].count = 1;
+	i40e_vf_dev->irqs[i40e_vf_dev->num_irqs].flags = flags;
+	i40e_vf_dev->irqs[i40e_vf_dev->num_irqs].ops = ops;
+	i40e_vf_dev->num_irqs++;
+	return 0;
+}
 static int i40e_vf_iommu_notifier(struct notifier_block *nb,
 				  unsigned long action, void *data)
 {
@@ -100,6 +360,12 @@ static int i40e_vf_prepare_dirty_track(struct i40e_vf_migration *i40e_vf_dev)
 		goto out_group;
 	}
 
+	/* wait for bar 0 is remapped to read-write */
+	ret = i40e_vf_remap_bars(i40e_vf_dev, true);
+	if (ret) {
+		pr_err("failed to remap BAR 0\n");
+		goto out_group;
+	}
 	i40e_vf_dev->in_dirty_track = true;
 	return 0;
 
@@ -121,6 +387,8 @@ static void i40e_vf_stop_dirty_track(struct i40e_vf_migration *i40e_vf_dev)
 				 &i40e_vf_dev->iommu_notifier);
 	vfio_group_put_external_user(i40e_vf_dev->vfio_group);
 	i40e_vf_dev->in_dirty_track = false;
+	/* just nottify userspace to remap bar0 without waiting */
+	i40e_vf_remap_bars(i40e_vf_dev, false);
 }
 
 static size_t i40e_vf_set_device_state(struct i40e_vf_migration *i40e_vf_dev,
@@ -134,6 +402,8 @@ static size_t i40e_vf_set_device_state(struct i40e_vf_migration *i40e_vf_dev,
 
 	switch (state) {
 	case VFIO_DEVICE_STATE_RUNNING:
+		if (mig_ctl->device_state & VFIO_DEVICE_STATE_SAVING)
+			i40e_vf_stop_dirty_track(i40e_vf_dev);
 		break;
 	case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
 		ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
@@ -360,7 +630,25 @@ static long i40e_vf_get_region_info(void *device_data,
 		-EFAULT : 0;
 
 default_handle:
-	return vfio_pci_ioctl(device_data, cmd, arg);
+	ret = vfio_pci_ioctl(device_data, cmd, arg);
+	if (ret)
+		return ret;
+
+	if (info.index == VFIO_PCI_BAR0_REGION_INDEX) {
+		if (!i40e_vf_dev->in_dirty_track)
+			return ret;
+
+		/* read default handler's data back*/
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		info.flags = VFIO_REGION_INFO_FLAG_READ |
+					VFIO_REGION_INFO_FLAG_WRITE;
+		/* update customized region info*/
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			return -EFAULT;
+	}
+	return ret;
 }
 
 static int i40e_vf_open(void *device_data)
@@ -392,10 +680,20 @@ static int i40e_vf_open(void *device_data)
 		if (ret)
 			goto error;
 
+		ret = i40e_vf_register_irq(i40e_vf_dev,
+					   VFIO_IRQ_TYPE_REMAP_BAR_REGION,
+					   VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION,
+					   VFIO_IRQ_INFO_MASKABLE |
+					   VFIO_IRQ_INFO_EVENTFD,
+					   &i40e_vf_irqops_remap_bars);
+		if (ret)
+			goto error;
+
 		i40e_vf_dev->mig_ctl = mig_ctl;
 		vfio_pci_set_vendor_regions(device_data,
 					    i40e_vf_dev->num_regions);
-		vfio_pci_set_vendor_irqs(device_data, 0);
+		vfio_pci_set_vendor_irqs(device_data,
+					 i40e_vf_dev->num_irqs);
 	}
 
 	ret = vfio_pci_open(device_data);
@@ -413,6 +711,9 @@ static int i40e_vf_open(void *device_data)
 		i40e_vf_dev->regions = NULL;
 		vfio_pci_set_vendor_regions(device_data, 0);
 		vfio_pci_set_vendor_irqs(device_data, 0);
+		i40e_vf_dev->irqs = NULL;
+		i40e_vf_dev->num_irqs = 0;
+		kfree(i40e_vf_dev->irqs);
 	}
 	module_put(THIS_MODULE);
 	mutex_unlock(&i40e_vf_dev->reflock);
@@ -436,7 +737,16 @@ void i40e_vf_release(void *device_data)
 		kfree(i40e_vf_dev->regions);
 		i40e_vf_dev->regions = NULL;
 		vfio_pci_set_vendor_regions(device_data, 0);
+
 		vfio_pci_set_vendor_irqs(device_data, 0);
+		for (i = 0; i < i40e_vf_dev->num_irqs; i++)
+			i40e_vf_dev->irqs[i].ops->set_irqs(i40e_vf_dev,
+					VFIO_IRQ_SET_DATA_NONE |
+					VFIO_IRQ_SET_ACTION_TRIGGER,
+					i, 0, 0, NULL);
+		kfree(i40e_vf_dev->irqs);
+		i40e_vf_dev->irqs = NULL;
+		i40e_vf_dev->num_irqs = 0;
 	}
 	vfio_pci_release(device_data);
 	mutex_unlock(&i40e_vf_dev->reflock);
@@ -448,6 +758,10 @@ static long i40e_vf_ioctl(void *device_data,
 {
 	if (cmd == VFIO_DEVICE_GET_REGION_INFO)
 		return i40e_vf_get_region_info(device_data, cmd, arg);
+	else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+		return i40e_vf_get_irq_info(device_data, cmd, arg);
+	else if (cmd == VFIO_DEVICE_SET_IRQS)
+		return i40e_vf_set_irqs(device_data, cmd, arg);
 
 	return vfio_pci_ioctl(device_data, cmd, arg);
 }
@@ -487,8 +801,10 @@ static ssize_t i40e_vf_write(void *device_data, const char __user *buf,
 	int num_vdev_regions = vfio_pci_num_regions(device_data);
 	int num_vendor_region = i40e_vf_dev->num_regions;
 
-	if (index == VFIO_PCI_BAR0_REGION_INDEX)
+	if (index == VFIO_PCI_BAR0_REGION_INDEX) {
+		pr_debug("vfio bar 0 write\n");
 		;// scan dirty pages
+	}
 
 	if (index < VFIO_PCI_NUM_REGIONS + num_vdev_regions)
 		return vfio_pci_write(device_data, buf, count, ppos);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
index 918ba275d5b5..2c4d9ebee4ac 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
@@ -46,6 +46,14 @@ struct pci_sriov {
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
 };
 
+struct i40e_vf_remap_irq_ctx {
+	struct eventfd_ctx	*trigger;
+	struct virqfd		*sync;
+	atomic_t		cnt;
+	wait_queue_head_t	waitq;
+	bool			init;
+};
+
 struct i40e_vf_migration {
 	__u32				vf_vendor;
 	__u32				vf_device;
@@ -58,11 +66,14 @@ struct i40e_vf_migration {
 
 	struct vfio_device_migration_info *mig_ctl;
 	bool				in_dirty_track;
+	struct i40e_vf_remap_irq_ctx  remap_irq_ctx;
 
 	struct i40e_vf_region		*regions;
 	int				num_regions;
 	struct notifier_block		iommu_notifier;
 	struct vfio_group		*vfio_group;
+	struct i40e_vf_irq		*irqs;
+	int				num_irqs;
 
 };
 
@@ -89,5 +100,20 @@ struct i40e_vf_region {
 	void				*data;
 };
 
+struct i40e_vf_irqops {
+	int (*set_irqs)(struct i40e_vf_migration *i40e_vf_dev,
+			u32 flags, unsigned int index,
+			unsigned int start, unsigned int count,
+			void *data);
+};
+
+struct i40e_vf_irq {
+	u32	type;
+	u32	subtype;
+	u32	flags;
+	u32	count;
+	const struct i40e_vf_irqops *ops;
+};
+
 #endif /* I40E_MIG_H */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [QEMU RFC PATCH v4] hw/vfio/pci: remap bar region irq
  2020-05-18  2:52 ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Yan Zhao
@ 2020-05-18  2:56   ` Yan Zhao
  2020-05-29 21:45   ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-05-18  2:56 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Yan Zhao

Added an irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION to
dynamically query and remap BAR regions.

QEMU decodes the index of the BARs by reading cnt of eventfd.
If bit n is set, the corresponding BAR will be requeried and
its subregions will be remapped according to the its new flags.

rely on [1] "vfio: Add a funtion to return a specific irq capabilities"
[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg621645.html

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 hw/vfio/common.c              | 50 ++++++++++++++++++++++++
 hw/vfio/pci.c                 | 90 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h                 |  2 +
 include/hw/vfio/vfio-common.h |  2 +
 linux-headers/linux/vfio.h    | 11 ++++++
 5 files changed, 155 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a041c3b..cf24293 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1284,6 +1284,56 @@ void vfio_region_unmap(VFIORegion *region)
     }
 }
 
+/*
+ * re-query a region's flags,
+ * and update its mmap'd subregions.
+ * It does not support change a region's size.
+ */
+void vfio_region_reset_mmap(VFIODevice *vbasedev, VFIORegion *region, int index)
+{
+    struct vfio_region_info *new;
+
+    if (!region->mem) {
+        return;
+    }
+
+    if (vfio_get_region_info(vbasedev, index, &new)) {
+        goto out;
+    }
+
+    if (region->size != new->size) {
+        error_report("vfio: resetting of region size is not supported");
+        goto out;
+    }
+
+    if (region->flags == new->flags) {
+        goto out;
+    }
+
+    /* ummap old mmap'd subregions, if any */
+    vfio_region_unmap(region);
+    region->nr_mmaps = 0;
+    g_free(region->mmaps);
+    region->mmaps = NULL;
+
+    /* setup new mmap'd subregions*/
+    region->flags = new->flags;
+    if (vbasedev->no_mmap ||
+            !(region->flags & VFIO_REGION_INFO_FLAG_MMAP)) {
+        goto out;
+    }
+
+    if (vfio_setup_region_sparse_mmaps(region, new)) {
+        region->nr_mmaps = 1;
+        region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+        region->mmaps[0].offset = 0;
+        region->mmaps[0].size = region->size;
+    }
+    vfio_region_mmap(region);
+out:
+    g_free(new);
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c70f153..12998c5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2883,6 +2883,94 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+static void vfio_remap_bar_notifier_handler(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    uint64_t bars;
+    ssize_t ret;
+    int i;
+
+    ret = read(vdev->remap_bar_notifier.rfd, &bars, sizeof(bars));
+    if (ret != sizeof(bars)) {
+            return;
+    }
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        VFIORegion *region = &vdev->bars[i].region;
+
+        if (!test_bit(i, &bars)) {
+            continue;
+        }
+
+        vfio_region_reset_mmap(&vdev->vbasedev, region, i);
+    }
+
+    /* write 0 to notify kernel that we're done */
+    bars = 0;
+    write(vdev->remap_bar_notifier.wfd, &bars, sizeof(bars));
+}
+
+static void vfio_register_remap_bar_notifier(VFIOPCIDevice *vdev)
+{
+    int ret;
+    struct vfio_irq_info *irq;
+    Error *err = NULL;
+    int32_t fd;
+
+    ret = vfio_get_dev_irq_info(&vdev->vbasedev,
+                                VFIO_IRQ_TYPE_REMAP_BAR_REGION,
+                                VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION,
+                                &irq);
+    if (ret) {
+        return;
+    }
+    ret = event_notifier_init(&vdev->remap_bar_notifier, 0);
+    if (ret) {
+        error_report("vfio: Failed to init event notifier for remap bar irq");
+        return;
+    }
+
+    fd = event_notifier_get_fd(&vdev->remap_bar_notifier);
+    qemu_set_fd_handler(fd, vfio_remap_bar_notifier_handler, NULL, vdev);
+
+    if (vfio_set_irq_signaling(&vdev->vbasedev, irq->index, 0,
+                               VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
+        error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+        qemu_set_fd_handler(fd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->remap_bar_notifier);
+    } else {
+        vdev->remap_bar_enabled = true;
+    }
+};
+
+static void vfio_unregister_remap_bar_notifier(VFIOPCIDevice *vdev)
+{
+    struct vfio_irq_info *irq;
+    Error *err = NULL;
+    int ret;
+
+    if (!vdev->remap_bar_enabled) {
+        return;
+    }
+
+    ret = vfio_get_dev_irq_info(&vdev->vbasedev,
+                                VFIO_IRQ_TYPE_REMAP_BAR_REGION,
+                                VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION,
+                                &irq);
+    if (ret) {
+        return;
+    }
+
+    if (vfio_set_irq_signaling(&vdev->vbasedev, irq->index, 0,
+                               VFIO_IRQ_SET_ACTION_TRIGGER, -1, &err)) {
+        error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+    }
+    qemu_set_fd_handler(event_notifier_get_fd(&vdev->remap_bar_notifier),
+                        NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->req_notifier);
+
+    vdev->remap_bar_enabled = false;
+}
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
@@ -3194,6 +3282,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
+    vfio_register_remap_bar_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
     return;
@@ -3235,6 +3324,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
+    vfio_unregister_remap_bar_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index b148c93..5a1e564 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    EventNotifier remap_bar_notifier;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
@@ -157,6 +158,7 @@ typedef struct VFIOPCIDevice {
     uint8_t nv_gpudirect_clique;
     bool pci_aer;
     bool req_enabled;
+    bool remap_bar_enabled;
     bool has_flr;
     bool has_pm_reset;
     bool rom_read_failed;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a6283b7..1c16790 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -188,6 +188,8 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
 void vfio_region_unmap(VFIORegion *region);
+void vfio_region_reset_mmap(VFIODevice *vbasedev,
+                            VFIORegion *region, int index);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 2598a84..2344ca6 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -703,6 +703,17 @@ struct vfio_irq_info_cap_type {
 	__u32 subtype;  /* type specific */
 };
 
+/* Bar Region Query IRQ TYPE */
+#define VFIO_IRQ_TYPE_REMAP_BAR_REGION                 (1)
+
+/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
+/*
+ * This irq notifies userspace to re-query BAR region and remaps the
+ * subregions.
+ */
+#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION      (0)
+
+
 /**
  * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
  *
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 05/10] vfio/pci: export vfio_pci_get_barmap
  2020-05-18  2:50 ` [RFC PATCH v4 05/10] vfio/pci: export vfio_pci_get_barmap Yan Zhao
@ 2020-05-18  6:37   ` kbuild test robot
  0 siblings, 0 replies; 42+ messages in thread
From: kbuild test robot @ 2020-05-18  6:37 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

Hi Yan,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on vfio/next]
[also build test WARNING on jkirsher-next-queue/dev-queue linus/master v5.7-rc6 next-20200515]
[cannot apply to linux/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yan-Zhao/Introduce-vendor-ops-in-vfio-pci/20200518-110542
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-allyesconfig (attached as .config)
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-193-gb8fad4bc-dirty
        # save the attached .config to linux build tree
        make C=1 ARCH=x86_64 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> drivers/vfio/pci/vfio_pci_rdwr.c:162:20: sparse: sparse: incompatible types in conditional expression (different address spaces):
>> drivers/vfio/pci/vfio_pci_rdwr.c:162:20: sparse:    void *
>> drivers/vfio/pci/vfio_pci_rdwr.c:162:20: sparse:    void [noderef] <asn:2> *

vim +162 drivers/vfio/pci/vfio_pci_rdwr.c

   155	
   156	void __iomem *vfio_pci_get_barmap(void *device_data, int bar)
   157	{
   158		int ret;
   159		struct vfio_pci_device *vdev = device_data;
   160	
   161		ret = vfio_pci_setup_barmap(vdev, bar);
 > 162		return ret ? ERR_PTR(ret) : vdev->barmap[bar];
   163	}
   164	EXPORT_SYMBOL_GPL(vfio_pci_get_barmap);
   165	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 72437 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 09/10] i40e/vf_migration: register a migration vendor region
  2020-05-18  2:54 ` [RFC PATCH v4 09/10] i40e/vf_migration: register a migration vendor region Yan Zhao
@ 2020-05-18  6:47   ` kbuild test robot
  0 siblings, 0 replies; 42+ messages in thread
From: kbuild test robot @ 2020-05-18  6:47 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 10815 bytes --]

Hi Yan,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on vfio/next]
[also build test ERROR on jkirsher-next-queue/dev-queue linus/master v5.7-rc6 next-20200515]
[cannot apply to linux/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yan-Zhao/Introduce-vendor-ops-in-vfio-pci/20200518-110542
base:   https://github.com/awilliam/linux-vfio.git next
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>, old ones prefixed by <<):

drivers/net/ethernet/intel/i40e/i40e_vf_migration.c: In function 'i40e_vf_set_device_state':
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:132:22: error: dereferencing pointer to incomplete type 'struct vfio_device_migration_info'
if (state == mig_ctl->device_state)
^~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:136:7: error: 'VFIO_DEVICE_STATE_RUNNING' undeclared (first use in this function); did you mean 'VFIO_EEH_PE_STATE_UNAVAIL'?
case VFIO_DEVICE_STATE_RUNNING:
^~~~~~~~~~~~~~~~~~~~~~~~~
VFIO_EEH_PE_STATE_UNAVAIL
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:136:7: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:138:7: error: 'VFIO_DEVICE_STATE_SAVING' undeclared (first use in this function); did you mean 'VFIO_DEVICE_STATE_RUNNING'?
case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
^~~~~~~~~~~~~~~~~~~~~~~~
VFIO_DEVICE_STATE_RUNNING
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:144:7: error: 'VFIO_DEVICE_STATE_STOP' undeclared (first use in this function); did you mean 'VFIO_DEVICE_STATE_SAVING'?
case VFIO_DEVICE_STATE_STOP:
^~~~~~~~~~~~~~~~~~~~~~
VFIO_DEVICE_STATE_SAVING
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:149:7: error: 'VFIO_DEVICE_STATE_RESUMING' undeclared (first use in this function); did you mean 'VFIO_DEVICE_STATE_SAVING'?
case VFIO_DEVICE_STATE_RESUMING:
^~~~~~~~~~~~~~~~~~~~~~~~~~
VFIO_DEVICE_STATE_SAVING
In file included from <command-line>:0:0:
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c: In function 'i40e_vf_region_migration_rw':
>> include/linux/compiler_types.h:129:35: error: invalid use of undefined type 'struct vfio_device_migration_info'
#define __compiler_offsetof(a, b) __builtin_offsetof(a, b)
^
include/linux/stddef.h:17:32: note: in expansion of macro '__compiler_offsetof'
#define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER)
^~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:166:23: note: in expansion of macro 'offsetof'
#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
^~~~~~~~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:172:7: note: in expansion of macro 'VDM_OFFSET'
case VDM_OFFSET(device_state):
^~~~~~~~~~
>> include/linux/compiler_types.h:129:35: error: invalid use of undefined type 'struct vfio_device_migration_info'
#define __compiler_offsetof(a, b) __builtin_offsetof(a, b)
^
include/linux/stddef.h:17:32: note: in expansion of macro '__compiler_offsetof'
#define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER)
^~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:166:23: note: in expansion of macro 'offsetof'
#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
^~~~~~~~
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:195:7: note: in expansion of macro 'VDM_OFFSET'
case VDM_OFFSET(reserved):
^~~~~~~~~~
>> include/linux/compiler_types.h:129:35: error: invalid use of undefined type 'struct vfio_device_migration_info'
#define __compiler_offsetof(a, b) __builtin_offsetof(a, b)
^
include/linux/stddef.h:17:32: note: in expansion of macro '__compiler_offsetof'
#define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER)
^~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:166:23: note: in expansion of macro 'offsetof'
#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
^~~~~~~~
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:199:7: note: in expansion of macro 'VDM_OFFSET'
case VDM_OFFSET(pending_bytes):
^~~~~~~~~~
>> include/linux/compiler_types.h:129:35: error: invalid use of undefined type 'struct vfio_device_migration_info'
#define __compiler_offsetof(a, b) __builtin_offsetof(a, b)
^
include/linux/stddef.h:17:32: note: in expansion of macro '__compiler_offsetof'
#define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER)
^~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:166:23: note: in expansion of macro 'offsetof'
#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
^~~~~~~~
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:216:7: note: in expansion of macro 'VDM_OFFSET'
case VDM_OFFSET(data_offset):
^~~~~~~~~~
>> include/linux/compiler_types.h:129:35: error: invalid use of undefined type 'struct vfio_device_migration_info'
#define __compiler_offsetof(a, b) __builtin_offsetof(a, b)
^
include/linux/stddef.h:17:32: note: in expansion of macro '__compiler_offsetof'
#define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER)
^~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:166:23: note: in expansion of macro 'offsetof'
#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
^~~~~~~~
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:225:7: note: in expansion of macro 'VDM_OFFSET'
case VDM_OFFSET(data_size):
^~~~~~~~~~
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c: In function 'i40e_vf_open':
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:385:12: error: 'VFIO_REGION_TYPE_MIGRATION' undeclared (first use in this function); did you mean 'VFIO_REGION_TYPE_GFX'?
VFIO_REGION_TYPE_MIGRATION,
^~~~~~~~~~~~~~~~~~~~~~~~~~
VFIO_REGION_TYPE_GFX
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:386:12: error: 'VFIO_REGION_SUBTYPE_MIGRATION' undeclared (first use in this function); did you mean 'VFIO_REGION_TYPE_MIGRATION'?
VFIO_REGION_SUBTYPE_MIGRATION,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VFIO_REGION_TYPE_MIGRATION
In file included from drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:16:0:
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.h:20:37: error: invalid application of 'sizeof' to incomplete type 'struct vfio_device_migration_info'
#define MIGRATION_REGION_SZ (sizeof(struct vfio_device_migration_info))
^
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:388:12: note: in expansion of macro 'MIGRATION_REGION_SZ'
MIGRATION_REGION_SZ,
^~~~~~~~~~~~~~~~~~~

vim +132 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c

   125	
   126	static size_t i40e_vf_set_device_state(struct i40e_vf_migration *i40e_vf_dev,
   127					       u32 state)
   128	{
   129		int ret = 0;
   130		struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
   131	
 > 132		if (state == mig_ctl->device_state)
   133			return 0;
   134	
   135		switch (state) {
 > 136		case VFIO_DEVICE_STATE_RUNNING:
   137			break;
 > 138		case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
   139			ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
   140			break;
   141		case VFIO_DEVICE_STATE_SAVING:
   142			// do the last round of dirty page scanning
   143			break;
 > 144		case VFIO_DEVICE_STATE_STOP:
   145			// release dirty page tracking resources
   146			if (mig_ctl->device_state == VFIO_DEVICE_STATE_SAVING)
   147				i40e_vf_stop_dirty_track(i40e_vf_dev);
   148			break;
 > 149		case VFIO_DEVICE_STATE_RESUMING:
   150			break;
   151		default:
   152			ret = -EFAULT;
   153		}
   154	
   155		if (!ret)
   156			mig_ctl->device_state = state;
   157	
   158		return ret;
   159	}
   160	
   161	static
   162	ssize_t i40e_vf_region_migration_rw(struct i40e_vf_migration *i40e_vf_dev,
   163					    char __user *buf, size_t count,
   164					    loff_t *ppos, bool iswrite)
   165	{
 > 166	#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
   167		struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
   168		u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
   169		int ret = 0;
   170	
   171		switch (pos) {
 > 172		case VDM_OFFSET(device_state):
   173			if (count != sizeof(mig_ctl->device_state)) {
   174				ret = -EINVAL;
   175				break;
   176			}
   177	
   178			if (iswrite) {
   179				u32 device_state;
   180	
   181				if (copy_from_user(&device_state, buf, count)) {
   182					ret = -EFAULT;
   183					break;
   184				}
   185	
   186				ret = i40e_vf_set_device_state(i40e_vf_dev,
   187							       device_state) ?
   188							       ret : count;
   189			} else {
   190				ret = copy_to_user(buf, &mig_ctl->device_state,
   191						   count) ? -EFAULT : count;
   192			}
   193			break;
   194	
   195		case VDM_OFFSET(reserved):
   196			ret = -EFAULT;
   197			break;
   198	
   199		case VDM_OFFSET(pending_bytes):
   200			{
   201				if (count != sizeof(mig_ctl->pending_bytes)) {
   202					ret = -EINVAL;
   203					break;
   204				}
   205	
   206				if (iswrite)
   207					ret = -EFAULT;
   208				else
   209					ret = copy_to_user(buf,
   210							   &mig_ctl->pending_bytes,
   211							   count) ? -EFAULT : count;
   212	
   213				break;
   214			}
   215	
   216		case VDM_OFFSET(data_offset):
   217			{
   218				/* as we don't support device internal dirty data
   219				 * and our pending_bytes is always 0,
   220				 * return error here.
   221				 */
   222				ret = -EFAULT;
   223				break;
   224			}
   225		case VDM_OFFSET(data_size):
   226			if (count != sizeof(mig_ctl->data_size)) {
   227				ret = -EINVAL;
   228				break;
   229			}
   230	
   231			if (iswrite)
   232				ret = copy_from_user(&mig_ctl->data_size, buf, count) ?
   233						     -EFAULT : count;
   234			else
   235				ret = copy_to_user(buf, &mig_ctl->data_size, count) ?
   236						   -EFAULT : count;
   237			break;
   238	
   239		default:
   240			ret = -EFAULT;
   241			break;
   242		}
   243		return ret;
   244	}
   245	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 71443 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first
  2020-05-18  2:53 ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Yan Zhao
@ 2020-05-18  8:49   ` kbuild test robot
  2020-05-18  8:49   ` [RFC PATCH] i40e/vf_migration: i40e_vf_release() can be static kbuild test robot
  2020-06-10  8:59   ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Xiang Zheng
  2 siblings, 0 replies; 42+ messages in thread
From: kbuild test robot @ 2020-05-18  8:49 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1545 bytes --]

Hi Yan,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on vfio/next]
[also build test WARNING on jkirsher-next-queue/dev-queue linus/master v5.7-rc6 next-20200515]
[cannot apply to linux/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yan-Zhao/Introduce-vendor-ops-in-vfio-pci/20200518-110542
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-allyesconfig (attached as .config)
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-193-gb8fad4bc-dirty
        # save the attached .config to linux build tree
        make C=1 ARCH=x86_64 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:54:6: sparse: sparse: symbol 'i40e_vf_release' was not declared. Should it be static?
>> drivers/net/ethernet/intel/i40e/i40e_vf_migration.c:108:6: sparse: sparse: symbol 'i40e_vf_probe' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 72450 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH] i40e/vf_migration: i40e_vf_release() can be static
  2020-05-18  2:53 ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Yan Zhao
  2020-05-18  8:49   ` kbuild test robot
@ 2020-05-18  8:49   ` kbuild test robot
  2020-06-10  8:59   ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Xiang Zheng
  2 siblings, 0 replies; 42+ messages in thread
From: kbuild test robot @ 2020-05-18  8:49 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]


Signed-off-by: kbuild test robot <lkp@intel.com>
---
 i40e_vf_migration.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index 96026dcf5c9df..d222c7531fa6e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -51,7 +51,7 @@ static int i40e_vf_open(void *device_data)
 	return ret;
 }
 
-void i40e_vf_release(void *device_data)
+static void i40e_vf_release(void *device_data)
 {
 	struct i40e_vf_migration *i40e_vf_dev =
 		vfio_pci_vendor_data(device_data);
@@ -105,7 +105,7 @@ static struct vfio_device_ops i40e_vf_device_ops_node = {
 	.request	= i40e_vf_request,
 };
 
-void *i40e_vf_probe(struct pci_dev *pdev)
+static void *i40e_vf_probe(struct pci_dev *pdev)
 {
 	struct i40e_vf_migration *i40e_vf_dev = NULL;
 	struct pci_dev *pf_dev, *vf_dev;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-05-18  2:52 ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Yan Zhao
  2020-05-18  2:56   ` [QEMU RFC PATCH v4] hw/vfio/pci: remap bar region irq Yan Zhao
@ 2020-05-29 21:45   ` Alex Williamson
  2020-06-01  6:57     ` Yan Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2020-05-29 21:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Sun, 17 May 2020 22:52:45 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> This is a virtual irq type.
> vendor driver triggers this irq when it wants to notify userspace to
> remap PCI BARs.
> 
> 1. vendor driver triggers this irq and packs the target bar number in
>    the ctx count. i.e. "1 << bar_number".
>    if a bit is set, the corresponding bar is to be remapped.
> 
> 2. userspace requery the specified PCI BAR from kernel and if flags of
> the bar regions are changed, it removes the old subregions and attaches
> subregions according to the new flags.
> 
> 3. userspace notifies back to kernel by writing one to the eventfd of
> this irq.
> 
> Please check the corresponding qemu implementation from the reply of this
> patch, and a sample usage in vendor driver in patch [10/10].
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  include/uapi/linux/vfio.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2d0d85c7c4d4..55895f75d720 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
>  	__u32 subtype;  /* type specific */
>  };
>  
> +/* Bar Region Query IRQ TYPE */
> +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION			(1)
> +
> +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> +/*
> + * This irq notifies userspace to re-query BAR region and remaps the
> + * subregions.
> + */
> +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION	(0)

Hi Yan,

How do we do this in a way that's backwards compatible?  Or maybe, how
do we perform a handshake between the vendor driver and userspace to
indicate this support?  Would the vendor driver refuse to change
device_state in the migration region if the user has not enabled this
IRQ?

Everything you've described in the commit log needs to be in this
header, we can't have the usage protocol buried in a commit log.  It
also seems like this is unnecessarily PCI specific.  Can't the count
bitmap simply indicate the region index to re-evaluate?  Maybe you were
worried about running out of bits in the ctx count?  An IRQ per region
could resolve that, but maybe we could also just add another IRQ for
the next bitmap of regions.  I assume that the bitmap can indicate
multiple regions to re-evaluate, but that should be documented.

Also, what sort of service requirements does this imply?  Would the
vendor driver send this IRQ when the user tries to set the device_state
to _SAVING and therefore we'd require the user to accept, implement the
mapping change, and acknowledge the IRQ all while waiting for the write
to device_state to return?  That implies quite a lot of asynchronous
support in the userspace driver.  Thanks,

Alex

> +
> +
>  /**
>   * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
>   *


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-05-29 21:45   ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Alex Williamson
@ 2020-06-01  6:57     ` Yan Zhao
  2020-06-01 16:43       ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-01  6:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:
> On Sun, 17 May 2020 22:52:45 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > This is a virtual irq type.
> > vendor driver triggers this irq when it wants to notify userspace to
> > remap PCI BARs.
> > 
> > 1. vendor driver triggers this irq and packs the target bar number in
> >    the ctx count. i.e. "1 << bar_number".
> >    if a bit is set, the corresponding bar is to be remapped.
> > 
> > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > the bar regions are changed, it removes the old subregions and attaches
> > subregions according to the new flags.
> > 
> > 3. userspace notifies back to kernel by writing one to the eventfd of
> > this irq.
> > 
> > Please check the corresponding qemu implementation from the reply of this
> > patch, and a sample usage in vendor driver in patch [10/10].
> > 
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  include/uapi/linux/vfio.h | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 2d0d85c7c4d4..55895f75d720 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> >  	__u32 subtype;  /* type specific */
> >  };
> >  
> > +/* Bar Region Query IRQ TYPE */
> > +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION			(1)
> > +
> > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > +/*
> > + * This irq notifies userspace to re-query BAR region and remaps the
> > + * subregions.
> > + */
> > +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION	(0)
> 
> Hi Yan,
> 
> How do we do this in a way that's backwards compatible?  Or maybe, how
> do we perform a handshake between the vendor driver and userspace to
> indicate this support?
hi Alex
thank you for your thoughtful review!

do you think below sequence can provide enough backwards compatibility?

- on vendor driver opening, it registers an irq of type
  VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
  1 vendor irq.

- after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  it enables it by signaling ACTION_TRIGGER.
  
- on receiving this ACTION_TRIGGER, vendor driver will try to setup a
  virqfd to monitor file write to the fd of this irq, enable this irq
  and return its enabling status to userspace.


> Would the vendor driver refuse to change
> device_state in the migration region if the user has not enabled this
> IRQ?
yes, vendor driver can refuse to change device_state if the irq
VFIO_IRQ_TYPE_REMAP_BAR_REGION is not enabled.
in my sample i40e_vf driver (patch 10/10), it implemented this logic
like below:

i40e_vf_set_device_state
    |-> case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
    |          ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
                              |->ret = i40e_vf_remap_bars(i40e_vf_dev, true);
			                     |->if (!i40e_vf_dev->remap_irq_ctx.init)
                                                    return -ENODEV;


(i40e_vf_dev->remap_irq_ctx.init is set in below path)
i40e_vf_ioctl(cmd==VFIO_DEVICE_SET_IRQS)
    |->i40e_vf_set_irq_remap_bars
       |->i40e_vf_enable_remap_bars_irq
           |-> vf_dev->remap_irq_ctx.init = true;

> 
> Everything you've described in the commit log needs to be in this
> header, we can't have the usage protocol buried in a commit log.  It
got it! I'll move all descriptions in commit logs to this header so that
readers can understand the whole picture here.

> also seems like this is unnecessarily PCI specific.  Can't the count
> bitmap simply indicate the region index to re-evaluate?  Maybe you were
yes, it is possible. but what prevented me from doing it is that it's not
easy to write an irq handler in qemu to remap other regions dynamically.

for BAR regions, there're 3 layers as below.
1. bar->mr  -->bottom layer
2. bar->region.mem --> slow path
3. bar->region->mmaps[i].mem  --> fast path
so, bar remap irq handler can simply re-revaluate the region and
remove/re-generate the layer 3 (fast path) without losing track of any
guest accesses to the bar regions.

actually so far, the bar remap irq handler in qemu only supports remap
mmap'd subregions (layout of mmap'd subregions are re-queried) and
not supports updating the whole bar region size.
(do you think updating bar region size is a must?)

however, there are no such fast path and slow path in other regions, so
remap handlers for them are region specific.

> worried about running out of bits in the ctx count?  An IRQ per region
yes. that's also possible :) 
but current ctx count is 64bit, so it can support regions of index up to 63.
if we don't need to remap dev regions, seems it's enough?

> could resolve that, but maybe we could also just add another IRQ for
> the next bitmap of regions.  I assume that the bitmap can indicate
> multiple regions to re-evaluate, but that should be documented.
hmm. would you mind elaborating more about it?

> 
> Also, what sort of service requirements does this imply?  Would the
> vendor driver send this IRQ when the user tries to set the device_state
> to _SAVING and therefore we'd require the user to accept, implement the
> mapping change, and acknowledge the IRQ all while waiting for the write
> to device_state to return?  That implies quite a lot of asynchronous
> support in the userspace driver.  Thanks,
yes.
(1) when user sets device_state to _SAVING, the vendor driver notifies this
IRQ, waits until user IRQ ack is received.
(2) in IRQ handler, user decodes and sends IRQ ack to vendor driver.

if a wait is required in (1) returns, it demands the qemu_mutex_iothread is
not locked in migration thread when device_state is set in (1), as before
entering (2), acquiring of this mutex is required.

Currently, this lock is not hold in vfio_migration_set_state() at
save_setup stage but is hold in stop and copy stage. so we wait in
kernel in save_setup stage and not wait in stop stage.
it can be fixed by calling qemu_mutex_unlock_iothread() on entering
vfio_migration_set_state() and qemu_mutex_lock_iothread() on leaving
vfio_migration_set_state() in qemu.

do you think it's acceptable?

Thanks
Yan
> 
> 
> > +
> > +
> >  /**
> >   * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
> >   *
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-01  6:57     ` Yan Zhao
@ 2020-06-01 16:43       ` Alex Williamson
  2020-06-02  8:28         ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2020-06-01 16:43 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Mon, 1 Jun 2020 02:57:26 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:
> > On Sun, 17 May 2020 22:52:45 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > This is a virtual irq type.
> > > vendor driver triggers this irq when it wants to notify userspace to
> > > remap PCI BARs.
> > > 
> > > 1. vendor driver triggers this irq and packs the target bar number in
> > >    the ctx count. i.e. "1 << bar_number".
> > >    if a bit is set, the corresponding bar is to be remapped.
> > > 
> > > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > > the bar regions are changed, it removes the old subregions and attaches
> > > subregions according to the new flags.
> > > 
> > > 3. userspace notifies back to kernel by writing one to the eventfd of
> > > this irq.
> > > 
> > > Please check the corresponding qemu implementation from the reply of this
> > > patch, and a sample usage in vendor driver in patch [10/10].
> > > 
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  include/uapi/linux/vfio.h | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 2d0d85c7c4d4..55895f75d720 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> > >  	__u32 subtype;  /* type specific */
> > >  };
> > >  
> > > +/* Bar Region Query IRQ TYPE */
> > > +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION			(1)
> > > +
> > > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > > +/*
> > > + * This irq notifies userspace to re-query BAR region and remaps the
> > > + * subregions.
> > > + */
> > > +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION	(0)  
> > 
> > Hi Yan,
> > 
> > How do we do this in a way that's backwards compatible?  Or maybe, how
> > do we perform a handshake between the vendor driver and userspace to
> > indicate this support?  
> hi Alex
> thank you for your thoughtful review!
> 
> do you think below sequence can provide enough backwards compatibility?
> 
> - on vendor driver opening, it registers an irq of type
>   VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
>   1 vendor irq.
> 
> - after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
>   it enables it by signaling ACTION_TRIGGER.
>   
> - on receiving this ACTION_TRIGGER, vendor driver will try to setup a
>   virqfd to monitor file write to the fd of this irq, enable this irq
>   and return its enabling status to userspace.

I'm not sure I follow here, what's the purpose of the irqfd?  When and
what does the user signal by writing to the irqfd?  Is this an ACK
mechanism?  Is this a different fd from the signaling eventfd?

> > Would the vendor driver refuse to change
> > device_state in the migration region if the user has not enabled this
> > IRQ?  
> yes, vendor driver can refuse to change device_state if the irq
> VFIO_IRQ_TYPE_REMAP_BAR_REGION is not enabled.
> in my sample i40e_vf driver (patch 10/10), it implemented this logic
> like below:
> 
> i40e_vf_set_device_state
>     |-> case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
>     |          ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
>                               |->ret = i40e_vf_remap_bars(i40e_vf_dev, true);
> 			                     |->if (!i40e_vf_dev->remap_irq_ctx.init)
>                                                     return -ENODEV;
> 
> 
> (i40e_vf_dev->remap_irq_ctx.init is set in below path)
> i40e_vf_ioctl(cmd==VFIO_DEVICE_SET_IRQS)
>     |->i40e_vf_set_irq_remap_bars
>        |->i40e_vf_enable_remap_bars_irq
>            |-> vf_dev->remap_irq_ctx.init = true;

This should be a documented aspect of the uapi, not left to vendor
discretion to implement.
 
> > 
> > Everything you've described in the commit log needs to be in this
> > header, we can't have the usage protocol buried in a commit log.  It  
> got it! I'll move all descriptions in commit logs to this header so that
> readers can understand the whole picture here.
> 
> > also seems like this is unnecessarily PCI specific.  Can't the count
> > bitmap simply indicate the region index to re-evaluate?  Maybe you were  
> yes, it is possible. but what prevented me from doing it is that it's not
> easy to write an irq handler in qemu to remap other regions dynamically.
> 
> for BAR regions, there're 3 layers as below.
> 1. bar->mr  -->bottom layer
> 2. bar->region.mem --> slow path
> 3. bar->region->mmaps[i].mem  --> fast path
> so, bar remap irq handler can simply re-revaluate the region and
> remove/re-generate the layer 3 (fast path) without losing track of any
> guest accesses to the bar regions.
> 
> actually so far, the bar remap irq handler in qemu only supports remap
> mmap'd subregions (layout of mmap'd subregions are re-queried) and
> not supports updating the whole bar region size.
> (do you think updating bar region size is a must?)

It depends on whether our interrupt is defined that the user should
re-evaluate the entire region_info or just the spare mmap capability.
A device spontaneously changing region size seems like a much more
abstract problem.  We do need to figure out how to support resizeable
BARs, but it seems that would be at the direction of the user, for
example emulating the resizeable BAR capability and requiring userspace
to re-evaluate the region_info after interacting with that emulation.
So long as we specify that this IRQ is limited to re-evaluating the
sparse mmap capability for the indicated regions, I don't think we need
to handle the remainder of region_info spontaneously changing.

> however, there are no such fast path and slow path in other regions, so
> remap handlers for them are region specific.

QEMU support for re-evaluating arbitrary regions for sparse mmap
changes should not limit our kernel implementation.  Maybe it does
suggest though that userspace should be informed of the region indexes
subject to re-evaluation such that it can choose to ignore this
interrupt (and lose the features enabled by the IRQ), if it doesn't
support re-evaluating all of the indicated regions.  For example the
capability could include a bitmap indicating regions that might be
signaled and the QEMU driver might skip registering an eventfd via
SET_IRQS if support for non-BAR region indexes is indicated as a
requirement.  I'd really prefer if we can design this to not be limited
to PCI BARs.

> > worried about running out of bits in the ctx count?  An IRQ per region  
> yes. that's also possible :) 
> but current ctx count is 64bit, so it can support regions of index up to 63.
> if we don't need to remap dev regions, seems it's enough?

This is the kind of decision we might look back on 10yrs later and
wonder how we were so short sighted, but yes it does seem like enough
and we can define additional IRQs for each of the next 64 region
indexes if we need too.
 
> > could resolve that, but maybe we could also just add another IRQ for
> > the next bitmap of regions.  I assume that the bitmap can indicate
> > multiple regions to re-evaluate, but that should be documented.  
> hmm. would you mind elaborating more about it?

I'm just confirming that the usage expectation would allow the user to
be signaled with multiple bits in the bitmap set and the user is
expected to re-evaluate each region index sparse bitmap.

> > Also, what sort of service requirements does this imply?  Would the
> > vendor driver send this IRQ when the user tries to set the device_state
> > to _SAVING and therefore we'd require the user to accept, implement the
> > mapping change, and acknowledge the IRQ all while waiting for the write
> > to device_state to return?  That implies quite a lot of asynchronous
> > support in the userspace driver.  Thanks,  
> yes.
> (1) when user sets device_state to _SAVING, the vendor driver notifies this
> IRQ, waits until user IRQ ack is received.
> (2) in IRQ handler, user decodes and sends IRQ ack to vendor driver.
> 
> if a wait is required in (1) returns, it demands the qemu_mutex_iothread is
> not locked in migration thread when device_state is set in (1), as before
> entering (2), acquiring of this mutex is required.
> 
> Currently, this lock is not hold in vfio_migration_set_state() at
> save_setup stage but is hold in stop and copy stage. so we wait in
> kernel in save_setup stage and not wait in stop stage.
> it can be fixed by calling qemu_mutex_unlock_iothread() on entering
> vfio_migration_set_state() and qemu_mutex_lock_iothread() on leaving
> vfio_migration_set_state() in qemu.
> 
> do you think it's acceptable?

I'm not thrilled by it, it seems a bit tricky for both userspace and
the vendor driver to get right.  Userspace needs to handle this eventfd
while blocked on write(2) into a region, which for QEMU means
additional ioctls to retrieve new REGION_INFO, closing some mmaps,
maybe opening other mmaps, which implies new layering of MemoryRegion
sub-regions and all of the calls through KVM to implement those address
space changes.  The vendor driver must also be able to support
concurrency of handling the REGION_INFO ioctl, new calls to mmap
regions, and maybe vm_ops.close and vm_ops.fault.  These regions might
also be IOMMU mapped, so re-evaluating the sparse mmap could result in
DMA maps and unmaps, which the vendor driver might see via the notify
forcing it to unpin pages.  How would the vendor driver know when to
unblock the write to device_state, would it look for vm_ops.close on
infringing vmas or are you thinking of an ACK via irqfd?  I wouldn't
want to debug lockups as a result of this design :-\

What happens if the mmap re-evaluation occurs asynchronous to the
device_state write?  The vendor driver can track outstanding mmap vmas
to areas it's trying to revoke, so the vendor driver can know when
userspace has reached an acceptable state (assuming we require
userspace to munmap areas that are no longer valid).  We should also
consider what we can accomplish by invalidating user mmaps, ex. can we
fault them back in on a per-page basis and continue to mark them dirty
in the migration state, re-invalidating on each iteration until they've
finally been closed.   It seems the vendor driver needs to handle
incrementally closing each mmap anyway, there's no requirement to the
user to stop the device (ie. block all access), make these changes,
then restart the device.  So perhaps the vendor driver can "limp" along
until userspace completes the changes.  I think we can assume we are in
a cooperative environment here, userspace wants to perform a migration,
disabling direct access to some regions is for mediating those accesses
during migration, not for preventing the user from accessing something
they shouldn't have access to, userspace is only delaying the migration
or affecting the state of their device by not promptly participating in
the protocol.

Another problem I see though is what about p2p DMA?  If the vendor
driver invalidates an mmap we're removing it from both direct CPU as
well as DMA access via the IOMMU.  We can't signal to the guest OS that
a DMA channel they've been using is suddenly no longer valid.  Is QEMU
going to need to avoid ever IOMMU mapping device_ram for regions
subject to mmap invalidation?  That would introduce an undesirable need
to choose whether we want to support p2p or migration unless we had an
IOMMU that could provide dirty tracking via p2p, right?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-01 16:43       ` Alex Williamson
@ 2020-06-02  8:28         ` Yan Zhao
  2020-06-02 19:34           ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Mon, Jun 01, 2020 at 10:43:07AM -0600, Alex Williamson wrote:
> On Mon, 1 Jun 2020 02:57:26 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:
> > > On Sun, 17 May 2020 22:52:45 -0400
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > This is a virtual irq type.
> > > > vendor driver triggers this irq when it wants to notify userspace to
> > > > remap PCI BARs.
> > > > 
> > > > 1. vendor driver triggers this irq and packs the target bar number in
> > > >    the ctx count. i.e. "1 << bar_number".
> > > >    if a bit is set, the corresponding bar is to be remapped.
> > > > 
> > > > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > > > the bar regions are changed, it removes the old subregions and attaches
> > > > subregions according to the new flags.
> > > > 
> > > > 3. userspace notifies back to kernel by writing one to the eventfd of
> > > > this irq.
> > > > 
> > > > Please check the corresponding qemu implementation from the reply of this
> > > > patch, and a sample usage in vendor driver in patch [10/10].
> > > > 
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > >  include/uapi/linux/vfio.h | 11 +++++++++++
> > > >  1 file changed, 11 insertions(+)
> > > > 
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 2d0d85c7c4d4..55895f75d720 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> > > >  	__u32 subtype;  /* type specific */
> > > >  };
> > > >  
> > > > +/* Bar Region Query IRQ TYPE */
> > > > +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION			(1)
> > > > +
> > > > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > > > +/*
> > > > + * This irq notifies userspace to re-query BAR region and remaps the
> > > > + * subregions.
> > > > + */
> > > > +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION	(0)  
> > > 
> > > Hi Yan,
> > > 
> > > How do we do this in a way that's backwards compatible?  Or maybe, how
> > > do we perform a handshake between the vendor driver and userspace to
> > > indicate this support?  
> > hi Alex
> > thank you for your thoughtful review!
> > 
> > do you think below sequence can provide enough backwards compatibility?
> > 
> > - on vendor driver opening, it registers an irq of type
> >   VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
> >   1 vendor irq.
> > 
> > - after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
> >   it enables it by signaling ACTION_TRIGGER.
> >   
> > - on receiving this ACTION_TRIGGER, vendor driver will try to setup a
> >   virqfd to monitor file write to the fd of this irq, enable this irq
> >   and return its enabling status to userspace.
> 
> I'm not sure I follow here, what's the purpose of the irqfd?  When and
> what does the user signal by writing to the irqfd?  Is this an ACK
> mechanism?  Is this a different fd from the signaling eventfd?
it's not the kvm irqfd.
in the vendor driver side, once ACTION_TRIGGER is received for the remap irq,
interface vfio_virqfd_enable() is called to monitor writes to the eventfd of
this irq.

when vendor driver signals the eventfd, remap handler in QEMU is
called and it writes to the eventfd after remapping is done.
Then the virqfd->handler registered in vendor driver is called to receive
the QEMU ack.

> 
> > > Would the vendor driver refuse to change
> > > device_state in the migration region if the user has not enabled this
> > > IRQ?  
> > yes, vendor driver can refuse to change device_state if the irq
> > VFIO_IRQ_TYPE_REMAP_BAR_REGION is not enabled.
> > in my sample i40e_vf driver (patch 10/10), it implemented this logic
> > like below:
> > 
> > i40e_vf_set_device_state
> >     |-> case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
> >     |          ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
> >                               |->ret = i40e_vf_remap_bars(i40e_vf_dev, true);
> > 			                     |->if (!i40e_vf_dev->remap_irq_ctx.init)
> >                                                     return -ENODEV;
> > 
> > 
> > (i40e_vf_dev->remap_irq_ctx.init is set in below path)
> > i40e_vf_ioctl(cmd==VFIO_DEVICE_SET_IRQS)
> >     |->i40e_vf_set_irq_remap_bars
> >        |->i40e_vf_enable_remap_bars_irq
> >            |-> vf_dev->remap_irq_ctx.init = true;
> 
> This should be a documented aspect of the uapi, not left to vendor
> discretion to implement.
>
ok. got it.

> > > 
> > > Everything you've described in the commit log needs to be in this
> > > header, we can't have the usage protocol buried in a commit log.  It  
> > got it! I'll move all descriptions in commit logs to this header so that
> > readers can understand the whole picture here.
> > 
> > > also seems like this is unnecessarily PCI specific.  Can't the count
> > > bitmap simply indicate the region index to re-evaluate?  Maybe you were  
> > yes, it is possible. but what prevented me from doing it is that it's not
> > easy to write an irq handler in qemu to remap other regions dynamically.
> > 
> > for BAR regions, there're 3 layers as below.
> > 1. bar->mr  -->bottom layer
> > 2. bar->region.mem --> slow path
> > 3. bar->region->mmaps[i].mem  --> fast path
> > so, bar remap irq handler can simply re-revaluate the region and
> > remove/re-generate the layer 3 (fast path) without losing track of any
> > guest accesses to the bar regions.
> > 
> > actually so far, the bar remap irq handler in qemu only supports remap
> > mmap'd subregions (layout of mmap'd subregions are re-queried) and
> > not supports updating the whole bar region size.
> > (do you think updating bar region size is a must?)
> 
> It depends on whether our interrupt is defined that the user should
> re-evaluate the entire region_info or just the spare mmap capability.
> A device spontaneously changing region size seems like a much more
> abstract problem.  We do need to figure out how to support resizeable
> BARs, but it seems that would be at the direction of the user, for
> example emulating the resizeable BAR capability and requiring userspace
> to re-evaluate the region_info after interacting with that emulation.
> So long as we specify that this IRQ is limited to re-evaluating the
> sparse mmap capability for the indicated regions, I don't think we need
> to handle the remainder of region_info spontaneously changing.
got it.

> > however, there are no such fast path and slow path in other regions, so
> > remap handlers for them are region specific.
> 
> QEMU support for re-evaluating arbitrary regions for sparse mmap
> changes should not limit our kernel implementation.  Maybe it does
> suggest though that userspace should be informed of the region indexes
> subject to re-evaluation such that it can choose to ignore this
> interrupt (and lose the features enabled by the IRQ), if it doesn't
> support re-evaluating all of the indicated regions.  For example the
> capability could include a bitmap indicating regions that might be
> signaled and the QEMU driver might skip registering an eventfd via
> SET_IRQS if support for non-BAR region indexes is indicated as a
> requirement.  I'd really prefer if we can design this to not be limited
> to PCI BARs.
> 
what about use the irq_set->start and irq_set->count in ioctl SET_IRQS
to notify vendor driver of the supported invalidation range of region
indexes? e.g. currently it's irq_set->start = 0, and irq_set->count=6.
this SET_IRQS ioctl can be called multiple of times to notify vendor
driver all supported ranges.
if vendor driver signals indexes outside of this range, QEMU just
ignores the request.

> > > worried about running out of bits in the ctx count?  An IRQ per region  
> > yes. that's also possible :) 
> > but current ctx count is 64bit, so it can support regions of index up to 63.
> > if we don't need to remap dev regions, seems it's enough?
> 
> This is the kind of decision we might look back on 10yrs later and
> wonder how we were so short sighted, but yes it does seem like enough
> and we can define additional IRQs for each of the next 64 region
> indexes if we need too.
ok.

>  
> > > could resolve that, but maybe we could also just add another IRQ for
> > > the next bitmap of regions.  I assume that the bitmap can indicate
> > > multiple regions to re-evaluate, but that should be documented.  
> > hmm. would you mind elaborating more about it?
> 
> I'm just confirming that the usage expectation would allow the user to
> be signaled with multiple bits in the bitmap set and the user is
> expected to re-evaluate each region index sparse bitmap.
yes, currently, vendor driver is able to specify multiple bits in the
bitmap set.

> 
> > > Also, what sort of service requirements does this imply?  Would the
> > > vendor driver send this IRQ when the user tries to set the device_state
> > > to _SAVING and therefore we'd require the user to accept, implement the
> > > mapping change, and acknowledge the IRQ all while waiting for the write
> > > to device_state to return?  That implies quite a lot of asynchronous
> > > support in the userspace driver.  Thanks,  
> > yes.
> > (1) when user sets device_state to _SAVING, the vendor driver notifies this
> > IRQ, waits until user IRQ ack is received.
> > (2) in IRQ handler, user decodes and sends IRQ ack to vendor driver.
> > 
> > if a wait is required in (1) returns, it demands the qemu_mutex_iothread is
> > not locked in migration thread when device_state is set in (1), as before
> > entering (2), acquiring of this mutex is required.
> > 
> > Currently, this lock is not hold in vfio_migration_set_state() at
> > save_setup stage but is hold in stop and copy stage. so we wait in
> > kernel in save_setup stage and not wait in stop stage.
> > it can be fixed by calling qemu_mutex_unlock_iothread() on entering
> > vfio_migration_set_state() and qemu_mutex_lock_iothread() on leaving
> > vfio_migration_set_state() in qemu.
> > 
> > do you think it's acceptable?
> 
> I'm not thrilled by it, it seems a bit tricky for both userspace and
> the vendor driver to get right.  Userspace needs to handle this eventfd
> while blocked on write(2) into a region, which for QEMU means
> additional ioctls to retrieve new REGION_INFO, closing some mmaps,
> maybe opening other mmaps, which implies new layering of MemoryRegion
> sub-regions and all of the calls through KVM to implement those address
> space changes.  The vendor driver must also be able to support
> concurrency of handling the REGION_INFO ioctl, new calls to mmap
> regions, and maybe vm_ops.close and vm_ops.fault.  These regions might
> also be IOMMU mapped, so re-evaluating the sparse mmap could result in
> DMA maps and unmaps, which the vendor driver might see via the notify
> forcing it to unpin pages.  How would the vendor driver know when to
> unblock the write to device_state, would it look for vm_ops.close on
> infringing vmas or are you thinking of an ACK via irqfd?  I wouldn't
> want to debug lockups as a result of this design :-\
hmm, do you think below sequence is acceptable?
1. QEMU sets device_state to PRE_SAVING.
   vendor driver signals the remap irq and returns the device_state
   write.

2. QEMU remap irq handler is invoked to do the region remapping. after
   that user ack is sent to vendor driver by the handler writing to eventfd.

3. QEMU sets device state to SAVING.
   vendor driver returns success if user ack is received or failure
   after a timeout wait.

> 
> What happens if the mmap re-evaluation occurs asynchronous to the
> device_state write?  The vendor driver can track outstanding mmap vmas
> to areas it's trying to revoke, so the vendor driver can know when
> userspace has reached an acceptable state (assuming we require
> userspace to munmap areas that are no longer valid).  We should also
> consider what we can accomplish by invalidating user mmaps, ex. can we
> fault them back in on a per-page basis and continue to mark them dirty
> in the migration state, re-invalidating on each iteration until they've
> finally been closed.   It seems the vendor driver needs to handle
> incrementally closing each mmap anyway, there's no requirement to the
> user to stop the device (ie. block all access), make these changes,
> then restart the device.  So perhaps the vendor driver can "limp" along
> until userspace completes the changes.  I think we can assume we are in
> a cooperative environment here, userspace wants to perform a migration,
> disabling direct access to some regions is for mediating those accesses
> during migration, not for preventing the user from accessing something
> they shouldn't have access to, userspace is only delaying the migration
> or affecting the state of their device by not promptly participating in
> the protocol.
> 
the problem is that the mmap re-evaluation has to be done before
device_state is successfully set to SAVING. otherwise, the QEMU may
have left save_setup stage and it's too late to start dirty tracking.
And the reason for us to trap the BAR regions is not because there're
dirty data in this region, it is because we want to know when the device
registers mapped in the BARs are written, so we can do dirty page track
of system memory in software way.

> Another problem I see though is what about p2p DMA?  If the vendor
> driver invalidates an mmap we're removing it from both direct CPU as
> well as DMA access via the IOMMU.  We can't signal to the guest OS that
> a DMA channel they've been using is suddenly no longer valid.  Is QEMU
> going to need to avoid ever IOMMU mapping device_ram for regions
> subject to mmap invalidation?  That would introduce an undesirable need
> to choose whether we want to support p2p or migration unless we had an
> IOMMU that could provide dirty tracking via p2p, right?  Thanks,

yes, if there are device memory mapped in the BARs to be remapped, p2p
DMA would be affected. Perhaps it is what vendor driver should be aware
of and know what it is doing before sending out the remap irq ?
in i40e vf's case, the BAR 0 to be remapped is only for device registers,
so is it still good?


Thanks
Yan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-02  8:28         ` Yan Zhao
@ 2020-06-02 19:34           ` Alex Williamson
  2020-06-03  1:40             ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2020-06-02 19:34 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Tue, 2 Jun 2020 04:28:58 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Jun 01, 2020 at 10:43:07AM -0600, Alex Williamson wrote:
> > On Mon, 1 Jun 2020 02:57:26 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:  
> > > > On Sun, 17 May 2020 22:52:45 -0400
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > This is a virtual irq type.
> > > > > vendor driver triggers this irq when it wants to notify userspace to
> > > > > remap PCI BARs.
> > > > > 
> > > > > 1. vendor driver triggers this irq and packs the target bar number in
> > > > >    the ctx count. i.e. "1 << bar_number".
> > > > >    if a bit is set, the corresponding bar is to be remapped.
> > > > > 
> > > > > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > > > > the bar regions are changed, it removes the old subregions and attaches
> > > > > subregions according to the new flags.
> > > > > 
> > > > > 3. userspace notifies back to kernel by writing one to the eventfd of
> > > > > this irq.
> > > > > 
> > > > > Please check the corresponding qemu implementation from the reply of this
> > > > > patch, and a sample usage in vendor driver in patch [10/10].
> > > > > 
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > > ---
> > > > >  include/uapi/linux/vfio.h | 11 +++++++++++
> > > > >  1 file changed, 11 insertions(+)
> > > > > 
> > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > > index 2d0d85c7c4d4..55895f75d720 100644
> > > > > --- a/include/uapi/linux/vfio.h
> > > > > +++ b/include/uapi/linux/vfio.h
> > > > > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> > > > >  	__u32 subtype;  /* type specific */
> > > > >  };
> > > > >  
> > > > > +/* Bar Region Query IRQ TYPE */
> > > > > +#define VFIO_IRQ_TYPE_REMAP_BAR_REGION			(1)
> > > > > +
> > > > > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > > > > +/*
> > > > > + * This irq notifies userspace to re-query BAR region and remaps the
> > > > > + * subregions.
> > > > > + */
> > > > > +#define VFIO_IRQ_SUBTYPE_REMAP_BAR_REGION	(0)    
> > > > 
> > > > Hi Yan,
> > > > 
> > > > How do we do this in a way that's backwards compatible?  Or maybe, how
> > > > do we perform a handshake between the vendor driver and userspace to
> > > > indicate this support?    
> > > hi Alex
> > > thank you for your thoughtful review!
> > > 
> > > do you think below sequence can provide enough backwards compatibility?
> > > 
> > > - on vendor driver opening, it registers an irq of type
> > >   VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
> > >   1 vendor irq.
> > > 
> > > - after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
> > >   it enables it by signaling ACTION_TRIGGER.
> > >   
> > > - on receiving this ACTION_TRIGGER, vendor driver will try to setup a
> > >   virqfd to monitor file write to the fd of this irq, enable this irq
> > >   and return its enabling status to userspace.  
> > 
> > I'm not sure I follow here, what's the purpose of the irqfd?  When and
> > what does the user signal by writing to the irqfd?  Is this an ACK
> > mechanism?  Is this a different fd from the signaling eventfd?  
> it's not the kvm irqfd.
> in the vendor driver side, once ACTION_TRIGGER is received for the remap irq,
> interface vfio_virqfd_enable() is called to monitor writes to the eventfd of
> this irq.
> 
> when vendor driver signals the eventfd, remap handler in QEMU is
> called and it writes to the eventfd after remapping is done.
> Then the virqfd->handler registered in vendor driver is called to receive
> the QEMU ack.

This seems racy to use the same fd as both an eventfd and irqfd, does
the host need to wait for the user to service the previous IRQ before
sending a new one?  Don't we have gaps where the user is either reading
or writing where we can lose an interrupt?  Does the user also write a
bitmap?  How do we avoid getting out of sync?  Why do we even need
this, can't the vendor driver look for vm_ops.close callbacks for the
offending vma mmaps?

> > > > Would the vendor driver refuse to change
> > > > device_state in the migration region if the user has not enabled this
> > > > IRQ?    
> > > yes, vendor driver can refuse to change device_state if the irq
> > > VFIO_IRQ_TYPE_REMAP_BAR_REGION is not enabled.
> > > in my sample i40e_vf driver (patch 10/10), it implemented this logic
> > > like below:
> > > 
> > > i40e_vf_set_device_state
> > >     |-> case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
> > >     |          ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
> > >                               |->ret = i40e_vf_remap_bars(i40e_vf_dev, true);
> > > 			                     |->if (!i40e_vf_dev->remap_irq_ctx.init)
> > >                                                     return -ENODEV;
> > > 
> > > 
> > > (i40e_vf_dev->remap_irq_ctx.init is set in below path)
> > > i40e_vf_ioctl(cmd==VFIO_DEVICE_SET_IRQS)
> > >     |->i40e_vf_set_irq_remap_bars
> > >        |->i40e_vf_enable_remap_bars_irq
> > >            |-> vf_dev->remap_irq_ctx.init = true;  
> > 
> > This should be a documented aspect of the uapi, not left to vendor
> > discretion to implement.
> >  
> ok. got it.
> 
> > > > 
> > > > Everything you've described in the commit log needs to be in this
> > > > header, we can't have the usage protocol buried in a commit log.  It    
> > > got it! I'll move all descriptions in commit logs to this header so that
> > > readers can understand the whole picture here.
> > >   
> > > > also seems like this is unnecessarily PCI specific.  Can't the count
> > > > bitmap simply indicate the region index to re-evaluate?  Maybe you were    
> > > yes, it is possible. but what prevented me from doing it is that it's not
> > > easy to write an irq handler in qemu to remap other regions dynamically.
> > > 
> > > for BAR regions, there're 3 layers as below.
> > > 1. bar->mr  -->bottom layer
> > > 2. bar->region.mem --> slow path
> > > 3. bar->region->mmaps[i].mem  --> fast path
> > > so, bar remap irq handler can simply re-revaluate the region and
> > > remove/re-generate the layer 3 (fast path) without losing track of any
> > > guest accesses to the bar regions.
> > > 
> > > actually so far, the bar remap irq handler in qemu only supports remap
> > > mmap'd subregions (layout of mmap'd subregions are re-queried) and
> > > not supports updating the whole bar region size.
> > > (do you think updating bar region size is a must?)  
> > 
> > It depends on whether our interrupt is defined that the user should
> > re-evaluate the entire region_info or just the spare mmap capability.
> > A device spontaneously changing region size seems like a much more
> > abstract problem.  We do need to figure out how to support resizeable
> > BARs, but it seems that would be at the direction of the user, for
> > example emulating the resizeable BAR capability and requiring userspace
> > to re-evaluate the region_info after interacting with that emulation.
> > So long as we specify that this IRQ is limited to re-evaluating the
> > sparse mmap capability for the indicated regions, I don't think we need
> > to handle the remainder of region_info spontaneously changing.  
> got it.
> 
> > > however, there are no such fast path and slow path in other regions, so
> > > remap handlers for them are region specific.  
> > 
> > QEMU support for re-evaluating arbitrary regions for sparse mmap
> > changes should not limit our kernel implementation.  Maybe it does
> > suggest though that userspace should be informed of the region indexes
> > subject to re-evaluation such that it can choose to ignore this
> > interrupt (and lose the features enabled by the IRQ), if it doesn't
> > support re-evaluating all of the indicated regions.  For example the
> > capability could include a bitmap indicating regions that might be
> > signaled and the QEMU driver might skip registering an eventfd via
> > SET_IRQS if support for non-BAR region indexes is indicated as a
> > requirement.  I'd really prefer if we can design this to not be limited
> > to PCI BARs.
> >   
> what about use the irq_set->start and irq_set->count in ioctl SET_IRQS
> to notify vendor driver of the supported invalidation range of region
> indexes? e.g. currently it's irq_set->start = 0, and irq_set->count=6.
> this SET_IRQS ioctl can be called multiple of times to notify vendor
> driver all supported ranges.
> if vendor driver signals indexes outside of this range, QEMU just
> ignores the request.


You want an IRQ per region?  I don't think I understand this proposal.
Overloading sub-indexes within a SET_IRQS index is not acceptable.
Also, "if vendor driver signals indexes outside of this range, QEMU
just ignores the request", if that means that QEMU doesn't handle a
request to re-evaluate the sparse mmap capability I think that
nullifies the entire proposal.

> > > > worried about running out of bits in the ctx count?  An IRQ per region    
> > > yes. that's also possible :) 
> > > but current ctx count is 64bit, so it can support regions of index up to 63.
> > > if we don't need to remap dev regions, seems it's enough?  
> > 
> > This is the kind of decision we might look back on 10yrs later and
> > wonder how we were so short sighted, but yes it does seem like enough
> > and we can define additional IRQs for each of the next 64 region
> > indexes if we need too.  
> ok.
> 
> >    
> > > > could resolve that, but maybe we could also just add another IRQ for
> > > > the next bitmap of regions.  I assume that the bitmap can indicate
> > > > multiple regions to re-evaluate, but that should be documented.    
> > > hmm. would you mind elaborating more about it?  
> > 
> > I'm just confirming that the usage expectation would allow the user to
> > be signaled with multiple bits in the bitmap set and the user is
> > expected to re-evaluate each region index sparse bitmap.  
> yes, currently, vendor driver is able to specify multiple bits in the
> bitmap set.
> 
> >   
> > > > Also, what sort of service requirements does this imply?  Would the
> > > > vendor driver send this IRQ when the user tries to set the device_state
> > > > to _SAVING and therefore we'd require the user to accept, implement the
> > > > mapping change, and acknowledge the IRQ all while waiting for the write
> > > > to device_state to return?  That implies quite a lot of asynchronous
> > > > support in the userspace driver.  Thanks,    
> > > yes.
> > > (1) when user sets device_state to _SAVING, the vendor driver notifies this
> > > IRQ, waits until user IRQ ack is received.
> > > (2) in IRQ handler, user decodes and sends IRQ ack to vendor driver.
> > > 
> > > if a wait is required in (1) returns, it demands the qemu_mutex_iothread is
> > > not locked in migration thread when device_state is set in (1), as before
> > > entering (2), acquiring of this mutex is required.
> > > 
> > > Currently, this lock is not hold in vfio_migration_set_state() at
> > > save_setup stage but is hold in stop and copy stage. so we wait in
> > > kernel in save_setup stage and not wait in stop stage.
> > > it can be fixed by calling qemu_mutex_unlock_iothread() on entering
> > > vfio_migration_set_state() and qemu_mutex_lock_iothread() on leaving
> > > vfio_migration_set_state() in qemu.
> > > 
> > > do you think it's acceptable?  
> > 
> > I'm not thrilled by it, it seems a bit tricky for both userspace and
> > the vendor driver to get right.  Userspace needs to handle this eventfd
> > while blocked on write(2) into a region, which for QEMU means
> > additional ioctls to retrieve new REGION_INFO, closing some mmaps,
> > maybe opening other mmaps, which implies new layering of MemoryRegion
> > sub-regions and all of the calls through KVM to implement those address
> > space changes.  The vendor driver must also be able to support
> > concurrency of handling the REGION_INFO ioctl, new calls to mmap
> > regions, and maybe vm_ops.close and vm_ops.fault.  These regions might
> > also be IOMMU mapped, so re-evaluating the sparse mmap could result in
> > DMA maps and unmaps, which the vendor driver might see via the notify
> > forcing it to unpin pages.  How would the vendor driver know when to
> > unblock the write to device_state, would it look for vm_ops.close on
> > infringing vmas or are you thinking of an ACK via irqfd?  I wouldn't
> > want to debug lockups as a result of this design :-\  
> hmm, do you think below sequence is acceptable?
> 1. QEMU sets device_state to PRE_SAVING.
>    vendor driver signals the remap irq and returns the device_state
>    write.

We have no PRE_SAVING device state.  We haven't even gotten an
implementation of the agreed migration protocol into mainline and we're
already abandoning it?

> 2. QEMU remap irq handler is invoked to do the region remapping. after
>    that user ack is sent to vendor driver by the handler writing to eventfd.

Why would we even need an remap IRQ, if we're going to throw away the
migration protocol we could just define that the user should
re-evaluate the sparse mmaps when switching to _SAVING.

> 3. QEMU sets device state to SAVING.
>    vendor driver returns success if user ack is received or failure
>    after a timeout wait.

I'm not at all happy with this.  Why do we need to hide the migration
sparse mmap from the user until migration time?  What if instead we
introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
where the existing capability is the normal runtime sparse setup and
the user is required to use this new one prior to enabled device_state
with _SAVING.  The vendor driver could then simply track mmap vmas to
the region and refuse to change device_state if there are outstanding
mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
required, no new irqfds, an incremental change to the protocol,
backwards compatible to the extent that a vendor driver requiring this
will automatically fail migration.

> > What happens if the mmap re-evaluation occurs asynchronous to the
> > device_state write?  The vendor driver can track outstanding mmap vmas
> > to areas it's trying to revoke, so the vendor driver can know when
> > userspace has reached an acceptable state (assuming we require
> > userspace to munmap areas that are no longer valid).  We should also
> > consider what we can accomplish by invalidating user mmaps, ex. can we
> > fault them back in on a per-page basis and continue to mark them dirty
> > in the migration state, re-invalidating on each iteration until they've
> > finally been closed.   It seems the vendor driver needs to handle
> > incrementally closing each mmap anyway, there's no requirement to the
> > user to stop the device (ie. block all access), make these changes,
> > then restart the device.  So perhaps the vendor driver can "limp" along
> > until userspace completes the changes.  I think we can assume we are in
> > a cooperative environment here, userspace wants to perform a migration,
> > disabling direct access to some regions is for mediating those accesses
> > during migration, not for preventing the user from accessing something
> > they shouldn't have access to, userspace is only delaying the migration
> > or affecting the state of their device by not promptly participating in
> > the protocol.
> >   
> the problem is that the mmap re-evaluation has to be done before
> device_state is successfully set to SAVING. otherwise, the QEMU may
> have left save_setup stage and it's too late to start dirty tracking.
> And the reason for us to trap the BAR regions is not because there're
> dirty data in this region, it is because we want to know when the device
> registers mapped in the BARs are written, so we can do dirty page track
> of system memory in software way.

I think my proposal above resolves this.

> > Another problem I see though is what about p2p DMA?  If the vendor
> > driver invalidates an mmap we're removing it from both direct CPU as
> > well as DMA access via the IOMMU.  We can't signal to the guest OS that
> > a DMA channel they've been using is suddenly no longer valid.  Is QEMU
> > going to need to avoid ever IOMMU mapping device_ram for regions
> > subject to mmap invalidation?  That would introduce an undesirable need
> > to choose whether we want to support p2p or migration unless we had an
> > IOMMU that could provide dirty tracking via p2p, right?  Thanks,  
> 
> yes, if there are device memory mapped in the BARs to be remapped, p2p
> DMA would be affected. Perhaps it is what vendor driver should be aware
> of and know what it is doing before sending out the remap irq ?
> in i40e vf's case, the BAR 0 to be remapped is only for device registers,
> so is it still good?

No, we can't design the interface based on one vendor driver's
implementation of the interface or the requirements of a single device.
If we took the approach above where the user is provided both the
normal sparse mmap and the _SAVING sparse mmap, perhaps QEMU could
avoid DMA mapping portions that don't exist in the _SAVING version, at
least then the p2p DMA mappings would be consistent across the
transition.  QEMU might be able to combine the sparse mmap maps such
that it can easily drop ranges not present during _SAVING.  QEMU would
need to munmap() the dropped ranges rather than simply mark the
MemoryRegion disabled though for the vendor driver to have visibility
of the vm_ops.close callback.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-02 19:34           ` Alex Williamson
@ 2020-06-03  1:40             ` Yan Zhao
  2020-06-03 23:04               ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-03  1:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> I'm not at all happy with this.  Why do we need to hide the migration
> sparse mmap from the user until migration time?  What if instead we
> introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> where the existing capability is the normal runtime sparse setup and
> the user is required to use this new one prior to enabled device_state
> with _SAVING.  The vendor driver could then simply track mmap vmas to
> the region and refuse to change device_state if there are outstanding
> mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> required, no new irqfds, an incremental change to the protocol,
> backwards compatible to the extent that a vendor driver requiring this
> will automatically fail migration.
> 
right. looks we need to use this approach to solve the problem.
thanks for your guide.
so I'll abandon the current remap irq way for dirty tracking during live
migration.
but anyway, it demos how to customize irq_types in vendor drivers.
then, what do you think about patches 1-5?

> > > What happens if the mmap re-evaluation occurs asynchronous to the
> > > device_state write?  The vendor driver can track outstanding mmap vmas
> > > to areas it's trying to revoke, so the vendor driver can know when
> > > userspace has reached an acceptable state (assuming we require
> > > userspace to munmap areas that are no longer valid).  We should also
> > > consider what we can accomplish by invalidating user mmaps, ex. can we
> > > fault them back in on a per-page basis and continue to mark them dirty
> > > in the migration state, re-invalidating on each iteration until they've
> > > finally been closed.   It seems the vendor driver needs to handle
> > > incrementally closing each mmap anyway, there's no requirement to the
> > > user to stop the device (ie. block all access), make these changes,
> > > then restart the device.  So perhaps the vendor driver can "limp" along
> > > until userspace completes the changes.  I think we can assume we are in
> > > a cooperative environment here, userspace wants to perform a migration,
> > > disabling direct access to some regions is for mediating those accesses
> > > during migration, not for preventing the user from accessing something
> > > they shouldn't have access to, userspace is only delaying the migration
> > > or affecting the state of their device by not promptly participating in
> > > the protocol.
> > >   
> > the problem is that the mmap re-evaluation has to be done before
> > device_state is successfully set to SAVING. otherwise, the QEMU may
> > have left save_setup stage and it's too late to start dirty tracking.
> > And the reason for us to trap the BAR regions is not because there're
> > dirty data in this region, it is because we want to know when the device
> > registers mapped in the BARs are written, so we can do dirty page track
> > of system memory in software way.
> 
> I think my proposal above resolves this.
>
yes.

> > > Another problem I see though is what about p2p DMA?  If the vendor
> > > driver invalidates an mmap we're removing it from both direct CPU as
> > > well as DMA access via the IOMMU.  We can't signal to the guest OS that
> > > a DMA channel they've been using is suddenly no longer valid.  Is QEMU
> > > going to need to avoid ever IOMMU mapping device_ram for regions
> > > subject to mmap invalidation?  That would introduce an undesirable need
> > > to choose whether we want to support p2p or migration unless we had an
> > > IOMMU that could provide dirty tracking via p2p, right?  Thanks,  
> > 
> > yes, if there are device memory mapped in the BARs to be remapped, p2p
> > DMA would be affected. Perhaps it is what vendor driver should be aware
> > of and know what it is doing before sending out the remap irq ?
> > in i40e vf's case, the BAR 0 to be remapped is only for device registers,
> > so is it still good?
> 
> No, we can't design the interface based on one vendor driver's
> implementation of the interface or the requirements of a single device.
> If we took the approach above where the user is provided both the
> normal sparse mmap and the _SAVING sparse mmap, perhaps QEMU could
> avoid DMA mapping portions that don't exist in the _SAVING version, at
> least then the p2p DMA mappings would be consistent across the
> transition.  QEMU might be able to combine the sparse mmap maps such
> that it can easily drop ranges not present during _SAVING.  QEMU would
> need to munmap() the dropped ranges rather than simply mark the
> MemoryRegion disabled though for the vendor driver to have visibility
> of the vm_ops.close callback.  Thanks,
>
ok. got it! thanks you!

Yan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-03  1:40             ` Yan Zhao
@ 2020-06-03 23:04               ` Alex Williamson
  2020-06-04  2:42                 ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2020-06-03 23:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Tue, 2 Jun 2020 21:40:58 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> > I'm not at all happy with this.  Why do we need to hide the migration
> > sparse mmap from the user until migration time?  What if instead we
> > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > where the existing capability is the normal runtime sparse setup and
> > the user is required to use this new one prior to enabled device_state
> > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > the region and refuse to change device_state if there are outstanding
> > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > required, no new irqfds, an incremental change to the protocol,
> > backwards compatible to the extent that a vendor driver requiring this
> > will automatically fail migration.
> >   
> right. looks we need to use this approach to solve the problem.
> thanks for your guide.
> so I'll abandon the current remap irq way for dirty tracking during live
> migration.
> but anyway, it demos how to customize irq_types in vendor drivers.
> then, what do you think about patches 1-5?

In broad strokes, I don't think we've found the right solution yet.  I
really question whether it's supportable to parcel out vfio-pci like
this and I don't know how I'd support unraveling whether we have a bug
in vfio-pci, the vendor driver, or how the vendor driver is making use
of vfio-pci.

Let me also ask, why does any of this need to be in the kernel?  We
spend 5 patches slicing up vfio-pci so that we can register a vendor
driver and have that vendor driver call into vfio-pci as it sees fit.
We have two patches creating device specific interrupts and a BAR
remapping scheme that we've decided we don't need.  That brings us to
the actual i40e vendor driver, where the first patch is simply making
the vendor driver work like vfio-pci already does, the second patch is
handling the migration region, and the third patch is implementing the
BAR remapping IRQ that we decided we don't need.  It's difficult to
actually find the small bit of code that's required to support
migration outside of just dealing with the protocol we've defined to
expose this from the kernel.  So why are we trying to do this in the
kernel?  We have quirk support in QEMU, we can easily flip
MemoryRegions on and off, etc.  What access to the device outside of
what vfio-pci provides to the user, and therefore QEMU, is necessary to
implement this migration support for i40e VFs?  Is this just an
exercise in making use of the migration interface?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-03 23:04               ` Alex Williamson
@ 2020-06-04  2:42                 ` Yan Zhao
  2020-06-04  4:10                   ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-04  2:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> On Tue, 2 Jun 2020 21:40:58 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> > > I'm not at all happy with this.  Why do we need to hide the migration
> > > sparse mmap from the user until migration time?  What if instead we
> > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > where the existing capability is the normal runtime sparse setup and
> > > the user is required to use this new one prior to enabled device_state
> > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > the region and refuse to change device_state if there are outstanding
> > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > required, no new irqfds, an incremental change to the protocol,
> > > backwards compatible to the extent that a vendor driver requiring this
> > > will automatically fail migration.
> > >   
> > right. looks we need to use this approach to solve the problem.
> > thanks for your guide.
> > so I'll abandon the current remap irq way for dirty tracking during live
> > migration.
> > but anyway, it demos how to customize irq_types in vendor drivers.
> > then, what do you think about patches 1-5?
> 
> In broad strokes, I don't think we've found the right solution yet.  I
> really question whether it's supportable to parcel out vfio-pci like
> this and I don't know how I'd support unraveling whether we have a bug
> in vfio-pci, the vendor driver, or how the vendor driver is making use
> of vfio-pci.
>
> Let me also ask, why does any of this need to be in the kernel?  We
> spend 5 patches slicing up vfio-pci so that we can register a vendor
> driver and have that vendor driver call into vfio-pci as it sees fit.
> We have two patches creating device specific interrupts and a BAR
> remapping scheme that we've decided we don't need.  That brings us to
> the actual i40e vendor driver, where the first patch is simply making
> the vendor driver work like vfio-pci already does, the second patch is
> handling the migration region, and the third patch is implementing the
> BAR remapping IRQ that we decided we don't need.  It's difficult to
> actually find the small bit of code that's required to support
> migration outside of just dealing with the protocol we've defined to
> expose this from the kernel.  So why are we trying to do this in the
> kernel?  We have quirk support in QEMU, we can easily flip
> MemoryRegions on and off, etc.  What access to the device outside of
> what vfio-pci provides to the user, and therefore QEMU, is necessary to
> implement this migration support for i40e VFs?  Is this just an
> exercise in making use of the migration interface?  Thanks,
> 
hi Alex

There was a description of intention of this series in RFC v1
(https://www.spinics.net/lists/kernel/msg3337337.html).
sorry, I didn't include it in starting from RFC v2.

"
The reason why we don't choose the way of writing mdev parent driver is
that
(1) VFs are almost all the time directly passthroughed. Directly binding
to vfio-pci can make most of the code shared/reused. If we write a
vendor specific mdev parent driver, most of the code (like passthrough
style of rw/mmap) still needs to be copied from vfio-pci driver, which is
actually a duplicated and tedious work.
(2) For features like dynamically trap/untrap pci bars, if they are in
vfio-pci, they can be available to most people without repeated code
copying and re-testing.
(3) with a 1:1 mdev driver which passes through VFs most of the time, people
have to decide whether to bind VFs to vfio-pci or mdev parent driver before
it runs into a real migration need. However, if vfio-pci is bound
initially, they have no chance to do live migration when there's a need
later.
"
particularly, there're some devices (like NVMe) they purely reply on
vfio-pci to do device pass-through and they have no standalone parent driver
to do mdev way.

I think live migration is a general requirement for most devices and to
interact with the migration interface requires vendor drivers to do
device specific tasks like geting/seting device state, starting/stopping
devices, tracking dirty data, report migration capabilities... all those
works need be in kernel.
do you think it's better to create numerous vendor quirks in vfio-pci?

as to this series, though patch 9/10 currently only demos reporting a
migration region, it actually shows the capability iof vendor driver to
customize device regions. e.g. in patch 10/10, it customizes the BAR0 to
be read/write. and though we abandoned the REMAP BAR irq_type in patch
10/10 for migration purpose, I have to say this irq_type has its usage
in other use cases, where synchronization is not a hard requirement and
all it needs is a notification channel from kernel to use. this series
just provides a possibility for vendors to customize device regions and
irqs.

for interfaces exported in patch 3/10-5/10, they anyway need to be
exported for writing mdev parent drivers that pass through devices at
normal time to avoid duplication. and yes, your worry about
identification of bug sources is reasonable. but if a device is binding
to vfio-pci with a vendor module loaded, and there's a bug, they can do at
least two ways to identify if it's a bug in vfio-pci itself.
(1) prevent vendor modules from loading and see if the problem exists
with pure vfio-pci.
(2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
operations to vfio-pci.

so, do you think this series has its merit and we can continue improving
it?

Thanks
Yan
 




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-04  2:42                 ` Yan Zhao
@ 2020-06-04  4:10                   ` Alex Williamson
  2020-06-05  0:26                     ` He, Shaopeng
  2020-06-05  2:02                     ` Yan Zhao
  0 siblings, 2 replies; 42+ messages in thread
From: Alex Williamson @ 2020-06-04  4:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Wed, 3 Jun 2020 22:42:28 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> > On Tue, 2 Jun 2020 21:40:58 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:  
> > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > sparse mmap from the user until migration time?  What if instead we
> > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > where the existing capability is the normal runtime sparse setup and
> > > > the user is required to use this new one prior to enabled device_state
> > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > the region and refuse to change device_state if there are outstanding
> > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > required, no new irqfds, an incremental change to the protocol,
> > > > backwards compatible to the extent that a vendor driver requiring this
> > > > will automatically fail migration.
> > > >     
> > > right. looks we need to use this approach to solve the problem.
> > > thanks for your guide.
> > > so I'll abandon the current remap irq way for dirty tracking during live
> > > migration.
> > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > then, what do you think about patches 1-5?  
> > 
> > In broad strokes, I don't think we've found the right solution yet.  I
> > really question whether it's supportable to parcel out vfio-pci like
> > this and I don't know how I'd support unraveling whether we have a bug
> > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > of vfio-pci.
> >
> > Let me also ask, why does any of this need to be in the kernel?  We
> > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > driver and have that vendor driver call into vfio-pci as it sees fit.
> > We have two patches creating device specific interrupts and a BAR
> > remapping scheme that we've decided we don't need.  That brings us to
> > the actual i40e vendor driver, where the first patch is simply making
> > the vendor driver work like vfio-pci already does, the second patch is
> > handling the migration region, and the third patch is implementing the
> > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > actually find the small bit of code that's required to support
> > migration outside of just dealing with the protocol we've defined to
> > expose this from the kernel.  So why are we trying to do this in the
> > kernel?  We have quirk support in QEMU, we can easily flip
> > MemoryRegions on and off, etc.  What access to the device outside of
> > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > implement this migration support for i40e VFs?  Is this just an
> > exercise in making use of the migration interface?  Thanks,
> >   
> hi Alex
> 
> There was a description of intention of this series in RFC v1
> (https://www.spinics.net/lists/kernel/msg3337337.html).
> sorry, I didn't include it in starting from RFC v2.
> 
> "
> The reason why we don't choose the way of writing mdev parent driver is
> that

I didn't mention an mdev approach, I'm asking what are we accomplishing
by doing this in the kernel at all versus exposing the device as normal
through vfio-pci and providing the migration support in QEMU.  Are you
actually leveraging having some sort of access to the PF in supporting
migration of the VF?  Is vfio-pci masking the device in a way that
prevents migrating the state from QEMU?

> (1) VFs are almost all the time directly passthroughed. Directly binding
> to vfio-pci can make most of the code shared/reused. If we write a
> vendor specific mdev parent driver, most of the code (like passthrough
> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> actually a duplicated and tedious work.
> (2) For features like dynamically trap/untrap pci bars, if they are in
> vfio-pci, they can be available to most people without repeated code
> copying and re-testing.
> (3) with a 1:1 mdev driver which passes through VFs most of the time, people
> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> it runs into a real migration need. However, if vfio-pci is bound
> initially, they have no chance to do live migration when there's a need
> later.
> "
> particularly, there're some devices (like NVMe) they purely reply on
> vfio-pci to do device pass-through and they have no standalone parent driver
> to do mdev way.
> 
> I think live migration is a general requirement for most devices and to
> interact with the migration interface requires vendor drivers to do
> device specific tasks like geting/seting device state, starting/stopping
> devices, tracking dirty data, report migration capabilities... all those
> works need be in kernel.

I think Alex Graf proved they don't necessarily need to be done in
kernel back in 2015: https://www.youtube.com/watch?v=4RFsSgzuFso
He was able to achieve i40e VF live migration by only hacking QEMU.  In
this series you're allowing a vendor driver to interpose itself between
the user (QEMU) and vfio-pci such that we switch to the vendor code
during migration.  Why can't that interpose layer be in QEMU rather
than the kernel?  It seems that it only must be in the kernel if we
need to provide migration state via backdoor, perhaps like going
through the PF.  So what access to the i40e VF device is not provided to
the user through vfio-pci that is necessary to implement migration of
this device?  The tasks listed above are mostly standard device driver
activities and clearly vfio-pci allows userspace device drivers.

> do you think it's better to create numerous vendor quirks in vfio-pci?

In QEMU, perhaps.  Alternatively, let's look at exactly what access is
not provided through vfio-pci that's necessary for this and decide if
we want to enable that access or if cracking vfio-pci wide open for
vendor drivers to pick and choose when and how to use it is really the
right answer.

> as to this series, though patch 9/10 currently only demos reporting a
> migration region, it actually shows the capability iof vendor driver to
> customize device regions. e.g. in patch 10/10, it customizes the BAR0 to
> be read/write. and though we abandoned the REMAP BAR irq_type in patch
> 10/10 for migration purpose, I have to say this irq_type has its usage
> in other use cases, where synchronization is not a hard requirement and
> all it needs is a notification channel from kernel to use. this series
> just provides a possibility for vendors to customize device regions and
> irqs.

I don't disagree that a device specific interrupt might be useful, but
I would object to implementing this one only as an artificial use case.
We can wait for a legitimate use case to implement that.

> for interfaces exported in patch 3/10-5/10, they anyway need to be
> exported for writing mdev parent drivers that pass through devices at
> normal time to avoid duplication. and yes, your worry about

Where are those parent drivers?  What are their actual requirements?

> identification of bug sources is reasonable. but if a device is binding
> to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> least two ways to identify if it's a bug in vfio-pci itself.
> (1) prevent vendor modules from loading and see if the problem exists
> with pure vfio-pci.
> (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> operations to vfio-pci.

The code split is still extremely ad-hoc, there's no API.  An mdev
driver isn't even a sub-driver of vfio-pci like you're trying to
accomplish here, there would need to be a much more defined API when
the base device isn't even a vfio_pci_device.  I don't see how this
series would directly enable an mdev use case.

> so, do you think this series has its merit and we can continue improving
> it?

I think this series is trying to push an artificial use case that is
perhaps better done in userspace.  What is the actual interaction with
the VF device that can only be done in the host kernel for this
example?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules
  2020-05-18  2:45 ` [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules Yan Zhao
@ 2020-06-04 15:01   ` Cornelia Huck
  2020-06-05  2:05     ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Cornelia Huck @ 2020-06-04 15:01 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, alex.williamson, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Sun, 17 May 2020 22:45:10 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> vendor modules call macro module_vfio_pci_register_vendor_handler to
> generate module_init and module_exit.
> It is necessary to ensure that vendor modules always call
> vfio_pci_register_vendor_driver() on driver loading and
> vfio_pci_unregister_vendor_driver on driver unloading,
> because
> (1) at compiling time, there's only a dependency of vendor modules on
> vfio_pci.
> (2) at runtime,
> - vendor modules add refs of vfio_pci on a successful calling of
>   vfio_pci_register_vendor_driver() and deref of vfio_pci on a
>   successful calling of vfio_pci_unregister_vendor_driver().
> - vfio_pci only adds refs of vendor module on a successful probe of vendor
>   driver.
>   vfio_pci derefs vendor module when unbinding from a device.
> 
> So, after vfio_pci is unbound from a device, the vendor module to that
> device is free to get unloaded. However, if that vendor module does not
> call vfio_pci_unregister_vendor_driver() in its module_exit, vfio_pci may
> hold a stale pointer to vendor module.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  include/linux/vfio.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 3e53deb012b6..f3746608c2d9 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -223,4 +223,31 @@ struct vfio_pci_vendor_driver_ops {
>  };
>  int __vfio_pci_register_vendor_driver(struct vfio_pci_vendor_driver_ops *ops);
>  void vfio_pci_unregister_vendor_driver(struct vfio_device_ops *device_ops);
> +
> +#define vfio_pci_register_vendor_driver(__name, __probe, __remove,	\
> +					__device_ops)			\
> +static struct vfio_pci_vendor_driver_ops  __ops ## _node = {		\
> +	.owner		= THIS_MODULE,					\
> +	.name		= __name,					\
> +	.probe		= __probe,					\
> +	.remove		= __remove,					\
> +	.device_ops	= __device_ops,					\
> +};									\
> +__vfio_pci_register_vendor_driver(&__ops ## _node)
> +
> +#define module_vfio_pci_register_vendor_handler(name, probe, remove,	\
> +						device_ops)		\
> +static int __init device_ops ## _module_init(void)			\
> +{									\
> +	vfio_pci_register_vendor_driver(name, probe, remove,		\
> +					device_ops);			\

What if this function fails (e.g. with -ENOMEM)?

> +	return 0;							\
> +};									\
> +static void __exit device_ops ## _module_exit(void)			\
> +{									\
> +	vfio_pci_unregister_vendor_driver(device_ops);			\
> +};									\
> +module_init(device_ops ## _module_init);				\
> +module_exit(device_ops ## _module_exit)
> +
>  #endif /* VFIO_H */


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs
  2020-05-18  2:49 ` [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs Yan Zhao
@ 2020-06-04 15:25   ` Cornelia Huck
  2020-06-05  2:15     ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Cornelia Huck @ 2020-06-04 15:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, alex.williamson, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Sun, 17 May 2020 22:49:44 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> This allows a simpler VFIO_DEVICE_GET_INFO ioctl in vendor driver
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         | 23 +++++++++++++++++++++--
>  drivers/vfio/pci/vfio_pci_private.h |  2 ++
>  include/linux/vfio.h                |  3 +++
>  3 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 290b7ab55ecf..30137c1c5308 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -105,6 +105,24 @@ void *vfio_pci_vendor_data(void *device_data)
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_vendor_data);
>  
> +int vfio_pci_set_vendor_regions(void *device_data, int num_vendor_regions)
> +{
> +	struct vfio_pci_device *vdev = device_data;
> +
> +	vdev->num_vendor_regions = num_vendor_regions;

Do we need any kind of sanity check here, in case this is called with a
bogus value?

> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_pci_set_vendor_regions);
> +
> +
> +int vfio_pci_set_vendor_irqs(void *device_data, int num_vendor_irqs)
> +{
> +	struct vfio_pci_device *vdev = device_data;
> +
> +	vdev->num_vendor_irqs = num_vendor_irqs;

Here as well.

> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_pci_set_vendor_irqs);
>  /*
>   * Our VGA arbiter participation is limited since we don't know anything
>   * about the device itself.  However, if the device is the only VGA device

(...)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-04  4:10                   ` Alex Williamson
@ 2020-06-05  0:26                     ` He, Shaopeng
  2020-06-05 17:54                       ` Alex Williamson
  2020-06-05  2:02                     ` Yan Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: He, Shaopeng @ 2020-06-05  0:26 UTC (permalink / raw)
  To: Alex Williamson, Zhao, Yan Y
  Cc: kvm, linux-kernel, cohuck, zhenyuw, Wang, Zhi A, Tian, Kevin,
	Liu, Yi L, Zeng, Xin, Yuan, Hang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, June 4, 2020 12:11 PM
> 
> On Wed, 3 Jun 2020 22:42:28 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> > > > > I'm not at all happy with this.  Why do we need to hide the
> > > > > migration sparse mmap from the user until migration time?  What
> > > > > if instead we introduced a new
> > > > > VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability where
> the
> > > > > existing capability is the normal runtime sparse setup and the
> > > > > user is required to use this new one prior to enabled
> > > > > device_state with _SAVING.  The vendor driver could then simply
> > > > > track mmap vmas to the region and refuse to change device_state
> > > > > if there are outstanding mmaps conflicting with the _SAVING
> > > > > sparse mmap layout.  No new IRQs required, no new irqfds, an
> > > > > incremental change to the protocol, backwards compatible to the
> extent that a vendor driver requiring this will automatically fail migration.
> > > > >
> > > > right. looks we need to use this approach to solve the problem.
> > > > thanks for your guide.
> > > > so I'll abandon the current remap irq way for dirty tracking
> > > > during live migration.
> > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > then, what do you think about patches 1-5?
> > >
> > > In broad strokes, I don't think we've found the right solution yet.
> > > I really question whether it's supportable to parcel out vfio-pci
> > > like this and I don't know how I'd support unraveling whether we
> > > have a bug in vfio-pci, the vendor driver, or how the vendor driver
> > > is making use of vfio-pci.
> > >
> > > Let me also ask, why does any of this need to be in the kernel?  We
> > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > We have two patches creating device specific interrupts and a BAR
> > > remapping scheme that we've decided we don't need.  That brings us
> > > to the actual i40e vendor driver, where the first patch is simply
> > > making the vendor driver work like vfio-pci already does, the second
> > > patch is handling the migration region, and the third patch is
> > > implementing the BAR remapping IRQ that we decided we don't need.
> > > It's difficult to actually find the small bit of code that's
> > > required to support migration outside of just dealing with the
> > > protocol we've defined to expose this from the kernel.  So why are
> > > we trying to do this in the kernel?  We have quirk support in QEMU,
> > > we can easily flip MemoryRegions on and off, etc.  What access to
> > > the device outside of what vfio-pci provides to the user, and
> > > therefore QEMU, is necessary to implement this migration support for
> > > i40e VFs?  Is this just an exercise in making use of the migration
> > > interface?  Thanks,
> > >
> > hi Alex
> >
> > There was a description of intention of this series in RFC v1
> > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > sorry, I didn't include it in starting from RFC v2.
> >
> > "
> > The reason why we don't choose the way of writing mdev parent driver
> > is that
> 
> I didn't mention an mdev approach, I'm asking what are we accomplishing by
> doing this in the kernel at all versus exposing the device as normal through
> vfio-pci and providing the migration support in QEMU.  Are you actually
> leveraging having some sort of access to the PF in supporting migration of the
> VF?  Is vfio-pci masking the device in a way that prevents migrating the state
> from QEMU?
> 
> > (1) VFs are almost all the time directly passthroughed. Directly
> > binding to vfio-pci can make most of the code shared/reused. If we
> > write a vendor specific mdev parent driver, most of the code (like
> > passthrough style of rw/mmap) still needs to be copied from vfio-pci
> > driver, which is actually a duplicated and tedious work.
> > (2) For features like dynamically trap/untrap pci bars, if they are in
> > vfio-pci, they can be available to most people without repeated code
> > copying and re-testing.
> > (3) with a 1:1 mdev driver which passes through VFs most of the time,
> > people have to decide whether to bind VFs to vfio-pci or mdev parent
> > driver before it runs into a real migration need. However, if vfio-pci
> > is bound initially, they have no chance to do live migration when
> > there's a need later.
> > "
> > particularly, there're some devices (like NVMe) they purely reply on
> > vfio-pci to do device pass-through and they have no standalone parent
> > driver to do mdev way.
> >
> > I think live migration is a general requirement for most devices and
> > to interact with the migration interface requires vendor drivers to do
> > device specific tasks like geting/seting device state,
> > starting/stopping devices, tracking dirty data, report migration
> > capabilities... all those works need be in kernel.
> 
> I think Alex Graf proved they don't necessarily need to be done in kernel back
> in 2015: https://www.youtube.com/watch?v=4RFsSgzuFso
> He was able to achieve i40e VF live migration by only hacking QEMU.  In this
> series you're allowing a vendor driver to interpose itself between the user
> (QEMU) and vfio-pci such that we switch to the vendor code during migration.
> Why can't that interpose layer be in QEMU rather than the kernel?  It seems
> that it only must be in the kernel if we need to provide migration state via
> backdoor, perhaps like going through the PF.  So what access to the i40e VF
> device is not provided to the user through vfio-pci that is necessary to
> implement migration of this device?  The tasks listed above are mostly
> standard device driver activities and clearly vfio-pci allows userspace device
> drivers.
> 
> > do you think it's better to create numerous vendor quirks in vfio-pci?
> 
> In QEMU, perhaps.  Alternatively, let's look at exactly what access is not
> provided through vfio-pci that's necessary for this and decide if we want to
> enable that access or if cracking vfio-pci wide open for vendor drivers to pick
> and choose when and how to use it is really the right answer.
> 
> > as to this series, though patch 9/10 currently only demos reporting a
> > migration region, it actually shows the capability iof vendor driver
> > to customize device regions. e.g. in patch 10/10, it customizes the
> > BAR0 to be read/write. and though we abandoned the REMAP BAR irq_type
> > in patch
> > 10/10 for migration purpose, I have to say this irq_type has its usage
> > in other use cases, where synchronization is not a hard requirement
> > and all it needs is a notification channel from kernel to use. this
> > series just provides a possibility for vendors to customize device
> > regions and irqs.
> 
> I don't disagree that a device specific interrupt might be useful, but I would
> object to implementing this one only as an artificial use case.
> We can wait for a legitimate use case to implement that.
> 
> > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > exported for writing mdev parent drivers that pass through devices at
> > normal time to avoid duplication. and yes, your worry about
> 
> Where are those parent drivers?  What are their actual requirements?
> 
> > identification of bug sources is reasonable. but if a device is
> > binding to vfio-pci with a vendor module loaded, and there's a bug,
> > they can do at least two ways to identify if it's a bug in vfio-pci itself.
> > (1) prevent vendor modules from loading and see if the problem exists
> > with pure vfio-pci.
> > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass
> > all operations to vfio-pci.
> 
> The code split is still extremely ad-hoc, there's no API.  An mdev driver isn't
> even a sub-driver of vfio-pci like you're trying to accomplish here, there
> would need to be a much more defined API when the base device isn't even a
> vfio_pci_device.  I don't see how this series would directly enable an mdev
> use case.
> 
> > so, do you think this series has its merit and we can continue
> > improving it?
> 
> I think this series is trying to push an artificial use case that is perhaps better
> done in userspace.  What is the actual interaction with the VF device that can
> only be done in the host kernel for this example?  Thanks,

Hi Alex,

As shared in KVM Forum last November(https://www.youtube.com/watch?v=aiCCUFXxVEA),
we already have one PoC working internally. This series is part of that, if going well,
we plan to support it in our future network, storage, security etc. device drivers.

This series has two enhancements to support passthrough device live migration:
general support for SR-IOV live migration and Software assisted dirty page tracking.
We tried PoC for other solutions too, but this series seems to work the best
balancing on feasibility, code duplication, performance etc.

We are more focusing on enabling our latest E810 NIC product now, but we
will check again how we could make it public earlier, as low quality i40e PoC
or formal E810 driver, so you may see "the actual interaction" more clearly.

Thanks,
--Shaopeng
> 
> Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-04  4:10                   ` Alex Williamson
  2020-06-05  0:26                     ` He, Shaopeng
@ 2020-06-05  2:02                     ` Yan Zhao
  2020-06-05 16:13                       ` Alex Williamson
  1 sibling, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-05  2:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:
> On Wed, 3 Jun 2020 22:42:28 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:  
> > > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > > sparse mmap from the user until migration time?  What if instead we
> > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > where the existing capability is the normal runtime sparse setup and
> > > > > the user is required to use this new one prior to enabled device_state
> > > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > > the region and refuse to change device_state if there are outstanding
> > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > backwards compatible to the extent that a vendor driver requiring this
> > > > > will automatically fail migration.
> > > > >     
> > > > right. looks we need to use this approach to solve the problem.
> > > > thanks for your guide.
> > > > so I'll abandon the current remap irq way for dirty tracking during live
> > > > migration.
> > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > then, what do you think about patches 1-5?  
> > > 
> > > In broad strokes, I don't think we've found the right solution yet.  I
> > > really question whether it's supportable to parcel out vfio-pci like
> > > this and I don't know how I'd support unraveling whether we have a bug
> > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > of vfio-pci.
> > >
> > > Let me also ask, why does any of this need to be in the kernel?  We
> > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > We have two patches creating device specific interrupts and a BAR
> > > remapping scheme that we've decided we don't need.  That brings us to
> > > the actual i40e vendor driver, where the first patch is simply making
> > > the vendor driver work like vfio-pci already does, the second patch is
> > > handling the migration region, and the third patch is implementing the
> > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > actually find the small bit of code that's required to support
> > > migration outside of just dealing with the protocol we've defined to
> > > expose this from the kernel.  So why are we trying to do this in the
> > > kernel?  We have quirk support in QEMU, we can easily flip
> > > MemoryRegions on and off, etc.  What access to the device outside of
> > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > implement this migration support for i40e VFs?  Is this just an
> > > exercise in making use of the migration interface?  Thanks,
> > >   
> > hi Alex
> > 
> > There was a description of intention of this series in RFC v1
> > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > sorry, I didn't include it in starting from RFC v2.
> > 
> > "
> > The reason why we don't choose the way of writing mdev parent driver is
> > that
> 
> I didn't mention an mdev approach, I'm asking what are we accomplishing
> by doing this in the kernel at all versus exposing the device as normal
> through vfio-pci and providing the migration support in QEMU.  Are you
> actually leveraging having some sort of access to the PF in supporting
> migration of the VF?  Is vfio-pci masking the device in a way that
> prevents migrating the state from QEMU?
>
yes, communication to PF is required. VF state is managed by PF and is
queried from PF when VF is stopped.

migration support in QEMU seems only suitable to devices with dirty
pages and device state available by reading/writing device MMIOs, which
is not the case for most devices.

> > (1) VFs are almost all the time directly passthroughed. Directly binding
> > to vfio-pci can make most of the code shared/reused. If we write a
> > vendor specific mdev parent driver, most of the code (like passthrough
> > style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> > actually a duplicated and tedious work.
> > (2) For features like dynamically trap/untrap pci bars, if they are in
> > vfio-pci, they can be available to most people without repeated code
> > copying and re-testing.
> > (3) with a 1:1 mdev driver which passes through VFs most of the time, people
> > have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> > it runs into a real migration need. However, if vfio-pci is bound
> > initially, they have no chance to do live migration when there's a need
> > later.
> > "
> > particularly, there're some devices (like NVMe) they purely reply on
> > vfio-pci to do device pass-through and they have no standalone parent driver
> > to do mdev way.
> > 
> > I think live migration is a general requirement for most devices and to
> > interact with the migration interface requires vendor drivers to do
> > device specific tasks like geting/seting device state, starting/stopping
> > devices, tracking dirty data, report migration capabilities... all those
> > works need be in kernel.
> 
> I think Alex Graf proved they don't necessarily need to be done in
> kernel back in 2015: https://www.youtube.com/watch?v=4RFsSgzuFso
> He was able to achieve i40e VF live migration by only hacking QEMU.  In
I checked the qemu code. https://github.com/agraf/qemu/tree/vfio-i40vf.
a new vfio-i40e device type is registered as a child type of vfio-pci, as well
as its exclusive savevm handlers, which are not compatible to Kirti's
general VFIO live migration framework.

> this series you're allowing a vendor driver to interpose itself between
> the user (QEMU) and vfio-pci such that we switch to the vendor code
> during migration.  Why can't that interpose layer be in QEMU rather
> than the kernel?  It seems that it only must be in the kernel if we
> need to provide migration state via backdoor, perhaps like going
> through the PF.  So what access to the i40e VF device is not provided to
> the user through vfio-pci that is necessary to implement migration of
> this device?  The tasks listed above are mostly standard device driver
> activities and clearly vfio-pci allows userspace device drivers.
> 
tasks like interacting with PF driver, preparing resources and tracking dirty
pages in device internal memory, detecting of whether dirty page is able
to be tracked by hardware and reporting migration capabilities, exposing
hardware dirty bitmap buffer... all those are hard to be done in QEMU.

maintaining migration code in kernel can also allow vendors to re-use common
code for devices across generations. e.g. for i40e, in some generations,
software dirty page track is used,  some generations hardware dirty
track is enabled and some other generations leveraging IOMMU A/D bit is
feasible. is QEMU quirks allowing such flexibility as in kernel?

besides, migration version string as a sysfs attribute requires a vendor
driver to generate.

> > do you think it's better to create numerous vendor quirks in vfio-pci?
> 
> In QEMU, perhaps.  Alternatively, let's look at exactly what access is
> not provided through vfio-pci that's necessary for this and decide if
> we want to enable that access or if cracking vfio-pci wide open for
> vendor drivers to pick and choose when and how to use it is really the
> right answer.
> 
I think the position of vendor modules is just like vfio_pci_igd.c under
vfio-pci. the difference is that the vendor modules are able to be
dynamically loaded outside of vfio-pci.

> > as to this series, though patch 9/10 currently only demos reporting a
> > migration region, it actually shows the capability iof vendor driver to
> > customize device regions. e.g. in patch 10/10, it customizes the BAR0 to
> > be read/write. and though we abandoned the REMAP BAR irq_type in patch
> > 10/10 for migration purpose, I have to say this irq_type has its usage
> > in other use cases, where synchronization is not a hard requirement and
> > all it needs is a notification channel from kernel to use. this series
> > just provides a possibility for vendors to customize device regions and
> > irqs.
> 
> I don't disagree that a device specific interrupt might be useful, but
> I would object to implementing this one only as an artificial use case.
> We can wait for a legitimate use case to implement that.
>
ok. sure.

> > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > exported for writing mdev parent drivers that pass through devices at
> > normal time to avoid duplication. and yes, your worry about
> 
> Where are those parent drivers?  What are their actual requirements?
>
if this way of registering vendor ops to vfio-pci is not permitted,
vendors have to resort to writing its mdev parent drivers for VFs. Those
parent drivers need to pass through VFs at normal time, doing exactly what
vfio-pci does and only doing what vendor ops does during migration.

if vfio-pci could export common code to those parent drivers, lots of
duplicated code can be avoided.

> > identification of bug sources is reasonable. but if a device is binding
> > to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> > least two ways to identify if it's a bug in vfio-pci itself.
> > (1) prevent vendor modules from loading and see if the problem exists
> > with pure vfio-pci.
> > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> > operations to vfio-pci.
> 
> The code split is still extremely ad-hoc, there's no API.  An mdev
> driver isn't even a sub-driver of vfio-pci like you're trying to
> accomplish here, there would need to be a much more defined API when
> the base device isn't even a vfio_pci_device.  I don't see how this
> series would directly enable an mdev use case.
> 
similar to Yi's series https://patchwork.kernel.org/patch/11320841/.
we can parcel the vdev creation code in vfio_pci_probe() to allow calling from
mdev parent probe routine. (of course, also need to parcel code to free vdev)
e.g.

void *vfio_pci_alloc_vdev(struct pci_dev *pdev, const struct pci_device_id *id)
{
	struct vfio_pci_device *vdev;
        vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
        if (!vdev) {
                ret = -ENOMEM;
                goto out_group_put;
        }

        vdev->pdev = pdev;
        vdev->irq_type = VFIO_PCI_NUM_IRQS;
        mutex_init(&vdev->igate);
        spin_lock_init(&vdev->irqlock);
        mutex_init(&vdev->ioeventfds_lock);
        INIT_LIST_HEAD(&vdev->ioeventfds_list);
	...
	vfio_pci_probe_power_state(vdev);

        if (!disable_idle_d3) {
                vfio_pci_set_power_state(vdev, PCI_D0);
                vfio_pci_set_power_state(vdev, PCI_D3hot);
        }
	return vdev;
}

static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id))
{

       void *vdev = vfio_pci_alloc_vdev(pdev, id);

       //save the vdev pointer 

}
then all the exported interfaces from this series can also benefit the
mdev use case.

> > so, do you think this series has its merit and we can continue improving
> > it?
> 
> I think this series is trying to push an artificial use case that is
> perhaps better done in userspace.  What is the actual interaction with
> the VF device that can only be done in the host kernel for this
> example?  Thanks,
>
yes, as with shaopeng's reply, looking forward to their early release of
code.
Besides, there're other real use cases as well, like NVMe, QAT.
we can show the interactions for those devices as well.

Thanks
Yan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules
  2020-06-04 15:01   ` Cornelia Huck
@ 2020-06-05  2:05     ` Yan Zhao
  0 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-06-05  2:05 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, linux-kernel, alex.williamson, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Thu, Jun 04, 2020 at 05:01:06PM +0200, Cornelia Huck wrote:
> On Sun, 17 May 2020 22:45:10 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > vendor modules call macro module_vfio_pci_register_vendor_handler to
> > generate module_init and module_exit.
> > It is necessary to ensure that vendor modules always call
> > vfio_pci_register_vendor_driver() on driver loading and
> > vfio_pci_unregister_vendor_driver on driver unloading,
> > because
> > (1) at compiling time, there's only a dependency of vendor modules on
> > vfio_pci.
> > (2) at runtime,
> > - vendor modules add refs of vfio_pci on a successful calling of
> >   vfio_pci_register_vendor_driver() and deref of vfio_pci on a
> >   successful calling of vfio_pci_unregister_vendor_driver().
> > - vfio_pci only adds refs of vendor module on a successful probe of vendor
> >   driver.
> >   vfio_pci derefs vendor module when unbinding from a device.
> > 
> > So, after vfio_pci is unbound from a device, the vendor module to that
> > device is free to get unloaded. However, if that vendor module does not
> > call vfio_pci_unregister_vendor_driver() in its module_exit, vfio_pci may
> > hold a stale pointer to vendor module.
> > 
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  include/linux/vfio.h | 27 +++++++++++++++++++++++++++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index 3e53deb012b6..f3746608c2d9 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -223,4 +223,31 @@ struct vfio_pci_vendor_driver_ops {
> >  };
> >  int __vfio_pci_register_vendor_driver(struct vfio_pci_vendor_driver_ops *ops);
> >  void vfio_pci_unregister_vendor_driver(struct vfio_device_ops *device_ops);
> > +
> > +#define vfio_pci_register_vendor_driver(__name, __probe, __remove,	\
> > +					__device_ops)			\
> > +static struct vfio_pci_vendor_driver_ops  __ops ## _node = {		\
> > +	.owner		= THIS_MODULE,					\
> > +	.name		= __name,					\
> > +	.probe		= __probe,					\
> > +	.remove		= __remove,					\
> > +	.device_ops	= __device_ops,					\
> > +};									\
> > +__vfio_pci_register_vendor_driver(&__ops ## _node)
> > +
> > +#define module_vfio_pci_register_vendor_handler(name, probe, remove,	\
> > +						device_ops)		\
> > +static int __init device_ops ## _module_init(void)			\
> > +{									\
> > +	vfio_pci_register_vendor_driver(name, probe, remove,		\
> > +					device_ops);			\
> 
> What if this function fails (e.g. with -ENOMEM)?
>
right. I need to return error in that case.

Thanks for pointing it out!

Yan

> > +	return 0;							\
> > +};									\
> > +static void __exit device_ops ## _module_exit(void)			\
> > +{									\
> > +	vfio_pci_unregister_vendor_driver(device_ops);			\
> > +};									\
> > +module_init(device_ops ## _module_init);				\
> > +module_exit(device_ops ## _module_exit)
> > +
> >  #endif /* VFIO_H */
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs
  2020-06-04 15:25   ` Cornelia Huck
@ 2020-06-05  2:15     ` Yan Zhao
  2020-06-11 12:31       ` David Edmondson
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-05  2:15 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, linux-kernel, alex.williamson, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Thu, Jun 04, 2020 at 05:25:15PM +0200, Cornelia Huck wrote:
> On Sun, 17 May 2020 22:49:44 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > This allows a simpler VFIO_DEVICE_GET_INFO ioctl in vendor driver
> > 
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c         | 23 +++++++++++++++++++++--
> >  drivers/vfio/pci/vfio_pci_private.h |  2 ++
> >  include/linux/vfio.h                |  3 +++
> >  3 files changed, 26 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 290b7ab55ecf..30137c1c5308 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -105,6 +105,24 @@ void *vfio_pci_vendor_data(void *device_data)
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_pci_vendor_data);
> >  
> > +int vfio_pci_set_vendor_regions(void *device_data, int num_vendor_regions)
> > +{
> > +	struct vfio_pci_device *vdev = device_data;
> > +
> > +	vdev->num_vendor_regions = num_vendor_regions;
> 
> Do we need any kind of sanity check here, in case this is called with a
> bogus value?
>
you are right. it at least needs to be >=0.
maybe type of "unsigned int" is more appropriate for num_vendor_regions.
we don't need to check its max value as QEMU would check it.

> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_pci_set_vendor_regions);
> > +
> > +
> > +int vfio_pci_set_vendor_irqs(void *device_data, int num_vendor_irqs)
> > +{
> > +	struct vfio_pci_device *vdev = device_data;
> > +
> > +	vdev->num_vendor_irqs = num_vendor_irqs;
> 
> Here as well.
yes. will change the type to "unsigned int". 
Thank you for kindly reviewing:)

Yan

> 
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_pci_set_vendor_irqs);
> >  /*
> >   * Our VGA arbiter participation is limited since we don't know anything
> >   * about the device itself.  However, if the device is the only VGA device
> 
> (...)
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-05  2:02                     ` Yan Zhao
@ 2020-06-05 16:13                       ` Alex Williamson
  2020-06-10  5:23                         ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2020-06-05 16:13 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Thu, 4 Jun 2020 22:02:31 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:
> > On Wed, 3 Jun 2020 22:42:28 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:  
> > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:    
> > > > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > > > sparse mmap from the user until migration time?  What if instead we
> > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > > where the existing capability is the normal runtime sparse setup and
> > > > > > the user is required to use this new one prior to enabled device_state
> > > > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > > > the region and refuse to change device_state if there are outstanding
> > > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > backwards compatible to the extent that a vendor driver requiring this
> > > > > > will automatically fail migration.
> > > > > >       
> > > > > right. looks we need to use this approach to solve the problem.
> > > > > thanks for your guide.
> > > > > so I'll abandon the current remap irq way for dirty tracking during live
> > > > > migration.
> > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > then, what do you think about patches 1-5?    
> > > > 
> > > > In broad strokes, I don't think we've found the right solution yet.  I
> > > > really question whether it's supportable to parcel out vfio-pci like
> > > > this and I don't know how I'd support unraveling whether we have a bug
> > > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > > of vfio-pci.
> > > >
> > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > We have two patches creating device specific interrupts and a BAR
> > > > remapping scheme that we've decided we don't need.  That brings us to
> > > > the actual i40e vendor driver, where the first patch is simply making
> > > > the vendor driver work like vfio-pci already does, the second patch is
> > > > handling the migration region, and the third patch is implementing the
> > > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > > actually find the small bit of code that's required to support
> > > > migration outside of just dealing with the protocol we've defined to
> > > > expose this from the kernel.  So why are we trying to do this in the
> > > > kernel?  We have quirk support in QEMU, we can easily flip
> > > > MemoryRegions on and off, etc.  What access to the device outside of
> > > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > > implement this migration support for i40e VFs?  Is this just an
> > > > exercise in making use of the migration interface?  Thanks,
> > > >     
> > > hi Alex
> > > 
> > > There was a description of intention of this series in RFC v1
> > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > sorry, I didn't include it in starting from RFC v2.
> > > 
> > > "
> > > The reason why we don't choose the way of writing mdev parent driver is
> > > that  
> > 
> > I didn't mention an mdev approach, I'm asking what are we accomplishing
> > by doing this in the kernel at all versus exposing the device as normal
> > through vfio-pci and providing the migration support in QEMU.  Are you
> > actually leveraging having some sort of access to the PF in supporting
> > migration of the VF?  Is vfio-pci masking the device in a way that
> > prevents migrating the state from QEMU?
> >  
> yes, communication to PF is required. VF state is managed by PF and is
> queried from PF when VF is stopped.
> 
> migration support in QEMU seems only suitable to devices with dirty
> pages and device state available by reading/writing device MMIOs, which
> is not the case for most devices.

Post code for such a device.
 
> > > (1) VFs are almost all the time directly passthroughed. Directly binding
> > > to vfio-pci can make most of the code shared/reused. If we write a
> > > vendor specific mdev parent driver, most of the code (like passthrough
> > > style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> > > actually a duplicated and tedious work.
> > > (2) For features like dynamically trap/untrap pci bars, if they are in
> > > vfio-pci, they can be available to most people without repeated code
> > > copying and re-testing.
> > > (3) with a 1:1 mdev driver which passes through VFs most of the time, people
> > > have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> > > it runs into a real migration need. However, if vfio-pci is bound
> > > initially, they have no chance to do live migration when there's a need
> > > later.
> > > "
> > > particularly, there're some devices (like NVMe) they purely reply on
> > > vfio-pci to do device pass-through and they have no standalone parent driver
> > > to do mdev way.
> > > 
> > > I think live migration is a general requirement for most devices and to
> > > interact with the migration interface requires vendor drivers to do
> > > device specific tasks like geting/seting device state, starting/stopping
> > > devices, tracking dirty data, report migration capabilities... all those
> > > works need be in kernel.  
> > 
> > I think Alex Graf proved they don't necessarily need to be done in
> > kernel back in 2015: https://www.youtube.com/watch?v=4RFsSgzuFso
> > He was able to achieve i40e VF live migration by only hacking QEMU.  In  
> I checked the qemu code. https://github.com/agraf/qemu/tree/vfio-i40vf.
> a new vfio-i40e device type is registered as a child type of vfio-pci, as well
> as its exclusive savevm handlers, which are not compatible to Kirti's
> general VFIO live migration framework.

Obviously, saved state is managed within QEMU.  We've already seen
pushback to using mdev as a means to implement emulation in the kernel.
A vfio migration interface is not an excuse to move everything to
in-kernel drivers just to make use of it.  IF migration support can be
achieved for a device within QEMU, then that's the correct place to put
it.

> > this series you're allowing a vendor driver to interpose itself between
> > the user (QEMU) and vfio-pci such that we switch to the vendor code
> > during migration.  Why can't that interpose layer be in QEMU rather
> > than the kernel?  It seems that it only must be in the kernel if we
> > need to provide migration state via backdoor, perhaps like going
> > through the PF.  So what access to the i40e VF device is not provided to
> > the user through vfio-pci that is necessary to implement migration of
> > this device?  The tasks listed above are mostly standard device driver
> > activities and clearly vfio-pci allows userspace device drivers.
> >   
> tasks like interacting with PF driver, preparing resources and tracking dirty
> pages in device internal memory, detecting of whether dirty page is able
> to be tracked by hardware and reporting migration capabilities, exposing
> hardware dirty bitmap buffer... all those are hard to be done in QEMU.

Something being easier to do in the kernel does not automatically make
the kernel the right place to do it.  The kernel manages resources, so
if access through a PF, where the PF is a shared resources, is
necessary then those aspects might justify a kernel interface.  We
should also consider that the kernel presents a much richer attack
vector.  QEMU is already confined and a single set of ioctls through
vfio-pci is much easier to audit for security than allowing every
vendor driver to re-implement their own version.  Attempting to re-use
vfio-pci code is an effort to contain that risk, but I think it ends up
turning into a Frankenstein's monster of intermingled dependencies
without a defined API.

> maintaining migration code in kernel can also allow vendors to re-use common
> code for devices across generations. e.g. for i40e, in some generations,
> software dirty page track is used,  some generations hardware dirty
> track is enabled and some other generations leveraging IOMMU A/D bit is
> feasible. is QEMU quirks allowing such flexibility as in kernel?

These arguments all sound like excuses, ie. hiding migration code in
the kernel for convenience.  Obviously we can re-use code between
devices in QEMU.  What I think I see happening here is using the vfio
migration interface as an excuse to push more code into the kernel, and
the vfio-pci vendor extensions are a mechanism to masquerade behind a
known driver and avoid defining interfaces for specific features.

> besides, migration version string as a sysfs attribute requires a vendor
> driver to generate.

We don't have that yet anyway, and it's also a false dependency.  The
external version string is required _because_ the migration backend is
not provided _within_ QEMU.  If QEMU manages generating the migration
data for a device, we don't need an external version string.

> > > do you think it's better to create numerous vendor quirks in vfio-pci?  
> > 
> > In QEMU, perhaps.  Alternatively, let's look at exactly what access is
> > not provided through vfio-pci that's necessary for this and decide if
> > we want to enable that access or if cracking vfio-pci wide open for
> > vendor drivers to pick and choose when and how to use it is really the
> > right answer.
> >   
> I think the position of vendor modules is just like vfio_pci_igd.c under
> vfio-pci. the difference is that the vendor modules are able to be
> dynamically loaded outside of vfio-pci.

No, this is entirely false.  vfio_pci_igd provides two supplemental,
read-only regions necessary to satisfy some of the dependencies of the
guest driver.  It does not attempt to take over the device.
 
> > > as to this series, though patch 9/10 currently only demos reporting a
> > > migration region, it actually shows the capability iof vendor driver to
> > > customize device regions. e.g. in patch 10/10, it customizes the BAR0 to
> > > be read/write. and though we abandoned the REMAP BAR irq_type in patch
> > > 10/10 for migration purpose, I have to say this irq_type has its usage
> > > in other use cases, where synchronization is not a hard requirement and
> > > all it needs is a notification channel from kernel to use. this series
> > > just provides a possibility for vendors to customize device regions and
> > > irqs.  
> > 
> > I don't disagree that a device specific interrupt might be useful, but
> > I would object to implementing this one only as an artificial use case.
> > We can wait for a legitimate use case to implement that.
> >  
> ok. sure.
> 
> > > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > > exported for writing mdev parent drivers that pass through devices at
> > > normal time to avoid duplication. and yes, your worry about  
> > 
> > Where are those parent drivers?  What are their actual requirements?
> >  
> if this way of registering vendor ops to vfio-pci is not permitted,
> vendors have to resort to writing its mdev parent drivers for VFs. Those
> parent drivers need to pass through VFs at normal time, doing exactly what
> vfio-pci does and only doing what vendor ops does during migration.
> 
> if vfio-pci could export common code to those parent drivers, lots of
> duplicated code can be avoided.

There are two sides to this argument though.  We could also argue that
mdev has already made it too easy to implement device emulation in the
kernel, the barrier is that such emulation is more transparent because
it does require a fair bit of code duplication from vfio-pci.  If we
make it easier to simply re-use vfio-pci for much of this, and even
take it a step further by allowing vendor drivers to masquerade behind
vfio-pci, then we're creating an environment where vendors don't need
to work with QEMU to get their device emulation accepted.  They can
write their own vendor drivers, which are now simplified and sanctioned
by exported functions in vfio-pci.  They can do this easily and open up
massive attack vectors, hiding behind vfio-pci.

I know that I was advocating avoiding user driver confusion, ie. does
the user bind a device to vfio-pci, i40e_vf_vfio, etc, but maybe that's
the barrier we need such that a user can make an informed decision
about what they're actually using.  If a vendor then wants to implement
a feature in vfio-pci, we'll need to architect an interface for it
rather than letting them pick and choose which pieces of vfio-pci to
override.

> > > identification of bug sources is reasonable. but if a device is binding
> > > to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> > > least two ways to identify if it's a bug in vfio-pci itself.
> > > (1) prevent vendor modules from loading and see if the problem exists
> > > with pure vfio-pci.
> > > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> > > operations to vfio-pci.  
> > 
> > The code split is still extremely ad-hoc, there's no API.  An mdev
> > driver isn't even a sub-driver of vfio-pci like you're trying to
> > accomplish here, there would need to be a much more defined API when
> > the base device isn't even a vfio_pci_device.  I don't see how this
> > series would directly enable an mdev use case.
> >   
> similar to Yi's series https://patchwork.kernel.org/patch/11320841/.
> we can parcel the vdev creation code in vfio_pci_probe() to allow calling from
> mdev parent probe routine. (of course, also need to parcel code to free vdev)
> e.g.
> 
> void *vfio_pci_alloc_vdev(struct pci_dev *pdev, const struct pci_device_id *id)
> {
> 	struct vfio_pci_device *vdev;
>         vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>         if (!vdev) {
>                 ret = -ENOMEM;
>                 goto out_group_put;
>         }
> 
>         vdev->pdev = pdev;
>         vdev->irq_type = VFIO_PCI_NUM_IRQS;
>         mutex_init(&vdev->igate);
>         spin_lock_init(&vdev->irqlock);
>         mutex_init(&vdev->ioeventfds_lock);
>         INIT_LIST_HEAD(&vdev->ioeventfds_list);
> 	...
> 	vfio_pci_probe_power_state(vdev);
> 
>         if (!disable_idle_d3) {
>                 vfio_pci_set_power_state(vdev, PCI_D0);
>                 vfio_pci_set_power_state(vdev, PCI_D3hot);
>         }
> 	return vdev;
> }
> 
> static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id))
> {
> 
>        void *vdev = vfio_pci_alloc_vdev(pdev, id);
> 
>        //save the vdev pointer 
> 
> }
> then all the exported interfaces from this series can also benefit the
> mdev use case.

You need to convince me that we're not just doing this for the sake of
re-using a migration interface.  We do need vendor specific drivers to
support migration, but implementing those vendor specific drivers in
the kernel just because we have that interface is the wrong answer.  If
we can implement that device specific migration support in QEMU and
limit the attack surface from the hypervisor or guest into the host
kernel, that's a better answer.  As I've noted above, I'm afraid all of
these attempts to parcel out vfio-pci are only going to serve to
proliferate vendor modules that have limited community review, expand
the attack surface, and potentially harm the vfio ecosystem overall
through bad actors and reduced autonomy.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-05  0:26                     ` He, Shaopeng
@ 2020-06-05 17:54                       ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2020-06-05 17:54 UTC (permalink / raw)
  To: He, Shaopeng
  Cc: Zhao, Yan Y, kvm, linux-kernel, cohuck, zhenyuw, Wang, Zhi A,
	Tian, Kevin, Liu, Yi L, Zeng, Xin, Yuan, Hang

On Fri, 5 Jun 2020 00:26:10 +0000
"He, Shaopeng" <shaopeng.he@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, June 4, 2020 12:11 PM
> > 
> > On Wed, 3 Jun 2020 22:42:28 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:  
> > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >  
> > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:  
> > > > > > I'm not at all happy with this.  Why do we need to hide the
> > > > > > migration sparse mmap from the user until migration time?  What
> > > > > > if instead we introduced a new
> > > > > > VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability where  
> > the  
> > > > > > existing capability is the normal runtime sparse setup and the
> > > > > > user is required to use this new one prior to enabled
> > > > > > device_state with _SAVING.  The vendor driver could then simply
> > > > > > track mmap vmas to the region and refuse to change device_state
> > > > > > if there are outstanding mmaps conflicting with the _SAVING
> > > > > > sparse mmap layout.  No new IRQs required, no new irqfds, an
> > > > > > incremental change to the protocol, backwards compatible to the  
> > extent that a vendor driver requiring this will automatically fail migration.  
> > > > > >  
> > > > > right. looks we need to use this approach to solve the problem.
> > > > > thanks for your guide.
> > > > > so I'll abandon the current remap irq way for dirty tracking
> > > > > during live migration.
> > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > then, what do you think about patches 1-5?  
> > > >
> > > > In broad strokes, I don't think we've found the right solution yet.
> > > > I really question whether it's supportable to parcel out vfio-pci
> > > > like this and I don't know how I'd support unraveling whether we
> > > > have a bug in vfio-pci, the vendor driver, or how the vendor driver
> > > > is making use of vfio-pci.
> > > >
> > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > We have two patches creating device specific interrupts and a BAR
> > > > remapping scheme that we've decided we don't need.  That brings us
> > > > to the actual i40e vendor driver, where the first patch is simply
> > > > making the vendor driver work like vfio-pci already does, the second
> > > > patch is handling the migration region, and the third patch is
> > > > implementing the BAR remapping IRQ that we decided we don't need.
> > > > It's difficult to actually find the small bit of code that's
> > > > required to support migration outside of just dealing with the
> > > > protocol we've defined to expose this from the kernel.  So why are
> > > > we trying to do this in the kernel?  We have quirk support in QEMU,
> > > > we can easily flip MemoryRegions on and off, etc.  What access to
> > > > the device outside of what vfio-pci provides to the user, and
> > > > therefore QEMU, is necessary to implement this migration support for
> > > > i40e VFs?  Is this just an exercise in making use of the migration
> > > > interface?  Thanks,
> > > >  
> > > hi Alex
> > >
> > > There was a description of intention of this series in RFC v1
> > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > sorry, I didn't include it in starting from RFC v2.
> > >
> > > "
> > > The reason why we don't choose the way of writing mdev parent driver
> > > is that  
> > 
> > I didn't mention an mdev approach, I'm asking what are we accomplishing by
> > doing this in the kernel at all versus exposing the device as normal through
> > vfio-pci and providing the migration support in QEMU.  Are you actually
> > leveraging having some sort of access to the PF in supporting migration of the
> > VF?  Is vfio-pci masking the device in a way that prevents migrating the state
> > from QEMU?
> >   
> > > (1) VFs are almost all the time directly passthroughed. Directly
> > > binding to vfio-pci can make most of the code shared/reused. If we
> > > write a vendor specific mdev parent driver, most of the code (like
> > > passthrough style of rw/mmap) still needs to be copied from vfio-pci
> > > driver, which is actually a duplicated and tedious work.
> > > (2) For features like dynamically trap/untrap pci bars, if they are in
> > > vfio-pci, they can be available to most people without repeated code
> > > copying and re-testing.
> > > (3) with a 1:1 mdev driver which passes through VFs most of the time,
> > > people have to decide whether to bind VFs to vfio-pci or mdev parent
> > > driver before it runs into a real migration need. However, if vfio-pci
> > > is bound initially, they have no chance to do live migration when
> > > there's a need later.
> > > "
> > > particularly, there're some devices (like NVMe) they purely reply on
> > > vfio-pci to do device pass-through and they have no standalone parent
> > > driver to do mdev way.
> > >
> > > I think live migration is a general requirement for most devices and
> > > to interact with the migration interface requires vendor drivers to do
> > > device specific tasks like geting/seting device state,
> > > starting/stopping devices, tracking dirty data, report migration
> > > capabilities... all those works need be in kernel.  
> > 
> > I think Alex Graf proved they don't necessarily need to be done in kernel back
> > in 2015: https://www.youtube.com/watch?v=4RFsSgzuFso
> > He was able to achieve i40e VF live migration by only hacking QEMU.  In this
> > series you're allowing a vendor driver to interpose itself between the user
> > (QEMU) and vfio-pci such that we switch to the vendor code during migration.
> > Why can't that interpose layer be in QEMU rather than the kernel?  It seems
> > that it only must be in the kernel if we need to provide migration state via
> > backdoor, perhaps like going through the PF.  So what access to the i40e VF
> > device is not provided to the user through vfio-pci that is necessary to
> > implement migration of this device?  The tasks listed above are mostly
> > standard device driver activities and clearly vfio-pci allows userspace device
> > drivers.
> >   
> > > do you think it's better to create numerous vendor quirks in vfio-pci?  
> > 
> > In QEMU, perhaps.  Alternatively, let's look at exactly what access is not
> > provided through vfio-pci that's necessary for this and decide if we want to
> > enable that access or if cracking vfio-pci wide open for vendor drivers to pick
> > and choose when and how to use it is really the right answer.
> >   
> > > as to this series, though patch 9/10 currently only demos reporting a
> > > migration region, it actually shows the capability iof vendor driver
> > > to customize device regions. e.g. in patch 10/10, it customizes the
> > > BAR0 to be read/write. and though we abandoned the REMAP BAR irq_type
> > > in patch
> > > 10/10 for migration purpose, I have to say this irq_type has its usage
> > > in other use cases, where synchronization is not a hard requirement
> > > and all it needs is a notification channel from kernel to use. this
> > > series just provides a possibility for vendors to customize device
> > > regions and irqs.  
> > 
> > I don't disagree that a device specific interrupt might be useful, but I would
> > object to implementing this one only as an artificial use case.
> > We can wait for a legitimate use case to implement that.
> >   
> > > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > > exported for writing mdev parent drivers that pass through devices at
> > > normal time to avoid duplication. and yes, your worry about  
> > 
> > Where are those parent drivers?  What are their actual requirements?
> >   
> > > identification of bug sources is reasonable. but if a device is
> > > binding to vfio-pci with a vendor module loaded, and there's a bug,
> > > they can do at least two ways to identify if it's a bug in vfio-pci itself.
> > > (1) prevent vendor modules from loading and see if the problem exists
> > > with pure vfio-pci.
> > > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass
> > > all operations to vfio-pci.  
> > 
> > The code split is still extremely ad-hoc, there's no API.  An mdev driver isn't
> > even a sub-driver of vfio-pci like you're trying to accomplish here, there
> > would need to be a much more defined API when the base device isn't even a
> > vfio_pci_device.  I don't see how this series would directly enable an mdev
> > use case.
> >   
> > > so, do you think this series has its merit and we can continue
> > > improving it?  
> > 
> > I think this series is trying to push an artificial use case that is perhaps better
> > done in userspace.  What is the actual interaction with the VF device that can
> > only be done in the host kernel for this example?  Thanks,  
> 
> Hi Alex,
> 
> As shared in KVM Forum last November(https://www.youtube.com/watch?v=aiCCUFXxVEA),
> we already have one PoC working internally. This series is part of that, if going well,
> we plan to support it in our future network, storage, security etc. device drivers.
> 
> This series has two enhancements to support passthrough device live migration:
> general support for SR-IOV live migration and Software assisted dirty page tracking.
> We tried PoC for other solutions too, but this series seems to work the best
> balancing on feasibility, code duplication, performance etc.
> 
> We are more focusing on enabling our latest E810 NIC product now, but we
> will check again how we could make it public earlier, as low quality i40e PoC
> or formal E810 driver, so you may see "the actual interaction" more clearly.

"General support for SR-IOV live migration" is not a thing, there's
always device specific code.  "Software assisted dirty page tracking"
implies to me trapping and emulating device accesses in order to learn
about DMA targets, which is something that we can also do in QEMU.

In your list of "balancing on feasibility, code duplication,
performance, etc", an explicit mention of security is strikingly
lacking.  The first two are rather obvious, it's more feasible to
implement features like migration for any device once the code
necessary for such a feature is buried in a vendor driver surrounded by
a small development community.  The actual protocol of the device state
is hidden behind an interface that's opaque to userspace and requires
trust that the vendor has implemented a robust scheme and ultimately
relying on the vendor for ongoing support.  Reducing code duplication by
exporting the guts of vfio-pci makes that even easier.  Vendors drivers
get to ad-hoc pick and choose how and when they interact with vfio-pci,
leaving vfio-pci with objects and device state it can't really trust.
The performance claim is harder to justify as the path to trapping a
region in the kernel vendor driver necessarily passes through trapping
that same region in QEMU.  Maintaining a dirty bitmap in the vfio IOMMU
backend and later retrieving it via the dirty page tracking actually
sounds like more overhead than setting dirty bits within QEMU.

But then there's the entire security aspect where the same thing that I
think makes this more feasible for the vendor driver opens the door for
a much larger attack surface from the user (ie. we're intentionally
choosing to open the door to vendor drivers running privileged inside
the host kernel, reviewed and supported by smaller communities).
Additionally, by masquerading these vendor drivers behind vfio-pci, we
simplify usage for the user, but also prevent them from making an
informed decision of the security risk of binding a device to
"vfio-pci".  Is it really vfio-pci, or is it i40e_vf_migration, which
may or may not share common code with vfio-pci?  VFIO already supports
a plugin bus driver architecture, vendors can already supply their own,
but users/admin must choose to use it.  I originally thought that's
something we should abstract for the user, but I'm beginning to see
that might actually be the one remaining leverage point we have to
architect secure interfaces rather than turning vfio into a playground
of insecure and abandoned vendor drivers.

VFIO migration support is trying to provide a solution for device
specific state and dirty page tracking, but I don't think the desire to
make use of this migration interface in and of itself as justification
for an in-kernel vendor driver.  We assume with mdevs that we're
exposing a portion of a device to the user and the mediation of that
portion of the device contains state and awareness of page dirtying
that we can extract for migration.  Implementing that state awareness
and dirty page tracking in the host kernel for the sole purpose of
being able to extract it via the migration interface seems like a
non-goal and unwise security choice.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-05 16:13                       ` Alex Williamson
@ 2020-06-10  5:23                         ` Yan Zhao
  2020-06-19 22:55                           ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-10  5:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Fri, Jun 05, 2020 at 10:13:01AM -0600, Alex Williamson wrote:
> On Thu, 4 Jun 2020 22:02:31 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:
> > > On Wed, 3 Jun 2020 22:42:28 -0400
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:  
> > > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:    
> > > > > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > > > > sparse mmap from the user until migration time?  What if instead we
> > > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > > > where the existing capability is the normal runtime sparse setup and
> > > > > > > the user is required to use this new one prior to enabled device_state
> > > > > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > > > > the region and refuse to change device_state if there are outstanding
> > > > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > > backwards compatible to the extent that a vendor driver requiring this
> > > > > > > will automatically fail migration.
> > > > > > >       
> > > > > > right. looks we need to use this approach to solve the problem.
> > > > > > thanks for your guide.
> > > > > > so I'll abandon the current remap irq way for dirty tracking during live
> > > > > > migration.
> > > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > > then, what do you think about patches 1-5?    
> > > > > 
> > > > > In broad strokes, I don't think we've found the right solution yet.  I
> > > > > really question whether it's supportable to parcel out vfio-pci like
> > > > > this and I don't know how I'd support unraveling whether we have a bug
> > > > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > > > of vfio-pci.
> > > > >
> > > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > > We have two patches creating device specific interrupts and a BAR
> > > > > remapping scheme that we've decided we don't need.  That brings us to
> > > > > the actual i40e vendor driver, where the first patch is simply making
> > > > > the vendor driver work like vfio-pci already does, the second patch is
> > > > > handling the migration region, and the third patch is implementing the
> > > > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > > > actually find the small bit of code that's required to support
> > > > > migration outside of just dealing with the protocol we've defined to
> > > > > expose this from the kernel.  So why are we trying to do this in the
> > > > > kernel?  We have quirk support in QEMU, we can easily flip
> > > > > MemoryRegions on and off, etc.  What access to the device outside of
> > > > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > > > implement this migration support for i40e VFs?  Is this just an
> > > > > exercise in making use of the migration interface?  Thanks,
> > > > >     
> > > > hi Alex
> > > > 
> > > > There was a description of intention of this series in RFC v1
> > > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > > sorry, I didn't include it in starting from RFC v2.
> > > > 
> > > > "
> > > > The reason why we don't choose the way of writing mdev parent driver is
> > > > that  
> > > 
> > > I didn't mention an mdev approach, I'm asking what are we accomplishing
> > > by doing this in the kernel at all versus exposing the device as normal
> > > through vfio-pci and providing the migration support in QEMU.  Are you
> > > actually leveraging having some sort of access to the PF in supporting
> > > migration of the VF?  Is vfio-pci masking the device in a way that
> > > prevents migrating the state from QEMU?
> > >  
> > yes, communication to PF is required. VF state is managed by PF and is
> > queried from PF when VF is stopped.
> > 
> > migration support in QEMU seems only suitable to devices with dirty
> > pages and device state available by reading/writing device MMIOs, which
> > is not the case for most devices.
> 
> Post code for such a device.
>
hi Alex,
There's an example in i40e vf. virtual channel related resources are in
guest memory. dirty page tracking requires the info stored in those
guest memory.

there're two ways to get the resources addresses:
(1) always trap VF registers related. as in Alex Graf's qemu code.

starting from beginning, it tracks rw of Admin Queue Configuration registers.
Then in the write handler vfio_i40evf_aq_mmio_mem_region_write(), guest
commands are processed to record the guest dma addresses of the virtual
channel related resources.
e.g. vdev->vsi_config is read from the guest dma addr contained in
command I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES.


vfio_i40evf_initfn()
{
 ...
 memory_region_init_io(&vdev->aq_mmio_mem, OBJECT(dev),
                          &vfio_i40evf_aq_mmio_mem_region_ops,
                          vdev, "i40evf AQ config",
                          I40E_VFGEN_RSTAT - I40E_VF_ARQBAH1);
 ...
}

vfio_i40evf_aq_mmio_mem_region_write()
{
   ...
    switch (addr) {
    case I40E_VF_ARQBAH1:
    case I40E_VF_ARQBAL1:
    case I40E_VF_ARQH1:
    case I40E_VF_ARQLEN1:
    case I40E_VF_ARQT1:
    case I40E_VF_ATQBAH1:
    case I40E_VF_ATQBAL1:
    case I40E_VF_ATQH1:
    case I40E_VF_ATQT1:
    case I40E_VF_ATQLEN1:
        vfio_i40evf_vw32(vdev, addr, data);
        vfio_i40e_aq_update(vdev); ==> update & process atq commands
        break;
    default:
        vfio_i40evf_w32(vdev, addr, data);
        break;
    }
}
vfio_i40e_aq_update(vdev)
	|->vfio_i40e_atq_process_one(vdev, vfio_i40evf_vr32(vdev, I40E_VF_ATQH1)
		|-> hwaddr addr = vfio_i40e_get_atqba(vdev) + (index * sizeof(desc));
		|   pci_dma_read(pdev, addr, &desc, sizeof(desc));//read guest's command
		|   vfio_i40e_record_atq_cmd(vdev, pdev, &desc)
			
		

vfio_i40e_record_atq_cmd(...I40eAdminQueueDescriptor *desc) {
	data_addr = desc->params.external.addr_high;
	...

	switch (desc->cookie_high) {
	...
	case I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES:
	pci_dma_read(pdev, data_addr, &vdev->vsi_config,
		         MIN(desc->datalen, sizeof(vdev->vsi_config)));
	...
	}
	...
}


(2) pass through all guest MMIO accesses and only do MMIO trap when migration
is about to start.
This is the way we're using in the host vfio-pci vendor driver (or mdev parent driver)
of i40e vf device (sorry for no public code available still).

when migration is about to start, it's already too late to get the guest dma
address for those virtual channel related resources merely by MMIO
trapping, so we have to ask for them from PF.



<...>

> > > > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > > > exported for writing mdev parent drivers that pass through devices at
> > > > normal time to avoid duplication. and yes, your worry about  
> > > 
> > > Where are those parent drivers?  What are their actual requirements?
> > >  
> > if this way of registering vendor ops to vfio-pci is not permitted,
> > vendors have to resort to writing its mdev parent drivers for VFs. Those
> > parent drivers need to pass through VFs at normal time, doing exactly what
> > vfio-pci does and only doing what vendor ops does during migration.
> > 
> > if vfio-pci could export common code to those parent drivers, lots of
> > duplicated code can be avoided.
> 
> There are two sides to this argument though.  We could also argue that
> mdev has already made it too easy to implement device emulation in the
> kernel, the barrier is that such emulation is more transparent because
> it does require a fair bit of code duplication from vfio-pci.  If we
> make it easier to simply re-use vfio-pci for much of this, and even
> take it a step further by allowing vendor drivers to masquerade behind
> vfio-pci, then we're creating an environment where vendors don't need
> to work with QEMU to get their device emulation accepted.  They can
> write their own vendor drivers, which are now simplified and sanctioned
> by exported functions in vfio-pci.  They can do this easily and open up
> massive attack vectors, hiding behind vfio-pci.
> 
your concern is reasonable.

> I know that I was advocating avoiding user driver confusion, ie. does
> the user bind a device to vfio-pci, i40e_vf_vfio, etc, but maybe that's
> the barrier we need such that a user can make an informed decision
> about what they're actually using.  If a vendor then wants to implement
> a feature in vfio-pci, we'll need to architect an interface for it
> rather than letting them pick and choose which pieces of vfio-pci to
> override.
> 
> > > > identification of bug sources is reasonable. but if a device is binding
> > > > to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> > > > least two ways to identify if it's a bug in vfio-pci itself.
> > > > (1) prevent vendor modules from loading and see if the problem exists
> > > > with pure vfio-pci.
> > > > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> > > > operations to vfio-pci.  
> > > 
> > > The code split is still extremely ad-hoc, there's no API.  An mdev
> > > driver isn't even a sub-driver of vfio-pci like you're trying to
> > > accomplish here, there would need to be a much more defined API when
> > > the base device isn't even a vfio_pci_device.  I don't see how this
> > > series would directly enable an mdev use case.
> > >   
> > similar to Yi's series https://patchwork.kernel.org/patch/11320841/.
> > we can parcel the vdev creation code in vfio_pci_probe() to allow calling from
> > mdev parent probe routine. (of course, also need to parcel code to free vdev)
> > e.g.
> > 
> > void *vfio_pci_alloc_vdev(struct pci_dev *pdev, const struct pci_device_id *id)
> > {
> > 	struct vfio_pci_device *vdev;
> >         vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> >         if (!vdev) {
> >                 ret = -ENOMEM;
> >                 goto out_group_put;
> >         }
> > 
> >         vdev->pdev = pdev;
> >         vdev->irq_type = VFIO_PCI_NUM_IRQS;
> >         mutex_init(&vdev->igate);
> >         spin_lock_init(&vdev->irqlock);
> >         mutex_init(&vdev->ioeventfds_lock);
> >         INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > 	...
> > 	vfio_pci_probe_power_state(vdev);
> > 
> >         if (!disable_idle_d3) {
> >                 vfio_pci_set_power_state(vdev, PCI_D0);
> >                 vfio_pci_set_power_state(vdev, PCI_D3hot);
> >         }
> > 	return vdev;
> > }
> > 
> > static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id))
> > {
> > 
> >        void *vdev = vfio_pci_alloc_vdev(pdev, id);
> > 
> >        //save the vdev pointer 
> > 
> > }
> > then all the exported interfaces from this series can also benefit the
> > mdev use case.
> 
> You need to convince me that we're not just doing this for the sake of
> re-using a migration interface.  We do need vendor specific drivers to
> support migration, but implementing those vendor specific drivers in
> the kernel just because we have that interface is the wrong answer.  If
> we can implement that device specific migration support in QEMU and
> limit the attack surface from the hypervisor or guest into the host
> kernel, that's a better answer.  As I've noted above, I'm afraid all of
> these attempts to parcel out vfio-pci are only going to serve to
> proliferate vendor modules that have limited community review, expand
> the attack surface, and potentially harm the vfio ecosystem overall
> through bad actors and reduced autonomy.  Thanks,
>
The requirement to access PF as mentioned above is one of the reason for
us to implement the emulation in kernel.
Another reason is that we don't want to duplicate a lot of kernel logic in
QEMU as what'd done in Alex Graf's "vfio-i40e". then QEMU has to be
updated along kernel driver changing. The effort for maintenance and
version matching is a big burden to vendors.
But you are right, there're less review in virtualization side to code under
vendor specific directory. That's also the pulse for us to propose
common helper APIs for them to call, not only for convenience and
duplication-less, but also for code with full review.

would you mind giving us some suggestions for where to go?

Thanks
Yan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first
  2020-05-18  2:53 ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Yan Zhao
  2020-05-18  8:49   ` kbuild test robot
  2020-05-18  8:49   ` [RFC PATCH] i40e/vf_migration: i40e_vf_release() can be static kbuild test robot
@ 2020-06-10  8:59   ` Xiang Zheng
  2020-06-11  0:23     ` Yan Zhao
  2 siblings, 1 reply; 42+ messages in thread
From: Xiang Zheng @ 2020-06-10  8:59 UTC (permalink / raw)
  To: Yan Zhao, kvm, linux-kernel
  Cc: alex.williamson, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan, Wang Haibin

Hi Yan,

few nits below...

On 2020/5/18 10:53, Yan Zhao wrote:
> This driver intercepts all device operations as long as it's probed
> successfully by vfio-pci driver.
> 
> It will process regions and irqs of its interest and then forward
> operations to default handlers exported from vfio pci if it wishes to.
> 
> In this patch, this driver does nothing but pass through VFs to guest
> by calling to exported handlers from driver vfio-pci.
> 
> Cc: Shaopeng He <shaopeng.he@intel.com>
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/net/ethernet/intel/Kconfig            |  10 ++
>  drivers/net/ethernet/intel/i40e/Makefile      |   2 +
>  .../ethernet/intel/i40e/i40e_vf_migration.c   | 165 ++++++++++++++++++
>  .../ethernet/intel/i40e/i40e_vf_migration.h   |  59 +++++++
>  4 files changed, 236 insertions(+)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> 
> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> index ad34e4335df2..31780d9a59f1 100644
> --- a/drivers/net/ethernet/intel/Kconfig
> +++ b/drivers/net/ethernet/intel/Kconfig
> @@ -264,6 +264,16 @@ config I40E_DCB
>  
>  	  If unsure, say N.
>  
> +config I40E_VF_MIGRATION
> +	tristate "XL710 Family VF live migration support -- loadable modules only"
> +	depends on I40E && VFIO_PCI && m
> +	help
> +	  Say m if you want to enable live migration of
> +	  Virtual Functions of Intel(R) Ethernet Controller XL710
> +	  Family of devices. It must be a module.
> +	  This module serves as vendor module of module vfio_pci.
> +	  VFs bind to module vfio_pci directly.
> +
>  # this is here to allow seamless migration from I40EVF --> IAVF name
>  # so that CONFIG_IAVF symbol will always mirror the state of CONFIG_I40EVF
>  config IAVF
> diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
> index 2f21b3e89fd0..b80c224c2602 100644
> --- a/drivers/net/ethernet/intel/i40e/Makefile
> +++ b/drivers/net/ethernet/intel/i40e/Makefile
> @@ -27,3 +27,5 @@ i40e-objs := i40e_main.o \
>  	i40e_xsk.o
>  
>  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> +
> +obj-$(CONFIG_I40E_VF_MIGRATION) += i40e_vf_migration.o
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> new file mode 100644
> index 000000000000..96026dcf5c9d
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> @@ -0,0 +1,165 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2013 - 2019 Intel Corporation. */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/vfio.h>
> +#include <linux/pci.h>
> +#include <linux/eventfd.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/sysfs.h>
> +#include <linux/file.h>
> +#include <linux/pci.h>
> +
> +#include "i40e.h"
> +#include "i40e_vf_migration.h"
> +
> +#define VERSION_STRING  "0.1"
> +#define DRIVER_AUTHOR   "Intel Corporation"
> +
> +static int i40e_vf_open(void *device_data)
> +{
> +	struct i40e_vf_migration *i40e_vf_dev =
> +		vfio_pci_vendor_data(device_data);
> +	int ret;
> +	struct vfio_device_migration_info *mig_ctl = NULL;
> +

"mig_ctl" is not used in this function. Shouldn't this declaration be
put into the next patch?

> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&i40e_vf_dev->reflock);
> +	if (!i40e_vf_dev->refcnt) {
> +		vfio_pci_set_vendor_regions(device_data, 0);
> +		vfio_pci_set_vendor_irqs(device_data, 0);
> +	}
> +
> +	ret = vfio_pci_open(device_data);
> +	if (ret)
> +		goto error;
> +
> +	i40e_vf_dev->refcnt++;
> +	mutex_unlock(&i40e_vf_dev->reflock);
> +	return 0;
> +error:
> +	if (!i40e_vf_dev->refcnt) {
> +		vfio_pci_set_vendor_regions(device_data, 0);
> +		vfio_pci_set_vendor_irqs(device_data, 0);
> +	}
> +	module_put(THIS_MODULE);
> +	mutex_unlock(&i40e_vf_dev->reflock);
> +	return ret;
> +}
> +
> +void i40e_vf_release(void *device_data)
> +{
> +	struct i40e_vf_migration *i40e_vf_dev =
> +		vfio_pci_vendor_data(device_data);
> +
> +	mutex_lock(&i40e_vf_dev->reflock);
> +	if (!--i40e_vf_dev->refcnt) {
> +		vfio_pci_set_vendor_regions(device_data, 0);
> +		vfio_pci_set_vendor_irqs(device_data, 0);
> +	}
> +	vfio_pci_release(device_data);
> +	mutex_unlock(&i40e_vf_dev->reflock);
> +	module_put(THIS_MODULE);
> +}
> +
> +static long i40e_vf_ioctl(void *device_data,
> +			  unsigned int cmd, unsigned long arg)
> +{
> +	return vfio_pci_ioctl(device_data, cmd, arg);
> +}
> +
> +static ssize_t i40e_vf_read(void *device_data, char __user *buf,
> +			    size_t count, loff_t *ppos)
> +{
> +	return vfio_pci_read(device_data, buf, count, ppos);
> +}
> +
> +static ssize_t i40e_vf_write(void *device_data, const char __user *buf,
> +			     size_t count, loff_t *ppos)
> +{
> +	return vfio_pci_write(device_data, buf, count, ppos);
> +}
> +
> +static int i40e_vf_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	return vfio_pci_mmap(device_data, vma);
> +}
> +
> +static void i40e_vf_request(void *device_data, unsigned int count)
> +{
> +	vfio_pci_request(device_data, count);
> +}
> +
> +static struct vfio_device_ops i40e_vf_device_ops_node = {
> +	.name		= "i40e_vf",
> +	.open		= i40e_vf_open,
> +	.release	= i40e_vf_release,
> +	.ioctl		= i40e_vf_ioctl,
> +	.read		= i40e_vf_read,
> +	.write		= i40e_vf_write,
> +	.mmap		= i40e_vf_mmap,
> +	.request	= i40e_vf_request,
> +};
> +
> +void *i40e_vf_probe(struct pci_dev *pdev)
> +{
> +	struct i40e_vf_migration *i40e_vf_dev = NULL;
> +	struct pci_dev *pf_dev, *vf_dev;
> +	struct i40e_pf *pf;
> +	struct i40e_vf *vf;
> +	unsigned int vf_devfn, devfn;
> +	int vf_id = -1;
> +	int i;
> +
> +	pf_dev = pdev->physfn;
> +	pf = pci_get_drvdata(pf_dev);
> +	vf_dev = pdev;
> +	vf_devfn = vf_dev->devfn;
> +
> +	for (i = 0; i < pci_num_vf(pf_dev); i++) {
> +		devfn = (pf_dev->devfn + pf_dev->sriov->offset +
> +			 pf_dev->sriov->stride * i) & 0xff;
> +		if (devfn == vf_devfn) {
> +			vf_id = i;
> +			break;
> +		}
> +	}
> +
> +	if (vf_id == -1)
> +		return ERR_PTR(-EINVAL);
> +
> +	i40e_vf_dev = kzalloc(sizeof(*i40e_vf_dev), GFP_KERNEL);
> +
> +	if (!i40e_vf_dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	i40e_vf_dev->vf_id = vf_id;
> +	i40e_vf_dev->vf_vendor = pdev->vendor;
> +	i40e_vf_dev->vf_device = pdev->device;
> +	i40e_vf_dev->pf_dev = pf_dev;
> +	i40e_vf_dev->vf_dev = vf_dev;
> +	mutex_init(&i40e_vf_dev->reflock);
> +
> +	vf = &pf->vf[vf_id];
> +

"vf" is also not used in this function...

> +	return i40e_vf_dev;
> +}
> +
> +static void i40e_vf_remove(void *vendor_data)
> +{
> +	kfree(vendor_data);
> +}
> +
> +#define i40e_vf_device_ops (&i40e_vf_device_ops_node)
> +module_vfio_pci_register_vendor_handler("I40E VF", i40e_vf_probe,
> +					i40e_vf_remove, i40e_vf_device_ops);
> +
> +MODULE_ALIAS("vfio-pci:8086-154c");
> +MODULE_LICENSE("GPL v2");
> +MODULE_INFO(supported, "Vendor driver of vfio pci to support VF live migration");
> +MODULE_VERSION(VERSION_STRING);
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> new file mode 100644
> index 000000000000..696d40601ec3
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2013 - 2019 Intel Corporation. */
> +
> +#ifndef I40E_MIG_H
> +#define I40E_MIG_H
> +
> +#include <linux/pci.h>
> +#include <linux/vfio.h>
> +#include <linux/mdev.h>
> +
> +#include "i40e.h"
> +#include "i40e_txrx.h"
> +
> +/* helper macros copied from vfio-pci */
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +
> +/* Single Root I/O Virtualization */
> +struct pci_sriov {
> +	int		pos;		/* Capability position */
> +	int		nres;		/* Number of resources */
> +	u32		cap;		/* SR-IOV Capabilities */
> +	u16		ctrl;		/* SR-IOV Control */
> +	u16		total_VFs;	/* Total VFs associated with the PF */
> +	u16		initial_VFs;	/* Initial VFs associated with the PF */
> +	u16		num_VFs;	/* Number of VFs available */
> +	u16		offset;		/* First VF Routing ID offset */
> +	u16		stride;		/* Following VF stride */
> +	u16		vf_device;	/* VF device ID */
> +	u32		pgsz;		/* Page size for BAR alignment */
> +	u8		link;		/* Function Dependency Link */
> +	u8		max_VF_buses;	/* Max buses consumed by VFs */
> +	u16		driver_max_VFs;	/* Max num VFs driver supports */
> +	struct pci_dev	*dev;		/* Lowest numbered PF */
> +	struct pci_dev	*self;		/* This PF */
> +	u32		cfg_size;	/* VF config space size */
> +	u32		class;		/* VF device */
> +	u8		hdr_type;	/* VF header type */
> +	u16		subsystem_vendor; /* VF subsystem vendor */
> +	u16		subsystem_device; /* VF subsystem device */
> +	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> +	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> +};
> +

Can "struct pci_sriov" be extracted for common use? This should not be exclusive
for "i40e_vf migration support".

> +struct i40e_vf_migration {
> +	__u32				vf_vendor;
> +	__u32				vf_device;
> +	__u32				handle;
> +	struct pci_dev			*pf_dev;
> +	struct pci_dev			*vf_dev;
> +	int				vf_id;
> +	int				refcnt;
> +	struct				mutex reflock; /*mutex protect refcnt */
                                                        ^                    ^

stray ' '

> +};
> +
> +#endif /* I40E_MIG_H */
> +
> 

-- 
Thanks,
Xiang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first
  2020-06-10  8:59   ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Xiang Zheng
@ 2020-06-11  0:23     ` Yan Zhao
  2020-06-11  2:27       ` Xiang Zheng
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2020-06-11  0:23 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: kvm, linux-kernel, alex.williamson, cohuck, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan,
	Wang Haibin

On Wed, Jun 10, 2020 at 04:59:43PM +0800, Xiang Zheng wrote:
> Hi Yan,
> 
> few nits below...
> 
> On 2020/5/18 10:53, Yan Zhao wrote:
> > This driver intercepts all device operations as long as it's probed
> > successfully by vfio-pci driver.
> > 
> > It will process regions and irqs of its interest and then forward
> > operations to default handlers exported from vfio pci if it wishes to.
> > 
> > In this patch, this driver does nothing but pass through VFs to guest
> > by calling to exported handlers from driver vfio-pci.
> > 
> > Cc: Shaopeng He <shaopeng.he@intel.com>
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  drivers/net/ethernet/intel/Kconfig            |  10 ++
> >  drivers/net/ethernet/intel/i40e/Makefile      |   2 +
> >  .../ethernet/intel/i40e/i40e_vf_migration.c   | 165 ++++++++++++++++++
> >  .../ethernet/intel/i40e/i40e_vf_migration.h   |  59 +++++++
> >  4 files changed, 236 insertions(+)
> >  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> > 
> > diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> > index ad34e4335df2..31780d9a59f1 100644
> > --- a/drivers/net/ethernet/intel/Kconfig
> > +++ b/drivers/net/ethernet/intel/Kconfig
> > @@ -264,6 +264,16 @@ config I40E_DCB
> >  
> >  	  If unsure, say N.
> >  
> > +config I40E_VF_MIGRATION
> > +	tristate "XL710 Family VF live migration support -- loadable modules only"
> > +	depends on I40E && VFIO_PCI && m
> > +	help
> > +	  Say m if you want to enable live migration of
> > +	  Virtual Functions of Intel(R) Ethernet Controller XL710
> > +	  Family of devices. It must be a module.
> > +	  This module serves as vendor module of module vfio_pci.
> > +	  VFs bind to module vfio_pci directly.
> > +
> >  # this is here to allow seamless migration from I40EVF --> IAVF name
> >  # so that CONFIG_IAVF symbol will always mirror the state of CONFIG_I40EVF
> >  config IAVF
> > diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
> > index 2f21b3e89fd0..b80c224c2602 100644
> > --- a/drivers/net/ethernet/intel/i40e/Makefile
> > +++ b/drivers/net/ethernet/intel/i40e/Makefile
> > @@ -27,3 +27,5 @@ i40e-objs := i40e_main.o \
> >  	i40e_xsk.o
> >  
> >  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> > +
> > +obj-$(CONFIG_I40E_VF_MIGRATION) += i40e_vf_migration.o
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> > new file mode 100644
> > index 000000000000..96026dcf5c9d
> > --- /dev/null
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> > @@ -0,0 +1,165 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2013 - 2019 Intel Corporation. */
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/vfio.h>
> > +#include <linux/pci.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/init.h>
> > +#include <linux/kernel.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/file.h>
> > +#include <linux/pci.h>
> > +
> > +#include "i40e.h"
> > +#include "i40e_vf_migration.h"
> > +
> > +#define VERSION_STRING  "0.1"
> > +#define DRIVER_AUTHOR   "Intel Corporation"
> > +
> > +static int i40e_vf_open(void *device_data)
> > +{
> > +	struct i40e_vf_migration *i40e_vf_dev =
> > +		vfio_pci_vendor_data(device_data);
> > +	int ret;
> > +	struct vfio_device_migration_info *mig_ctl = NULL;
> > +
> 
> "mig_ctl" is not used in this function. Shouldn't this declaration be
> put into the next patch?
>
right. thanks!

> > +	if (!try_module_get(THIS_MODULE))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&i40e_vf_dev->reflock);
> > +	if (!i40e_vf_dev->refcnt) {
> > +		vfio_pci_set_vendor_regions(device_data, 0);
> > +		vfio_pci_set_vendor_irqs(device_data, 0);
> > +	}
> > +
> > +	ret = vfio_pci_open(device_data);
> > +	if (ret)
> > +		goto error;
> > +
> > +	i40e_vf_dev->refcnt++;
> > +	mutex_unlock(&i40e_vf_dev->reflock);
> > +	return 0;
> > +error:
> > +	if (!i40e_vf_dev->refcnt) {
> > +		vfio_pci_set_vendor_regions(device_data, 0);
> > +		vfio_pci_set_vendor_irqs(device_data, 0);
> > +	}
> > +	module_put(THIS_MODULE);
> > +	mutex_unlock(&i40e_vf_dev->reflock);
> > +	return ret;
> > +}
> > +
> > +void i40e_vf_release(void *device_data)
> > +{
> > +	struct i40e_vf_migration *i40e_vf_dev =
> > +		vfio_pci_vendor_data(device_data);
> > +
> > +	mutex_lock(&i40e_vf_dev->reflock);
> > +	if (!--i40e_vf_dev->refcnt) {
> > +		vfio_pci_set_vendor_regions(device_data, 0);
> > +		vfio_pci_set_vendor_irqs(device_data, 0);
> > +	}
> > +	vfio_pci_release(device_data);
> > +	mutex_unlock(&i40e_vf_dev->reflock);
> > +	module_put(THIS_MODULE);
> > +}
> > +
> > +static long i40e_vf_ioctl(void *device_data,
> > +			  unsigned int cmd, unsigned long arg)
> > +{
> > +	return vfio_pci_ioctl(device_data, cmd, arg);
> > +}
> > +
> > +static ssize_t i40e_vf_read(void *device_data, char __user *buf,
> > +			    size_t count, loff_t *ppos)
> > +{
> > +	return vfio_pci_read(device_data, buf, count, ppos);
> > +}
> > +
> > +static ssize_t i40e_vf_write(void *device_data, const char __user *buf,
> > +			     size_t count, loff_t *ppos)
> > +{
> > +	return vfio_pci_write(device_data, buf, count, ppos);
> > +}
> > +
> > +static int i40e_vf_mmap(void *device_data, struct vm_area_struct *vma)
> > +{
> > +	return vfio_pci_mmap(device_data, vma);
> > +}
> > +
> > +static void i40e_vf_request(void *device_data, unsigned int count)
> > +{
> > +	vfio_pci_request(device_data, count);
> > +}
> > +
> > +static struct vfio_device_ops i40e_vf_device_ops_node = {
> > +	.name		= "i40e_vf",
> > +	.open		= i40e_vf_open,
> > +	.release	= i40e_vf_release,
> > +	.ioctl		= i40e_vf_ioctl,
> > +	.read		= i40e_vf_read,
> > +	.write		= i40e_vf_write,
> > +	.mmap		= i40e_vf_mmap,
> > +	.request	= i40e_vf_request,
> > +};
> > +
> > +void *i40e_vf_probe(struct pci_dev *pdev)
> > +{
> > +	struct i40e_vf_migration *i40e_vf_dev = NULL;
> > +	struct pci_dev *pf_dev, *vf_dev;
> > +	struct i40e_pf *pf;
> > +	struct i40e_vf *vf;
> > +	unsigned int vf_devfn, devfn;
> > +	int vf_id = -1;
> > +	int i;
> > +
> > +	pf_dev = pdev->physfn;
> > +	pf = pci_get_drvdata(pf_dev);
> > +	vf_dev = pdev;
> > +	vf_devfn = vf_dev->devfn;
> > +
> > +	for (i = 0; i < pci_num_vf(pf_dev); i++) {
> > +		devfn = (pf_dev->devfn + pf_dev->sriov->offset +
> > +			 pf_dev->sriov->stride * i) & 0xff;
> > +		if (devfn == vf_devfn) {
> > +			vf_id = i;
> > +			break;
> > +		}
> > +	}
> > +
> > +	if (vf_id == -1)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	i40e_vf_dev = kzalloc(sizeof(*i40e_vf_dev), GFP_KERNEL);
> > +
> > +	if (!i40e_vf_dev)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	i40e_vf_dev->vf_id = vf_id;
> > +	i40e_vf_dev->vf_vendor = pdev->vendor;
> > +	i40e_vf_dev->vf_device = pdev->device;
> > +	i40e_vf_dev->pf_dev = pf_dev;
> > +	i40e_vf_dev->vf_dev = vf_dev;
> > +	mutex_init(&i40e_vf_dev->reflock);
> > +
> > +	vf = &pf->vf[vf_id];
> > +
> 
> "vf" is also not used in this function...
>
yes, thanks.

> > +	return i40e_vf_dev;
> > +}
> > +
> > +static void i40e_vf_remove(void *vendor_data)
> > +{
> > +	kfree(vendor_data);
> > +}
> > +
> > +#define i40e_vf_device_ops (&i40e_vf_device_ops_node)
> > +module_vfio_pci_register_vendor_handler("I40E VF", i40e_vf_probe,
> > +					i40e_vf_remove, i40e_vf_device_ops);
> > +
> > +MODULE_ALIAS("vfio-pci:8086-154c");
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_INFO(supported, "Vendor driver of vfio pci to support VF live migration");
> > +MODULE_VERSION(VERSION_STRING);
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> > new file mode 100644
> > index 000000000000..696d40601ec3
> > --- /dev/null
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> > @@ -0,0 +1,59 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright(c) 2013 - 2019 Intel Corporation. */
> > +
> > +#ifndef I40E_MIG_H
> > +#define I40E_MIG_H
> > +
> > +#include <linux/pci.h>
> > +#include <linux/vfio.h>
> > +#include <linux/mdev.h>
> > +
> > +#include "i40e.h"
> > +#include "i40e_txrx.h"
> > +
> > +/* helper macros copied from vfio-pci */
> > +#define VFIO_PCI_OFFSET_SHIFT   40
> > +#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
> > +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> > +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> > +
> > +/* Single Root I/O Virtualization */
> > +struct pci_sriov {
> > +	int		pos;		/* Capability position */
> > +	int		nres;		/* Number of resources */
> > +	u32		cap;		/* SR-IOV Capabilities */
> > +	u16		ctrl;		/* SR-IOV Control */
> > +	u16		total_VFs;	/* Total VFs associated with the PF */
> > +	u16		initial_VFs;	/* Initial VFs associated with the PF */
> > +	u16		num_VFs;	/* Number of VFs available */
> > +	u16		offset;		/* First VF Routing ID offset */
> > +	u16		stride;		/* Following VF stride */
> > +	u16		vf_device;	/* VF device ID */
> > +	u32		pgsz;		/* Page size for BAR alignment */
> > +	u8		link;		/* Function Dependency Link */
> > +	u8		max_VF_buses;	/* Max buses consumed by VFs */
> > +	u16		driver_max_VFs;	/* Max num VFs driver supports */
> > +	struct pci_dev	*dev;		/* Lowest numbered PF */
> > +	struct pci_dev	*self;		/* This PF */
> > +	u32		cfg_size;	/* VF config space size */
> > +	u32		class;		/* VF device */
> > +	u8		hdr_type;	/* VF header type */
> > +	u16		subsystem_vendor; /* VF subsystem vendor */
> > +	u16		subsystem_device; /* VF subsystem device */
> > +	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> > +	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > +};
> > +
> 
> Can "struct pci_sriov" be extracted for common use? This should not be exclusive
> for "i40e_vf migration support".
>
the definition of this structure is actually in driver/pci/pci.h.
maybe removing the copy here and use below include is better?
#include "../../../../pci/pci.h"

> > +struct i40e_vf_migration {
> > +	__u32				vf_vendor;
> > +	__u32				vf_device;
> > +	__u32				handle;
> > +	struct pci_dev			*pf_dev;
> > +	struct pci_dev			*vf_dev;
> > +	int				vf_id;
> > +	int				refcnt;
> > +	struct				mutex reflock; /*mutex protect refcnt */
>                                                         ^                    ^
> 
> stray ' '
> 
got it!

thanks for review.

Yan
> > +};
> > +
> > +#endif /* I40E_MIG_H */
> > +
> > 
> 
> -- 
> Thanks,
> Xiang
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first
  2020-06-11  0:23     ` Yan Zhao
@ 2020-06-11  2:27       ` Xiang Zheng
  2020-06-11 23:10         ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Xiang Zheng @ 2020-06-11  2:27 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, alex.williamson, cohuck, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan,
	Wang Haibin



On 2020/6/11 8:23, Yan Zhao wrote:
> On Wed, Jun 10, 2020 at 04:59:43PM +0800, Xiang Zheng wrote:
>> Hi Yan,
>>
>> few nits below...
>>
>> On 2020/5/18 10:53, Yan Zhao wrote:
>>> This driver intercepts all device operations as long as it's probed
>>> successfully by vfio-pci driver.
>>>
>>> It will process regions and irqs of its interest and then forward
>>> operations to default handlers exported from vfio pci if it wishes to.
>>>
>>> In this patch, this driver does nothing but pass through VFs to guest
>>> by calling to exported handlers from driver vfio-pci.
>>>
>>> Cc: Shaopeng He <shaopeng.he@intel.com>
>>>
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/Kconfig            |  10 ++
>>>  drivers/net/ethernet/intel/i40e/Makefile      |   2 +
>>>  .../ethernet/intel/i40e/i40e_vf_migration.c   | 165 ++++++++++++++++++
>>>  .../ethernet/intel/i40e/i40e_vf_migration.h   |  59 +++++++
>>>  4 files changed, 236 insertions(+)
>>>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>
>>> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
>>> index ad34e4335df2..31780d9a59f1 100644
>>> --- a/drivers/net/ethernet/intel/Kconfig
>>> +++ b/drivers/net/ethernet/intel/Kconfig
>>> @@ -264,6 +264,16 @@ config I40E_DCB
>>>  
>>>  	  If unsure, say N.
>>>  

[...]

>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>> new file mode 100644
>>> index 000000000000..696d40601ec3
>>> --- /dev/null
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>> @@ -0,0 +1,59 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/* Copyright(c) 2013 - 2019 Intel Corporation. */
>>> +
>>> +#ifndef I40E_MIG_H
>>> +#define I40E_MIG_H
>>> +
>>> +#include <linux/pci.h>
>>> +#include <linux/vfio.h>
>>> +#include <linux/mdev.h>
>>> +
>>> +#include "i40e.h"
>>> +#include "i40e_txrx.h"
>>> +
>>> +/* helper macros copied from vfio-pci */
>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>>> +
>>> +/* Single Root I/O Virtualization */
>>> +struct pci_sriov {
>>> +	int		pos;		/* Capability position */
>>> +	int		nres;		/* Number of resources */
>>> +	u32		cap;		/* SR-IOV Capabilities */
>>> +	u16		ctrl;		/* SR-IOV Control */
>>> +	u16		total_VFs;	/* Total VFs associated with the PF */
>>> +	u16		initial_VFs;	/* Initial VFs associated with the PF */
>>> +	u16		num_VFs;	/* Number of VFs available */
>>> +	u16		offset;		/* First VF Routing ID offset */
>>> +	u16		stride;		/* Following VF stride */
>>> +	u16		vf_device;	/* VF device ID */
>>> +	u32		pgsz;		/* Page size for BAR alignment */
>>> +	u8		link;		/* Function Dependency Link */
>>> +	u8		max_VF_buses;	/* Max buses consumed by VFs */
>>> +	u16		driver_max_VFs;	/* Max num VFs driver supports */
>>> +	struct pci_dev	*dev;		/* Lowest numbered PF */
>>> +	struct pci_dev	*self;		/* This PF */
>>> +	u32		cfg_size;	/* VF config space size */
>>> +	u32		class;		/* VF device */
>>> +	u8		hdr_type;	/* VF header type */
>>> +	u16		subsystem_vendor; /* VF subsystem vendor */
>>> +	u16		subsystem_device; /* VF subsystem device */                                                                                   
>>> +	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
>>> +	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
>>> +};
>>> +
>>
>> Can "struct pci_sriov" be extracted for common use? This should not be exclusive
>> for "i40e_vf migration support".
>>
> the definition of this structure is actually in driver/pci/pci.h.
> maybe removing the copy here and use below include is better?
> #include "../../../../pci/pci.h"
> 

How about moving the definition from driver/pci/pci.h into include/linux/pci.h? So
we can just include "linux/pci.h" and removing the copy here.

-- 
Thanks,
Xiang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs
  2020-06-05  2:15     ` Yan Zhao
@ 2020-06-11 12:31       ` David Edmondson
  2020-06-11 23:09         ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: David Edmondson @ 2020-06-11 12:31 UTC (permalink / raw)
  To: Yan Zhao, Cornelia Huck
  Cc: kvm, linux-kernel, alex.williamson, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Thursday, 2020-06-04 at 22:15:42 -04, Yan Zhao wrote:

> On Thu, Jun 04, 2020 at 05:25:15PM +0200, Cornelia Huck wrote:
>> On Sun, 17 May 2020 22:49:44 -0400
>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>> 
>> > This allows a simpler VFIO_DEVICE_GET_INFO ioctl in vendor driver
>> > 
>> > Cc: Kevin Tian <kevin.tian@intel.com>
>> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>> > ---
>> >  drivers/vfio/pci/vfio_pci.c         | 23 +++++++++++++++++++++--
>> >  drivers/vfio/pci/vfio_pci_private.h |  2 ++
>> >  include/linux/vfio.h                |  3 +++
>> >  3 files changed, 26 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> > index 290b7ab55ecf..30137c1c5308 100644
>> > --- a/drivers/vfio/pci/vfio_pci.c
>> > +++ b/drivers/vfio/pci/vfio_pci.c
>> > @@ -105,6 +105,24 @@ void *vfio_pci_vendor_data(void *device_data)
>> >  }
>> >  EXPORT_SYMBOL_GPL(vfio_pci_vendor_data);
>> >  
>> > +int vfio_pci_set_vendor_regions(void *device_data, int num_vendor_regions)
>> > +{
>> > +	struct vfio_pci_device *vdev = device_data;
>> > +
>> > +	vdev->num_vendor_regions = num_vendor_regions;
>> 
>> Do we need any kind of sanity check here, in case this is called with a
>> bogus value?
>>
> you are right. it at least needs to be >=0.
> maybe type of "unsigned int" is more appropriate for num_vendor_regions.
> we don't need to check its max value as QEMU would check it.

That seems like a bad precedent - the caller may not be QEMU.

dme.
-- 
I'm not the reason you're looking for redemption.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs
  2020-06-11 12:31       ` David Edmondson
@ 2020-06-11 23:09         ` Yan Zhao
  0 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-06-11 23:09 UTC (permalink / raw)
  To: David Edmondson
  Cc: Cornelia Huck, kvm, linux-kernel, alex.williamson, zhenyuw,
	zhi.a.wang, kevin.tian, shaopeng.he, yi.l.liu, xin.zeng,
	hang.yuan

On Thu, Jun 11, 2020 at 01:31:05PM +0100, David Edmondson wrote:
> On Thursday, 2020-06-04 at 22:15:42 -04, Yan Zhao wrote:
> 
> > On Thu, Jun 04, 2020 at 05:25:15PM +0200, Cornelia Huck wrote:
> >> On Sun, 17 May 2020 22:49:44 -0400
> >> Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> 
> >> > This allows a simpler VFIO_DEVICE_GET_INFO ioctl in vendor driver
> >> > 
> >> > Cc: Kevin Tian <kevin.tian@intel.com>
> >> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >> > ---
> >> >  drivers/vfio/pci/vfio_pci.c         | 23 +++++++++++++++++++++--
> >> >  drivers/vfio/pci/vfio_pci_private.h |  2 ++
> >> >  include/linux/vfio.h                |  3 +++
> >> >  3 files changed, 26 insertions(+), 2 deletions(-)
> >> > 
> >> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> > index 290b7ab55ecf..30137c1c5308 100644
> >> > --- a/drivers/vfio/pci/vfio_pci.c
> >> > +++ b/drivers/vfio/pci/vfio_pci.c
> >> > @@ -105,6 +105,24 @@ void *vfio_pci_vendor_data(void *device_data)
> >> >  }
> >> >  EXPORT_SYMBOL_GPL(vfio_pci_vendor_data);
> >> >  
> >> > +int vfio_pci_set_vendor_regions(void *device_data, int num_vendor_regions)
> >> > +{
> >> > +	struct vfio_pci_device *vdev = device_data;
> >> > +
> >> > +	vdev->num_vendor_regions = num_vendor_regions;
> >> 
> >> Do we need any kind of sanity check here, in case this is called with a
> >> bogus value?
> >>
> > you are right. it at least needs to be >=0.
> > maybe type of "unsigned int" is more appropriate for num_vendor_regions.
> > we don't need to check its max value as QEMU would check it.
> 
> That seems like a bad precedent - the caller may not be QEMU.
>
but the caller has to query that through vfio_pci_ioctl() and at there
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions +  vdev->num_vendor_regions;         

info.num_regions is of type __u32.


Thanks
Yan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first
  2020-06-11  2:27       ` Xiang Zheng
@ 2020-06-11 23:10         ` Yan Zhao
  0 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-06-11 23:10 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: kvm, linux-kernel, alex.williamson, cohuck, zhenyuw, zhi.a.wang,
	kevin.tian, shaopeng.he, yi.l.liu, xin.zeng, hang.yuan,
	Wang Haibin

On Thu, Jun 11, 2020 at 10:27:34AM +0800, Xiang Zheng wrote:
> 
> 
> On 2020/6/11 8:23, Yan Zhao wrote:
> > On Wed, Jun 10, 2020 at 04:59:43PM +0800, Xiang Zheng wrote:
> >> Hi Yan,
> >>
> >> few nits below...
> >>
> >> On 2020/5/18 10:53, Yan Zhao wrote:
> >>> This driver intercepts all device operations as long as it's probed
> >>> successfully by vfio-pci driver.
> >>>
> >>> It will process regions and irqs of its interest and then forward
> >>> operations to default handlers exported from vfio pci if it wishes to.
> >>>
> >>> In this patch, this driver does nothing but pass through VFs to guest
> >>> by calling to exported handlers from driver vfio-pci.
> >>>
> >>> Cc: Shaopeng He <shaopeng.he@intel.com>
> >>>
> >>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >>> ---
> >>>  drivers/net/ethernet/intel/Kconfig            |  10 ++
> >>>  drivers/net/ethernet/intel/i40e/Makefile      |   2 +
> >>>  .../ethernet/intel/i40e/i40e_vf_migration.c   | 165 ++++++++++++++++++
> >>>  .../ethernet/intel/i40e/i40e_vf_migration.h   |  59 +++++++
> >>>  4 files changed, 236 insertions(+)
> >>>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>>
> >>> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> >>> index ad34e4335df2..31780d9a59f1 100644
> >>> --- a/drivers/net/ethernet/intel/Kconfig
> >>> +++ b/drivers/net/ethernet/intel/Kconfig
> >>> @@ -264,6 +264,16 @@ config I40E_DCB
> >>>  
> >>>  	  If unsure, say N.
> >>>  
> 
> [...]
> 
> >>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>> new file mode 100644
> >>> index 000000000000..696d40601ec3
> >>> --- /dev/null
> >>> +++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>> @@ -0,0 +1,59 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>> +/* Copyright(c) 2013 - 2019 Intel Corporation. */
> >>> +
> >>> +#ifndef I40E_MIG_H
> >>> +#define I40E_MIG_H
> >>> +
> >>> +#include <linux/pci.h>
> >>> +#include <linux/vfio.h>
> >>> +#include <linux/mdev.h>
> >>> +
> >>> +#include "i40e.h"
> >>> +#include "i40e_txrx.h"
> >>> +
> >>> +/* helper macros copied from vfio-pci */
> >>> +#define VFIO_PCI_OFFSET_SHIFT   40
> >>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
> >>> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >>> +
> >>> +/* Single Root I/O Virtualization */
> >>> +struct pci_sriov {
> >>> +	int		pos;		/* Capability position */
> >>> +	int		nres;		/* Number of resources */
> >>> +	u32		cap;		/* SR-IOV Capabilities */
> >>> +	u16		ctrl;		/* SR-IOV Control */
> >>> +	u16		total_VFs;	/* Total VFs associated with the PF */
> >>> +	u16		initial_VFs;	/* Initial VFs associated with the PF */
> >>> +	u16		num_VFs;	/* Number of VFs available */
> >>> +	u16		offset;		/* First VF Routing ID offset */
> >>> +	u16		stride;		/* Following VF stride */
> >>> +	u16		vf_device;	/* VF device ID */
> >>> +	u32		pgsz;		/* Page size for BAR alignment */
> >>> +	u8		link;		/* Function Dependency Link */
> >>> +	u8		max_VF_buses;	/* Max buses consumed by VFs */
> >>> +	u16		driver_max_VFs;	/* Max num VFs driver supports */
> >>> +	struct pci_dev	*dev;		/* Lowest numbered PF */
> >>> +	struct pci_dev	*self;		/* This PF */
> >>> +	u32		cfg_size;	/* VF config space size */
> >>> +	u32		class;		/* VF device */
> >>> +	u8		hdr_type;	/* VF header type */
> >>> +	u16		subsystem_vendor; /* VF subsystem vendor */
> >>> +	u16		subsystem_device; /* VF subsystem device */                                                                                   
> >>> +	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> >>> +	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> >>> +};
> >>> +
> >>
> >> Can "struct pci_sriov" be extracted for common use? This should not be exclusive
> >> for "i40e_vf migration support".
> >>
> > the definition of this structure is actually in driver/pci/pci.h.
> > maybe removing the copy here and use below include is better?
> > #include "../../../../pci/pci.h"
> > 
> 
> How about moving the definition from driver/pci/pci.h into include/linux/pci.h? So
> we can just include "linux/pci.h" and removing the copy here.
>
I prefer to leaving it in drivers/pci/pci.h for now.

Thanks
Yan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-10  5:23                         ` Yan Zhao
@ 2020-06-19 22:55                           ` Alex Williamson
  2020-06-22  3:34                             ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2020-06-19 22:55 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Wed, 10 Jun 2020 01:23:14 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, Jun 05, 2020 at 10:13:01AM -0600, Alex Williamson wrote:
> > On Thu, 4 Jun 2020 22:02:31 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:  
> > > > On Wed, 3 Jun 2020 22:42:28 -0400
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:    
> > > > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:      
> > > > > > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > > > > > sparse mmap from the user until migration time?  What if instead we
> > > > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > > > > where the existing capability is the normal runtime sparse setup and
> > > > > > > > the user is required to use this new one prior to enabled device_state
> > > > > > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > > > > > the region and refuse to change device_state if there are outstanding
> > > > > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > > > backwards compatible to the extent that a vendor driver requiring this
> > > > > > > > will automatically fail migration.
> > > > > > > >         
> > > > > > > right. looks we need to use this approach to solve the problem.
> > > > > > > thanks for your guide.
> > > > > > > so I'll abandon the current remap irq way for dirty tracking during live
> > > > > > > migration.
> > > > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > > > then, what do you think about patches 1-5?      
> > > > > > 
> > > > > > In broad strokes, I don't think we've found the right solution yet.  I
> > > > > > really question whether it's supportable to parcel out vfio-pci like
> > > > > > this and I don't know how I'd support unraveling whether we have a bug
> > > > > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > > > > of vfio-pci.
> > > > > >
> > > > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > > > We have two patches creating device specific interrupts and a BAR
> > > > > > remapping scheme that we've decided we don't need.  That brings us to
> > > > > > the actual i40e vendor driver, where the first patch is simply making
> > > > > > the vendor driver work like vfio-pci already does, the second patch is
> > > > > > handling the migration region, and the third patch is implementing the
> > > > > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > > > > actually find the small bit of code that's required to support
> > > > > > migration outside of just dealing with the protocol we've defined to
> > > > > > expose this from the kernel.  So why are we trying to do this in the
> > > > > > kernel?  We have quirk support in QEMU, we can easily flip
> > > > > > MemoryRegions on and off, etc.  What access to the device outside of
> > > > > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > > > > implement this migration support for i40e VFs?  Is this just an
> > > > > > exercise in making use of the migration interface?  Thanks,
> > > > > >       
> > > > > hi Alex
> > > > > 
> > > > > There was a description of intention of this series in RFC v1
> > > > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > > > sorry, I didn't include it in starting from RFC v2.
> > > > > 
> > > > > "
> > > > > The reason why we don't choose the way of writing mdev parent driver is
> > > > > that    
> > > > 
> > > > I didn't mention an mdev approach, I'm asking what are we accomplishing
> > > > by doing this in the kernel at all versus exposing the device as normal
> > > > through vfio-pci and providing the migration support in QEMU.  Are you
> > > > actually leveraging having some sort of access to the PF in supporting
> > > > migration of the VF?  Is vfio-pci masking the device in a way that
> > > > prevents migrating the state from QEMU?
> > > >    
> > > yes, communication to PF is required. VF state is managed by PF and is
> > > queried from PF when VF is stopped.
> > > 
> > > migration support in QEMU seems only suitable to devices with dirty
> > > pages and device state available by reading/writing device MMIOs, which
> > > is not the case for most devices.  
> > 
> > Post code for such a device.
> >  
> hi Alex,
> There's an example in i40e vf. virtual channel related resources are in
> guest memory. dirty page tracking requires the info stored in those
> guest memory.
> 
> there're two ways to get the resources addresses:
> (1) always trap VF registers related. as in Alex Graf's qemu code.
> 
> starting from beginning, it tracks rw of Admin Queue Configuration registers.
> Then in the write handler vfio_i40evf_aq_mmio_mem_region_write(), guest
> commands are processed to record the guest dma addresses of the virtual
> channel related resources.
> e.g. vdev->vsi_config is read from the guest dma addr contained in
> command I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES.
> 
> 
> vfio_i40evf_initfn()
> {
>  ...
>  memory_region_init_io(&vdev->aq_mmio_mem, OBJECT(dev),
>                           &vfio_i40evf_aq_mmio_mem_region_ops,
>                           vdev, "i40evf AQ config",
>                           I40E_VFGEN_RSTAT - I40E_VF_ARQBAH1);
>  ...
> }
> 
> vfio_i40evf_aq_mmio_mem_region_write()
> {
>    ...
>     switch (addr) {
>     case I40E_VF_ARQBAH1:
>     case I40E_VF_ARQBAL1:
>     case I40E_VF_ARQH1:
>     case I40E_VF_ARQLEN1:
>     case I40E_VF_ARQT1:
>     case I40E_VF_ATQBAH1:
>     case I40E_VF_ATQBAL1:
>     case I40E_VF_ATQH1:
>     case I40E_VF_ATQT1:
>     case I40E_VF_ATQLEN1:
>         vfio_i40evf_vw32(vdev, addr, data);
>         vfio_i40e_aq_update(vdev); ==> update & process atq commands
>         break;
>     default:
>         vfio_i40evf_w32(vdev, addr, data);
>         break;
>     }
> }
> vfio_i40e_aq_update(vdev)
> 	|->vfio_i40e_atq_process_one(vdev, vfio_i40evf_vr32(vdev, I40E_VF_ATQH1)
> 		|-> hwaddr addr = vfio_i40e_get_atqba(vdev) + (index * sizeof(desc));
> 		|   pci_dma_read(pdev, addr, &desc, sizeof(desc));//read guest's command
> 		|   vfio_i40e_record_atq_cmd(vdev, pdev, &desc)
> 			
> 		
> 
> vfio_i40e_record_atq_cmd(...I40eAdminQueueDescriptor *desc) {
> 	data_addr = desc->params.external.addr_high;
> 	...
> 
> 	switch (desc->cookie_high) {
> 	...
> 	case I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES:
> 	pci_dma_read(pdev, data_addr, &vdev->vsi_config,
> 		         MIN(desc->datalen, sizeof(vdev->vsi_config)));
> 	...
> 	}
> 	...
> }
> 
> 
> (2) pass through all guest MMIO accesses and only do MMIO trap when migration
> is about to start.
> This is the way we're using in the host vfio-pci vendor driver (or mdev parent driver)
> of i40e vf device (sorry for no public code available still).
> 
> when migration is about to start, it's already too late to get the guest dma
> address for those virtual channel related resources merely by MMIO
> trapping, so we have to ask for them from PF.
> 
> 
> 
> <...>
> 
> > > > > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > > > > exported for writing mdev parent drivers that pass through devices at
> > > > > normal time to avoid duplication. and yes, your worry about    
> > > > 
> > > > Where are those parent drivers?  What are their actual requirements?
> > > >    
> > > if this way of registering vendor ops to vfio-pci is not permitted,
> > > vendors have to resort to writing its mdev parent drivers for VFs. Those
> > > parent drivers need to pass through VFs at normal time, doing exactly what
> > > vfio-pci does and only doing what vendor ops does during migration.
> > > 
> > > if vfio-pci could export common code to those parent drivers, lots of
> > > duplicated code can be avoided.  
> > 
> > There are two sides to this argument though.  We could also argue that
> > mdev has already made it too easy to implement device emulation in the
> > kernel, the barrier is that such emulation is more transparent because
> > it does require a fair bit of code duplication from vfio-pci.  If we
> > make it easier to simply re-use vfio-pci for much of this, and even
> > take it a step further by allowing vendor drivers to masquerade behind
> > vfio-pci, then we're creating an environment where vendors don't need
> > to work with QEMU to get their device emulation accepted.  They can
> > write their own vendor drivers, which are now simplified and sanctioned
> > by exported functions in vfio-pci.  They can do this easily and open up
> > massive attack vectors, hiding behind vfio-pci.
> >   
> your concern is reasonable.
> 
> > I know that I was advocating avoiding user driver confusion, ie. does
> > the user bind a device to vfio-pci, i40e_vf_vfio, etc, but maybe that's
> > the barrier we need such that a user can make an informed decision
> > about what they're actually using.  If a vendor then wants to implement
> > a feature in vfio-pci, we'll need to architect an interface for it
> > rather than letting them pick and choose which pieces of vfio-pci to
> > override.
> >   
> > > > > identification of bug sources is reasonable. but if a device is binding
> > > > > to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> > > > > least two ways to identify if it's a bug in vfio-pci itself.
> > > > > (1) prevent vendor modules from loading and see if the problem exists
> > > > > with pure vfio-pci.
> > > > > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> > > > > operations to vfio-pci.    
> > > > 
> > > > The code split is still extremely ad-hoc, there's no API.  An mdev
> > > > driver isn't even a sub-driver of vfio-pci like you're trying to
> > > > accomplish here, there would need to be a much more defined API when
> > > > the base device isn't even a vfio_pci_device.  I don't see how this
> > > > series would directly enable an mdev use case.
> > > >     
> > > similar to Yi's series https://patchwork.kernel.org/patch/11320841/.
> > > we can parcel the vdev creation code in vfio_pci_probe() to allow calling from
> > > mdev parent probe routine. (of course, also need to parcel code to free vdev)
> > > e.g.
> > > 
> > > void *vfio_pci_alloc_vdev(struct pci_dev *pdev, const struct pci_device_id *id)
> > > {
> > > 	struct vfio_pci_device *vdev;
> > >         vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > >         if (!vdev) {
> > >                 ret = -ENOMEM;
> > >                 goto out_group_put;
> > >         }
> > > 
> > >         vdev->pdev = pdev;
> > >         vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > >         mutex_init(&vdev->igate);
> > >         spin_lock_init(&vdev->irqlock);
> > >         mutex_init(&vdev->ioeventfds_lock);
> > >         INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > 	...
> > > 	vfio_pci_probe_power_state(vdev);
> > > 
> > >         if (!disable_idle_d3) {
> > >                 vfio_pci_set_power_state(vdev, PCI_D0);
> > >                 vfio_pci_set_power_state(vdev, PCI_D3hot);
> > >         }
> > > 	return vdev;
> > > }
> > > 
> > > static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id))
> > > {
> > > 
> > >        void *vdev = vfio_pci_alloc_vdev(pdev, id);
> > > 
> > >        //save the vdev pointer 
> > > 
> > > }
> > > then all the exported interfaces from this series can also benefit the
> > > mdev use case.  
> > 
> > You need to convince me that we're not just doing this for the sake of
> > re-using a migration interface.  We do need vendor specific drivers to
> > support migration, but implementing those vendor specific drivers in
> > the kernel just because we have that interface is the wrong answer.  If
> > we can implement that device specific migration support in QEMU and
> > limit the attack surface from the hypervisor or guest into the host
> > kernel, that's a better answer.  As I've noted above, I'm afraid all of
> > these attempts to parcel out vfio-pci are only going to serve to
> > proliferate vendor modules that have limited community review, expand
> > the attack surface, and potentially harm the vfio ecosystem overall
> > through bad actors and reduced autonomy.  Thanks,
> >  
> The requirement to access PF as mentioned above is one of the reason for
> us to implement the emulation in kernel.
> Another reason is that we don't want to duplicate a lot of kernel logic in
> QEMU as what'd done in Alex Graf's "vfio-i40e". then QEMU has to be
> updated along kernel driver changing. The effort for maintenance and
> version matching is a big burden to vendors.
> But you are right, there're less review in virtualization side to code under
> vendor specific directory. That's also the pulse for us to propose
> common helper APIs for them to call, not only for convenience and
> duplication-less, but also for code with full review.
> 
> would you mind giving us some suggestions for where to go?

Not duplicating kernel code into userspace isn't a great excuse.  What
we need to do to emulate a device is not an exact mapping to what a
driver for that device needs to do.  If we need to keep the device
driver and the emulation in sync then we haven't done a good job with
the emulation.  What would it look like if we only had an additional
device specific region on the vfio device fd we could use to get the
descriptor information we need from the PF?  This would be more inline
with the quirks we provide for IGD assignment.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION
  2020-06-19 22:55                           ` Alex Williamson
@ 2020-06-22  3:34                             ` Yan Zhao
  0 siblings, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2020-06-22  3:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, cohuck, zhenyuw, zhi.a.wang, kevin.tian,
	shaopeng.he, yi.l.liu, xin.zeng, hang.yuan

On Fri, Jun 19, 2020 at 04:55:34PM -0600, Alex Williamson wrote:
> On Wed, 10 Jun 2020 01:23:14 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Jun 05, 2020 at 10:13:01AM -0600, Alex Williamson wrote:
> > > On Thu, 4 Jun 2020 22:02:31 -0400
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:  
> > > > > On Wed, 3 Jun 2020 22:42:28 -0400
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:    
> > > > > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >       
> > > > > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:      
> > > > > > > > > I'm not at all happy with this.  Why do we need to hide the migration
> > > > > > > > > sparse mmap from the user until migration time?  What if instead we
> > > > > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > > > > > where the existing capability is the normal runtime sparse setup and
> > > > > > > > > the user is required to use this new one prior to enabled device_state
> > > > > > > > > with _SAVING.  The vendor driver could then simply track mmap vmas to
> > > > > > > > > the region and refuse to change device_state if there are outstanding
> > > > > > > > > mmaps conflicting with the _SAVING sparse mmap layout.  No new IRQs
> > > > > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > > > > backwards compatible to the extent that a vendor driver requiring this
> > > > > > > > > will automatically fail migration.
> > > > > > > > >         
> > > > > > > > right. looks we need to use this approach to solve the problem.
> > > > > > > > thanks for your guide.
> > > > > > > > so I'll abandon the current remap irq way for dirty tracking during live
> > > > > > > > migration.
> > > > > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > > > > then, what do you think about patches 1-5?      
> > > > > > > 
> > > > > > > In broad strokes, I don't think we've found the right solution yet.  I
> > > > > > > really question whether it's supportable to parcel out vfio-pci like
> > > > > > > this and I don't know how I'd support unraveling whether we have a bug
> > > > > > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > > > > > of vfio-pci.
> > > > > > >
> > > > > > > Let me also ask, why does any of this need to be in the kernel?  We
> > > > > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > > > > We have two patches creating device specific interrupts and a BAR
> > > > > > > remapping scheme that we've decided we don't need.  That brings us to
> > > > > > > the actual i40e vendor driver, where the first patch is simply making
> > > > > > > the vendor driver work like vfio-pci already does, the second patch is
> > > > > > > handling the migration region, and the third patch is implementing the
> > > > > > > BAR remapping IRQ that we decided we don't need.  It's difficult to
> > > > > > > actually find the small bit of code that's required to support
> > > > > > > migration outside of just dealing with the protocol we've defined to
> > > > > > > expose this from the kernel.  So why are we trying to do this in the
> > > > > > > kernel?  We have quirk support in QEMU, we can easily flip
> > > > > > > MemoryRegions on and off, etc.  What access to the device outside of
> > > > > > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > > > > > implement this migration support for i40e VFs?  Is this just an
> > > > > > > exercise in making use of the migration interface?  Thanks,
> > > > > > >       
> > > > > > hi Alex
> > > > > > 
> > > > > > There was a description of intention of this series in RFC v1
> > > > > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > > > > sorry, I didn't include it in starting from RFC v2.
> > > > > > 
> > > > > > "
> > > > > > The reason why we don't choose the way of writing mdev parent driver is
> > > > > > that    
> > > > > 
> > > > > I didn't mention an mdev approach, I'm asking what are we accomplishing
> > > > > by doing this in the kernel at all versus exposing the device as normal
> > > > > through vfio-pci and providing the migration support in QEMU.  Are you
> > > > > actually leveraging having some sort of access to the PF in supporting
> > > > > migration of the VF?  Is vfio-pci masking the device in a way that
> > > > > prevents migrating the state from QEMU?
> > > > >    
> > > > yes, communication to PF is required. VF state is managed by PF and is
> > > > queried from PF when VF is stopped.
> > > > 
> > > > migration support in QEMU seems only suitable to devices with dirty
> > > > pages and device state available by reading/writing device MMIOs, which
> > > > is not the case for most devices.  
> > > 
> > > Post code for such a device.
> > >  
> > hi Alex,
> > There's an example in i40e vf. virtual channel related resources are in
> > guest memory. dirty page tracking requires the info stored in those
> > guest memory.
> > 
> > there're two ways to get the resources addresses:
> > (1) always trap VF registers related. as in Alex Graf's qemu code.
> > 
> > starting from beginning, it tracks rw of Admin Queue Configuration registers.
> > Then in the write handler vfio_i40evf_aq_mmio_mem_region_write(), guest
> > commands are processed to record the guest dma addresses of the virtual
> > channel related resources.
> > e.g. vdev->vsi_config is read from the guest dma addr contained in
> > command I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES.
> > 
> > 
> > vfio_i40evf_initfn()
> > {
> >  ...
> >  memory_region_init_io(&vdev->aq_mmio_mem, OBJECT(dev),
> >                           &vfio_i40evf_aq_mmio_mem_region_ops,
> >                           vdev, "i40evf AQ config",
> >                           I40E_VFGEN_RSTAT - I40E_VF_ARQBAH1);
> >  ...
> > }
> > 
> > vfio_i40evf_aq_mmio_mem_region_write()
> > {
> >    ...
> >     switch (addr) {
> >     case I40E_VF_ARQBAH1:
> >     case I40E_VF_ARQBAL1:
> >     case I40E_VF_ARQH1:
> >     case I40E_VF_ARQLEN1:
> >     case I40E_VF_ARQT1:
> >     case I40E_VF_ATQBAH1:
> >     case I40E_VF_ATQBAL1:
> >     case I40E_VF_ATQH1:
> >     case I40E_VF_ATQT1:
> >     case I40E_VF_ATQLEN1:
> >         vfio_i40evf_vw32(vdev, addr, data);
> >         vfio_i40e_aq_update(vdev); ==> update & process atq commands
> >         break;
> >     default:
> >         vfio_i40evf_w32(vdev, addr, data);
> >         break;
> >     }
> > }
> > vfio_i40e_aq_update(vdev)
> > 	|->vfio_i40e_atq_process_one(vdev, vfio_i40evf_vr32(vdev, I40E_VF_ATQH1)
> > 		|-> hwaddr addr = vfio_i40e_get_atqba(vdev) + (index * sizeof(desc));
> > 		|   pci_dma_read(pdev, addr, &desc, sizeof(desc));//read guest's command
> > 		|   vfio_i40e_record_atq_cmd(vdev, pdev, &desc)
> > 			
> > 		
> > 
> > vfio_i40e_record_atq_cmd(...I40eAdminQueueDescriptor *desc) {
> > 	data_addr = desc->params.external.addr_high;
> > 	...
> > 
> > 	switch (desc->cookie_high) {
> > 	...
> > 	case I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES:
> > 	pci_dma_read(pdev, data_addr, &vdev->vsi_config,
> > 		         MIN(desc->datalen, sizeof(vdev->vsi_config)));
> > 	...
> > 	}
> > 	...
> > }
> > 
> > 
> > (2) pass through all guest MMIO accesses and only do MMIO trap when migration
> > is about to start.
> > This is the way we're using in the host vfio-pci vendor driver (or mdev parent driver)
> > of i40e vf device (sorry for no public code available still).
> > 
> > when migration is about to start, it's already too late to get the guest dma
> > address for those virtual channel related resources merely by MMIO
> > trapping, so we have to ask for them from PF.
> > 
> > 
> > 
> > <...>
> > 
> > > > > > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > > > > > exported for writing mdev parent drivers that pass through devices at
> > > > > > normal time to avoid duplication. and yes, your worry about    
> > > > > 
> > > > > Where are those parent drivers?  What are their actual requirements?
> > > > >    
> > > > if this way of registering vendor ops to vfio-pci is not permitted,
> > > > vendors have to resort to writing its mdev parent drivers for VFs. Those
> > > > parent drivers need to pass through VFs at normal time, doing exactly what
> > > > vfio-pci does and only doing what vendor ops does during migration.
> > > > 
> > > > if vfio-pci could export common code to those parent drivers, lots of
> > > > duplicated code can be avoided.  
> > > 
> > > There are two sides to this argument though.  We could also argue that
> > > mdev has already made it too easy to implement device emulation in the
> > > kernel, the barrier is that such emulation is more transparent because
> > > it does require a fair bit of code duplication from vfio-pci.  If we
> > > make it easier to simply re-use vfio-pci for much of this, and even
> > > take it a step further by allowing vendor drivers to masquerade behind
> > > vfio-pci, then we're creating an environment where vendors don't need
> > > to work with QEMU to get their device emulation accepted.  They can
> > > write their own vendor drivers, which are now simplified and sanctioned
> > > by exported functions in vfio-pci.  They can do this easily and open up
> > > massive attack vectors, hiding behind vfio-pci.
> > >   
> > your concern is reasonable.
> > 
> > > I know that I was advocating avoiding user driver confusion, ie. does
> > > the user bind a device to vfio-pci, i40e_vf_vfio, etc, but maybe that's
> > > the barrier we need such that a user can make an informed decision
> > > about what they're actually using.  If a vendor then wants to implement
> > > a feature in vfio-pci, we'll need to architect an interface for it
> > > rather than letting them pick and choose which pieces of vfio-pci to
> > > override.
> > >   
> > > > > > identification of bug sources is reasonable. but if a device is binding
> > > > > > to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> > > > > > least two ways to identify if it's a bug in vfio-pci itself.
> > > > > > (1) prevent vendor modules from loading and see if the problem exists
> > > > > > with pure vfio-pci.
> > > > > > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> > > > > > operations to vfio-pci.    
> > > > > 
> > > > > The code split is still extremely ad-hoc, there's no API.  An mdev
> > > > > driver isn't even a sub-driver of vfio-pci like you're trying to
> > > > > accomplish here, there would need to be a much more defined API when
> > > > > the base device isn't even a vfio_pci_device.  I don't see how this
> > > > > series would directly enable an mdev use case.
> > > > >     
> > > > similar to Yi's series https://patchwork.kernel.org/patch/11320841/.
> > > > we can parcel the vdev creation code in vfio_pci_probe() to allow calling from
> > > > mdev parent probe routine. (of course, also need to parcel code to free vdev)
> > > > e.g.
> > > > 
> > > > void *vfio_pci_alloc_vdev(struct pci_dev *pdev, const struct pci_device_id *id)
> > > > {
> > > > 	struct vfio_pci_device *vdev;
> > > >         vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > >         if (!vdev) {
> > > >                 ret = -ENOMEM;
> > > >                 goto out_group_put;
> > > >         }
> > > > 
> > > >         vdev->pdev = pdev;
> > > >         vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > >         mutex_init(&vdev->igate);
> > > >         spin_lock_init(&vdev->irqlock);
> > > >         mutex_init(&vdev->ioeventfds_lock);
> > > >         INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > 	...
> > > > 	vfio_pci_probe_power_state(vdev);
> > > > 
> > > >         if (!disable_idle_d3) {
> > > >                 vfio_pci_set_power_state(vdev, PCI_D0);
> > > >                 vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > >         }
> > > > 	return vdev;
> > > > }
> > > > 
> > > > static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id))
> > > > {
> > > > 
> > > >        void *vdev = vfio_pci_alloc_vdev(pdev, id);
> > > > 
> > > >        //save the vdev pointer 
> > > > 
> > > > }
> > > > then all the exported interfaces from this series can also benefit the
> > > > mdev use case.  
> > > 
> > > You need to convince me that we're not just doing this for the sake of
> > > re-using a migration interface.  We do need vendor specific drivers to
> > > support migration, but implementing those vendor specific drivers in
> > > the kernel just because we have that interface is the wrong answer.  If
> > > we can implement that device specific migration support in QEMU and
> > > limit the attack surface from the hypervisor or guest into the host
> > > kernel, that's a better answer.  As I've noted above, I'm afraid all of
> > > these attempts to parcel out vfio-pci are only going to serve to
> > > proliferate vendor modules that have limited community review, expand
> > > the attack surface, and potentially harm the vfio ecosystem overall
> > > through bad actors and reduced autonomy.  Thanks,
> > >  
> > The requirement to access PF as mentioned above is one of the reason for
> > us to implement the emulation in kernel.
> > Another reason is that we don't want to duplicate a lot of kernel logic in
> > QEMU as what'd done in Alex Graf's "vfio-i40e". then QEMU has to be
> > updated along kernel driver changing. The effort for maintenance and
> > version matching is a big burden to vendors.
> > But you are right, there're less review in virtualization side to code under
> > vendor specific directory. That's also the pulse for us to propose
> > common helper APIs for them to call, not only for convenience and
> > duplication-less, but also for code with full review.
> > 
> > would you mind giving us some suggestions for where to go?
> 
> Not duplicating kernel code into userspace isn't a great excuse.  What
> we need to do to emulate a device is not an exact mapping to what a
> driver for that device needs to do.  If we need to keep the device
> driver and the emulation in sync then we haven't done a good job with
> the emulation.  What would it look like if we only had an additional
> device specific region on the vfio device fd we could use to get the
> descriptor information we need from the PF?  This would be more inline
> with the quirks we provide for IGD assignment.  Thanks,
> 
hi Alex
Thanks for this suggestion.
As migration region is a generic vendor region, do you think below way
to specify device specific region is acceptable?

(1) provide/export an interface to let vendor driver register its device
specific region or substitute get_region_info/rw/mmap of existing regions.
(2) export vfio_pci_default_rw(), vfio_pci_default_mmap() to called from
both vendor driver handlers and vfio-pci.

Or do you still prefer to adding quirks per device so you can have a
better review of all code? 

we can add a disable flag to disable regions registered/modified by
vendor drivers in bullet (1) for debug purpose.

Thanks
Yan




^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-06-22  3:44 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-18  2:42 [RFC PATCH v4 00/10] Introduce vendor ops in vfio-pci Yan Zhao
2020-05-18  2:43 ` [RFC PATCH v4 01/10] vfio/pci: register/unregister vfio_pci_vendor_driver_ops Yan Zhao
2020-05-18  2:45 ` [RFC PATCH v4 02/10] vfio/pci: macros to generate module_init and module_exit for vendor modules Yan Zhao
2020-06-04 15:01   ` Cornelia Huck
2020-06-05  2:05     ` Yan Zhao
2020-05-18  2:49 ` [RFC PATCH v4 03/10] vfio/pci: export vendor_data, irq_type, num_regions, pdev and functions in vfio_pci_ops Yan Zhao
2020-05-18  2:49 ` [RFC PATCH v4 04/10] vfio/pci: let vfio_pci know number of vendor regions and vendor irqs Yan Zhao
2020-06-04 15:25   ` Cornelia Huck
2020-06-05  2:15     ` Yan Zhao
2020-06-11 12:31       ` David Edmondson
2020-06-11 23:09         ` Yan Zhao
2020-05-18  2:50 ` [RFC PATCH v4 05/10] vfio/pci: export vfio_pci_get_barmap Yan Zhao
2020-05-18  6:37   ` kbuild test robot
2020-05-18  2:50 ` [RFC PATCH v4 06/10] vfio: Define device specific irq type capability Yan Zhao
2020-05-18  2:52 ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Yan Zhao
2020-05-18  2:56   ` [QEMU RFC PATCH v4] hw/vfio/pci: remap bar region irq Yan Zhao
2020-05-29 21:45   ` [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION Alex Williamson
2020-06-01  6:57     ` Yan Zhao
2020-06-01 16:43       ` Alex Williamson
2020-06-02  8:28         ` Yan Zhao
2020-06-02 19:34           ` Alex Williamson
2020-06-03  1:40             ` Yan Zhao
2020-06-03 23:04               ` Alex Williamson
2020-06-04  2:42                 ` Yan Zhao
2020-06-04  4:10                   ` Alex Williamson
2020-06-05  0:26                     ` He, Shaopeng
2020-06-05 17:54                       ` Alex Williamson
2020-06-05  2:02                     ` Yan Zhao
2020-06-05 16:13                       ` Alex Williamson
2020-06-10  5:23                         ` Yan Zhao
2020-06-19 22:55                           ` Alex Williamson
2020-06-22  3:34                             ` Yan Zhao
2020-05-18  2:53 ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Yan Zhao
2020-05-18  8:49   ` kbuild test robot
2020-05-18  8:49   ` [RFC PATCH] i40e/vf_migration: i40e_vf_release() can be static kbuild test robot
2020-06-10  8:59   ` [RFC PATCH v4 08/10] i40e/vf_migration: VF live migration - pass-through VF first Xiang Zheng
2020-06-11  0:23     ` Yan Zhao
2020-06-11  2:27       ` Xiang Zheng
2020-06-11 23:10         ` Yan Zhao
2020-05-18  2:54 ` [RFC PATCH v4 09/10] i40e/vf_migration: register a migration vendor region Yan Zhao
2020-05-18  6:47   ` kbuild test robot
2020-05-18  2:54 ` [RFC PATCH v4 10/10] i40e/vf_migration: vendor defined irq_type to support dynamic bar map Yan Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.