All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/4] Add Mediated device support
@ 2016-08-25  3:53 ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated devices that are created by mdev core module and update VFIO type1
IOMMU module to support pinning & unpinning for mediated devices.

This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
This requires following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
        lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
        lib/uuid.c: introduce a few more generic helpers

Requires below commits from linux master branch for mmap region fault handler
that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
        KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
        KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

What's new in v7?
- Removed 'instance' field from mdev_device structure.
- Replaced 'start' and 'stop' with 'online' interface which is per mdev device
  and takes 1 or 0 as argument.
- Removed validate_mmap_request() callback and added mmap() callback to
  parent_ops.
- With above change, removed mapping tracking logic and invalidation function
  from mdev core module. Vendor driver should have this in their module.
- Added get_device_info() callback so that vendor driver can define the device
  type, number of regions and number of IRQs supported.
- Added get_irq_info() callback for vendor driver to define the flags for irqs.
- Updated get_region_info() callback so that vendor driver can specify the
  capabilities.
- With all the above changes, VFIO driver is no more PCI driver. It can be used
  for any type of device. Hence, renamed vfio_mpci module to vfio_mdev and
  removed match() from driver interface structure.

Yet TODO:
  Need to handle the case in vfio_type1_iommu module that Alex pointed out in v6
review, that is, if the devices attached to the normal IOMMU API domain go away,
need to re-establish accounting for local domain.


Kirti Wankhede (4):
  vfio: Mediated device Core driver
  vfio: VFIO driver for mediated devices
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices

 Documentation/vfio-mediated-device.txt | 203 +++++++++++++
 drivers/vfio/Kconfig                   |   1 +
 drivers/vfio/Makefile                  |   1 +
 drivers/vfio/mdev/Kconfig              |  18 ++
 drivers/vfio/mdev/Makefile             |   6 +
 drivers/vfio/mdev/mdev_core.c          | 509 +++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c        | 131 +++++++++
 drivers/vfio/mdev/mdev_private.h       |  36 +++
 drivers/vfio/mdev/mdev_sysfs.c         | 240 ++++++++++++++++
 drivers/vfio/mdev/vfio_mdev.c          | 467 ++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h    |   6 +-
 drivers/vfio/vfio.c                    | 117 ++++++++
 drivers/vfio/vfio_iommu_type1.c        | 499 +++++++++++++++++++++++++++++---
 include/linux/mdev.h                   | 212 ++++++++++++++
 include/linux/vfio.h                   |  13 +-
 15 files changed, 2408 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-08-25  3:53 ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated devices that are created by mdev core module and update VFIO type1
IOMMU module to support pinning & unpinning for mediated devices.

This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
This requires following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
        lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
        lib/uuid.c: introduce a few more generic helpers

Requires below commits from linux master branch for mmap region fault handler
that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
        KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
        KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

What's new in v7?
- Removed 'instance' field from mdev_device structure.
- Replaced 'start' and 'stop' with 'online' interface which is per mdev device
  and takes 1 or 0 as argument.
- Removed validate_mmap_request() callback and added mmap() callback to
  parent_ops.
- With above change, removed mapping tracking logic and invalidation function
  from mdev core module. Vendor driver should have this in their module.
- Added get_device_info() callback so that vendor driver can define the device
  type, number of regions and number of IRQs supported.
- Added get_irq_info() callback for vendor driver to define the flags for irqs.
- Updated get_region_info() callback so that vendor driver can specify the
  capabilities.
- With all the above changes, VFIO driver is no more PCI driver. It can be used
  for any type of device. Hence, renamed vfio_mpci module to vfio_mdev and
  removed match() from driver interface structure.

Yet TODO:
  Need to handle the case in vfio_type1_iommu module that Alex pointed out in v6
review, that is, if the devices attached to the normal IOMMU API domain go away,
need to re-establish accounting for local domain.


Kirti Wankhede (4):
  vfio: Mediated device Core driver
  vfio: VFIO driver for mediated devices
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices

 Documentation/vfio-mediated-device.txt | 203 +++++++++++++
 drivers/vfio/Kconfig                   |   1 +
 drivers/vfio/Makefile                  |   1 +
 drivers/vfio/mdev/Kconfig              |  18 ++
 drivers/vfio/mdev/Makefile             |   6 +
 drivers/vfio/mdev/mdev_core.c          | 509 +++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c        | 131 +++++++++
 drivers/vfio/mdev/mdev_private.h       |  36 +++
 drivers/vfio/mdev/mdev_sysfs.c         | 240 ++++++++++++++++
 drivers/vfio/mdev/vfio_mdev.c          | 467 ++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h    |   6 +-
 drivers/vfio/vfio.c                    | 117 ++++++++
 drivers/vfio/vfio_iommu_type1.c        | 499 +++++++++++++++++++++++++++++---
 include/linux/mdev.h                   | 212 ++++++++++++++
 include/linux/vfio.h                   |  13 +-
 15 files changed, 2408 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-08-25  3:53 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-25  3:53   ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev, vfio_mdev, uses this interface to
register with Core driver. vfio_mdev module adds mediated device to VFIO
group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- reset: to free and reallocate resources in vendor driver during device
	 reset.
- set_online_status: to change online status of mediated device.
- get_online_status: to get current (online/offline) status of mediated
		     device.
- read : read emulation callback.
- write: write emulation callback.
- mmap: mmap emulation callback.
- get_irq_info: to retrieve information about mediated device's IRQ.
- set_irqs: send interrupt configuration information that VMM sets.
- get_device_info: to retrieve VFIO device related flags, number of regions
		   and number of IRQs supported.
- get_region_info: to provide region size and its flags for the mediated
		   device.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.
Locks to serialize above callbacks are removed. If required, vendor driver
can have locks to serialize above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
Reviewed-on: http://git-master/r/1175705
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  12 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 509 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++
 drivers/vfio/mdev/mdev_private.h |  36 +++
 drivers/vfio/mdev/mdev_sysfs.c   | 240 ++++++++++++++++++
 include/linux/mdev.h             | 212 ++++++++++++++++
 9 files changed, 1147 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..a34fbc66f92f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize device.
+	See Documentation/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..9f278c7507f7
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,509 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+/* Should be called holding parent->mdev_list_lock */
+static struct mdev_device *__find_mdev_device(struct parent_device *parent,
+					      uuid_le uuid)
+{
+	struct mdev_device *mdev;
+
+	list_for_each_entry(mdev, &parent->mdev_list, next) {
+		if (uuid_le_cmp(mdev->uuid, uuid) == 0)
+			return mdev;
+	}
+	return NULL;
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *__find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	kfree(parent);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
+{
+	struct parent_device *parent;
+
+	mutex_lock(&parent_list_lock);
+	parent = mdev_get_parent(__find_parent_device(dev));
+	mutex_unlock(&parent_list_lock);
+
+	return parent;
+}
+
+static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(mdev, mdev_params);
+	if (ret)
+		return ret;
+
+	ret = mdev_add_attribute_group(&mdev->dev,
+					parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->destroy(mdev);
+
+	return ret;
+}
+
+static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	/*
+	 * If vendor driver doesn't return success that means vendor
+	 * driver doesn't support hot-unplug
+	 */
+	ret = parent->ops->destroy(mdev);
+	if (ret && !force)
+		return -EBUSY;
+
+	mdev_remove_attribute_group(&mdev->dev,
+				    parent->ops->mdev_attr_groups);
+
+	return ret;
+}
+
+static void mdev_release_device(struct kref *kref)
+{
+	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
+	struct parent_device *parent = mdev->parent;
+
+	list_del(&mdev->next);
+
+	/*
+	 * This unlock pairs with mutex held by mdev_put_device() through
+	 * kref_put_mutex()
+	 */
+	mutex_unlock(&parent->mdev_list_lock);
+
+	device_unregister(&mdev->dev);
+	wake_up(&parent->release_done);
+	mdev_put_parent(parent);
+}
+
+struct mdev_device *mdev_get_device(struct mdev_device *mdev)
+{
+	if (mdev)
+		kref_get(&mdev->ref);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device);
+
+void mdev_put_device(struct mdev_device *mdev)
+{
+	struct parent_device *parent;
+
+	if (!mdev)
+		return;
+
+	parent = mdev->parent;
+	kref_put_mutex(&mdev->ref, mdev_release_device,
+		       &parent->mdev_list_lock);
+}
+EXPORT_SYMBOL(mdev_put_device);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	/* check for mandatory ops */
+	if (!ops->create || !ops->destroy)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = __find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+	list_add(&parent->next, &parent_list);
+
+	parent->dev = dev;
+	parent->ops = ops;
+	mutex_init(&parent->mdev_list_lock);
+	INIT_LIST_HEAD(&parent->mdev_list);
+	init_waitqueue_head(&parent->release_done);
+	mutex_unlock(&parent_list_lock);
+
+	ret = parent_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_lock(&parent_list_lock);
+	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	struct mdev_device *mdev = NULL;
+	int ret;
+
+	mutex_lock(&parent_list_lock);
+	parent = __find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove "mdev_create" and
+	 * "mdev_destroy" sysfs files so that no new mediated device could be
+	 * created for this parent
+	 */
+	list_del(&parent->next);
+	parent_remove_sysfs_files(dev);
+	mutex_unlock(&parent_list_lock);
+
+	mdev_remove_attribute_group(dev,
+				    parent->ops->dev_attr_groups);
+
+	while (!list_empty(&parent->mdev_list)) {
+		mutex_lock(&parent->mdev_list_lock);
+		if (!list_empty(&parent->mdev_list)) {
+			mdev = list_first_entry(&parent->mdev_list,
+						struct mdev_device, next);
+			mdev_device_destroy_ops(mdev, true);
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			mdev_put_device(mdev);
+	}
+
+	do {
+		ret = wait_event_interruptible_timeout(parent->release_done,
+				list_empty(&parent->mdev_list), HZ * 10);
+		if (ret == -ERESTARTSYS) {
+			dev_warn(dev, "Mediated devices are in use, task"
+				      " \"%s\" (%d) "
+				      "blocked until all are released",
+				      current->comm, task_pid_nr(current));
+		}
+	} while (ret <= 0);
+
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev_sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_from_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	mutex_lock(&parent->mdev_list_lock);
+	/* Check for duplicate */
+	mdev = __find_mdev_device(parent, uuid);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl", uuid.b);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(mdev, mdev_params);
+	if (ret)
+		goto create_failed;
+
+	ret = mdev_create_sysfs_files(&mdev->dev);
+	if (ret)
+		goto create_sysfs_error;
+
+	list_add(&mdev->next, &parent->mdev_list);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_sysfs_error:
+	mdev_device_destroy_ops(mdev, true);
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_destroy(struct device *dev, uuid_le uuid)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	int ret;
+
+	parent = mdev_get_parent_from_dev(dev);
+	if (!parent)
+		return -ENODEV;
+
+	mutex_lock(&parent->mdev_list_lock);
+	mdev = __find_mdev_device(parent, uuid);
+	if (!mdev) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	mdev_remove_sysfs_files(&mdev->dev);
+	ret = mdev_device_destroy_ops(mdev, false);
+	if (ret)
+		goto destroy_err;
+
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_device(mdev);
+
+	mdev_put_parent(parent);
+	return ret;
+
+destroy_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+void mdev_device_supported_config(struct device *dev, char *str)
+{
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_from_dev(dev);
+
+	if (parent) {
+		if (parent->ops->supported_config)
+			parent->ops->supported_config(parent->dev, str);
+		mdev_put_parent(parent);
+	}
+}
+
+int mdev_device_set_online_status(struct device *dev, bool online)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_device(to_mdev_device(dev));
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->set_online_status)
+		ret = parent->ops->set_online_status(mdev, online);
+
+	if (ret)
+		pr_err("mdev online failed  %d\n", ret);
+	else {
+		if (online)
+			kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
+		else
+			kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
+	}
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+int mdev_device_get_online_status(struct device *dev, bool *online)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_device(to_mdev_device(dev));
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->get_online_status)
+		ret = parent->ops->get_online_status(mdev, online);
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		return ret;
+	}
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..8afc2d8e5c04
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,131 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	mdev->group = NULL;
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..07ad1b381370
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,36 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  parent_create_sysfs_files(struct device *dev);
+void parent_remove_sysfs_files(struct device *dev);
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params);
+int  mdev_device_destroy(struct device *dev, uuid_le uuid);
+void mdev_device_supported_config(struct device *dev, char *str);
+
+int mdev_device_set_online_status(struct device *dev, bool online);
+int mdev_device_get_online_status(struct device *dev, bool *online);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..ed55cd5d6595
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,240 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+static ssize_t online_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count);
+static ssize_t online_show(struct device *dev, struct device_attribute *attr,
+			   char *buf);
+static DEVICE_ATTR_RW(online);
+
+/* Static functions */
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
+
+/* mdev sysfs Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str, *ptr;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ptr = str;
+	mdev_device_supported_config(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(ptr);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *mdev_params = NULL, *params = NULL;
+	uuid_le uuid;
+	int ret;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (str)
+		params = mdev_params = kstrdup(str, GFP_KERNEL);
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		goto create_error;
+	}
+
+	ret = mdev_device_create(dev, uuid, mdev_params);
+	if (ret)
+		pr_err("mdev_create: Failed to create mdev device\n");
+	else
+		ret = count;
+
+create_error:
+	kfree(params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = mdev_device_destroy(dev, uuid);
+	if (ret == 0)
+		ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t online_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *str;
+	int ret;
+	uint32_t online_status;
+	bool online;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ret = kstrtouint(str, 0, &online_status);
+	kfree(str);
+
+	if (ret) {
+		pr_err("online: parsing error %s\n", buf);
+		return ret;
+	}
+
+	online = online_status > 0 ? true : false;
+
+	ret = mdev_device_set_online_status(dev, online);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static ssize_t online_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	int ret;
+	bool online = false;
+
+	ret = mdev_device_get_online_status(dev, &online);
+	if (ret)
+		return ret;
+
+	ret = sprintf(buf, "%d\n", online);
+	return ret;
+}
+
+int parent_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj,
+				&dev_attr_mdev_supported_types.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		goto create_sysfs_failed;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	} else
+		return ret;
+
+create_sysfs_failed:
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	return ret;
+}
+
+void parent_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_online.attr);
+	if (ret)
+		pr_err("Failed to create 'online' entry\n");
+
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_online.attr);
+}
+
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..babcb7293199
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,212 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/*
+ * Mediated device
+ */
+
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	uuid_le			uuid;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the parent device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of parent device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			@mdev_params: extra parameters required by parent
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in parent device's driver for a
+ *			a mediated device. It is mandatory to provide destroy
+ *			ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ *			If VMM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VMM is
+ *			running and driver doesn't support mediated device
+ *			hotplug.
+ * @reset:		Called to reset mediated device.
+ *			@mdev: mdev_device device structure.
+ *			Returns integer: success (0) or error (< 0)
+ * @set_online_status:	Called to change to status of mediated device.
+ *			@mdev: mediated device.
+ *			@online: set true or false to make mdev device online or
+ *			offline.
+ *			Returns integer: success (0) or error (< 0)
+ * @get_online_status:	Called to get online/offline status of  mediated device
+ *			@mdev: mediated device.
+ *			@online: Returns status of mediated device.
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@pos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@pos: address.
+ *			Retuns number on bytes written on success or error.
+ * @get_irq_info:	Called to retrieve information about mediated device IRQ
+ *			@mdev: mediated device structure
+ *			@irq_info: VFIO IRQ flags and count.
+ *			Returns integer: success (0) or error (< 0)
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_device_info:	Called to get VFIO device information for a mediated
+ *			device.
+ *			@vfio_device_info: VFIO device info.
+ *			Returns integer: success (0) or error (< 0)
+ * @get_region_info:	Called to get VFIO region size and flags of mediated
+ *			device.
+ *			@mdev: mediated device structure
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			@cap_type_id: returns id of capability.
+ *			@cap_type: returns pointer to capability structure
+ *			corresponding to capability id.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct mdev_device *mdev, char *mdev_params);
+	int     (*destroy)(struct mdev_device *mdev);
+	int     (*reset)(struct mdev_device *mdev);
+	int     (*set_online_status)(struct mdev_device *mdev, bool online);
+	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
+	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
+			loff_t pos);
+	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
+			 loff_t pos);
+	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	int	(*get_irq_info)(struct mdev_device *mdev,
+				struct vfio_irq_info *irq_info);
+	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_device_info)(struct mdev_device *mdev,
+				   struct vfio_device_info *dev_info);
+	int	(*get_region_info)(struct mdev_device *mdev,
+				   struct vfio_region_info *region_info,
+				   u16 *cap_type_id, void **cap_type);
+};
+
+/*
+ * Parent Device
+ */
+
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct list_head	mdev_list;
+	struct mutex		mdev_list_lock;
+	wait_queue_head_t	release_done;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
+extern void mdev_put_device(struct mdev_device *mdev);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-08-25  3:53   ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev, vfio_mdev, uses this interface to
register with Core driver. vfio_mdev module adds mediated device to VFIO
group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- reset: to free and reallocate resources in vendor driver during device
	 reset.
- set_online_status: to change online status of mediated device.
- get_online_status: to get current (online/offline) status of mediated
		     device.
- read : read emulation callback.
- write: write emulation callback.
- mmap: mmap emulation callback.
- get_irq_info: to retrieve information about mediated device's IRQ.
- set_irqs: send interrupt configuration information that VMM sets.
- get_device_info: to retrieve VFIO device related flags, number of regions
		   and number of IRQs supported.
- get_region_info: to provide region size and its flags for the mediated
		   device.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.
Locks to serialize above callbacks are removed. If required, vendor driver
can have locks to serialize above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
Reviewed-on: http://git-master/r/1175705
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  12 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 509 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++
 drivers/vfio/mdev/mdev_private.h |  36 +++
 drivers/vfio/mdev/mdev_sysfs.c   | 240 ++++++++++++++++++
 include/linux/mdev.h             | 212 ++++++++++++++++
 9 files changed, 1147 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..a34fbc66f92f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize device.
+	See Documentation/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..9f278c7507f7
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,509 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+/* Should be called holding parent->mdev_list_lock */
+static struct mdev_device *__find_mdev_device(struct parent_device *parent,
+					      uuid_le uuid)
+{
+	struct mdev_device *mdev;
+
+	list_for_each_entry(mdev, &parent->mdev_list, next) {
+		if (uuid_le_cmp(mdev->uuid, uuid) == 0)
+			return mdev;
+	}
+	return NULL;
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *__find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	kfree(parent);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
+{
+	struct parent_device *parent;
+
+	mutex_lock(&parent_list_lock);
+	parent = mdev_get_parent(__find_parent_device(dev));
+	mutex_unlock(&parent_list_lock);
+
+	return parent;
+}
+
+static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(mdev, mdev_params);
+	if (ret)
+		return ret;
+
+	ret = mdev_add_attribute_group(&mdev->dev,
+					parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->destroy(mdev);
+
+	return ret;
+}
+
+static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	/*
+	 * If vendor driver doesn't return success that means vendor
+	 * driver doesn't support hot-unplug
+	 */
+	ret = parent->ops->destroy(mdev);
+	if (ret && !force)
+		return -EBUSY;
+
+	mdev_remove_attribute_group(&mdev->dev,
+				    parent->ops->mdev_attr_groups);
+
+	return ret;
+}
+
+static void mdev_release_device(struct kref *kref)
+{
+	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
+	struct parent_device *parent = mdev->parent;
+
+	list_del(&mdev->next);
+
+	/*
+	 * This unlock pairs with mutex held by mdev_put_device() through
+	 * kref_put_mutex()
+	 */
+	mutex_unlock(&parent->mdev_list_lock);
+
+	device_unregister(&mdev->dev);
+	wake_up(&parent->release_done);
+	mdev_put_parent(parent);
+}
+
+struct mdev_device *mdev_get_device(struct mdev_device *mdev)
+{
+	if (mdev)
+		kref_get(&mdev->ref);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device);
+
+void mdev_put_device(struct mdev_device *mdev)
+{
+	struct parent_device *parent;
+
+	if (!mdev)
+		return;
+
+	parent = mdev->parent;
+	kref_put_mutex(&mdev->ref, mdev_release_device,
+		       &parent->mdev_list_lock);
+}
+EXPORT_SYMBOL(mdev_put_device);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	/* check for mandatory ops */
+	if (!ops->create || !ops->destroy)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = __find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+	list_add(&parent->next, &parent_list);
+
+	parent->dev = dev;
+	parent->ops = ops;
+	mutex_init(&parent->mdev_list_lock);
+	INIT_LIST_HEAD(&parent->mdev_list);
+	init_waitqueue_head(&parent->release_done);
+	mutex_unlock(&parent_list_lock);
+
+	ret = parent_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_lock(&parent_list_lock);
+	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	struct mdev_device *mdev = NULL;
+	int ret;
+
+	mutex_lock(&parent_list_lock);
+	parent = __find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove "mdev_create" and
+	 * "mdev_destroy" sysfs files so that no new mediated device could be
+	 * created for this parent
+	 */
+	list_del(&parent->next);
+	parent_remove_sysfs_files(dev);
+	mutex_unlock(&parent_list_lock);
+
+	mdev_remove_attribute_group(dev,
+				    parent->ops->dev_attr_groups);
+
+	while (!list_empty(&parent->mdev_list)) {
+		mutex_lock(&parent->mdev_list_lock);
+		if (!list_empty(&parent->mdev_list)) {
+			mdev = list_first_entry(&parent->mdev_list,
+						struct mdev_device, next);
+			mdev_device_destroy_ops(mdev, true);
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			mdev_put_device(mdev);
+	}
+
+	do {
+		ret = wait_event_interruptible_timeout(parent->release_done,
+				list_empty(&parent->mdev_list), HZ * 10);
+		if (ret == -ERESTARTSYS) {
+			dev_warn(dev, "Mediated devices are in use, task"
+				      " \"%s\" (%d) "
+				      "blocked until all are released",
+				      current->comm, task_pid_nr(current));
+		}
+	} while (ret <= 0);
+
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev_sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_from_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	mutex_lock(&parent->mdev_list_lock);
+	/* Check for duplicate */
+	mdev = __find_mdev_device(parent, uuid);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl", uuid.b);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(mdev, mdev_params);
+	if (ret)
+		goto create_failed;
+
+	ret = mdev_create_sysfs_files(&mdev->dev);
+	if (ret)
+		goto create_sysfs_error;
+
+	list_add(&mdev->next, &parent->mdev_list);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_sysfs_error:
+	mdev_device_destroy_ops(mdev, true);
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_destroy(struct device *dev, uuid_le uuid)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	int ret;
+
+	parent = mdev_get_parent_from_dev(dev);
+	if (!parent)
+		return -ENODEV;
+
+	mutex_lock(&parent->mdev_list_lock);
+	mdev = __find_mdev_device(parent, uuid);
+	if (!mdev) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	mdev_remove_sysfs_files(&mdev->dev);
+	ret = mdev_device_destroy_ops(mdev, false);
+	if (ret)
+		goto destroy_err;
+
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_device(mdev);
+
+	mdev_put_parent(parent);
+	return ret;
+
+destroy_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+void mdev_device_supported_config(struct device *dev, char *str)
+{
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_from_dev(dev);
+
+	if (parent) {
+		if (parent->ops->supported_config)
+			parent->ops->supported_config(parent->dev, str);
+		mdev_put_parent(parent);
+	}
+}
+
+int mdev_device_set_online_status(struct device *dev, bool online)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_device(to_mdev_device(dev));
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->set_online_status)
+		ret = parent->ops->set_online_status(mdev, online);
+
+	if (ret)
+		pr_err("mdev online failed  %d\n", ret);
+	else {
+		if (online)
+			kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
+		else
+			kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
+	}
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+int mdev_device_get_online_status(struct device *dev, bool *online)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_device(to_mdev_device(dev));
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->get_online_status)
+		ret = parent->ops->get_online_status(mdev, online);
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		return ret;
+	}
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..8afc2d8e5c04
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,131 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	mdev->group = NULL;
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..07ad1b381370
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,36 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  parent_create_sysfs_files(struct device *dev);
+void parent_remove_sysfs_files(struct device *dev);
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params);
+int  mdev_device_destroy(struct device *dev, uuid_le uuid);
+void mdev_device_supported_config(struct device *dev, char *str);
+
+int mdev_device_set_online_status(struct device *dev, bool online);
+int mdev_device_get_online_status(struct device *dev, bool *online);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..ed55cd5d6595
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,240 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+static ssize_t online_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count);
+static ssize_t online_show(struct device *dev, struct device_attribute *attr,
+			   char *buf);
+static DEVICE_ATTR_RW(online);
+
+/* Static functions */
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
+
+/* mdev sysfs Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str, *ptr;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ptr = str;
+	mdev_device_supported_config(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(ptr);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *mdev_params = NULL, *params = NULL;
+	uuid_le uuid;
+	int ret;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (str)
+		params = mdev_params = kstrdup(str, GFP_KERNEL);
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		goto create_error;
+	}
+
+	ret = mdev_device_create(dev, uuid, mdev_params);
+	if (ret)
+		pr_err("mdev_create: Failed to create mdev device\n");
+	else
+		ret = count;
+
+create_error:
+	kfree(params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = mdev_device_destroy(dev, uuid);
+	if (ret == 0)
+		ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t online_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *str;
+	int ret;
+	uint32_t online_status;
+	bool online;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ret = kstrtouint(str, 0, &online_status);
+	kfree(str);
+
+	if (ret) {
+		pr_err("online: parsing error %s\n", buf);
+		return ret;
+	}
+
+	online = online_status > 0 ? true : false;
+
+	ret = mdev_device_set_online_status(dev, online);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static ssize_t online_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	int ret;
+	bool online = false;
+
+	ret = mdev_device_get_online_status(dev, &online);
+	if (ret)
+		return ret;
+
+	ret = sprintf(buf, "%d\n", online);
+	return ret;
+}
+
+int parent_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj,
+				&dev_attr_mdev_supported_types.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		goto create_sysfs_failed;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	} else
+		return ret;
+
+create_sysfs_failed:
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	return ret;
+}
+
+void parent_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_online.attr);
+	if (ret)
+		pr_err("Failed to create 'online' entry\n");
+
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_online.attr);
+}
+
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..babcb7293199
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,212 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/*
+ * Mediated device
+ */
+
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	uuid_le			uuid;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the parent device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of parent device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			@mdev_params: extra parameters required by parent
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in parent device's driver for a
+ *			a mediated device. It is mandatory to provide destroy
+ *			ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ *			If VMM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VMM is
+ *			running and driver doesn't support mediated device
+ *			hotplug.
+ * @reset:		Called to reset mediated device.
+ *			@mdev: mdev_device device structure.
+ *			Returns integer: success (0) or error (< 0)
+ * @set_online_status:	Called to change to status of mediated device.
+ *			@mdev: mediated device.
+ *			@online: set true or false to make mdev device online or
+ *			offline.
+ *			Returns integer: success (0) or error (< 0)
+ * @get_online_status:	Called to get online/offline status of  mediated device
+ *			@mdev: mediated device.
+ *			@online: Returns status of mediated device.
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@pos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@pos: address.
+ *			Retuns number on bytes written on success or error.
+ * @get_irq_info:	Called to retrieve information about mediated device IRQ
+ *			@mdev: mediated device structure
+ *			@irq_info: VFIO IRQ flags and count.
+ *			Returns integer: success (0) or error (< 0)
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_device_info:	Called to get VFIO device information for a mediated
+ *			device.
+ *			@vfio_device_info: VFIO device info.
+ *			Returns integer: success (0) or error (< 0)
+ * @get_region_info:	Called to get VFIO region size and flags of mediated
+ *			device.
+ *			@mdev: mediated device structure
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			@cap_type_id: returns id of capability.
+ *			@cap_type: returns pointer to capability structure
+ *			corresponding to capability id.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct mdev_device *mdev, char *mdev_params);
+	int     (*destroy)(struct mdev_device *mdev);
+	int     (*reset)(struct mdev_device *mdev);
+	int     (*set_online_status)(struct mdev_device *mdev, bool online);
+	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
+	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
+			loff_t pos);
+	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
+			 loff_t pos);
+	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	int	(*get_irq_info)(struct mdev_device *mdev,
+				struct vfio_irq_info *irq_info);
+	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_device_info)(struct mdev_device *mdev,
+				   struct vfio_device_info *dev_info);
+	int	(*get_region_info)(struct mdev_device *mdev,
+				   struct vfio_region_info *region_info,
+				   u16 *cap_type_id, void **cap_type);
+};
+
+/*
+ * Parent Device
+ */
+
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct list_head	mdev_list;
+	struct mutex		mdev_list_lock;
+	wait_queue_head_t	release_done;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
+extern void mdev_put_device(struct mdev_device *mdev);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-25  3:53 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-25  3:53   ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

VFIO MDEV driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This driver
adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated
device. Those are:
- get VFIO device information about type of device, maximum number of
  regions and maximum number of interrupts supported.
- get region information from vendor driver.
- Get interrupt information and send interrupt configuration information to
  vendor driver.
- Device reset
- Trap and forward read/write for emulated regions.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
Reviewed-on: http://git-master/r/1175706
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/mdev/Kconfig           |   6 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mdev.c       | 467 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 +-
 4 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index a34fbc66f92f..703abd0a9bff 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MDEV_DEVICE
+    tristate "VFIO support for Mediated devices"
+    depends on VFIO && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..e5087ed83a34 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
 
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index 000000000000..28f13aeaa46b
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,467 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+	struct vfio_device_info dev_info;
+};
+
+static int vfio_mdev_open(void *device_data)
+{
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	return ret;
+}
+
+static void vfio_mdev_close(void *device_data)
+{
+	module_put(THIS_MODULE);
+}
+
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+	size_t size;
+
+	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+	header = vfio_info_cap_add(caps, size,
+				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	sparse_cap = container_of(header,
+			struct vfio_region_info_cap_sparse_mmap, header);
+	sparse_cap->nr_areas = sparse->nr_areas;
+	memcpy(sparse_cap->areas, sparse->areas,
+	       sparse->nr_areas * sizeof(*sparse->areas));
+	return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+	header = vfio_info_cap_add(caps, sizeof(*cap),
+				   VFIO_REGION_INFO_CAP_TYPE, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	type_cap = container_of(header, struct vfio_region_info_cap_type,
+				header);
+	type_cap->type = cap->type;
+	type_cap->subtype = cap->type;
+	return 0;
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (parent->ops->get_device_info)
+			ret = parent->ops->get_device_info(vmdev->mdev, &info);
+		else
+			return -EINVAL;
+
+		if (ret)
+			return ret;
+
+		if (parent->ops->reset)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		memcpy(&vmdev->dev_info, &info, sizeof(info));
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		u16 cap_type_id = 0;
+		void *cap_type = NULL;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (parent->ops->get_region_info)
+			ret = parent->ops->get_region_info(vmdev->mdev, &info,
+						       &cap_type_id, &cap_type);
+		else
+			return -EINVAL;
+
+		if (ret)
+			return ret;
+
+		if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && cap_type) {
+			switch (cap_type_id) {
+			case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
+				ret = sparse_mmap_cap(&caps, cap_type);
+				if (ret)
+					return ret;
+				break;
+
+			case VFIO_REGION_INFO_CAP_TYPE:
+				ret = region_type_cap(&caps, cap_type);
+				if (ret)
+					return ret;
+				break;
+			default:
+				return -EINVAL;
+			}
+		}
+
+		if (caps.size) {
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg +
+							sizeof(info), caps.buf,
+							caps.size)) {
+					kfree(caps.buf);
+					return -EFAULT;
+				}
+				info.cap_offset = sizeof(info);
+			}
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((info.argsz < minsz) ||
+		    (info.index >= vmdev->dev_info.num_irqs))
+			return -EINVAL;
+
+		if (parent->ops->get_irq_info)
+			ret = parent->ops->get_irq_info(vmdev->mdev, &info);
+		else
+			return -EINVAL;
+
+		if (ret)
+			return ret;
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		u8 *data = NULL, *ptr = NULL;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((hdr.argsz < minsz) ||
+		    (hdr.index >= vmdev->dev_info.num_irqs) ||
+		    (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK)))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size)
+				return -EINVAL;
+
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+						 hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		if (parent->ops->set_irqs)
+			ret = parent->ops->set_irqs(vmdev->mdev, hdr.flags,
+						    hdr.index, hdr.start,
+						    hdr.count, data);
+		else
+			ret = -EINVAL;
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+	{
+		if (parent->ops->reset)
+			return parent->ops->reset(vmdev->mdev);
+
+		return -EINVAL;
+	}
+	}
+	return -ENOTTY;
+}
+
+static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	unsigned int done = 0;
+	int ret;
+
+	if (!parent->ops->read)
+		return -EINVAL;
+
+	while (count) {
+		size_t filled;
+
+		if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
+						*ppos);
+			if (ret <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
+						*ppos);
+			if (ret <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
+			if (ret <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	return done;
+
+read_err:
+	return -EFAULT;
+}
+
+static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	unsigned int done = 0;
+	int ret;
+
+	if (!parent->ops->write)
+		return -EINVAL;
+
+	while (count) {
+		size_t filled;
+
+		if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			ret = parent->ops->write(mdev, (char *)&val,
+						 sizeof(val), *ppos);
+			if (ret <= 0)
+				goto write_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			ret = parent->ops->write(mdev, (char *)&val,
+						 sizeof(val), *ppos);
+			if (ret <= 0)
+				goto write_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			ret = parent->ops->write(mdev, &val, sizeof(val),
+						 *ppos);
+			if (ret <= 0)
+				goto write_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	return done;
+write_err:
+	return -EFAULT;
+}
+
+static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+
+	if (parent->ops->mmap)
+		return parent->ops->mmap(mdev, vma);
+
+	return -EINVAL;
+}
+
+static const struct vfio_device_ops vfio_mdev_dev_ops = {
+	.name		= "vfio-mdev",
+	.open		= vfio_mdev_open,
+	.release	= vfio_mdev_close,
+	.ioctl		= vfio_mdev_unlocked_ioctl,
+	.read		= vfio_mdev_read,
+	.write		= vfio_mdev_write,
+	.mmap		= vfio_mdev_mmap,
+};
+
+int vfio_mdev_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->group = mdev->group;
+
+	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+void vfio_mdev_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+struct mdev_driver vfio_mdev_driver = {
+	.name	= "vfio_mdev",
+	.probe	= vfio_mdev_probe,
+	.remove	= vfio_mdev_remove,
+};
+
+static int __init vfio_mdev_init(void)
+{
+	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mdev_exit(void)
+{
+	mdev_unregister_driver(&vfio_mdev_driver);
+}
+
+module_init(vfio_mdev_init)
+module_exit(vfio_mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 016c14a1b454..776cc2b063d4 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -21,9 +21,9 @@
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-08-25  3:53   ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

VFIO MDEV driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This driver
adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated
device. Those are:
- get VFIO device information about type of device, maximum number of
  regions and maximum number of interrupts supported.
- get region information from vendor driver.
- Get interrupt information and send interrupt configuration information to
  vendor driver.
- Device reset
- Trap and forward read/write for emulated regions.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
Reviewed-on: http://git-master/r/1175706
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/mdev/Kconfig           |   6 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mdev.c       | 467 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 +-
 4 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index a34fbc66f92f..703abd0a9bff 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MDEV_DEVICE
+    tristate "VFIO support for Mediated devices"
+    depends on VFIO && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..e5087ed83a34 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
 
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index 000000000000..28f13aeaa46b
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,467 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+	struct vfio_device_info dev_info;
+};
+
+static int vfio_mdev_open(void *device_data)
+{
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	return ret;
+}
+
+static void vfio_mdev_close(void *device_data)
+{
+	module_put(THIS_MODULE);
+}
+
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+	size_t size;
+
+	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+	header = vfio_info_cap_add(caps, size,
+				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	sparse_cap = container_of(header,
+			struct vfio_region_info_cap_sparse_mmap, header);
+	sparse_cap->nr_areas = sparse->nr_areas;
+	memcpy(sparse_cap->areas, sparse->areas,
+	       sparse->nr_areas * sizeof(*sparse->areas));
+	return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+	header = vfio_info_cap_add(caps, sizeof(*cap),
+				   VFIO_REGION_INFO_CAP_TYPE, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	type_cap = container_of(header, struct vfio_region_info_cap_type,
+				header);
+	type_cap->type = cap->type;
+	type_cap->subtype = cap->type;
+	return 0;
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (parent->ops->get_device_info)
+			ret = parent->ops->get_device_info(vmdev->mdev, &info);
+		else
+			return -EINVAL;
+
+		if (ret)
+			return ret;
+
+		if (parent->ops->reset)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		memcpy(&vmdev->dev_info, &info, sizeof(info));
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		u16 cap_type_id = 0;
+		void *cap_type = NULL;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (parent->ops->get_region_info)
+			ret = parent->ops->get_region_info(vmdev->mdev, &info,
+						       &cap_type_id, &cap_type);
+		else
+			return -EINVAL;
+
+		if (ret)
+			return ret;
+
+		if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && cap_type) {
+			switch (cap_type_id) {
+			case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
+				ret = sparse_mmap_cap(&caps, cap_type);
+				if (ret)
+					return ret;
+				break;
+
+			case VFIO_REGION_INFO_CAP_TYPE:
+				ret = region_type_cap(&caps, cap_type);
+				if (ret)
+					return ret;
+				break;
+			default:
+				return -EINVAL;
+			}
+		}
+
+		if (caps.size) {
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				vfio_info_cap_shift(&caps, sizeof(info));
+				if (copy_to_user((void __user *)arg +
+							sizeof(info), caps.buf,
+							caps.size)) {
+					kfree(caps.buf);
+					return -EFAULT;
+				}
+				info.cap_offset = sizeof(info);
+			}
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((info.argsz < minsz) ||
+		    (info.index >= vmdev->dev_info.num_irqs))
+			return -EINVAL;
+
+		if (parent->ops->get_irq_info)
+			ret = parent->ops->get_irq_info(vmdev->mdev, &info);
+		else
+			return -EINVAL;
+
+		if (ret)
+			return ret;
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		u8 *data = NULL, *ptr = NULL;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((hdr.argsz < minsz) ||
+		    (hdr.index >= vmdev->dev_info.num_irqs) ||
+		    (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK)))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size)
+				return -EINVAL;
+
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+						 hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		if (parent->ops->set_irqs)
+			ret = parent->ops->set_irqs(vmdev->mdev, hdr.flags,
+						    hdr.index, hdr.start,
+						    hdr.count, data);
+		else
+			ret = -EINVAL;
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+	{
+		if (parent->ops->reset)
+			return parent->ops->reset(vmdev->mdev);
+
+		return -EINVAL;
+	}
+	}
+	return -ENOTTY;
+}
+
+static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	unsigned int done = 0;
+	int ret;
+
+	if (!parent->ops->read)
+		return -EINVAL;
+
+	while (count) {
+		size_t filled;
+
+		if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
+						*ppos);
+			if (ret <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
+						*ppos);
+			if (ret <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
+			if (ret <= 0)
+				goto read_err;
+
+			if (copy_to_user(buf, &val, sizeof(val)))
+				goto read_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	return done;
+
+read_err:
+	return -EFAULT;
+}
+
+static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	unsigned int done = 0;
+	int ret;
+
+	if (!parent->ops->write)
+		return -EINVAL;
+
+	while (count) {
+		size_t filled;
+
+		if (count >= 4 && !(*ppos % 4)) {
+			u32 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			ret = parent->ops->write(mdev, (char *)&val,
+						 sizeof(val), *ppos);
+			if (ret <= 0)
+				goto write_err;
+
+			filled = 4;
+		} else if (count >= 2 && !(*ppos % 2)) {
+			u16 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			ret = parent->ops->write(mdev, (char *)&val,
+						 sizeof(val), *ppos);
+			if (ret <= 0)
+				goto write_err;
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (copy_from_user(&val, buf, sizeof(val)))
+				goto write_err;
+
+			ret = parent->ops->write(mdev, &val, sizeof(val),
+						 *ppos);
+			if (ret <= 0)
+				goto write_err;
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		*ppos += filled;
+		buf += filled;
+	}
+
+	return done;
+write_err:
+	return -EFAULT;
+}
+
+static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+
+	if (parent->ops->mmap)
+		return parent->ops->mmap(mdev, vma);
+
+	return -EINVAL;
+}
+
+static const struct vfio_device_ops vfio_mdev_dev_ops = {
+	.name		= "vfio-mdev",
+	.open		= vfio_mdev_open,
+	.release	= vfio_mdev_close,
+	.ioctl		= vfio_mdev_unlocked_ioctl,
+	.read		= vfio_mdev_read,
+	.write		= vfio_mdev_write,
+	.mmap		= vfio_mdev_mmap,
+};
+
+int vfio_mdev_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->group = mdev->group;
+
+	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+void vfio_mdev_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+struct mdev_driver vfio_mdev_driver = {
+	.name	= "vfio_mdev",
+	.probe	= vfio_mdev_probe,
+	.remove	= vfio_mdev_remove,
+};
+
+static int __init vfio_mdev_init(void)
+{
+	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mdev_exit(void)
+{
+	mdev_unregister_driver(&vfio_mdev_driver);
+}
+
+module_init(vfio_mdev_init)
+module_exit(vfio_mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 016c14a1b454..776cc2b063d4 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -21,9 +21,9 @@
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-08-25  3:53 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-25  3:53   ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
Reviewed-on: http://git-master/r/1175707
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/vfio.c             | 117 ++++++++++
 drivers/vfio/vfio_iommu_type1.c | 498 ++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h            |  13 +-
 3 files changed, 580 insertions(+), 48 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..e3e342861e04 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static struct vfio_group *vfio_group_from_dev(struct device *dev)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret;
+
+	device = vfio_device_get_from_dev(dev);
+	if (!device)
+		return ERR_PTR(-EINVAL);
+
+	group = device->group;
+	if (!atomic_inc_not_zero(&group->container_users)) {
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	if (group->noiommu) {
+		atomic_dec(&group->container_users);
+		ret = -EPERM;
+		goto err_ret;
+	}
+
+	if (!group->container->iommu_driver ||
+	    !vfio_group_viable(group)) {
+		atomic_dec(&group->container_users);
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	vfio_device_put(device);
+	return group;
+
+err_ret:
+	vfio_device_put(device);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba19424e4a1..d52d75fd0f04 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*local_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct local_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	struct local_addr_space	*local_addr_space;
 };
 
 struct vfio_dma {
@@ -83,6 +91,22 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+
+#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
+			 (list_empty(&iommu->domain_list) ? false : true)
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	node = domain->local_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->local_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, size_t prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = (mm ? mm : current->mm);
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
@@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, 1);
 		return 1;
 	}
 
@@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, i);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
@@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlocked);
+	return unlocked;
+}
+
+static long __vfio_pin_pages_local(struct vfio_domain *domain,
+				   unsigned long vaddr, int prot,
+				   unsigned long *pfn_base,
+				   bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->local_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_pages_local(struct vfio_domain *domain,
+				     unsigned long pfn, int prot,
+				     bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->local_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
+				 do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting = false;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->local_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->local_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
+						 &pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_pages_local(domain, pfn[i], prot,
+						 do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	domain = iommu->local_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, true);
+
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+	}
 
 	return unlocked;
 }
@@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		return;
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
+						      unmapped >> PAGE_SHIFT,
+						      dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
+		dma->size = size;
+		goto map_done;
+	}
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = __vfio_pin_pages_remote(vaddr + dma->size,
+						size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			__vfio_unpin_pages_remote(pfn, npage, prot, true);
 			break;
 		}
 
@@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->local_domain) {
+		if (find_iommu_group(iommu->local_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
+	    (bus == &mdev_bus_type)) {
+		if (iommu->local_domain) {
+			list_add(&group->next,
+				 &iommu->local_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+
+		domain->local_addr_space = kzalloc(sizeof(*domain->local_addr_space),
+						   GFP_KERNEL);
+		if (!domain->local_addr_space) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		domain->local_addr_space->task = current;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->local_addr_space->pfn_list = RB_ROOT;
+		mutex_init(&domain->local_addr_space->pfn_list_lock);
+		iommu->local_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1207,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_local_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->local_addr_space->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1228,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+	if (iommu->local_domain) {
+		domain = iommu->local_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
+			list_del(&group->next);
+			kfree(group);
 
+			if (list_empty(&domain->group_list)) {
+				vfio_local_unpin_all(domain);
+				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+					vfio_iommu_unmap_unpin_all(iommu);
+				kfree(domain);
+				iommu->local_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and local domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				   (!iommu->local_domain))
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1305,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->local_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->local_addr_space)
+		vfio_local_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->local_domain) {
+		vfio_release_domain(iommu->local_domain);
+		kfree(iommu->local_domain);
+		iommu->local_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1450,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..0bd25ba6223d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
@@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-25  3:53   ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
Reviewed-on: http://git-master/r/1175707
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/vfio.c             | 117 ++++++++++
 drivers/vfio/vfio_iommu_type1.c | 498 ++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h            |  13 +-
 3 files changed, 580 insertions(+), 48 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..e3e342861e04 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static struct vfio_group *vfio_group_from_dev(struct device *dev)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret;
+
+	device = vfio_device_get_from_dev(dev);
+	if (!device)
+		return ERR_PTR(-EINVAL);
+
+	group = device->group;
+	if (!atomic_inc_not_zero(&group->container_users)) {
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	if (group->noiommu) {
+		atomic_dec(&group->container_users);
+		ret = -EPERM;
+		goto err_ret;
+	}
+
+	if (!group->container->iommu_driver ||
+	    !vfio_group_viable(group)) {
+		atomic_dec(&group->container_users);
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	vfio_device_put(device);
+	return group;
+
+err_ret:
+	vfio_device_put(device);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba19424e4a1..d52d75fd0f04 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*local_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct local_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	struct local_addr_space	*local_addr_space;
 };
 
 struct vfio_dma {
@@ -83,6 +91,22 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+
+#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
+			 (list_empty(&iommu->domain_list) ? false : true)
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	node = domain->local_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->local_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, size_t prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = (mm ? mm : current->mm);
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
@@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, 1);
 		return 1;
 	}
 
@@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, i);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
@@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlocked);
+	return unlocked;
+}
+
+static long __vfio_pin_pages_local(struct vfio_domain *domain,
+				   unsigned long vaddr, int prot,
+				   unsigned long *pfn_base,
+				   bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->local_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_pages_local(struct vfio_domain *domain,
+				     unsigned long pfn, int prot,
+				     bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->local_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
+				 do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting = false;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->local_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->local_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
+						 &pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_pages_local(domain, pfn[i], prot,
+						 do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	domain = iommu->local_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, true);
+
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+	}
 
 	return unlocked;
 }
@@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		return;
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
+						      unmapped >> PAGE_SHIFT,
+						      dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
+		dma->size = size;
+		goto map_done;
+	}
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = __vfio_pin_pages_remote(vaddr + dma->size,
+						size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			__vfio_unpin_pages_remote(pfn, npage, prot, true);
 			break;
 		}
 
@@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->local_domain) {
+		if (find_iommu_group(iommu->local_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
+	    (bus == &mdev_bus_type)) {
+		if (iommu->local_domain) {
+			list_add(&group->next,
+				 &iommu->local_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+
+		domain->local_addr_space = kzalloc(sizeof(*domain->local_addr_space),
+						   GFP_KERNEL);
+		if (!domain->local_addr_space) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		domain->local_addr_space->task = current;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->local_addr_space->pfn_list = RB_ROOT;
+		mutex_init(&domain->local_addr_space->pfn_list_lock);
+		iommu->local_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1207,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_local_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->local_addr_space->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1228,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+	if (iommu->local_domain) {
+		domain = iommu->local_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
+			list_del(&group->next);
+			kfree(group);
 
+			if (list_empty(&domain->group_list)) {
+				vfio_local_unpin_all(domain);
+				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+					vfio_iommu_unmap_unpin_all(iommu);
+				kfree(domain);
+				iommu->local_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and local domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				   (!iommu->local_domain))
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1305,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->local_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->local_addr_space)
+		vfio_local_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->local_domain) {
+		vfio_release_domain(iommu->local_domain);
+		kfree(iommu->local_domain);
+		iommu->local_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1450,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..0bd25ba6223d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
@@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [PATCH v7 4/4] docs: Add Documentation for Mediated devices
  2016-08-25  3:53 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-25  3:53   ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
Reviewed-on: http://git-master/r/1182512
Reviewed-by: Automatic_Commit_Validation_User
---
 Documentation/vfio-mediated-device.txt | 203 +++++++++++++++++++++++++++++++++
 1 file changed, 203 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
new file mode 100644
index 000000000000..237d8eb630b7
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,203 @@
+VFIO Mediated devices [1]
+-------------------------------------------------------------------------------
+
+There are more and more use cases/demands to virtualize the DMA devices which
+doesn't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device agnostic framework for exposing direct device access to
+user space, in a secure, IOMMU protected environment. This framework is
+used for multiple devices like GPUs, network adapters and compute accelerators.
+With direct device access, virtual machines or user space applications have
+direct access of physical device. This framework is reused for mediated devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a generic
+interface to create/destroy mediated device, add/remove it to mediated bus
+driver, add/remove device to IOMMU group. It also provides an interface to
+register bus driver, for example, Mediated VFIO mdev driver is designed for
+mediated devices and supports VFIO APIs. Mediated bus driver add/delete mediated
+device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now.
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |  mdev     | |                         |              |
+     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
+     | |  driver   | |     probe()/remove()    |              |    APIs
+     | |           | |                         +--------------+
+     | +-----------+ |
+     |               |
+     |  MDEV CORE    |
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces
+-------------------------------------------------------------------------------
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+-------------------------------------------------
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     struct device_driver    driver;
+     };
+
+Mediated bus driver for mdev should use this interface to register and
+unregister with core driver respectively:
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
+group when devices are bound and unbound to the driver.
+
+2. Physical device driver interface:
+-----------------------------------
+This interface [3] provides a set of APIs to manage physical device related work
+in its driver. APIs are:
+
+* dev_attr_groups: attributes of the parent device.
+* mdev_attr_groups: attributes of the mediated device.
+* supported_config: to provide supported configuration list by the driver.
+* create: to allocate basic resources in driver for a mediated device.
+* destroy: to free resources in driver when mediated device is destroyed.
+* reset: to free and reallocate resources in driver on mediated device reset.
+* set_online_status: to change online status of mediated device.
+* get_online_status: to get current (online/offline) status of mediated device.
+* read : read emulation callback.
+* write: write emulation callback.
+* mmap: mmap emulation callback.
+* get_irq_info: to retrieve information about mediated device's IRQ.
+* set_irqs: gives interrupt configuration information that VMM sets.
+* get_region_info: to provide region size and its flags for the mediated device.
+    Vendor driver can provide the capability id and corresponding capability
+    structure if want to support a capability.
+* get_device_info: to retrieve VFIO device related flags, number of regions and
+  number of IRQs supported.
+
+Drivers should use this interface to register and unregister device to mdev core
+driver respectively:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+Mediated device management interface via sysfs
+-------------------------------------------------------------------------------
+This is the interface that allows user space software, like libvirt, to query
+and configure mediated device in a HW agnostic fashion. This management
+interface provide flexibility to underlying physical device's driver to support
+mediated device hotplug, multiple mediated devices per virtual machine, multiple
+mediated devices from different physical devices, etc.
+
+Under per-physical device sysfs:
+--------------------------------
+
+* mdev_supported_types: (read only)
+    List the current supported mediated device types and its details.
+
+* mdev_create: (write only)
+	Create a mediated device on target physical device.
+	Input syntax: <UUID:params>
+	where,
+		UUID: mediated device's UUID
+		params: extra parameters required by driver
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc:0" >
+				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
+
+* mdev_destroy: (write only)
+	Destroy a mediated device on a target physical device.
+	Input syntax: <UUID>
+	where,
+		UUID: mediated device's UUID
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc" >
+			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
+
+Under per mdev device:
+----------------------------------------
+
+* online: (read write)
+	Read on this file provides current status of mediated device (0 or 1).
+	Write on this file (0 or 1) will change the state of mediated device.
+	This trigger the registration callback to notify the driver to commit
+	or free mediated device resources. This callback is blocking call,
+	successful return of this call will indicate requested mdev resources
+	has been fully committed, the VMM should continue.
+	Example:
+	# echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
+
+
+Mediated device Hotplug:
+-----------------------
+
+To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
+accessed during VM runtime, and the corresponding registration callback is
+invoked to allow driver to support hotplug.
+
+Translation APIs for Mediated device
+------------------------------------------------------------------------------
+
+Below APIs are provided for user pfn to host pfn translation in VFIO driver:
+
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+                           long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
+These functions call back into the backend IOMMU module using two callbacks of
+struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
+supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
+modules, such as PPC64 sPAPR module, they need to provide these two callback
+functions.
+
+References
+-------------------------------------------------------------------------------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* [Qemu-devel] [PATCH v7 4/4] docs: Add Documentation for Mediated devices
@ 2016-08-25  3:53   ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-25  3:53 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
Reviewed-on: http://git-master/r/1182512
Reviewed-by: Automatic_Commit_Validation_User
---
 Documentation/vfio-mediated-device.txt | 203 +++++++++++++++++++++++++++++++++
 1 file changed, 203 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
new file mode 100644
index 000000000000..237d8eb630b7
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,203 @@
+VFIO Mediated devices [1]
+-------------------------------------------------------------------------------
+
+There are more and more use cases/demands to virtualize the DMA devices which
+doesn't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device agnostic framework for exposing direct device access to
+user space, in a secure, IOMMU protected environment. This framework is
+used for multiple devices like GPUs, network adapters and compute accelerators.
+With direct device access, virtual machines or user space applications have
+direct access of physical device. This framework is reused for mediated devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a generic
+interface to create/destroy mediated device, add/remove it to mediated bus
+driver, add/remove device to IOMMU group. It also provides an interface to
+register bus driver, for example, Mediated VFIO mdev driver is designed for
+mediated devices and supports VFIO APIs. Mediated bus driver add/delete mediated
+device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now.
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |  mdev     | |                         |              |
+     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
+     | |  driver   | |     probe()/remove()    |              |    APIs
+     | |           | |                         +--------------+
+     | +-----------+ |
+     |               |
+     |  MDEV CORE    |
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces
+-------------------------------------------------------------------------------
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+-------------------------------------------------
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     struct device_driver    driver;
+     };
+
+Mediated bus driver for mdev should use this interface to register and
+unregister with core driver respectively:
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
+group when devices are bound and unbound to the driver.
+
+2. Physical device driver interface:
+-----------------------------------
+This interface [3] provides a set of APIs to manage physical device related work
+in its driver. APIs are:
+
+* dev_attr_groups: attributes of the parent device.
+* mdev_attr_groups: attributes of the mediated device.
+* supported_config: to provide supported configuration list by the driver.
+* create: to allocate basic resources in driver for a mediated device.
+* destroy: to free resources in driver when mediated device is destroyed.
+* reset: to free and reallocate resources in driver on mediated device reset.
+* set_online_status: to change online status of mediated device.
+* get_online_status: to get current (online/offline) status of mediated device.
+* read : read emulation callback.
+* write: write emulation callback.
+* mmap: mmap emulation callback.
+* get_irq_info: to retrieve information about mediated device's IRQ.
+* set_irqs: gives interrupt configuration information that VMM sets.
+* get_region_info: to provide region size and its flags for the mediated device.
+    Vendor driver can provide the capability id and corresponding capability
+    structure if want to support a capability.
+* get_device_info: to retrieve VFIO device related flags, number of regions and
+  number of IRQs supported.
+
+Drivers should use this interface to register and unregister device to mdev core
+driver respectively:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+Mediated device management interface via sysfs
+-------------------------------------------------------------------------------
+This is the interface that allows user space software, like libvirt, to query
+and configure mediated device in a HW agnostic fashion. This management
+interface provide flexibility to underlying physical device's driver to support
+mediated device hotplug, multiple mediated devices per virtual machine, multiple
+mediated devices from different physical devices, etc.
+
+Under per-physical device sysfs:
+--------------------------------
+
+* mdev_supported_types: (read only)
+    List the current supported mediated device types and its details.
+
+* mdev_create: (write only)
+	Create a mediated device on target physical device.
+	Input syntax: <UUID:params>
+	where,
+		UUID: mediated device's UUID
+		params: extra parameters required by driver
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc:0" >
+				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
+
+* mdev_destroy: (write only)
+	Destroy a mediated device on a target physical device.
+	Input syntax: <UUID>
+	where,
+		UUID: mediated device's UUID
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc" >
+			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
+
+Under per mdev device:
+----------------------------------------
+
+* online: (read write)
+	Read on this file provides current status of mediated device (0 or 1).
+	Write on this file (0 or 1) will change the state of mediated device.
+	This trigger the registration callback to notify the driver to commit
+	or free mediated device resources. This callback is blocking call,
+	successful return of this call will indicate requested mdev resources
+	has been fully committed, the VMM should continue.
+	Example:
+	# echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
+
+
+Mediated device Hotplug:
+-----------------------
+
+To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
+accessed during VM runtime, and the corresponding registration callback is
+invoked to allow driver to support hotplug.
+
+Translation APIs for Mediated device
+------------------------------------------------------------------------------
+
+Below APIs are provided for user pfn to host pfn translation in VFIO driver:
+
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+                           long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
+These functions call back into the backend IOMMU module using two callbacks of
+struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
+supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
+modules, such as PPC64 sPAPR module, they need to provide these two callback
+functions.
+
+References
+-------------------------------------------------------------------------------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-25  7:29     ` Dong Jia
  -1 siblings, 0 replies; 162+ messages in thread
From: Dong Jia @ 2016-08-25  7:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, alex.williamson,
	kraxel, pbonzini, Dong Jia

On Thu, 25 Aug 2016 09:23:54 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
> 
> +	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
s/CONFIF_VFIO_MDEV/CONFIG_VFIO_MDEV/

> +	    (bus == &mdev_bus_type)) {
> +		if (iommu->local_domain) {
> +			list_add(&group->next,
> +				 &iommu->local_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +


--------
Dong Jia

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-25  7:29     ` Dong Jia
  0 siblings, 0 replies; 162+ messages in thread
From: Dong Jia @ 2016-08-25  7:29 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song, Dong Jia

On Thu, 25 Aug 2016 09:23:54 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
> 
> +	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
s/CONFIF_VFIO_MDEV/CONFIG_VFIO_MDEV/

> +	    (bus == &mdev_bus_type)) {
> +		if (iommu->local_domain) {
> +			list_add(&group->next,
> +				 &iommu->local_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +


--------
Dong Jia

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-25  9:22     ` Dong Jia
  -1 siblings, 0 replies; 162+ messages in thread
From: Dong Jia @ 2016-08-25  9:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song, Dong Jia

On Thu, 25 Aug 2016 09:23:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

[...]

Dear Kirti,

I just rebased my vfio-ccw patches to this series.
With a little fix, which was pointed it out in my reply to the #3
patch, it works fine.

> +static long vfio_mdev_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +	unsigned long minsz;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (parent->ops->get_device_info)
> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
> +		else
> +			return -EINVAL;
> +
> +		if (ret)
> +			return ret;
> +
> +		if (parent->ops->reset)
> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
Shouldn't this be done inside the get_device_info callback?

> +
> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
[...]

> +
> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	unsigned int done = 0;
> +	int ret;
> +
> +	if (!parent->ops->read)
> +		return -EINVAL;
> +
> +	while (count) {
Here, I have to say sorry to you guys for that I didn't notice the
bad impact of this change to my patches during the v6 discussion.

For vfio-ccw, I introduced an I/O region to input/output I/O
instruction parameters and results for Qemu. The @count of these data
currently is 140. So supporting arbitrary lengths in one shot here, and
also in vfio_mdev_write, seems the better option for this case.

I believe that if the pci drivers want to iterate in a 4 bytes step, you
can do that in the parent read/write callbacks instead.

What do you think?

> +		size_t filled;
> +
> +		if (count >= 4 && !(*ppos % 4)) {
> +			u32 val;
> +
> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
> +						*ppos);
> +			if (ret <= 0)
> +				goto read_err;
> +
> +			if (copy_to_user(buf, &val, sizeof(val)))
> +				goto read_err;
> +
> +			filled = 4;
> +		} else if (count >= 2 && !(*ppos % 2)) {
> +			u16 val;
> +
> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
> +						*ppos);
> +			if (ret <= 0)
> +				goto read_err;
> +
> +			if (copy_to_user(buf, &val, sizeof(val)))
> +				goto read_err;
> +
> +			filled = 2;
> +		} else {
> +			u8 val;
> +
> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
> +			if (ret <= 0)
> +				goto read_err;
> +
> +			if (copy_to_user(buf, &val, sizeof(val)))
> +				goto read_err;
> +
> +			filled = 1;
> +		}
> +
> +		count -= filled;
> +		done += filled;
> +		*ppos += filled;
> +		buf += filled;
> +	}
> +
> +	return done;
> +
> +read_err:
> +	return -EFAULT;
> +}
[...]

--------
Dong Jia


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-08-25  9:22     ` Dong Jia
  0 siblings, 0 replies; 162+ messages in thread
From: Dong Jia @ 2016-08-25  9:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song, Dong Jia

On Thu, 25 Aug 2016 09:23:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

[...]

Dear Kirti,

I just rebased my vfio-ccw patches to this series.
With a little fix, which was pointed it out in my reply to the #3
patch, it works fine.

> +static long vfio_mdev_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +	unsigned long minsz;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (parent->ops->get_device_info)
> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
> +		else
> +			return -EINVAL;
> +
> +		if (ret)
> +			return ret;
> +
> +		if (parent->ops->reset)
> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
Shouldn't this be done inside the get_device_info callback?

> +
> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
[...]

> +
> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	unsigned int done = 0;
> +	int ret;
> +
> +	if (!parent->ops->read)
> +		return -EINVAL;
> +
> +	while (count) {
Here, I have to say sorry to you guys for that I didn't notice the
bad impact of this change to my patches during the v6 discussion.

For vfio-ccw, I introduced an I/O region to input/output I/O
instruction parameters and results for Qemu. The @count of these data
currently is 140. So supporting arbitrary lengths in one shot here, and
also in vfio_mdev_write, seems the better option for this case.

I believe that if the pci drivers want to iterate in a 4 bytes step, you
can do that in the parent read/write callbacks instead.

What do you think?

> +		size_t filled;
> +
> +		if (count >= 4 && !(*ppos % 4)) {
> +			u32 val;
> +
> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
> +						*ppos);
> +			if (ret <= 0)
> +				goto read_err;
> +
> +			if (copy_to_user(buf, &val, sizeof(val)))
> +				goto read_err;
> +
> +			filled = 4;
> +		} else if (count >= 2 && !(*ppos % 2)) {
> +			u16 val;
> +
> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
> +						*ppos);
> +			if (ret <= 0)
> +				goto read_err;
> +
> +			if (copy_to_user(buf, &val, sizeof(val)))
> +				goto read_err;
> +
> +			filled = 2;
> +		} else {
> +			u8 val;
> +
> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
> +			if (ret <= 0)
> +				goto read_err;
> +
> +			if (copy_to_user(buf, &val, sizeof(val)))
> +				goto read_err;
> +
> +			filled = 1;
> +		}
> +
> +		count -= filled;
> +		done += filled;
> +		*ppos += filled;
> +		buf += filled;
> +	}
> +
> +	return done;
> +
> +read_err:
> +	return -EFAULT;
> +}
[...]

--------
Dong Jia

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-08-25  7:29     ` [Qemu-devel] " Dong Jia
@ 2016-08-26 13:50       ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-26 13:50 UTC (permalink / raw)
  To: Dong Jia
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song


Oh, that's the last minute change after running checkpatch.pl :(
Thanks for catching that. I'll correct that.

Thanks,
Kirti

On 8/25/2016 12:59 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:54 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_free;
>>
>> +	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
> s/CONFIF_VFIO_MDEV/CONFIG_VFIO_MDEV/
> 
>> +	    (bus == &mdev_bus_type)) {
>> +		if (iommu->local_domain) {
>> +			list_add(&group->next,
>> +				 &iommu->local_domain->group_list);
>> +			kfree(domain);
>> +			mutex_unlock(&iommu->lock);
>> +			return 0;
>> +		}
>> +
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-26 13:50       ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-26 13:50 UTC (permalink / raw)
  To: Dong Jia
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song


Oh, that's the last minute change after running checkpatch.pl :(
Thanks for catching that. I'll correct that.

Thanks,
Kirti

On 8/25/2016 12:59 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:54 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_free;
>>
>> +	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
> s/CONFIF_VFIO_MDEV/CONFIG_VFIO_MDEV/
> 
>> +	    (bus == &mdev_bus_type)) {
>> +		if (iommu->local_domain) {
>> +			list_add(&group->next,
>> +				 &iommu->local_domain->group_list);
>> +			kfree(domain);
>> +			mutex_unlock(&iommu->lock);
>> +			return 0;
>> +		}
>> +
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-25  9:22     ` [Qemu-devel] " Dong Jia
@ 2016-08-26 14:13       ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-26 14:13 UTC (permalink / raw)
  To: Dong Jia
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, alex.williamson,
	kraxel, pbonzini



On 8/25/2016 2:52 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> [...]
> 
> Dear Kirti,
> 
> I just rebased my vfio-ccw patches to this series.
> With a little fix, which was pointed it out in my reply to the #3
> patch, it works fine.
> 

Thanks for update. Glad to know this works for you.


>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>> +				     unsigned int cmd, unsigned long arg)
>> +{
>> +	int ret = 0;
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct parent_device *parent = vmdev->mdev->parent;
>> +	unsigned long minsz;
>> +
>> +	switch (cmd) {
>> +	case VFIO_DEVICE_GET_INFO:
>> +	{
>> +		struct vfio_device_info info;
>> +
>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>> +
>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (info.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		if (parent->ops->get_device_info)
>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>> +		else
>> +			return -EINVAL;
>> +
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (parent->ops->reset)
>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> Shouldn't this be done inside the get_device_info callback?
> 

I would like Vendor driver to set device type only. Reset flag should be
set on basis of reset() callback provided.

>> +
>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>> +
>> +		return copy_to_user((void __user *)arg, &info, minsz);
>> +	}
> [...]
> 
>> +
>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	unsigned int done = 0;
>> +	int ret;
>> +
>> +	if (!parent->ops->read)
>> +		return -EINVAL;
>> +
>> +	while (count) {
> Here, I have to say sorry to you guys for that I didn't notice the
> bad impact of this change to my patches during the v6 discussion.
> 
> For vfio-ccw, I introduced an I/O region to input/output I/O
> instruction parameters and results for Qemu. The @count of these data
> currently is 140. So supporting arbitrary lengths in one shot here, and
> also in vfio_mdev_write, seems the better option for this case.
> 
> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> can do that in the parent read/write callbacks instead.
> 
> What do you think?
> 

I would like to know Alex's thought on this. He raised concern with this
approach in v6 reviews:
"But I think this is exploitable, it lets the user make the kernel
allocate an arbitrarily sized buffer."

Thanks,
Kirti

>> +		size_t filled;
>> +
>> +		if (count >= 4 && !(*ppos % 4)) {
>> +			u32 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 4;
>> +		} else if (count >= 2 && !(*ppos % 2)) {
>> +			u16 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 2;
>> +		} else {
>> +			u8 val;
>> +
>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 1;
>> +		}
>> +
>> +		count -= filled;
>> +		done += filled;
>> +		*ppos += filled;
>> +		buf += filled;
>> +	}
>> +
>> +	return done;
>> +
>> +read_err:
>> +	return -EFAULT;
>> +}
> [...]
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-08-26 14:13       ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-08-26 14:13 UTC (permalink / raw)
  To: Dong Jia
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song



On 8/25/2016 2:52 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> [...]
> 
> Dear Kirti,
> 
> I just rebased my vfio-ccw patches to this series.
> With a little fix, which was pointed it out in my reply to the #3
> patch, it works fine.
> 

Thanks for update. Glad to know this works for you.


>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>> +				     unsigned int cmd, unsigned long arg)
>> +{
>> +	int ret = 0;
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct parent_device *parent = vmdev->mdev->parent;
>> +	unsigned long minsz;
>> +
>> +	switch (cmd) {
>> +	case VFIO_DEVICE_GET_INFO:
>> +	{
>> +		struct vfio_device_info info;
>> +
>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>> +
>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (info.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		if (parent->ops->get_device_info)
>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>> +		else
>> +			return -EINVAL;
>> +
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (parent->ops->reset)
>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> Shouldn't this be done inside the get_device_info callback?
> 

I would like Vendor driver to set device type only. Reset flag should be
set on basis of reset() callback provided.

>> +
>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>> +
>> +		return copy_to_user((void __user *)arg, &info, minsz);
>> +	}
> [...]
> 
>> +
>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	unsigned int done = 0;
>> +	int ret;
>> +
>> +	if (!parent->ops->read)
>> +		return -EINVAL;
>> +
>> +	while (count) {
> Here, I have to say sorry to you guys for that I didn't notice the
> bad impact of this change to my patches during the v6 discussion.
> 
> For vfio-ccw, I introduced an I/O region to input/output I/O
> instruction parameters and results for Qemu. The @count of these data
> currently is 140. So supporting arbitrary lengths in one shot here, and
> also in vfio_mdev_write, seems the better option for this case.
> 
> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> can do that in the parent read/write callbacks instead.
> 
> What do you think?
> 

I would like to know Alex's thought on this. He raised concern with this
approach in v6 reviews:
"But I think this is exploitable, it lets the user make the kernel
allocate an arbitrarily sized buffer."

Thanks,
Kirti

>> +		size_t filled;
>> +
>> +		if (count >= 4 && !(*ppos % 4)) {
>> +			u32 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 4;
>> +		} else if (count >= 2 && !(*ppos % 2)) {
>> +			u16 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 2;
>> +		} else {
>> +			u8 val;
>> +
>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 1;
>> +		}
>> +
>> +		count -= filled;
>> +		done += filled;
>> +		*ppos += filled;
>> +		buf += filled;
>> +	}
>> +
>> +	return done;
>> +
>> +read_err:
>> +	return -EFAULT;
>> +}
> [...]
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-08-25  3:53 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-30 16:16   ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-08-30 16:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

Hi folks,

At KVM Forum we had a BoF session primarily around the mediated device
sysfs interface.  I'd like to share what I think we agreed on and the
"problem areas" that still need some work so we can get the thoughts
and ideas from those who weren't able to attend.

DanPB expressed some concern about the mdev_supported_types sysfs
interface, which exposes a flat csv file with fields like "type",
"number of instance", "vendor string", and then a bunch of type
specific fields like "framebuffer size", "resolution", "frame rate
limit", etc.  This is not entirely machine parsing friendly and sort of
abuses the sysfs concept of one value per file.  Example output taken
from Neo's libvirt RFC:

cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
# vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
max_resolution
11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160

The create/destroy then looks like this:

echo "$mdev_UUID:vendor_specific_argument_list" >
	/sys/bus/pci/devices/.../mdev_create

echo "$mdev_UUID:vendor_specific_argument_list" >
	/sys/bus/pci/devices/.../mdev_destroy

"vendor_specific_argument_list" is nebulous.

So the idea to fix this is to explode this into a directory structure,
something like:

├── mdev_destroy
└── mdev_supported_types
    ├── 11
    │   ├── create
    │   ├── description
    │   └── max_instances
    ├── 12
    │   ├── create
    │   ├── description
    │   └── max_instances
    └── 13
        ├── create
        ├── description
        └── max_instances

Note that I'm only exposing the minimal attributes here for simplicity,
the other attributes would be included in separate files and we would
require vendors to create standard attributes for common device classes.

For vGPUs like NVIDIA where we don't support multiple types
concurrently, this directory structure would update as mdev devices are
created, removing no longer available types.  I carried forward
max_instances here, but perhaps we really want to copy SR-IOV and
report a max and current allocation.  Creation and deletion is
simplified as we can simply "echo $UUID > create" per type.  I don't
understand why destroy had a parameter list, so here I imagine we can
simply do the same... in fact, I'd actually rather see a "remove" sysfs
entry under each mdev device, so we remove it at the device rather than
in some central location (any objections?).

We discussed how this might look with Intel devices which do allow
mixed vGPU types concurrently.  We believe, but need confirmation, that
the vendor driver could still make a finite set of supported types,
perhaps with additional module options to the vendor driver to enable
more "exotic" types.  So for instance if IGD vGPUs are based on
power-of-2 portions of the framebuffer size, then the vendor driver
could list types with 32MB, 64MB, 128MB, etc in useful and popular
sizes.  As vGPUs are allocated, the larger sizes may become unavailable.

We still don't have any way for the admin to learn in advance how the
available supported types will change once mdev devices start to be
created.  I'm not sure how we can create a specification for this, so
probing by creating devices may be the most flexible model.

The other issue is the start/stop requirement, which was revealed to
setup peer-to-peer resources between vGPUs which is a limited hardware
resource.  We'd really like to have these happen automatically on the
first open of a vfio mdev device file and final release.  So we
brainstormed how the open/release callbacks could know the other mdev
devices for a given user.  This is where the instance number came into
play previously.  This is an area that needs work.

There was a thought that perhaps on open() the vendor driver could look
at the user pid and use that to associate with other devices, but the
problem here is that we open and begin access to each device, so
devices do this discovery serially rather than in parallel as desired.
(we might not fault in mmio space yet though, so I wonder if open()
could set the association of mdev to pid, then the first mmio fault
would trigger the resource allocation?  Then all the "magic" would live
in the vendor driver.  open() could fail if the pid already has running
mdev devices and the vendor driver chooses not to support hotplug)

One comment was that for a GPU that only supports homogeneous vGPUs,
libvirt may choose to create all the vGPUs in advance and handle them
as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
case.

We also considered whether iommu groups could be (ab)used for this use
case, peer-to-peer would in fact be an iommu grouping constraint
afterall.  This would have the same UUID+instance constraint as above
though and would require some sort of sysfs interface for the user to
be able to create multiple mdevs within a group.

Everyone was given homework to think about this on their flights home,
so I expect plenty of ideas by now ;)

Overall I think mediated devices were well received by the community,
so let's keep up the development and discussion to bring it to
fruition.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-08-30 16:16   ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-08-30 16:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

Hi folks,

At KVM Forum we had a BoF session primarily around the mediated device
sysfs interface.  I'd like to share what I think we agreed on and the
"problem areas" that still need some work so we can get the thoughts
and ideas from those who weren't able to attend.

DanPB expressed some concern about the mdev_supported_types sysfs
interface, which exposes a flat csv file with fields like "type",
"number of instance", "vendor string", and then a bunch of type
specific fields like "framebuffer size", "resolution", "frame rate
limit", etc.  This is not entirely machine parsing friendly and sort of
abuses the sysfs concept of one value per file.  Example output taken
from Neo's libvirt RFC:

cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
# vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
max_resolution
11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160

The create/destroy then looks like this:

echo "$mdev_UUID:vendor_specific_argument_list" >
	/sys/bus/pci/devices/.../mdev_create

echo "$mdev_UUID:vendor_specific_argument_list" >
	/sys/bus/pci/devices/.../mdev_destroy

"vendor_specific_argument_list" is nebulous.

So the idea to fix this is to explode this into a directory structure,
something like:

├── mdev_destroy
└── mdev_supported_types
    ├── 11
    │   ├── create
    │   ├── description
    │   └── max_instances
    ├── 12
    │   ├── create
    │   ├── description
    │   └── max_instances
    └── 13
        ├── create
        ├── description
        └── max_instances

Note that I'm only exposing the minimal attributes here for simplicity,
the other attributes would be included in separate files and we would
require vendors to create standard attributes for common device classes.

For vGPUs like NVIDIA where we don't support multiple types
concurrently, this directory structure would update as mdev devices are
created, removing no longer available types.  I carried forward
max_instances here, but perhaps we really want to copy SR-IOV and
report a max and current allocation.  Creation and deletion is
simplified as we can simply "echo $UUID > create" per type.  I don't
understand why destroy had a parameter list, so here I imagine we can
simply do the same... in fact, I'd actually rather see a "remove" sysfs
entry under each mdev device, so we remove it at the device rather than
in some central location (any objections?).

We discussed how this might look with Intel devices which do allow
mixed vGPU types concurrently.  We believe, but need confirmation, that
the vendor driver could still make a finite set of supported types,
perhaps with additional module options to the vendor driver to enable
more "exotic" types.  So for instance if IGD vGPUs are based on
power-of-2 portions of the framebuffer size, then the vendor driver
could list types with 32MB, 64MB, 128MB, etc in useful and popular
sizes.  As vGPUs are allocated, the larger sizes may become unavailable.

We still don't have any way for the admin to learn in advance how the
available supported types will change once mdev devices start to be
created.  I'm not sure how we can create a specification for this, so
probing by creating devices may be the most flexible model.

The other issue is the start/stop requirement, which was revealed to
setup peer-to-peer resources between vGPUs which is a limited hardware
resource.  We'd really like to have these happen automatically on the
first open of a vfio mdev device file and final release.  So we
brainstormed how the open/release callbacks could know the other mdev
devices for a given user.  This is where the instance number came into
play previously.  This is an area that needs work.

There was a thought that perhaps on open() the vendor driver could look
at the user pid and use that to associate with other devices, but the
problem here is that we open and begin access to each device, so
devices do this discovery serially rather than in parallel as desired.
(we might not fault in mmio space yet though, so I wonder if open()
could set the association of mdev to pid, then the first mmio fault
would trigger the resource allocation?  Then all the "magic" would live
in the vendor driver.  open() could fail if the pid already has running
mdev devices and the vendor driver chooses not to support hotplug)

One comment was that for a GPU that only supports homogeneous vGPUs,
libvirt may choose to create all the vGPUs in advance and handle them
as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
case.

We also considered whether iommu groups could be (ab)used for this use
case, peer-to-peer would in fact be an iommu grouping constraint
afterall.  This would have the same UUID+instance constraint as above
though and would require some sort of sysfs interface for the user to
be able to create multiple mdevs within a group.

Everyone was given homework to think about this on their flights home,
so I expect plenty of ideas by now ;)

Overall I think mediated devices were well received by the community,
so let's keep up the development and discussion to bring it to
fruition.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* RE: [PATCH v7 0/4] Add Mediated device support
  2016-08-30 16:16   ` [Qemu-devel] " Alex Williamson
@ 2016-08-31  6:12     ` Tian, Kevin
  -1 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-08-31  6:12 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi,
	libvir-list, Daniel P. Berrange, Laine Stump

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 31, 2016 12:17 AM
> 
> Hi folks,
> 
> At KVM Forum we had a BoF session primarily around the mediated device
> sysfs interface.  I'd like to share what I think we agreed on and the
> "problem areas" that still need some work so we can get the thoughts
> and ideas from those who weren't able to attend.
> 
> DanPB expressed some concern about the mdev_supported_types sysfs
> interface, which exposes a flat csv file with fields like "type",
> "number of instance", "vendor string", and then a bunch of type
> specific fields like "framebuffer size", "resolution", "frame rate
> limit", etc.  This is not entirely machine parsing friendly and sort of
> abuses the sysfs concept of one value per file.  Example output taken
> from Neo's libvirt RFC:
> 
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> 
> The create/destroy then looks like this:
> 
> echo "$mdev_UUID:vendor_specific_argument_list" >
> 	/sys/bus/pci/devices/.../mdev_create
> 
> echo "$mdev_UUID:vendor_specific_argument_list" >
> 	/sys/bus/pci/devices/.../mdev_destroy
> 
> "vendor_specific_argument_list" is nebulous.
> 
> So the idea to fix this is to explode this into a directory structure,
> something like:
> 
> ├── mdev_destroy
> └── mdev_supported_types
>     ├── 11
>     │   ├── create
>     │   ├── description
>     │   └── max_instances
>     ├── 12
>     │   ├── create
>     │   ├── description
>     │   └── max_instances
>     └── 13
>         ├── create
>         ├── description
>         └── max_instances
> 
> Note that I'm only exposing the minimal attributes here for simplicity,
> the other attributes would be included in separate files and we would
> require vendors to create standard attributes for common device classes.

I like this idea. All standard attributes are reflected into this hierarchy.
In the meantime, can we still allow optional vendor string in create 
interface? libvirt doesn't need to know the meaning, but allows upper
layer to do some vendor specific tweak if necessary.

> 
> For vGPUs like NVIDIA where we don't support multiple types
> concurrently, this directory structure would update as mdev devices are
> created, removing no longer available types.  I carried forward

or keep the type with max_instances cleared to ZERO.

> max_instances here, but perhaps we really want to copy SR-IOV and
> report a max and current allocation.  Creation and deletion is

right, cur/max_instances look reasonable.

> simplified as we can simply "echo $UUID > create" per type.  I don't
> understand why destroy had a parameter list, so here I imagine we can
> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> entry under each mdev device, so we remove it at the device rather than
> in some central location (any objections?).

OK to me. 

> 
> We discussed how this might look with Intel devices which do allow
> mixed vGPU types concurrently.  We believe, but need confirmation, that
> the vendor driver could still make a finite set of supported types,
> perhaps with additional module options to the vendor driver to enable
> more "exotic" types.  So for instance if IGD vGPUs are based on
> power-of-2 portions of the framebuffer size, then the vendor driver
> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.

Yes, Intel can do such type of definition. One thing I'm not sure is 
about impact cross listed types, i.e. when creating a new instance
under a given type, max_instances under other types would be 
dynamically decremented based on available resource. Would it be
a problem for libvirt or upper level stack, since a natural interpretation
of max_instances should be a static number?

An alternative is to make max_instances configurable, so libvirt has
chance to define a pool of available instances with different types
before creating any instance. For example, initially IGD driver may 
report max_instances only for a minimal sharing granularity:
	128MB:
		max_instances (8)
	256MB:
		max_instances (0)
	512MB:
		max_instances (0)

Then libvirt can configure more types as:
	128MB:
		max_instances (2)
	256MB:
		max_instances (1)
	512MB:
		max_instances (1)

Starting from this point, max_instances would be static and then
mdev instance can be created under each type. But I'm not
sure whether such additional configuration role is reasonable to libvirt...

> 
> We still don't have any way for the admin to learn in advance how the
> available supported types will change once mdev devices start to be
> created.  I'm not sure how we can create a specification for this, so
> probing by creating devices may be the most flexible model.
> 
> The other issue is the start/stop requirement, which was revealed to
> setup peer-to-peer resources between vGPUs which is a limited hardware
> resource.  We'd really like to have these happen automatically on the
> first open of a vfio mdev device file and final release.  So we
> brainstormed how the open/release callbacks could know the other mdev
> devices for a given user.  This is where the instance number came into
> play previously.  This is an area that needs work.

IGD doesn't have such peer-to-peer resource setup requirement. So
it's sufficient to create/destroy a mdev instance in a single action on
IGD. However I'd expect we still keep the "start/stop" interface (
maybe not exposed as sysfs node, instead being a VFIO API), as 
required to support future live migration usage. We've made prototype
working for KVMGT today.

> 
> There was a thought that perhaps on open() the vendor driver could look
> at the user pid and use that to associate with other devices, but the
> problem here is that we open and begin access to each device, so
> devices do this discovery serially rather than in parallel as desired.
> (we might not fault in mmio space yet though, so I wonder if open()
> could set the association of mdev to pid, then the first mmio fault
> would trigger the resource allocation?  Then all the "magic" would live
> in the vendor driver.  open() could fail if the pid already has running
> mdev devices and the vendor driver chooses not to support hotplug)
> 
> One comment was that for a GPU that only supports homogeneous vGPUs,
> libvirt may choose to create all the vGPUs in advance and handle them
> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
> case.
> 
> We also considered whether iommu groups could be (ab)used for this use
> case, peer-to-peer would in fact be an iommu grouping constraint
> afterall.  This would have the same UUID+instance constraint as above
> though and would require some sort of sysfs interface for the user to
> be able to create multiple mdevs within a group.
> 
> Everyone was given homework to think about this on their flights home,
> so I expect plenty of ideas by now ;)
> 
> Overall I think mediated devices were well received by the community,
> so let's keep up the development and discussion to bring it to
> fruition.  Thanks,

Thanks a lot Alex for your help on driving this discussion. Mediated device
technique has the potential to be used for other type of I/O virtualizations
in the future, not limited to GPU virtualization. So getting the core framework
ready earlier would be highly welcomed. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-08-31  6:12     ` Tian, Kevin
  0 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-08-31  6:12 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi,
	libvir-list, Daniel P. Berrange, Laine Stump

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 31, 2016 12:17 AM
> 
> Hi folks,
> 
> At KVM Forum we had a BoF session primarily around the mediated device
> sysfs interface.  I'd like to share what I think we agreed on and the
> "problem areas" that still need some work so we can get the thoughts
> and ideas from those who weren't able to attend.
> 
> DanPB expressed some concern about the mdev_supported_types sysfs
> interface, which exposes a flat csv file with fields like "type",
> "number of instance", "vendor string", and then a bunch of type
> specific fields like "framebuffer size", "resolution", "frame rate
> limit", etc.  This is not entirely machine parsing friendly and sort of
> abuses the sysfs concept of one value per file.  Example output taken
> from Neo's libvirt RFC:
> 
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> 
> The create/destroy then looks like this:
> 
> echo "$mdev_UUID:vendor_specific_argument_list" >
> 	/sys/bus/pci/devices/.../mdev_create
> 
> echo "$mdev_UUID:vendor_specific_argument_list" >
> 	/sys/bus/pci/devices/.../mdev_destroy
> 
> "vendor_specific_argument_list" is nebulous.
> 
> So the idea to fix this is to explode this into a directory structure,
> something like:
> 
> ├── mdev_destroy
> └── mdev_supported_types
>     ├── 11
>     │   ├── create
>     │   ├── description
>     │   └── max_instances
>     ├── 12
>     │   ├── create
>     │   ├── description
>     │   └── max_instances
>     └── 13
>         ├── create
>         ├── description
>         └── max_instances
> 
> Note that I'm only exposing the minimal attributes here for simplicity,
> the other attributes would be included in separate files and we would
> require vendors to create standard attributes for common device classes.

I like this idea. All standard attributes are reflected into this hierarchy.
In the meantime, can we still allow optional vendor string in create 
interface? libvirt doesn't need to know the meaning, but allows upper
layer to do some vendor specific tweak if necessary.

> 
> For vGPUs like NVIDIA where we don't support multiple types
> concurrently, this directory structure would update as mdev devices are
> created, removing no longer available types.  I carried forward

or keep the type with max_instances cleared to ZERO.

> max_instances here, but perhaps we really want to copy SR-IOV and
> report a max and current allocation.  Creation and deletion is

right, cur/max_instances look reasonable.

> simplified as we can simply "echo $UUID > create" per type.  I don't
> understand why destroy had a parameter list, so here I imagine we can
> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> entry under each mdev device, so we remove it at the device rather than
> in some central location (any objections?).

OK to me. 

> 
> We discussed how this might look with Intel devices which do allow
> mixed vGPU types concurrently.  We believe, but need confirmation, that
> the vendor driver could still make a finite set of supported types,
> perhaps with additional module options to the vendor driver to enable
> more "exotic" types.  So for instance if IGD vGPUs are based on
> power-of-2 portions of the framebuffer size, then the vendor driver
> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.

Yes, Intel can do such type of definition. One thing I'm not sure is 
about impact cross listed types, i.e. when creating a new instance
under a given type, max_instances under other types would be 
dynamically decremented based on available resource. Would it be
a problem for libvirt or upper level stack, since a natural interpretation
of max_instances should be a static number?

An alternative is to make max_instances configurable, so libvirt has
chance to define a pool of available instances with different types
before creating any instance. For example, initially IGD driver may 
report max_instances only for a minimal sharing granularity:
	128MB:
		max_instances (8)
	256MB:
		max_instances (0)
	512MB:
		max_instances (0)

Then libvirt can configure more types as:
	128MB:
		max_instances (2)
	256MB:
		max_instances (1)
	512MB:
		max_instances (1)

Starting from this point, max_instances would be static and then
mdev instance can be created under each type. But I'm not
sure whether such additional configuration role is reasonable to libvirt...

> 
> We still don't have any way for the admin to learn in advance how the
> available supported types will change once mdev devices start to be
> created.  I'm not sure how we can create a specification for this, so
> probing by creating devices may be the most flexible model.
> 
> The other issue is the start/stop requirement, which was revealed to
> setup peer-to-peer resources between vGPUs which is a limited hardware
> resource.  We'd really like to have these happen automatically on the
> first open of a vfio mdev device file and final release.  So we
> brainstormed how the open/release callbacks could know the other mdev
> devices for a given user.  This is where the instance number came into
> play previously.  This is an area that needs work.

IGD doesn't have such peer-to-peer resource setup requirement. So
it's sufficient to create/destroy a mdev instance in a single action on
IGD. However I'd expect we still keep the "start/stop" interface (
maybe not exposed as sysfs node, instead being a VFIO API), as 
required to support future live migration usage. We've made prototype
working for KVMGT today.

> 
> There was a thought that perhaps on open() the vendor driver could look
> at the user pid and use that to associate with other devices, but the
> problem here is that we open and begin access to each device, so
> devices do this discovery serially rather than in parallel as desired.
> (we might not fault in mmio space yet though, so I wonder if open()
> could set the association of mdev to pid, then the first mmio fault
> would trigger the resource allocation?  Then all the "magic" would live
> in the vendor driver.  open() could fail if the pid already has running
> mdev devices and the vendor driver chooses not to support hotplug)
> 
> One comment was that for a GPU that only supports homogeneous vGPUs,
> libvirt may choose to create all the vGPUs in advance and handle them
> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
> case.
> 
> We also considered whether iommu groups could be (ab)used for this use
> case, peer-to-peer would in fact be an iommu grouping constraint
> afterall.  This would have the same UUID+instance constraint as above
> though and would require some sort of sysfs interface for the user to
> be able to create multiple mdevs within a group.
> 
> Everyone was given homework to think about this on their flights home,
> so I expect plenty of ideas by now ;)
> 
> Overall I think mediated devices were well received by the community,
> so let's keep up the development and discussion to bring it to
> fruition.  Thanks,

Thanks a lot Alex for your help on driving this discussion. Mediated device
technique has the potential to be used for other type of I/O virtualizations
in the future, not limited to GPU virtualization. So getting the core framework
ready earlier would be highly welcomed. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-08-31  6:12     ` [Qemu-devel] " Tian, Kevin
@ 2016-08-31  7:04       ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-08-31  7:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Kirti Wankhede, pbonzini, kraxel, cjia,
	qemu-devel, kvm, bjsdjshi, libvir-list, Daniel P. Berrange,
	Laine Stump

On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Wednesday, August 31, 2016 12:17 AM
>>
>> Hi folks,
>>
>> At KVM Forum we had a BoF session primarily around the mediated device
>> sysfs interface.  I'd like to share what I think we agreed on and the
>> "problem areas" that still need some work so we can get the thoughts
>> and ideas from those who weren't able to attend.
>>
>> DanPB expressed some concern about the mdev_supported_types sysfs
>> interface, which exposes a flat csv file with fields like "type",
>> "number of instance", "vendor string", and then a bunch of type
>> specific fields like "framebuffer size", "resolution", "frame rate
>> limit", etc.  This is not entirely machine parsing friendly and sort of
>> abuses the sysfs concept of one value per file.  Example output taken
>> from Neo's libvirt RFC:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>
>> The create/destroy then looks like this:
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_create
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_destroy
>>
>> "vendor_specific_argument_list" is nebulous.
>>
>> So the idea to fix this is to explode this into a directory structure,
>> something like:
>>
>> ├── mdev_destroy
>> └── mdev_supported_types
>>     ├── 11
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     ├── 12
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     └── 13
>>         ├── create
>>         ├── description
>>         └── max_instances
>>
>> Note that I'm only exposing the minimal attributes here for simplicity,
>> the other attributes would be included in separate files and we would
>> require vendors to create standard attributes for common device classes.
> 
> I like this idea. All standard attributes are reflected into this hierarchy.
> In the meantime, can we still allow optional vendor string in create 
> interface? libvirt doesn't need to know the meaning, but allows upper
> layer to do some vendor specific tweak if necessary.
> 

Not sure whether this can done within MDEV framework (attrs provided by
vendor driver of course), or must be within the vendor driver.

>>
>> For vGPUs like NVIDIA where we don't support multiple types
>> concurrently, this directory structure would update as mdev devices are
>> created, removing no longer available types.  I carried forward
> 
> or keep the type with max_instances cleared to ZERO.
>

+1 :)

>> max_instances here, but perhaps we really want to copy SR-IOV and
>> report a max and current allocation.  Creation and deletion is
> 
> right, cur/max_instances look reasonable.
> 
>> simplified as we can simply "echo $UUID > create" per type.  I don't
>> understand why destroy had a parameter list, so here I imagine we can
>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>> entry under each mdev device, so we remove it at the device rather than
>> in some central location (any objections?).
> 
> OK to me. 

IIUC, "destroy" has a parameter list is only because the previous
$VM_UUID + instnace implementation. It should be safe to move the "destroy"
file under mdev now.

>> We discussed how this might look with Intel devices which do allow
>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>> the vendor driver could still make a finite set of supported types,
>> perhaps with additional module options to the vendor driver to enable
>> more "exotic" types.  So for instance if IGD vGPUs are based on
>> power-of-2 portions of the framebuffer size, then the vendor driver
>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
>
> Yes, Intel can do such type of definition. One thing I'm not sure is 
> about impact cross listed types, i.e. when creating a new instance
> under a given type, max_instances under other types would be 
> dynamically decremented based on available resource. Would it be
> a problem for libvirt or upper level stack, since a natural interpretation
> of max_instances should be a static number?
>
> An alternative is to make max_instances configurable, so libvirt has
> chance to define a pool of available instances with different types
> before creating any instance. For example, initially IGD driver may 
> report max_instances only for a minimal sharing granularity:
> 	128MB:
> 		max_instances (8)
> 	256MB:
> 		max_instances (0)
> 	512MB:
> 		max_instances (0)
> 
> Then libvirt can configure more types as:
> 	128MB:
> 		max_instances (2)
> 	256MB:
> 		max_instances (1)
> 	512MB:
> 		max_instances (1)
> 
> Starting from this point, max_instances would be static and then
> mdev instance can be created under each type. But I'm not
> sure whether such additional configuration role is reasonable to libvirt...
>>
>> We still don't have any way for the admin to learn in advance how the
>> available supported types will change once mdev devices start to be
>> created.  I'm not sure how we can create a specification for this, so
>> probing by creating devices may be the most flexible model.
>>
>> The other issue is the start/stop requirement, which was revealed to
>> setup peer-to-peer resources between vGPUs which is a limited hardware
>> resource.  We'd really like to have these happen automatically on the
>> first open of a vfio mdev device file and final release.  So we
>> brainstormed how the open/release callbacks could know the other mdev
>> devices for a given user.  This is where the instance number came into
>> play previously.  This is an area that needs work.
> 
> IGD doesn't have such peer-to-peer resource setup requirement. So
> it's sufficient to create/destroy a mdev instance in a single action on
> IGD. However I'd expect we still keep the "start/stop" interface (
> maybe not exposed as sysfs node, instead being a VFIO API), as 
> required to support future live migration usage. We've made prototype
> working for KVMGT today.

It's good for the framework to define start/stop interfaces, but as Alex
said below, it should be MDEV oriented, not VM oriented.

I don't know a lot about the peer-to-peer resource, but to me, although
VM_UUID + instance is not applicable, userspace can always achieve the
same purpose by, let us assume a mdev hierarchy, providing the VM UUID
under every mdev:

	/sys/bus/pci/devices/<sbdf>/mdev/
	|-- mdev01/
	|   `-- vm_uuid
	`-- mdev02/
	    `-- vm_uuid

Did I miss something?

>>
>> There was a thought that perhaps on open() the vendor driver could look
>> at the user pid and use that to associate with other devices, but the
>> problem here is that we open and begin access to each device, so
>> devices do this discovery serially rather than in parallel as desired.
>> (we might not fault in mmio space yet though, so I wonder if open()
>> could set the association of mdev to pid, then the first mmio fault
>> would trigger the resource allocation?  Then all the "magic" would live
>> in the vendor driver.  open() could fail if the pid already has running
>> mdev devices and the vendor driver chooses not to support hotplug)
>>
>> One comment was that for a GPU that only supports homogeneous vGPUs,
>> libvirt may choose to create all the vGPUs in advance and handle them
>> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
>> case.
>>
>> We also considered whether iommu groups could be (ab)used for this use
>> case, peer-to-peer would in fact be an iommu grouping constraint
>> afterall.  This would have the same UUID+instance constraint as above
>> though and would require some sort of sysfs interface for the user to
>> be able to create multiple mdevs within a group.
>>
>> Everyone was given homework to think about this on their flights home,
>> so I expect plenty of ideas by now ;)
>>
>> Overall I think mediated devices were well received by the community,
>> so let's keep up the development and discussion to bring it to
>> fruition.  Thanks,
> 
> Thanks a lot Alex for your help on driving this discussion. Mediated device
> technique has the potential to be used for other type of I/O virtualizations
> in the future, not limited to GPU virtualization. So getting the core framework
> ready earlier would be highly welcomed. :-)
> 
--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-08-31  7:04       ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-08-31  7:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Kirti Wankhede, pbonzini, kraxel, cjia,
	qemu-devel, kvm, bjsdjshi, libvir-list, Daniel P. Berrange,
	Laine Stump

On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Wednesday, August 31, 2016 12:17 AM
>>
>> Hi folks,
>>
>> At KVM Forum we had a BoF session primarily around the mediated device
>> sysfs interface.  I'd like to share what I think we agreed on and the
>> "problem areas" that still need some work so we can get the thoughts
>> and ideas from those who weren't able to attend.
>>
>> DanPB expressed some concern about the mdev_supported_types sysfs
>> interface, which exposes a flat csv file with fields like "type",
>> "number of instance", "vendor string", and then a bunch of type
>> specific fields like "framebuffer size", "resolution", "frame rate
>> limit", etc.  This is not entirely machine parsing friendly and sort of
>> abuses the sysfs concept of one value per file.  Example output taken
>> from Neo's libvirt RFC:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>
>> The create/destroy then looks like this:
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_create
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_destroy
>>
>> "vendor_specific_argument_list" is nebulous.
>>
>> So the idea to fix this is to explode this into a directory structure,
>> something like:
>>
>> ├── mdev_destroy
>> └── mdev_supported_types
>>     ├── 11
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     ├── 12
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     └── 13
>>         ├── create
>>         ├── description
>>         └── max_instances
>>
>> Note that I'm only exposing the minimal attributes here for simplicity,
>> the other attributes would be included in separate files and we would
>> require vendors to create standard attributes for common device classes.
> 
> I like this idea. All standard attributes are reflected into this hierarchy.
> In the meantime, can we still allow optional vendor string in create 
> interface? libvirt doesn't need to know the meaning, but allows upper
> layer to do some vendor specific tweak if necessary.
> 

Not sure whether this can done within MDEV framework (attrs provided by
vendor driver of course), or must be within the vendor driver.

>>
>> For vGPUs like NVIDIA where we don't support multiple types
>> concurrently, this directory structure would update as mdev devices are
>> created, removing no longer available types.  I carried forward
> 
> or keep the type with max_instances cleared to ZERO.
>

+1 :)

>> max_instances here, but perhaps we really want to copy SR-IOV and
>> report a max and current allocation.  Creation and deletion is
> 
> right, cur/max_instances look reasonable.
> 
>> simplified as we can simply "echo $UUID > create" per type.  I don't
>> understand why destroy had a parameter list, so here I imagine we can
>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>> entry under each mdev device, so we remove it at the device rather than
>> in some central location (any objections?).
> 
> OK to me. 

IIUC, "destroy" has a parameter list is only because the previous
$VM_UUID + instnace implementation. It should be safe to move the "destroy"
file under mdev now.

>> We discussed how this might look with Intel devices which do allow
>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>> the vendor driver could still make a finite set of supported types,
>> perhaps with additional module options to the vendor driver to enable
>> more "exotic" types.  So for instance if IGD vGPUs are based on
>> power-of-2 portions of the framebuffer size, then the vendor driver
>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
>
> Yes, Intel can do such type of definition. One thing I'm not sure is 
> about impact cross listed types, i.e. when creating a new instance
> under a given type, max_instances under other types would be 
> dynamically decremented based on available resource. Would it be
> a problem for libvirt or upper level stack, since a natural interpretation
> of max_instances should be a static number?
>
> An alternative is to make max_instances configurable, so libvirt has
> chance to define a pool of available instances with different types
> before creating any instance. For example, initially IGD driver may 
> report max_instances only for a minimal sharing granularity:
> 	128MB:
> 		max_instances (8)
> 	256MB:
> 		max_instances (0)
> 	512MB:
> 		max_instances (0)
> 
> Then libvirt can configure more types as:
> 	128MB:
> 		max_instances (2)
> 	256MB:
> 		max_instances (1)
> 	512MB:
> 		max_instances (1)
> 
> Starting from this point, max_instances would be static and then
> mdev instance can be created under each type. But I'm not
> sure whether such additional configuration role is reasonable to libvirt...
>>
>> We still don't have any way for the admin to learn in advance how the
>> available supported types will change once mdev devices start to be
>> created.  I'm not sure how we can create a specification for this, so
>> probing by creating devices may be the most flexible model.
>>
>> The other issue is the start/stop requirement, which was revealed to
>> setup peer-to-peer resources between vGPUs which is a limited hardware
>> resource.  We'd really like to have these happen automatically on the
>> first open of a vfio mdev device file and final release.  So we
>> brainstormed how the open/release callbacks could know the other mdev
>> devices for a given user.  This is where the instance number came into
>> play previously.  This is an area that needs work.
> 
> IGD doesn't have such peer-to-peer resource setup requirement. So
> it's sufficient to create/destroy a mdev instance in a single action on
> IGD. However I'd expect we still keep the "start/stop" interface (
> maybe not exposed as sysfs node, instead being a VFIO API), as 
> required to support future live migration usage. We've made prototype
> working for KVMGT today.

It's good for the framework to define start/stop interfaces, but as Alex
said below, it should be MDEV oriented, not VM oriented.

I don't know a lot about the peer-to-peer resource, but to me, although
VM_UUID + instance is not applicable, userspace can always achieve the
same purpose by, let us assume a mdev hierarchy, providing the VM UUID
under every mdev:

	/sys/bus/pci/devices/<sbdf>/mdev/
	|-- mdev01/
	|   `-- vm_uuid
	`-- mdev02/
	    `-- vm_uuid

Did I miss something?

>>
>> There was a thought that perhaps on open() the vendor driver could look
>> at the user pid and use that to associate with other devices, but the
>> problem here is that we open and begin access to each device, so
>> devices do this discovery serially rather than in parallel as desired.
>> (we might not fault in mmio space yet though, so I wonder if open()
>> could set the association of mdev to pid, then the first mmio fault
>> would trigger the resource allocation?  Then all the "magic" would live
>> in the vendor driver.  open() could fail if the pid already has running
>> mdev devices and the vendor driver chooses not to support hotplug)
>>
>> One comment was that for a GPU that only supports homogeneous vGPUs,
>> libvirt may choose to create all the vGPUs in advance and handle them
>> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
>> case.
>>
>> We also considered whether iommu groups could be (ab)used for this use
>> case, peer-to-peer would in fact be an iommu grouping constraint
>> afterall.  This would have the same UUID+instance constraint as above
>> though and would require some sort of sysfs interface for the user to
>> be able to create multiple mdevs within a group.
>>
>> Everyone was given homework to think about this on their flights home,
>> so I expect plenty of ideas by now ;)
>>
>> Overall I think mediated devices were well received by the community,
>> so let's keep up the development and discussion to bring it to
>> fruition.  Thanks,
> 
> Thanks a lot Alex for your help on driving this discussion. Mediated device
> technique has the potential to be used for other type of I/O virtualizations
> in the future, not limited to GPU virtualization. So getting the core framework
> ready earlier would be highly welcomed. :-)
> 
--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-08-31  7:04       ` [Qemu-devel] " Jike Song
@ 2016-08-31 15:48         ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-08-31 15:48 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel,
	kvm, bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

On Wed, 31 Aug 2016 15:04:13 +0800
Jike Song <jike.song@intel.com> wrote:

> On 08/31/2016 02:12 PM, Tian, Kevin wrote:
> >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >> Sent: Wednesday, August 31, 2016 12:17 AM
> >>
> >> Hi folks,
> >>
> >> At KVM Forum we had a BoF session primarily around the mediated device
> >> sysfs interface.  I'd like to share what I think we agreed on and the
> >> "problem areas" that still need some work so we can get the thoughts
> >> and ideas from those who weren't able to attend.
> >>
> >> DanPB expressed some concern about the mdev_supported_types sysfs
> >> interface, which exposes a flat csv file with fields like "type",
> >> "number of instance", "vendor string", and then a bunch of type
> >> specific fields like "framebuffer size", "resolution", "frame rate
> >> limit", etc.  This is not entirely machine parsing friendly and sort of
> >> abuses the sysfs concept of one value per file.  Example output taken
> >> from Neo's libvirt RFC:
> >>
> >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >> max_resolution
> >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>
> >> The create/destroy then looks like this:
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_create
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_destroy
> >>
> >> "vendor_specific_argument_list" is nebulous.
> >>
> >> So the idea to fix this is to explode this into a directory structure,
> >> something like:
> >>
> >> ├── mdev_destroy
> >> └── mdev_supported_types
> >>     ├── 11
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     ├── 12
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     └── 13
> >>         ├── create
> >>         ├── description
> >>         └── max_instances
> >>
> >> Note that I'm only exposing the minimal attributes here for simplicity,
> >> the other attributes would be included in separate files and we would
> >> require vendors to create standard attributes for common device classes.  
> > 
> > I like this idea. All standard attributes are reflected into this hierarchy.
> > In the meantime, can we still allow optional vendor string in create 
> > interface? libvirt doesn't need to know the meaning, but allows upper
> > layer to do some vendor specific tweak if necessary.
> >   
> 
> Not sure whether this can done within MDEV framework (attrs provided by
> vendor driver of course), or must be within the vendor driver.

The purpose of the sub-directories is that libvirt doesn't need to pass
arbitrary, vendor strings to the create function, the attributes of the
mdev device created are defined by the attributes in the sysfs
directory where the create is done.  The user only provides a uuid for
the device.  Arbitrary vendor parameters are a barrier, libvirt may not
need to know the meaning, but would need to know when to apply them,
which is just as bad.  Ultimately we want libvirt to be able to
interact with sysfs without having an vendor specific knowledge.

> >>
> >> For vGPUs like NVIDIA where we don't support multiple types
> >> concurrently, this directory structure would update as mdev devices are
> >> created, removing no longer available types.  I carried forward  
> > 
> > or keep the type with max_instances cleared to ZERO.
> >  
> 
> +1 :)

Possible yes, but why would the vendor driver report types that the
user cannot create?  It just seems like superfluous information (well,
except for the use I discover below).

> >> max_instances here, but perhaps we really want to copy SR-IOV and
> >> report a max and current allocation.  Creation and deletion is  
> > 
> > right, cur/max_instances look reasonable.
> >   
> >> simplified as we can simply "echo $UUID > create" per type.  I don't
> >> understand why destroy had a parameter list, so here I imagine we can
> >> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> >> entry under each mdev device, so we remove it at the device rather than
> >> in some central location (any objections?).  
> > 
> > OK to me.   
> 
> IIUC, "destroy" has a parameter list is only because the previous
> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> file under mdev now.
> 
> >> We discussed how this might look with Intel devices which do allow
> >> mixed vGPU types concurrently.  We believe, but need confirmation, that
> >> the vendor driver could still make a finite set of supported types,
> >> perhaps with additional module options to the vendor driver to enable
> >> more "exotic" types.  So for instance if IGD vGPUs are based on
> >> power-of-2 portions of the framebuffer size, then the vendor driver
> >> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> >> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.  
> >
> > Yes, Intel can do such type of definition. One thing I'm not sure is 
> > about impact cross listed types, i.e. when creating a new instance
> > under a given type, max_instances under other types would be 
> > dynamically decremented based on available resource. Would it be
> > a problem for libvirt or upper level stack, since a natural interpretation
> > of max_instances should be a static number?
> >
> > An alternative is to make max_instances configurable, so libvirt has
> > chance to define a pool of available instances with different types
> > before creating any instance. For example, initially IGD driver may 
> > report max_instances only for a minimal sharing granularity:
> > 	128MB:
> > 		max_instances (8)
> > 	256MB:
> > 		max_instances (0)
> > 	512MB:
> > 		max_instances (0)
> > 
> > Then libvirt can configure more types as:
> > 	128MB:
> > 		max_instances (2)
> > 	256MB:
> > 		max_instances (1)
> > 	512MB:
> > 		max_instances (1)
> > 
> > Starting from this point, max_instances would be static and then
> > mdev instance can be created under each type. But I'm not
> > sure whether such additional configuration role is reasonable to libvirt...  

My expectation of your example, where I'm assuming you have 1G of total
memory that can be divided between the mdev devices would be:

 128M: 8
 256M: 4
 512M: 2

If a 512M mdev device is created, this becomes:

 128M: 4
 256M: 2
 512M: 1

Creating a 128M mdev device from that becomes:

 128M: 3
 256M: 1
 512M: 0

It's not great, but I don't know how to do it better without the user
having a clear understanding of the algorithm and resources required
for each mdev device.  For instance, the size here, presumably the
framebuffer size, is just one attribute in the device directory, the
user won't know that this attribute is the key to the available
instances.

I don't particularly like the idea of a writeable max_instances, the
user can simply create instances of the type and see the results.

Just thought of another thing; do we need some way to determine the
type of an mdev device from sysfs or is this implicit knowledge for the
user that created the device?  For instance, we create a 512M device
and it becomes a child device to the parent, so we can associate to the
parent, but if we come back later, how do we know it's a 512M device?
Perhaps this is a reason to keep the type directories around and we can
cross link the device to the type and create a devices subdirectory
under each type.  Perhaps then "max_instances" becomes
"available_instances" (ie. how many left we can create) and we don't
need a "current_instances" because we can simply look in the devices
directory.

> >>
> >> We still don't have any way for the admin to learn in advance how the
> >> available supported types will change once mdev devices start to be
> >> created.  I'm not sure how we can create a specification for this, so
> >> probing by creating devices may be the most flexible model.
> >>
> >> The other issue is the start/stop requirement, which was revealed to
> >> setup peer-to-peer resources between vGPUs which is a limited hardware
> >> resource.  We'd really like to have these happen automatically on the
> >> first open of a vfio mdev device file and final release.  So we
> >> brainstormed how the open/release callbacks could know the other mdev
> >> devices for a given user.  This is where the instance number came into
> >> play previously.  This is an area that needs work.  
> > 
> > IGD doesn't have such peer-to-peer resource setup requirement. So
> > it's sufficient to create/destroy a mdev instance in a single action on
> > IGD. However I'd expect we still keep the "start/stop" interface (
> > maybe not exposed as sysfs node, instead being a VFIO API), as 
> > required to support future live migration usage. We've made prototype
> > working for KVMGT today.  

Great!

> It's good for the framework to define start/stop interfaces, but as Alex
> said below, it should be MDEV oriented, not VM oriented.
> 
> I don't know a lot about the peer-to-peer resource, but to me, although
> VM_UUID + instance is not applicable, userspace can always achieve the
> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> under every mdev:
> 
> 	/sys/bus/pci/devices/<sbdf>/mdev/
> 	|-- mdev01/
> 	|   `-- vm_uuid
> 	`-- mdev02/
> 	    `-- vm_uuid
> 
> Did I miss something?

Sure, this is just another way of doing UUID+instance.  Nit, it might
look more like:

 	/sys/bus/pci/devices/<sbdf>/mdev/
 	|-- uuid1/
 	|   `-- group_uuid
 	`-- uuid2/
 	    `-- group_uuid

Where each mdev device is actually referenced by its UUID name then
we'd have some writable attribute under the device where mdev devices
sharing the same group UUID are handled together.  There's a problem
here though that vfio doesn't know about this level of grouping, so
uuid1 and uuid2 could actually be given to different users despite the
grouping here, which results in one or both devices not working or
creating security issues.  That sort of implies that this would
necessarily need to be exposed as iommu grouping.  This factors into why
it seems like a good idea to make the start/stop implicit within the
interface.  In that way each mdev device is fungible as far as a user
like libvirt is concerned, internal details like peer-to-peer resources
are handled automatically as the devices are accessed.

> >> There was a thought that perhaps on open() the vendor driver could look
> >> at the user pid and use that to associate with other devices, but the
> >> problem here is that we open and begin access to each device, so
> >> devices do this discovery serially rather than in parallel as desired.
> >> (we might not fault in mmio space yet though, so I wonder if open()
> >> could set the association of mdev to pid, then the first mmio fault
> >> would trigger the resource allocation?  Then all the "magic" would live
> >> in the vendor driver.  open() could fail if the pid already has running
> >> mdev devices and the vendor driver chooses not to support hotplug)
> >>
> >> One comment was that for a GPU that only supports homogeneous vGPUs,
> >> libvirt may choose to create all the vGPUs in advance and handle them
> >> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
> >> case.
> >>
> >> We also considered whether iommu groups could be (ab)used for this use
> >> case, peer-to-peer would in fact be an iommu grouping constraint
> >> afterall.  This would have the same UUID+instance constraint as above
> >> though and would require some sort of sysfs interface for the user to
> >> be able to create multiple mdevs within a group.
> >>
> >> Everyone was given homework to think about this on their flights home,
> >> so I expect plenty of ideas by now ;)
> >>
> >> Overall I think mediated devices were well received by the community,
> >> so let's keep up the development and discussion to bring it to
> >> fruition.  Thanks,  
> > 
> > Thanks a lot Alex for your help on driving this discussion. Mediated device
> > technique has the potential to be used for other type of I/O virtualizations
> > in the future, not limited to GPU virtualization. So getting the core framework
> > ready earlier would be highly welcomed. :-)

I agree, there's lots of potential and it's extra incentive to create
an interface that's going to make sense long term.  Ideally we only
need to create the kernel and libvirt infrastructure once and we
can handle any type of mediated driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-08-31 15:48         ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-08-31 15:48 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel,
	kvm, bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

On Wed, 31 Aug 2016 15:04:13 +0800
Jike Song <jike.song@intel.com> wrote:

> On 08/31/2016 02:12 PM, Tian, Kevin wrote:
> >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >> Sent: Wednesday, August 31, 2016 12:17 AM
> >>
> >> Hi folks,
> >>
> >> At KVM Forum we had a BoF session primarily around the mediated device
> >> sysfs interface.  I'd like to share what I think we agreed on and the
> >> "problem areas" that still need some work so we can get the thoughts
> >> and ideas from those who weren't able to attend.
> >>
> >> DanPB expressed some concern about the mdev_supported_types sysfs
> >> interface, which exposes a flat csv file with fields like "type",
> >> "number of instance", "vendor string", and then a bunch of type
> >> specific fields like "framebuffer size", "resolution", "frame rate
> >> limit", etc.  This is not entirely machine parsing friendly and sort of
> >> abuses the sysfs concept of one value per file.  Example output taken
> >> from Neo's libvirt RFC:
> >>
> >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >> max_resolution
> >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>
> >> The create/destroy then looks like this:
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_create
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_destroy
> >>
> >> "vendor_specific_argument_list" is nebulous.
> >>
> >> So the idea to fix this is to explode this into a directory structure,
> >> something like:
> >>
> >> ├── mdev_destroy
> >> └── mdev_supported_types
> >>     ├── 11
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     ├── 12
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     └── 13
> >>         ├── create
> >>         ├── description
> >>         └── max_instances
> >>
> >> Note that I'm only exposing the minimal attributes here for simplicity,
> >> the other attributes would be included in separate files and we would
> >> require vendors to create standard attributes for common device classes.  
> > 
> > I like this idea. All standard attributes are reflected into this hierarchy.
> > In the meantime, can we still allow optional vendor string in create 
> > interface? libvirt doesn't need to know the meaning, but allows upper
> > layer to do some vendor specific tweak if necessary.
> >   
> 
> Not sure whether this can done within MDEV framework (attrs provided by
> vendor driver of course), or must be within the vendor driver.

The purpose of the sub-directories is that libvirt doesn't need to pass
arbitrary, vendor strings to the create function, the attributes of the
mdev device created are defined by the attributes in the sysfs
directory where the create is done.  The user only provides a uuid for
the device.  Arbitrary vendor parameters are a barrier, libvirt may not
need to know the meaning, but would need to know when to apply them,
which is just as bad.  Ultimately we want libvirt to be able to
interact with sysfs without having an vendor specific knowledge.

> >>
> >> For vGPUs like NVIDIA where we don't support multiple types
> >> concurrently, this directory structure would update as mdev devices are
> >> created, removing no longer available types.  I carried forward  
> > 
> > or keep the type with max_instances cleared to ZERO.
> >  
> 
> +1 :)

Possible yes, but why would the vendor driver report types that the
user cannot create?  It just seems like superfluous information (well,
except for the use I discover below).

> >> max_instances here, but perhaps we really want to copy SR-IOV and
> >> report a max and current allocation.  Creation and deletion is  
> > 
> > right, cur/max_instances look reasonable.
> >   
> >> simplified as we can simply "echo $UUID > create" per type.  I don't
> >> understand why destroy had a parameter list, so here I imagine we can
> >> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> >> entry under each mdev device, so we remove it at the device rather than
> >> in some central location (any objections?).  
> > 
> > OK to me.   
> 
> IIUC, "destroy" has a parameter list is only because the previous
> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> file under mdev now.
> 
> >> We discussed how this might look with Intel devices which do allow
> >> mixed vGPU types concurrently.  We believe, but need confirmation, that
> >> the vendor driver could still make a finite set of supported types,
> >> perhaps with additional module options to the vendor driver to enable
> >> more "exotic" types.  So for instance if IGD vGPUs are based on
> >> power-of-2 portions of the framebuffer size, then the vendor driver
> >> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> >> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.  
> >
> > Yes, Intel can do such type of definition. One thing I'm not sure is 
> > about impact cross listed types, i.e. when creating a new instance
> > under a given type, max_instances under other types would be 
> > dynamically decremented based on available resource. Would it be
> > a problem for libvirt or upper level stack, since a natural interpretation
> > of max_instances should be a static number?
> >
> > An alternative is to make max_instances configurable, so libvirt has
> > chance to define a pool of available instances with different types
> > before creating any instance. For example, initially IGD driver may 
> > report max_instances only for a minimal sharing granularity:
> > 	128MB:
> > 		max_instances (8)
> > 	256MB:
> > 		max_instances (0)
> > 	512MB:
> > 		max_instances (0)
> > 
> > Then libvirt can configure more types as:
> > 	128MB:
> > 		max_instances (2)
> > 	256MB:
> > 		max_instances (1)
> > 	512MB:
> > 		max_instances (1)
> > 
> > Starting from this point, max_instances would be static and then
> > mdev instance can be created under each type. But I'm not
> > sure whether such additional configuration role is reasonable to libvirt...  

My expectation of your example, where I'm assuming you have 1G of total
memory that can be divided between the mdev devices would be:

 128M: 8
 256M: 4
 512M: 2

If a 512M mdev device is created, this becomes:

 128M: 4
 256M: 2
 512M: 1

Creating a 128M mdev device from that becomes:

 128M: 3
 256M: 1
 512M: 0

It's not great, but I don't know how to do it better without the user
having a clear understanding of the algorithm and resources required
for each mdev device.  For instance, the size here, presumably the
framebuffer size, is just one attribute in the device directory, the
user won't know that this attribute is the key to the available
instances.

I don't particularly like the idea of a writeable max_instances, the
user can simply create instances of the type and see the results.

Just thought of another thing; do we need some way to determine the
type of an mdev device from sysfs or is this implicit knowledge for the
user that created the device?  For instance, we create a 512M device
and it becomes a child device to the parent, so we can associate to the
parent, but if we come back later, how do we know it's a 512M device?
Perhaps this is a reason to keep the type directories around and we can
cross link the device to the type and create a devices subdirectory
under each type.  Perhaps then "max_instances" becomes
"available_instances" (ie. how many left we can create) and we don't
need a "current_instances" because we can simply look in the devices
directory.

> >>
> >> We still don't have any way for the admin to learn in advance how the
> >> available supported types will change once mdev devices start to be
> >> created.  I'm not sure how we can create a specification for this, so
> >> probing by creating devices may be the most flexible model.
> >>
> >> The other issue is the start/stop requirement, which was revealed to
> >> setup peer-to-peer resources between vGPUs which is a limited hardware
> >> resource.  We'd really like to have these happen automatically on the
> >> first open of a vfio mdev device file and final release.  So we
> >> brainstormed how the open/release callbacks could know the other mdev
> >> devices for a given user.  This is where the instance number came into
> >> play previously.  This is an area that needs work.  
> > 
> > IGD doesn't have such peer-to-peer resource setup requirement. So
> > it's sufficient to create/destroy a mdev instance in a single action on
> > IGD. However I'd expect we still keep the "start/stop" interface (
> > maybe not exposed as sysfs node, instead being a VFIO API), as 
> > required to support future live migration usage. We've made prototype
> > working for KVMGT today.  

Great!

> It's good for the framework to define start/stop interfaces, but as Alex
> said below, it should be MDEV oriented, not VM oriented.
> 
> I don't know a lot about the peer-to-peer resource, but to me, although
> VM_UUID + instance is not applicable, userspace can always achieve the
> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> under every mdev:
> 
> 	/sys/bus/pci/devices/<sbdf>/mdev/
> 	|-- mdev01/
> 	|   `-- vm_uuid
> 	`-- mdev02/
> 	    `-- vm_uuid
> 
> Did I miss something?

Sure, this is just another way of doing UUID+instance.  Nit, it might
look more like:

 	/sys/bus/pci/devices/<sbdf>/mdev/
 	|-- uuid1/
 	|   `-- group_uuid
 	`-- uuid2/
 	    `-- group_uuid

Where each mdev device is actually referenced by its UUID name then
we'd have some writable attribute under the device where mdev devices
sharing the same group UUID are handled together.  There's a problem
here though that vfio doesn't know about this level of grouping, so
uuid1 and uuid2 could actually be given to different users despite the
grouping here, which results in one or both devices not working or
creating security issues.  That sort of implies that this would
necessarily need to be exposed as iommu grouping.  This factors into why
it seems like a good idea to make the start/stop implicit within the
interface.  In that way each mdev device is fungible as far as a user
like libvirt is concerned, internal details like peer-to-peer resources
are handled automatically as the devices are accessed.

> >> There was a thought that perhaps on open() the vendor driver could look
> >> at the user pid and use that to associate with other devices, but the
> >> problem here is that we open and begin access to each device, so
> >> devices do this discovery serially rather than in parallel as desired.
> >> (we might not fault in mmio space yet though, so I wonder if open()
> >> could set the association of mdev to pid, then the first mmio fault
> >> would trigger the resource allocation?  Then all the "magic" would live
> >> in the vendor driver.  open() could fail if the pid already has running
> >> mdev devices and the vendor driver chooses not to support hotplug)
> >>
> >> One comment was that for a GPU that only supports homogeneous vGPUs,
> >> libvirt may choose to create all the vGPUs in advance and handle them
> >> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
> >> case.
> >>
> >> We also considered whether iommu groups could be (ab)used for this use
> >> case, peer-to-peer would in fact be an iommu grouping constraint
> >> afterall.  This would have the same UUID+instance constraint as above
> >> though and would require some sort of sysfs interface for the user to
> >> be able to create multiple mdevs within a group.
> >>
> >> Everyone was given homework to think about this on their flights home,
> >> so I expect plenty of ideas by now ;)
> >>
> >> Overall I think mediated devices were well received by the community,
> >> so let's keep up the development and discussion to bring it to
> >> fruition.  Thanks,  
> > 
> > Thanks a lot Alex for your help on driving this discussion. Mediated device
> > technique has the potential to be used for other type of I/O virtualizations
> > in the future, not limited to GPU virtualization. So getting the core framework
> > ready earlier would be highly welcomed. :-)

I agree, there's lots of potential and it's extra incentive to create
an interface that's going to make sense long term.  Ideally we only
need to create the kernel and libvirt infrastructure once and we
can handle any type of mediated driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* RE: [PATCH v7 0/4] Add Mediated device support
  2016-08-31 15:48         ` [Qemu-devel] " Alex Williamson
@ 2016-09-01  4:09           ` Tian, Kevin
  -1 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-01  4:09 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
> > On 08/31/2016 02:12 PM, Tian, Kevin wrote:
> > >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > >> Sent: Wednesday, August 31, 2016 12:17 AM
> > >>
> > >> Hi folks,
> > >>
> > >> At KVM Forum we had a BoF session primarily around the mediated device
> > >> sysfs interface.  I'd like to share what I think we agreed on and the
> > >> "problem areas" that still need some work so we can get the thoughts
> > >> and ideas from those who weren't able to attend.
> > >>
> > >> DanPB expressed some concern about the mdev_supported_types sysfs
> > >> interface, which exposes a flat csv file with fields like "type",
> > >> "number of instance", "vendor string", and then a bunch of type
> > >> specific fields like "framebuffer size", "resolution", "frame rate
> > >> limit", etc.  This is not entirely machine parsing friendly and sort of
> > >> abuses the sysfs concept of one value per file.  Example output taken
> > >> from Neo's libvirt RFC:
> > >>
> > >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> > >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > >> max_resolution
> > >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> > >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> > >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> > >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> > >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> > >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> > >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> > >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> > >>
> > >> The create/destroy then looks like this:
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >> 	/sys/bus/pci/devices/.../mdev_create
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >> 	/sys/bus/pci/devices/.../mdev_destroy
> > >>
> > >> "vendor_specific_argument_list" is nebulous.
> > >>
> > >> So the idea to fix this is to explode this into a directory structure,
> > >> something like:
> > >>
> > >> ├── mdev_destroy
> > >> └── mdev_supported_types
> > >>     ├── 11
> > >>     │   ├── create
> > >>     │   ├── description
> > >>     │   └── max_instances
> > >>     ├── 12
> > >>     │   ├── create
> > >>     │   ├── description
> > >>     │   └── max_instances
> > >>     └── 13
> > >>         ├── create
> > >>         ├── description
> > >>         └── max_instances
> > >>
> > >> Note that I'm only exposing the minimal attributes here for simplicity,
> > >> the other attributes would be included in separate files and we would
> > >> require vendors to create standard attributes for common device classes.
> > >
> > > I like this idea. All standard attributes are reflected into this hierarchy.
> > > In the meantime, can we still allow optional vendor string in create
> > > interface? libvirt doesn't need to know the meaning, but allows upper
> > > layer to do some vendor specific tweak if necessary.
> > >
> >
> > Not sure whether this can done within MDEV framework (attrs provided by
> > vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know when to apply them,
> which is just as bad.  Ultimately we want libvirt to be able to
> interact with sysfs without having an vendor specific knowledge.

Understand. Today Intel doesn't have such vendor specific parameter
requirement when creating a mdev instance (assuming type definition
is enough to cover our existing parameters).

Just think about future extensibility. Say if a new parameter (say
a QoS parameter like weight or cap) must be statically set before 
created mdev instance starts to work, due to device limitation, such
parameter needs to be exposed as a new attribute under the specific 
mdev instance, e.g.:
	/sys/bus/pci/devices/<sbdf>/mdev/weight

Then libvirt needs to make sure it's set before open() the instance.

If such flow is acceptable, it should remove necessity of vendor specific
parameter at the create, because any such requirement should be 
converted into sysfs node, if applicable to all vendors, then libvirt
can do asynchronous configurations before starting the instance.

> 
> > >>
> > >> For vGPUs like NVIDIA where we don't support multiple types
> > >> concurrently, this directory structure would update as mdev devices are
> > >> created, removing no longer available types.  I carried forward
> > >
> > > or keep the type with max_instances cleared to ZERO.
> > >
> >
> > +1 :)
> 
> Possible yes, but why would the vendor driver report types that the
> user cannot create?  It just seems like superfluous information (well,
> except for the use I discover below).

If we consider using available_instances as you suggested later, this way
is simpler since libvirt only needs to scan available types once, w/o need
to differentiate whether a specific vendor allows only one type or 
multiple types. :-)

> 
> > >> max_instances here, but perhaps we really want to copy SR-IOV and
> > >> report a max and current allocation.  Creation and deletion is
> > >
> > > right, cur/max_instances look reasonable.
> > >
> > >> simplified as we can simply "echo $UUID > create" per type.  I don't
> > >> understand why destroy had a parameter list, so here I imagine we can
> > >> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> > >> entry under each mdev device, so we remove it at the device rather than
> > >> in some central location (any objections?).
> > >
> > > OK to me.
> >
> > IIUC, "destroy" has a parameter list is only because the previous
> > $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> > file under mdev now.
> >
> > >> We discussed how this might look with Intel devices which do allow
> > >> mixed vGPU types concurrently.  We believe, but need confirmation, that
> > >> the vendor driver could still make a finite set of supported types,
> > >> perhaps with additional module options to the vendor driver to enable
> > >> more "exotic" types.  So for instance if IGD vGPUs are based on
> > >> power-of-2 portions of the framebuffer size, then the vendor driver
> > >> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> > >> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
> > >
> > > Yes, Intel can do such type of definition. One thing I'm not sure is
> > > about impact cross listed types, i.e. when creating a new instance
> > > under a given type, max_instances under other types would be
> > > dynamically decremented based on available resource. Would it be
> > > a problem for libvirt or upper level stack, since a natural interpretation
> > > of max_instances should be a static number?
> > >
> > > An alternative is to make max_instances configurable, so libvirt has
> > > chance to define a pool of available instances with different types
> > > before creating any instance. For example, initially IGD driver may
> > > report max_instances only for a minimal sharing granularity:
> > > 	128MB:
> > > 		max_instances (8)
> > > 	256MB:
> > > 		max_instances (0)
> > > 	512MB:
> > > 		max_instances (0)
> > >
> > > Then libvirt can configure more types as:
> > > 	128MB:
> > > 		max_instances (2)
> > > 	256MB:
> > > 		max_instances (1)
> > > 	512MB:
> > > 		max_instances (1)
> > >
> > > Starting from this point, max_instances would be static and then
> > > mdev instance can be created under each type. But I'm not
> > > sure whether such additional configuration role is reasonable to libvirt...
> 
> My expectation of your example, where I'm assuming you have 1G of total
> memory that can be divided between the mdev devices would be:
> 
>  128M: 8
>  256M: 4
>  512M: 2
> 
> If a 512M mdev device is created, this becomes:
> 
>  128M: 4
>  256M: 2
>  512M: 1
> 
> Creating a 128M mdev device from that becomes:
> 
>  128M: 3
>  256M: 1
>  512M: 0
> 
> It's not great, but I don't know how to do it better without the user
> having a clear understanding of the algorithm and resources required
> for each mdev device.  For instance, the size here, presumably the
> framebuffer size, is just one attribute in the device directory, the
> user won't know that this attribute is the key to the available
> instances.

Above is just one example. We may provide types described as:
"small", "medium" and "large", each with a description of available
resources, like framebuffer size, default weight, etc. But the
rationale is same, that creating instance under one type may impact
available instances under other types.

> 
> I don't particularly like the idea of a writeable max_instances, the
> user can simply create instances of the type and see the results.
> 
> Just thought of another thing; do we need some way to determine the
> type of an mdev device from sysfs or is this implicit knowledge for the
> user that created the device?  For instance, we create a 512M device
> and it becomes a child device to the parent, so we can associate to the
> parent, but if we come back later, how do we know it's a 512M device?
> Perhaps this is a reason to keep the type directories around and we can
> cross link the device to the type and create a devices subdirectory
> under each type.  

yes, we can have a hierarchy like below:

  	/sys/bus/pci/devices/<sbdf>/mdev/
  	|-- uuid1/
  	|   `-- type (->/sys/bus/pci/devices/<sbdf>/types/12)
  	`-- uuid2/
  	    `-- type (->/sys/bus/pci/devices/<sbdf>/types/13)

	/sys/bus/pci/devices/<sbdf>/types/12/
  	|-- create
  	|-- description
  	|-- available_instances
  	|-- devices
  	    `-- uuid1 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid1)

	/sys/bus/pci/devices/<sbdf>/types/13/
  	|-- create
  	|-- description
  	|-- available_instances
  	|-- devices
  	    `-- uuid2 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid2)

> Perhaps then "max_instances" becomes
> "available_instances" (ie. how many left we can create) and we don't
> need a "current_instances" because we can simply look in the devices
> directory.

It's a nice idea.

> 
> > >>
> > >> We still don't have any way for the admin to learn in advance how the
> > >> available supported types will change once mdev devices start to be
> > >> created.  I'm not sure how we can create a specification for this, so
> > >> probing by creating devices may be the most flexible model.
> > >>
> > >> The other issue is the start/stop requirement, which was revealed to
> > >> setup peer-to-peer resources between vGPUs which is a limited hardware
> > >> resource.  We'd really like to have these happen automatically on the
> > >> first open of a vfio mdev device file and final release.  So we
> > >> brainstormed how the open/release callbacks could know the other mdev
> > >> devices for a given user.  This is where the instance number came into
> > >> play previously.  This is an area that needs work.
> > >
> > > IGD doesn't have such peer-to-peer resource setup requirement. So
> > > it's sufficient to create/destroy a mdev instance in a single action on
> > > IGD. However I'd expect we still keep the "start/stop" interface (
> > > maybe not exposed as sysfs node, instead being a VFIO API), as
> > > required to support future live migration usage. We've made prototype
> > > working for KVMGT today.
> 
> Great!
> 
> > It's good for the framework to define start/stop interfaces, but as Alex
> > said below, it should be MDEV oriented, not VM oriented.
> >
> > I don't know a lot about the peer-to-peer resource, but to me, although
> > VM_UUID + instance is not applicable, userspace can always achieve the
> > same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> > under every mdev:
> >
> > 	/sys/bus/pci/devices/<sbdf>/mdev/
> > 	|-- mdev01/
> > 	|   `-- vm_uuid
> > 	`-- mdev02/
> > 	    `-- vm_uuid
> >
> > Did I miss something?
> 
> Sure, this is just another way of doing UUID+instance.  Nit, it might
> look more like:
> 
>  	/sys/bus/pci/devices/<sbdf>/mdev/
>  	|-- uuid1/
>  	|   `-- group_uuid
>  	`-- uuid2/
>  	    `-- group_uuid
> 
> Where each mdev device is actually referenced by its UUID name then
> we'd have some writable attribute under the device where mdev devices
> sharing the same group UUID are handled together.  There's a problem
> here though that vfio doesn't know about this level of grouping, so
> uuid1 and uuid2 could actually be given to different users despite the
> grouping here, which results in one or both devices not working or
> creating security issues.  That sort of implies that this would
> necessarily need to be exposed as iommu grouping.  This factors into why
> it seems like a good idea to make the start/stop implicit within the
> interface.  In that way each mdev device is fungible as far as a user
> like libvirt is concerned, internal details like peer-to-peer resources
> are handled automatically as the devices are accessed.

Such group knowledge comes from user. I'm not sure whether IOMMU group
logic allows user to create/define group today. Is it better to just create a
mdev group concept within VFIO scope?

  	/sys/bus/pci/devices/<sbdf>/mdev/
  	|-- uuid1/
  	|   `-- group_uuid0
  	`-- uuid2/
  	    `-- group_uuid0

	/sys/bus/pci/devices/<sbdf>/mdev/groups/
  	|-- 0/
  	|   `-- uuid1
  	    `-- uuid2

User is expected to setup group before opening any mdev instance within
that group. This way it should be easy for VFIO to start all instances 
within same group upon the 1st open() in this group.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-01  4:09           ` Tian, Kevin
  0 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-01  4:09 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
> > On 08/31/2016 02:12 PM, Tian, Kevin wrote:
> > >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > >> Sent: Wednesday, August 31, 2016 12:17 AM
> > >>
> > >> Hi folks,
> > >>
> > >> At KVM Forum we had a BoF session primarily around the mediated device
> > >> sysfs interface.  I'd like to share what I think we agreed on and the
> > >> "problem areas" that still need some work so we can get the thoughts
> > >> and ideas from those who weren't able to attend.
> > >>
> > >> DanPB expressed some concern about the mdev_supported_types sysfs
> > >> interface, which exposes a flat csv file with fields like "type",
> > >> "number of instance", "vendor string", and then a bunch of type
> > >> specific fields like "framebuffer size", "resolution", "frame rate
> > >> limit", etc.  This is not entirely machine parsing friendly and sort of
> > >> abuses the sysfs concept of one value per file.  Example output taken
> > >> from Neo's libvirt RFC:
> > >>
> > >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> > >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > >> max_resolution
> > >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> > >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> > >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> > >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> > >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> > >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> > >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> > >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> > >>
> > >> The create/destroy then looks like this:
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >> 	/sys/bus/pci/devices/.../mdev_create
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >> 	/sys/bus/pci/devices/.../mdev_destroy
> > >>
> > >> "vendor_specific_argument_list" is nebulous.
> > >>
> > >> So the idea to fix this is to explode this into a directory structure,
> > >> something like:
> > >>
> > >> ├── mdev_destroy
> > >> └── mdev_supported_types
> > >>     ├── 11
> > >>     │   ├── create
> > >>     │   ├── description
> > >>     │   └── max_instances
> > >>     ├── 12
> > >>     │   ├── create
> > >>     │   ├── description
> > >>     │   └── max_instances
> > >>     └── 13
> > >>         ├── create
> > >>         ├── description
> > >>         └── max_instances
> > >>
> > >> Note that I'm only exposing the minimal attributes here for simplicity,
> > >> the other attributes would be included in separate files and we would
> > >> require vendors to create standard attributes for common device classes.
> > >
> > > I like this idea. All standard attributes are reflected into this hierarchy.
> > > In the meantime, can we still allow optional vendor string in create
> > > interface? libvirt doesn't need to know the meaning, but allows upper
> > > layer to do some vendor specific tweak if necessary.
> > >
> >
> > Not sure whether this can done within MDEV framework (attrs provided by
> > vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know when to apply them,
> which is just as bad.  Ultimately we want libvirt to be able to
> interact with sysfs without having an vendor specific knowledge.

Understand. Today Intel doesn't have such vendor specific parameter
requirement when creating a mdev instance (assuming type definition
is enough to cover our existing parameters).

Just think about future extensibility. Say if a new parameter (say
a QoS parameter like weight or cap) must be statically set before 
created mdev instance starts to work, due to device limitation, such
parameter needs to be exposed as a new attribute under the specific 
mdev instance, e.g.:
	/sys/bus/pci/devices/<sbdf>/mdev/weight

Then libvirt needs to make sure it's set before open() the instance.

If such flow is acceptable, it should remove necessity of vendor specific
parameter at the create, because any such requirement should be 
converted into sysfs node, if applicable to all vendors, then libvirt
can do asynchronous configurations before starting the instance.

> 
> > >>
> > >> For vGPUs like NVIDIA where we don't support multiple types
> > >> concurrently, this directory structure would update as mdev devices are
> > >> created, removing no longer available types.  I carried forward
> > >
> > > or keep the type with max_instances cleared to ZERO.
> > >
> >
> > +1 :)
> 
> Possible yes, but why would the vendor driver report types that the
> user cannot create?  It just seems like superfluous information (well,
> except for the use I discover below).

If we consider using available_instances as you suggested later, this way
is simpler since libvirt only needs to scan available types once, w/o need
to differentiate whether a specific vendor allows only one type or 
multiple types. :-)

> 
> > >> max_instances here, but perhaps we really want to copy SR-IOV and
> > >> report a max and current allocation.  Creation and deletion is
> > >
> > > right, cur/max_instances look reasonable.
> > >
> > >> simplified as we can simply "echo $UUID > create" per type.  I don't
> > >> understand why destroy had a parameter list, so here I imagine we can
> > >> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> > >> entry under each mdev device, so we remove it at the device rather than
> > >> in some central location (any objections?).
> > >
> > > OK to me.
> >
> > IIUC, "destroy" has a parameter list is only because the previous
> > $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> > file under mdev now.
> >
> > >> We discussed how this might look with Intel devices which do allow
> > >> mixed vGPU types concurrently.  We believe, but need confirmation, that
> > >> the vendor driver could still make a finite set of supported types,
> > >> perhaps with additional module options to the vendor driver to enable
> > >> more "exotic" types.  So for instance if IGD vGPUs are based on
> > >> power-of-2 portions of the framebuffer size, then the vendor driver
> > >> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> > >> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
> > >
> > > Yes, Intel can do such type of definition. One thing I'm not sure is
> > > about impact cross listed types, i.e. when creating a new instance
> > > under a given type, max_instances under other types would be
> > > dynamically decremented based on available resource. Would it be
> > > a problem for libvirt or upper level stack, since a natural interpretation
> > > of max_instances should be a static number?
> > >
> > > An alternative is to make max_instances configurable, so libvirt has
> > > chance to define a pool of available instances with different types
> > > before creating any instance. For example, initially IGD driver may
> > > report max_instances only for a minimal sharing granularity:
> > > 	128MB:
> > > 		max_instances (8)
> > > 	256MB:
> > > 		max_instances (0)
> > > 	512MB:
> > > 		max_instances (0)
> > >
> > > Then libvirt can configure more types as:
> > > 	128MB:
> > > 		max_instances (2)
> > > 	256MB:
> > > 		max_instances (1)
> > > 	512MB:
> > > 		max_instances (1)
> > >
> > > Starting from this point, max_instances would be static and then
> > > mdev instance can be created under each type. But I'm not
> > > sure whether such additional configuration role is reasonable to libvirt...
> 
> My expectation of your example, where I'm assuming you have 1G of total
> memory that can be divided between the mdev devices would be:
> 
>  128M: 8
>  256M: 4
>  512M: 2
> 
> If a 512M mdev device is created, this becomes:
> 
>  128M: 4
>  256M: 2
>  512M: 1
> 
> Creating a 128M mdev device from that becomes:
> 
>  128M: 3
>  256M: 1
>  512M: 0
> 
> It's not great, but I don't know how to do it better without the user
> having a clear understanding of the algorithm and resources required
> for each mdev device.  For instance, the size here, presumably the
> framebuffer size, is just one attribute in the device directory, the
> user won't know that this attribute is the key to the available
> instances.

Above is just one example. We may provide types described as:
"small", "medium" and "large", each with a description of available
resources, like framebuffer size, default weight, etc. But the
rationale is same, that creating instance under one type may impact
available instances under other types.

> 
> I don't particularly like the idea of a writeable max_instances, the
> user can simply create instances of the type and see the results.
> 
> Just thought of another thing; do we need some way to determine the
> type of an mdev device from sysfs or is this implicit knowledge for the
> user that created the device?  For instance, we create a 512M device
> and it becomes a child device to the parent, so we can associate to the
> parent, but if we come back later, how do we know it's a 512M device?
> Perhaps this is a reason to keep the type directories around and we can
> cross link the device to the type and create a devices subdirectory
> under each type.  

yes, we can have a hierarchy like below:

  	/sys/bus/pci/devices/<sbdf>/mdev/
  	|-- uuid1/
  	|   `-- type (->/sys/bus/pci/devices/<sbdf>/types/12)
  	`-- uuid2/
  	    `-- type (->/sys/bus/pci/devices/<sbdf>/types/13)

	/sys/bus/pci/devices/<sbdf>/types/12/
  	|-- create
  	|-- description
  	|-- available_instances
  	|-- devices
  	    `-- uuid1 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid1)

	/sys/bus/pci/devices/<sbdf>/types/13/
  	|-- create
  	|-- description
  	|-- available_instances
  	|-- devices
  	    `-- uuid2 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid2)

> Perhaps then "max_instances" becomes
> "available_instances" (ie. how many left we can create) and we don't
> need a "current_instances" because we can simply look in the devices
> directory.

It's a nice idea.

> 
> > >>
> > >> We still don't have any way for the admin to learn in advance how the
> > >> available supported types will change once mdev devices start to be
> > >> created.  I'm not sure how we can create a specification for this, so
> > >> probing by creating devices may be the most flexible model.
> > >>
> > >> The other issue is the start/stop requirement, which was revealed to
> > >> setup peer-to-peer resources between vGPUs which is a limited hardware
> > >> resource.  We'd really like to have these happen automatically on the
> > >> first open of a vfio mdev device file and final release.  So we
> > >> brainstormed how the open/release callbacks could know the other mdev
> > >> devices for a given user.  This is where the instance number came into
> > >> play previously.  This is an area that needs work.
> > >
> > > IGD doesn't have such peer-to-peer resource setup requirement. So
> > > it's sufficient to create/destroy a mdev instance in a single action on
> > > IGD. However I'd expect we still keep the "start/stop" interface (
> > > maybe not exposed as sysfs node, instead being a VFIO API), as
> > > required to support future live migration usage. We've made prototype
> > > working for KVMGT today.
> 
> Great!
> 
> > It's good for the framework to define start/stop interfaces, but as Alex
> > said below, it should be MDEV oriented, not VM oriented.
> >
> > I don't know a lot about the peer-to-peer resource, but to me, although
> > VM_UUID + instance is not applicable, userspace can always achieve the
> > same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> > under every mdev:
> >
> > 	/sys/bus/pci/devices/<sbdf>/mdev/
> > 	|-- mdev01/
> > 	|   `-- vm_uuid
> > 	`-- mdev02/
> > 	    `-- vm_uuid
> >
> > Did I miss something?
> 
> Sure, this is just another way of doing UUID+instance.  Nit, it might
> look more like:
> 
>  	/sys/bus/pci/devices/<sbdf>/mdev/
>  	|-- uuid1/
>  	|   `-- group_uuid
>  	`-- uuid2/
>  	    `-- group_uuid
> 
> Where each mdev device is actually referenced by its UUID name then
> we'd have some writable attribute under the device where mdev devices
> sharing the same group UUID are handled together.  There's a problem
> here though that vfio doesn't know about this level of grouping, so
> uuid1 and uuid2 could actually be given to different users despite the
> grouping here, which results in one or both devices not working or
> creating security issues.  That sort of implies that this would
> necessarily need to be exposed as iommu grouping.  This factors into why
> it seems like a good idea to make the start/stop implicit within the
> interface.  In that way each mdev device is fungible as far as a user
> like libvirt is concerned, internal details like peer-to-peer resources
> are handled automatically as the devices are accessed.

Such group knowledge comes from user. I'm not sure whether IOMMU group
logic allows user to create/define group today. Is it better to just create a
mdev group concept within VFIO scope?

  	/sys/bus/pci/devices/<sbdf>/mdev/
  	|-- uuid1/
  	|   `-- group_uuid0
  	`-- uuid2/
  	    `-- group_uuid0

	/sys/bus/pci/devices/<sbdf>/mdev/groups/
  	|-- 0/
  	|   `-- uuid1
  	    `-- uuid2

User is expected to setup group before opening any mdev instance within
that group. This way it should be easy for VFIO to start all instances 
within same group upon the 1st open() in this group.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* RE: [PATCH v7 0/4] Add Mediated device support
  2016-08-31 15:48         ` [Qemu-devel] " Alex Williamson
@ 2016-09-01  4:10           ` Tian, Kevin
  -1 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-01  4:10 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> > >
> > > IGD doesn't have such peer-to-peer resource setup requirement. So
> > > it's sufficient to create/destroy a mdev instance in a single action on
> > > IGD. However I'd expect we still keep the "start/stop" interface (
> > > maybe not exposed as sysfs node, instead being a VFIO API), as
> > > required to support future live migration usage. We've made prototype
> > > working for KVMGT today.
> 
> Great!
> 

btw here is a link to KVMGT live migration demo:

https://www.youtube.com/watch?v=y2SkU5JODIY

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-01  4:10           ` Tian, Kevin
  0 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-01  4:10 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> > >
> > > IGD doesn't have such peer-to-peer resource setup requirement. So
> > > it's sufficient to create/destroy a mdev instance in a single action on
> > > IGD. However I'd expect we still keep the "start/stop" interface (
> > > maybe not exposed as sysfs node, instead being a VFIO API), as
> > > required to support future live migration usage. We've made prototype
> > > working for KVMGT today.
> 
> Great!
> 

btw here is a link to KVMGT live migration demo:

https://www.youtube.com/watch?v=y2SkU5JODIY

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-08-31  6:12     ` [Qemu-devel] " Tian, Kevin
  (?)
  (?)
@ 2016-09-01 16:47     ` Michal Privoznik
  2016-09-01 16:59         ` [Qemu-devel] " Alex Williamson
  -1 siblings, 1 reply; 162+ messages in thread
From: Michal Privoznik @ 2016-09-01 16:47 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Kirti Wankhede
  Cc: Song, Jike, cjia, kvm, libvir-list, qemu-devel, kraxel,
	Laine Stump, pbonzini, bjsdjshi

On 31.08.2016 08:12, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Wednesday, August 31, 2016 12:17 AM
>>
>> Hi folks,
>>
>> At KVM Forum we had a BoF session primarily around the mediated device
>> sysfs interface.  I'd like to share what I think we agreed on and the
>> "problem areas" that still need some work so we can get the thoughts
>> and ideas from those who weren't able to attend.
>>
>> DanPB expressed some concern about the mdev_supported_types sysfs
>> interface, which exposes a flat csv file with fields like "type",
>> "number of instance", "vendor string", and then a bunch of type
>> specific fields like "framebuffer size", "resolution", "frame rate
>> limit", etc.  This is not entirely machine parsing friendly and sort of
>> abuses the sysfs concept of one value per file.  Example output taken
>> from Neo's libvirt RFC:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>
>> The create/destroy then looks like this:
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_create
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_destroy
>>
>> "vendor_specific_argument_list" is nebulous.
>>
>> So the idea to fix this is to explode this into a directory structure,
>> something like:
>>
>> ├── mdev_destroy
>> └── mdev_supported_types
>>     ├── 11
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     ├── 12
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     └── 13
>>         ├── create
>>         ├── description
>>         └── max_instances
>>
>> Note that I'm only exposing the minimal attributes here for simplicity,
>> the other attributes would be included in separate files and we would
>> require vendors to create standard attributes for common device classes.
> 
> I like this idea. All standard attributes are reflected into this hierarchy.
> In the meantime, can we still allow optional vendor string in create 
> interface? libvirt doesn't need to know the meaning, but allows upper
> layer to do some vendor specific tweak if necessary.

This is not the best idea IMO. Libvirt is there to shadow differences
between hypervisors. While doing that, we often hide differences between
various types of HW too. Therefore in order to provide good abstraction
we should make vendor specific string as small as possible (ideally an
empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
example above in domain XML. What I think the better idea is if we let
users chose resolution and frame buffer size, e.g.: <video
resolution="1024x768" framebuffer="16"/> (just the first idea that came
to my mind while writing this e-mail). The point is, XML part is
completely free of any vendor-specific knobs.

Michal

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-09-01 16:47     ` Michal Privoznik
@ 2016-09-01 16:59         ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-01 16:59 UTC (permalink / raw)
  To: Michal Privoznik
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	Kirti Wankhede, kraxel, Laine Stump, pbonzini, bjsdjshi

On Thu, 1 Sep 2016 18:47:06 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:

> On 31.08.2016 08:12, Tian, Kevin wrote:
> >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >> Sent: Wednesday, August 31, 2016 12:17 AM
> >>
> >> Hi folks,
> >>
> >> At KVM Forum we had a BoF session primarily around the mediated device
> >> sysfs interface.  I'd like to share what I think we agreed on and the
> >> "problem areas" that still need some work so we can get the thoughts
> >> and ideas from those who weren't able to attend.
> >>
> >> DanPB expressed some concern about the mdev_supported_types sysfs
> >> interface, which exposes a flat csv file with fields like "type",
> >> "number of instance", "vendor string", and then a bunch of type
> >> specific fields like "framebuffer size", "resolution", "frame rate
> >> limit", etc.  This is not entirely machine parsing friendly and sort of
> >> abuses the sysfs concept of one value per file.  Example output taken
> >> from Neo's libvirt RFC:
> >>
> >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >> max_resolution
> >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>
> >> The create/destroy then looks like this:
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_create
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_destroy
> >>
> >> "vendor_specific_argument_list" is nebulous.
> >>
> >> So the idea to fix this is to explode this into a directory structure,
> >> something like:
> >>
> >> ├── mdev_destroy
> >> └── mdev_supported_types
> >>     ├── 11
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     ├── 12
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     └── 13
> >>         ├── create
> >>         ├── description
> >>         └── max_instances
> >>
> >> Note that I'm only exposing the minimal attributes here for simplicity,
> >> the other attributes would be included in separate files and we would
> >> require vendors to create standard attributes for common device classes.  
> > 
> > I like this idea. All standard attributes are reflected into this hierarchy.
> > In the meantime, can we still allow optional vendor string in create 
> > interface? libvirt doesn't need to know the meaning, but allows upper
> > layer to do some vendor specific tweak if necessary.  
> 
> This is not the best idea IMO. Libvirt is there to shadow differences
> between hypervisors. While doing that, we often hide differences between
> various types of HW too. Therefore in order to provide good abstraction
> we should make vendor specific string as small as possible (ideally an
> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
> example above in domain XML. What I think the better idea is if we let
> users chose resolution and frame buffer size, e.g.: <video
> resolution="1024x768" framebuffer="16"/> (just the first idea that came
> to my mind while writing this e-mail). The point is, XML part is
> completely free of any vendor-specific knobs.

That's not really what you want though, a user actually cares whether
they get an Intel of NVIDIA vGPU, we can't specify it as just a
resolution and framebuffer size.  The user also doesn't want the model
changing each time the VM is started, so not only do you *need* to know
the vendor, you need to know the vendor model.  This is the only way to
provide a consistent VM.  So as we discussed at the BoF, the libvirt
xml will likely reference the vendor string, which will be a unique
identifier that encompasses all the additional attributes we expose.
Really the goal of the attributes is simply so you don't need a per
vendor magic decoder ring to figure out the basic features of a given
vendor string.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-01 16:59         ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-01 16:59 UTC (permalink / raw)
  To: Michal Privoznik
  Cc: Tian, Kevin, Kirti Wankhede, Song, Jike, cjia, kvm, libvir-list,
	qemu-devel, kraxel, Laine Stump, pbonzini, bjsdjshi

On Thu, 1 Sep 2016 18:47:06 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:

> On 31.08.2016 08:12, Tian, Kevin wrote:
> >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >> Sent: Wednesday, August 31, 2016 12:17 AM
> >>
> >> Hi folks,
> >>
> >> At KVM Forum we had a BoF session primarily around the mediated device
> >> sysfs interface.  I'd like to share what I think we agreed on and the
> >> "problem areas" that still need some work so we can get the thoughts
> >> and ideas from those who weren't able to attend.
> >>
> >> DanPB expressed some concern about the mdev_supported_types sysfs
> >> interface, which exposes a flat csv file with fields like "type",
> >> "number of instance", "vendor string", and then a bunch of type
> >> specific fields like "framebuffer size", "resolution", "frame rate
> >> limit", etc.  This is not entirely machine parsing friendly and sort of
> >> abuses the sysfs concept of one value per file.  Example output taken
> >> from Neo's libvirt RFC:
> >>
> >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >> max_resolution
> >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>
> >> The create/destroy then looks like this:
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_create
> >>
> >> echo "$mdev_UUID:vendor_specific_argument_list" >
> >> 	/sys/bus/pci/devices/.../mdev_destroy
> >>
> >> "vendor_specific_argument_list" is nebulous.
> >>
> >> So the idea to fix this is to explode this into a directory structure,
> >> something like:
> >>
> >> ├── mdev_destroy
> >> └── mdev_supported_types
> >>     ├── 11
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     ├── 12
> >>     │   ├── create
> >>     │   ├── description
> >>     │   └── max_instances
> >>     └── 13
> >>         ├── create
> >>         ├── description
> >>         └── max_instances
> >>
> >> Note that I'm only exposing the minimal attributes here for simplicity,
> >> the other attributes would be included in separate files and we would
> >> require vendors to create standard attributes for common device classes.  
> > 
> > I like this idea. All standard attributes are reflected into this hierarchy.
> > In the meantime, can we still allow optional vendor string in create 
> > interface? libvirt doesn't need to know the meaning, but allows upper
> > layer to do some vendor specific tweak if necessary.  
> 
> This is not the best idea IMO. Libvirt is there to shadow differences
> between hypervisors. While doing that, we often hide differences between
> various types of HW too. Therefore in order to provide good abstraction
> we should make vendor specific string as small as possible (ideally an
> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
> example above in domain XML. What I think the better idea is if we let
> users chose resolution and frame buffer size, e.g.: <video
> resolution="1024x768" framebuffer="16"/> (just the first idea that came
> to my mind while writing this e-mail). The point is, XML part is
> completely free of any vendor-specific knobs.

That's not really what you want though, a user actually cares whether
they get an Intel of NVIDIA vGPU, we can't specify it as just a
resolution and framebuffer size.  The user also doesn't want the model
changing each time the VM is started, so not only do you *need* to know
the vendor, you need to know the vendor model.  This is the only way to
provide a consistent VM.  So as we discussed at the BoF, the libvirt
xml will likely reference the vendor string, which will be a unique
identifier that encompasses all the additional attributes we expose.
Really the goal of the attributes is simply so you don't need a per
vendor magic decoder ring to figure out the basic features of a given
vendor string.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-08-31 15:48         ` [Qemu-devel] " Alex Williamson
@ 2016-09-01 18:22           ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-01 18:22 UTC (permalink / raw)
  To: Alex Williamson, Jike Song
  Cc: Tian, Kevin, cjia, kvm, libvir-list, qemu-devel, kraxel,
	Laine Stump, pbonzini, bjsdjshi


Alex,
Thanks for summarizing the discussion.

On 8/31/2016 9:18 PM, Alex Williamson wrote:
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>
>>>> Hi folks,
>>>>
>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>> "problem areas" that still need some work so we can get the thoughts
>>>> and ideas from those who weren't able to attend.
>>>>
>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>> interface, which exposes a flat csv file with fields like "type",
>>>> "number of instance", "vendor string", and then a bunch of type
>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>> from Neo's libvirt RFC:
>>>>
>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>> max_resolution
>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>
>>>> The create/destroy then looks like this:
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>
>>>> "vendor_specific_argument_list" is nebulous.
>>>>
>>>> So the idea to fix this is to explode this into a directory structure,
>>>> something like:
>>>>
>>>> ├── mdev_destroy
>>>> └── mdev_supported_types
>>>>     ├── 11
>>>>     │   ├── create
>>>>     │   ├── description
>>>>     │   └── max_instances
>>>>     ├── 12
>>>>     │   ├── create
>>>>     │   ├── description
>>>>     │   └── max_instances
>>>>     └── 13
>>>>         ├── create
>>>>         ├── description
>>>>         └── max_instances
>>>>
>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>> the other attributes would be included in separate files and we would
>>>> require vendors to create standard attributes for common device classes.  
>>>
>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>> In the meantime, can we still allow optional vendor string in create 
>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>> layer to do some vendor specific tweak if necessary.
>>>   
>>
>> Not sure whether this can done within MDEV framework (attrs provided by
>> vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know when to apply them,
> which is just as bad.  Ultimately we want libvirt to be able to
> interact with sysfs without having an vendor specific knowledge.
> 

Above directory hierarchy looks fine to me. Along with the fixed set of
parameter, a optional field of extra parameter is also required. Such
parameters are required for some specific testing or running benchmarks,
for example to disable FRL (framerate limiter) or to disable console vnc
when not required. Libvirt don't need to know its details, its just a
string that user can provide and libvirt need to pass the string as it
is to vendor driver, vendor driver would act accordingly.

>>>>
>>>> For vGPUs like NVIDIA where we don't support multiple types
>>>> concurrently, this directory structure would update as mdev devices are
>>>> created, removing no longer available types.  I carried forward  
>>>
>>> or keep the type with max_instances cleared to ZERO.
>>>  
>>
>> +1 :)
> 
> Possible yes, but why would the vendor driver report types that the
> user cannot create?  It just seems like superfluous information (well,
> except for the use I discover below).
> 

The directory structure for a physical GPU will be defined when device
is register to mdev module. It would be simpler to change creatable
instance count i.e for the types which can't be created creatable
instance count would be set to 0.


>>>> max_instances here, but perhaps we really want to copy SR-IOV and
>>>> report a max and current allocation.  Creation and deletion is  
>>>
>>> right, cur/max_instances look reasonable.
>>>   
>>>> simplified as we can simply "echo $UUID > create" per type.  I don't
>>>> understand why destroy had a parameter list, so here I imagine we can
>>>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>>>> entry under each mdev device, so we remove it at the device rather than
>>>> in some central location (any objections?).  
>>>
>>> OK to me.   
>>
>> IIUC, "destroy" has a parameter list is only because the previous
>> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
>> file under mdev now.
>>

Sorry if that was there in libvirt discussion, but "destroy" don't need
extra parameters. Yes it could be moved to mdev device directory.

>>>> We discussed how this might look with Intel devices which do allow
>>>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>>>> the vendor driver could still make a finite set of supported types,
>>>> perhaps with additional module options to the vendor driver to enable
>>>> more "exotic" types.  So for instance if IGD vGPUs are based on
>>>> power-of-2 portions of the framebuffer size, then the vendor driver
>>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>>>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.  
>>>
>>> Yes, Intel can do such type of definition. One thing I'm not sure is 
>>> about impact cross listed types, i.e. when creating a new instance
>>> under a given type, max_instances under other types would be 
>>> dynamically decremented based on available resource. Would it be
>>> a problem for libvirt or upper level stack, since a natural interpretation
>>> of max_instances should be a static number?
>>>
>>> An alternative is to make max_instances configurable, so libvirt has
>>> chance to define a pool of available instances with different types
>>> before creating any instance. For example, initially IGD driver may 
>>> report max_instances only for a minimal sharing granularity:
>>> 	128MB:
>>> 		max_instances (8)
>>> 	256MB:
>>> 		max_instances (0)
>>> 	512MB:
>>> 		max_instances (0)
>>>
>>> Then libvirt can configure more types as:
>>> 	128MB:
>>> 		max_instances (2)
>>> 	256MB:
>>> 		max_instances (1)
>>> 	512MB:
>>> 		max_instances (1)
>>>
>>> Starting from this point, max_instances would be static and then
>>> mdev instance can be created under each type. But I'm not
>>> sure whether such additional configuration role is reasonable to libvirt...  
> 
> My expectation of your example, where I'm assuming you have 1G of total
> memory that can be divided between the mdev devices would be:
> 
>  128M: 8
>  256M: 4
>  512M: 2
> 
> If a 512M mdev device is created, this becomes:
> 
>  128M: 4
>  256M: 2
>  512M: 1
> 
> Creating a 128M mdev device from that becomes:
> 
>  128M: 3
>  256M: 1
>  512M: 0
> 
> It's not great, but I don't know how to do it better without the user
> having a clear understanding of the algorithm and resources required
> for each mdev device.  For instance, the size here, presumably the
> framebuffer size, is just one attribute in the device directory, the
> user won't know that this attribute is the key to the available
> instances.
> 
> I don't particularly like the idea of a writeable max_instances, the
> user can simply create instances of the type and see the results.
> 
> Just thought of another thing; do we need some way to determine the
> type of an mdev device from sysfs or is this implicit knowledge for the
> user that created the device?  For instance, we create a 512M device
> and it becomes a child device to the parent, so we can associate to the
> parent, but if we come back later, how do we know it's a 512M device?
> Perhaps this is a reason to keep the type directories around and we can
> cross link the device to the type and create a devices subdirectory
> under each type.  Perhaps then "max_instances" becomes
> "available_instances" (ie. how many left we can create) and we don't
> need a "current_instances" because we can simply look in the devices
> directory.
> 

When mdev module creates mdev device, mdev_device_create() in patch,
here 'mdev->dev.parent' is assigned as its parent physical device. So
device_register() create child's directory inside parent's directory.
Directory for mdev device is not explicitly created. So I don't think we
can move this directory to type directory. But we can think of adding
link to type directory from mdev device's directory.

>>>>
>>>> We still don't have any way for the admin to learn in advance how the
>>>> available supported types will change once mdev devices start to be
>>>> created.  I'm not sure how we can create a specification for this, so
>>>> probing by creating devices may be the most flexible model.
>>>>

Removing type directory dynamically seems difficult. So the other way as
suggested here, when that type is not supported, vendor driver can
return max_instance to 0.

>>>> The other issue is the start/stop requirement, which was revealed to
>>>> setup peer-to-peer resources between vGPUs which is a limited hardware
>>>> resource.  We'd really like to have these happen automatically on the
>>>> first open of a vfio mdev device file and final release.  So we
>>>> brainstormed how the open/release callbacks could know the other mdev
>>>> devices for a given user.  This is where the instance number came into
>>>> play previously.  This is an area that needs work.  
>>>
>>> IGD doesn't have such peer-to-peer resource setup requirement. So
>>> it's sufficient to create/destroy a mdev instance in a single action on
>>> IGD. However I'd expect we still keep the "start/stop" interface (
>>> maybe not exposed as sysfs node, instead being a VFIO API), as 
>>> required to support future live migration usage. We've made prototype
>>> working for KVMGT today.  
> 
> Great!
> 

In this v7 version of patch, I had made changes that introduce 'online'
in mdev device directory as discussed in v6 reviews. We need this to
commit resources for that device(s).


>> It's good for the framework to define start/stop interfaces, but as Alex
>> said below, it should be MDEV oriented, not VM oriented.
>>
>> I don't know a lot about the peer-to-peer resource, but to me, although
>> VM_UUID + instance is not applicable, userspace can always achieve the
>> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
>> under every mdev:
>>
>> 	/sys/bus/pci/devices/<sbdf>/mdev/
>> 	|-- mdev01/
>> 	|   `-- vm_uuid
>> 	`-- mdev02/
>> 	    `-- vm_uuid
>>
>> Did I miss something?
> 
> Sure, this is just another way of doing UUID+instance.  Nit, it might
> look more like:
> 
>  	/sys/bus/pci/devices/<sbdf>/mdev/
>  	|-- uuid1/
>  	|   `-- group_uuid
>  	`-- uuid2/
>  	    `-- group_uuid
> 
> Where each mdev device is actually referenced by its UUID name then
> we'd have some writable attribute under the device where mdev devices
> sharing the same group UUID are handled together.  

Group UUID would also work, as long as its unique and set for all
devices in a group, it should work.

> There's a problem
> here though that vfio doesn't know about this level of grouping, so
> uuid1 and uuid2 could actually be given to different users despite the
> grouping here, which results in one or both devices not working or
> creating security issues.  That sort of implies that this would
> necessarily need to be exposed as iommu grouping.  This factors into why
> it seems like a good idea to make the start/stop implicit within the
> interface.  In that way each mdev device is fungible as far as a user
> like libvirt is concerned, internal details like peer-to-peer resources
> are handled automatically as the devices are accessed.
> 

I understand your concerns here. But making implicit doesn't guarantee
that device will not be accessed unless all mdev devices are started.

>>>> There was a thought that perhaps on open() the vendor driver could look
>>>> at the user pid and use that to associate with other devices, but the
>>>> problem here is that we open and begin access to each device, so
>>>> devices do this discovery serially rather than in parallel as desired.
>>>> (we might not fault in mmio space yet though, so I wonder if open()
>>>> could set the association of mdev to pid, then the first mmio fault
>>>> would trigger the resource allocation?  Then all the "magic" would live
>>>> in the vendor driver.  open() could fail if the pid already has running
>>>> mdev devices and the vendor driver chooses not to support hotplug)
>>>>

Problem is resources should be committed before any device being
accessed and not at fault at mmio space.

>>>> One comment was that for a GPU that only supports homogeneous vGPUs,
>>>> libvirt may choose to create all the vGPUs in advance and handle them
>>>> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
>>>> case.
>>>>
>>>> We also considered whether iommu groups could be (ab)used for this use
>>>> case, peer-to-peer would in fact be an iommu grouping constraint
>>>> afterall.  This would have the same UUID+instance constraint as above
>>>> though and would require some sort of sysfs interface for the user to
>>>> be able to create multiple mdevs within a group.
>>>>
>>>> Everyone was given homework to think about this on their flights home,
>>>> so I expect plenty of ideas by now ;)
>>>>
>>>> Overall I think mediated devices were well received by the community,
>>>> so let's keep up the development and discussion to bring it to
>>>> fruition.  Thanks,  
>>>
>>> Thanks a lot Alex for your help on driving this discussion. Mediated device
>>> technique has the potential to be used for other type of I/O virtualizations
>>> in the future, not limited to GPU virtualization. So getting the core framework
>>> ready earlier would be highly welcomed. :-)
> 
> I agree, there's lots of potential and it's extra incentive to create
> an interface that's going to make sense long term.  Ideally we only
> need to create the kernel and libvirt infrastructure once and we
> can handle any type of mediated driver.  Thanks,
>

Yes, I too agree. This framework has evolved so much and taking good
shape now. I hope we settle down on kernel and libvirt interface soon
and get this working soon :). Thanks for your support and guidance.

Thanks,
Kirti.


> Alex
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-01 18:22           ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-01 18:22 UTC (permalink / raw)
  To: Alex Williamson, Jike Song
  Cc: Tian, Kevin, pbonzini, kraxel, cjia, qemu-devel, kvm, bjsdjshi,
	libvir-list, Daniel P. Berrange, Laine Stump


Alex,
Thanks for summarizing the discussion.

On 8/31/2016 9:18 PM, Alex Williamson wrote:
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>
>>>> Hi folks,
>>>>
>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>> "problem areas" that still need some work so we can get the thoughts
>>>> and ideas from those who weren't able to attend.
>>>>
>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>> interface, which exposes a flat csv file with fields like "type",
>>>> "number of instance", "vendor string", and then a bunch of type
>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>> from Neo's libvirt RFC:
>>>>
>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>> max_resolution
>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>
>>>> The create/destroy then looks like this:
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>
>>>> "vendor_specific_argument_list" is nebulous.
>>>>
>>>> So the idea to fix this is to explode this into a directory structure,
>>>> something like:
>>>>
>>>> ├── mdev_destroy
>>>> └── mdev_supported_types
>>>>     ├── 11
>>>>     │   ├── create
>>>>     │   ├── description
>>>>     │   └── max_instances
>>>>     ├── 12
>>>>     │   ├── create
>>>>     │   ├── description
>>>>     │   └── max_instances
>>>>     └── 13
>>>>         ├── create
>>>>         ├── description
>>>>         └── max_instances
>>>>
>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>> the other attributes would be included in separate files and we would
>>>> require vendors to create standard attributes for common device classes.  
>>>
>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>> In the meantime, can we still allow optional vendor string in create 
>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>> layer to do some vendor specific tweak if necessary.
>>>   
>>
>> Not sure whether this can done within MDEV framework (attrs provided by
>> vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know when to apply them,
> which is just as bad.  Ultimately we want libvirt to be able to
> interact with sysfs without having an vendor specific knowledge.
> 

Above directory hierarchy looks fine to me. Along with the fixed set of
parameter, a optional field of extra parameter is also required. Such
parameters are required for some specific testing or running benchmarks,
for example to disable FRL (framerate limiter) or to disable console vnc
when not required. Libvirt don't need to know its details, its just a
string that user can provide and libvirt need to pass the string as it
is to vendor driver, vendor driver would act accordingly.

>>>>
>>>> For vGPUs like NVIDIA where we don't support multiple types
>>>> concurrently, this directory structure would update as mdev devices are
>>>> created, removing no longer available types.  I carried forward  
>>>
>>> or keep the type with max_instances cleared to ZERO.
>>>  
>>
>> +1 :)
> 
> Possible yes, but why would the vendor driver report types that the
> user cannot create?  It just seems like superfluous information (well,
> except for the use I discover below).
> 

The directory structure for a physical GPU will be defined when device
is register to mdev module. It would be simpler to change creatable
instance count i.e for the types which can't be created creatable
instance count would be set to 0.


>>>> max_instances here, but perhaps we really want to copy SR-IOV and
>>>> report a max and current allocation.  Creation and deletion is  
>>>
>>> right, cur/max_instances look reasonable.
>>>   
>>>> simplified as we can simply "echo $UUID > create" per type.  I don't
>>>> understand why destroy had a parameter list, so here I imagine we can
>>>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>>>> entry under each mdev device, so we remove it at the device rather than
>>>> in some central location (any objections?).  
>>>
>>> OK to me.   
>>
>> IIUC, "destroy" has a parameter list is only because the previous
>> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
>> file under mdev now.
>>

Sorry if that was there in libvirt discussion, but "destroy" don't need
extra parameters. Yes it could be moved to mdev device directory.

>>>> We discussed how this might look with Intel devices which do allow
>>>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>>>> the vendor driver could still make a finite set of supported types,
>>>> perhaps with additional module options to the vendor driver to enable
>>>> more "exotic" types.  So for instance if IGD vGPUs are based on
>>>> power-of-2 portions of the framebuffer size, then the vendor driver
>>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>>>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.  
>>>
>>> Yes, Intel can do such type of definition. One thing I'm not sure is 
>>> about impact cross listed types, i.e. when creating a new instance
>>> under a given type, max_instances under other types would be 
>>> dynamically decremented based on available resource. Would it be
>>> a problem for libvirt or upper level stack, since a natural interpretation
>>> of max_instances should be a static number?
>>>
>>> An alternative is to make max_instances configurable, so libvirt has
>>> chance to define a pool of available instances with different types
>>> before creating any instance. For example, initially IGD driver may 
>>> report max_instances only for a minimal sharing granularity:
>>> 	128MB:
>>> 		max_instances (8)
>>> 	256MB:
>>> 		max_instances (0)
>>> 	512MB:
>>> 		max_instances (0)
>>>
>>> Then libvirt can configure more types as:
>>> 	128MB:
>>> 		max_instances (2)
>>> 	256MB:
>>> 		max_instances (1)
>>> 	512MB:
>>> 		max_instances (1)
>>>
>>> Starting from this point, max_instances would be static and then
>>> mdev instance can be created under each type. But I'm not
>>> sure whether such additional configuration role is reasonable to libvirt...  
> 
> My expectation of your example, where I'm assuming you have 1G of total
> memory that can be divided between the mdev devices would be:
> 
>  128M: 8
>  256M: 4
>  512M: 2
> 
> If a 512M mdev device is created, this becomes:
> 
>  128M: 4
>  256M: 2
>  512M: 1
> 
> Creating a 128M mdev device from that becomes:
> 
>  128M: 3
>  256M: 1
>  512M: 0
> 
> It's not great, but I don't know how to do it better without the user
> having a clear understanding of the algorithm and resources required
> for each mdev device.  For instance, the size here, presumably the
> framebuffer size, is just one attribute in the device directory, the
> user won't know that this attribute is the key to the available
> instances.
> 
> I don't particularly like the idea of a writeable max_instances, the
> user can simply create instances of the type and see the results.
> 
> Just thought of another thing; do we need some way to determine the
> type of an mdev device from sysfs or is this implicit knowledge for the
> user that created the device?  For instance, we create a 512M device
> and it becomes a child device to the parent, so we can associate to the
> parent, but if we come back later, how do we know it's a 512M device?
> Perhaps this is a reason to keep the type directories around and we can
> cross link the device to the type and create a devices subdirectory
> under each type.  Perhaps then "max_instances" becomes
> "available_instances" (ie. how many left we can create) and we don't
> need a "current_instances" because we can simply look in the devices
> directory.
> 

When mdev module creates mdev device, mdev_device_create() in patch,
here 'mdev->dev.parent' is assigned as its parent physical device. So
device_register() create child's directory inside parent's directory.
Directory for mdev device is not explicitly created. So I don't think we
can move this directory to type directory. But we can think of adding
link to type directory from mdev device's directory.

>>>>
>>>> We still don't have any way for the admin to learn in advance how the
>>>> available supported types will change once mdev devices start to be
>>>> created.  I'm not sure how we can create a specification for this, so
>>>> probing by creating devices may be the most flexible model.
>>>>

Removing type directory dynamically seems difficult. So the other way as
suggested here, when that type is not supported, vendor driver can
return max_instance to 0.

>>>> The other issue is the start/stop requirement, which was revealed to
>>>> setup peer-to-peer resources between vGPUs which is a limited hardware
>>>> resource.  We'd really like to have these happen automatically on the
>>>> first open of a vfio mdev device file and final release.  So we
>>>> brainstormed how the open/release callbacks could know the other mdev
>>>> devices for a given user.  This is where the instance number came into
>>>> play previously.  This is an area that needs work.  
>>>
>>> IGD doesn't have such peer-to-peer resource setup requirement. So
>>> it's sufficient to create/destroy a mdev instance in a single action on
>>> IGD. However I'd expect we still keep the "start/stop" interface (
>>> maybe not exposed as sysfs node, instead being a VFIO API), as 
>>> required to support future live migration usage. We've made prototype
>>> working for KVMGT today.  
> 
> Great!
> 

In this v7 version of patch, I had made changes that introduce 'online'
in mdev device directory as discussed in v6 reviews. We need this to
commit resources for that device(s).


>> It's good for the framework to define start/stop interfaces, but as Alex
>> said below, it should be MDEV oriented, not VM oriented.
>>
>> I don't know a lot about the peer-to-peer resource, but to me, although
>> VM_UUID + instance is not applicable, userspace can always achieve the
>> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
>> under every mdev:
>>
>> 	/sys/bus/pci/devices/<sbdf>/mdev/
>> 	|-- mdev01/
>> 	|   `-- vm_uuid
>> 	`-- mdev02/
>> 	    `-- vm_uuid
>>
>> Did I miss something?
> 
> Sure, this is just another way of doing UUID+instance.  Nit, it might
> look more like:
> 
>  	/sys/bus/pci/devices/<sbdf>/mdev/
>  	|-- uuid1/
>  	|   `-- group_uuid
>  	`-- uuid2/
>  	    `-- group_uuid
> 
> Where each mdev device is actually referenced by its UUID name then
> we'd have some writable attribute under the device where mdev devices
> sharing the same group UUID are handled together.  

Group UUID would also work, as long as its unique and set for all
devices in a group, it should work.

> There's a problem
> here though that vfio doesn't know about this level of grouping, so
> uuid1 and uuid2 could actually be given to different users despite the
> grouping here, which results in one or both devices not working or
> creating security issues.  That sort of implies that this would
> necessarily need to be exposed as iommu grouping.  This factors into why
> it seems like a good idea to make the start/stop implicit within the
> interface.  In that way each mdev device is fungible as far as a user
> like libvirt is concerned, internal details like peer-to-peer resources
> are handled automatically as the devices are accessed.
> 

I understand your concerns here. But making implicit doesn't guarantee
that device will not be accessed unless all mdev devices are started.

>>>> There was a thought that perhaps on open() the vendor driver could look
>>>> at the user pid and use that to associate with other devices, but the
>>>> problem here is that we open and begin access to each device, so
>>>> devices do this discovery serially rather than in parallel as desired.
>>>> (we might not fault in mmio space yet though, so I wonder if open()
>>>> could set the association of mdev to pid, then the first mmio fault
>>>> would trigger the resource allocation?  Then all the "magic" would live
>>>> in the vendor driver.  open() could fail if the pid already has running
>>>> mdev devices and the vendor driver chooses not to support hotplug)
>>>>

Problem is resources should be committed before any device being
accessed and not at fault at mmio space.

>>>> One comment was that for a GPU that only supports homogeneous vGPUs,
>>>> libvirt may choose to create all the vGPUs in advance and handle them
>>>> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
>>>> case.
>>>>
>>>> We also considered whether iommu groups could be (ab)used for this use
>>>> case, peer-to-peer would in fact be an iommu grouping constraint
>>>> afterall.  This would have the same UUID+instance constraint as above
>>>> though and would require some sort of sysfs interface for the user to
>>>> be able to create multiple mdevs within a group.
>>>>
>>>> Everyone was given homework to think about this on their flights home,
>>>> so I expect plenty of ideas by now ;)
>>>>
>>>> Overall I think mediated devices were well received by the community,
>>>> so let's keep up the development and discussion to bring it to
>>>> fruition.  Thanks,  
>>>
>>> Thanks a lot Alex for your help on driving this discussion. Mediated device
>>> technique has the potential to be used for other type of I/O virtualizations
>>> in the future, not limited to GPU virtualization. So getting the core framework
>>> ready earlier would be highly welcomed. :-)
> 
> I agree, there's lots of potential and it's extra incentive to create
> an interface that's going to make sense long term.  Ideally we only
> need to create the kernel and libvirt infrastructure once and we
> can handle any type of mediated driver.  Thanks,
>

Yes, I too agree. This framework has evolved so much and taking good
shape now. I hope we settle down on kernel and libvirt interface soon
and get this working soon :). Thanks for your support and guidance.

Thanks,
Kirti.


> Alex
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-09-01 18:22           ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-01 20:01             ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-01 20:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, cjia, kvm, libvir-list, qemu-devel, Tian, Kevin,
	kraxel, Laine Stump, pbonzini, bjsdjshi

On Thu, 1 Sep 2016 23:52:02 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> Thanks for summarizing the discussion.
> 
> On 8/31/2016 9:18 PM, Alex Williamson wrote:
> > On Wed, 31 Aug 2016 15:04:13 +0800
> > Jike Song <jike.song@intel.com> wrote:
> >   
> >> On 08/31/2016 02:12 PM, Tian, Kevin wrote:  
> >>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >>>> Sent: Wednesday, August 31, 2016 12:17 AM
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> At KVM Forum we had a BoF session primarily around the mediated device
> >>>> sysfs interface.  I'd like to share what I think we agreed on and the
> >>>> "problem areas" that still need some work so we can get the thoughts
> >>>> and ideas from those who weren't able to attend.
> >>>>
> >>>> DanPB expressed some concern about the mdev_supported_types sysfs
> >>>> interface, which exposes a flat csv file with fields like "type",
> >>>> "number of instance", "vendor string", and then a bunch of type
> >>>> specific fields like "framebuffer size", "resolution", "frame rate
> >>>> limit", etc.  This is not entirely machine parsing friendly and sort of
> >>>> abuses the sysfs concept of one value per file.  Example output taken
> >>>> from Neo's libvirt RFC:
> >>>>
> >>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >>>> max_resolution
> >>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>>>
> >>>> The create/destroy then looks like this:
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_create
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_destroy
> >>>>
> >>>> "vendor_specific_argument_list" is nebulous.
> >>>>
> >>>> So the idea to fix this is to explode this into a directory structure,
> >>>> something like:
> >>>>
> >>>> ├── mdev_destroy
> >>>> └── mdev_supported_types
> >>>>     ├── 11
> >>>>     │   ├── create
> >>>>     │   ├── description
> >>>>     │   └── max_instances
> >>>>     ├── 12
> >>>>     │   ├── create
> >>>>     │   ├── description
> >>>>     │   └── max_instances
> >>>>     └── 13
> >>>>         ├── create
> >>>>         ├── description
> >>>>         └── max_instances
> >>>>
> >>>> Note that I'm only exposing the minimal attributes here for simplicity,
> >>>> the other attributes would be included in separate files and we would
> >>>> require vendors to create standard attributes for common device classes.    
> >>>
> >>> I like this idea. All standard attributes are reflected into this hierarchy.
> >>> In the meantime, can we still allow optional vendor string in create 
> >>> interface? libvirt doesn't need to know the meaning, but allows upper
> >>> layer to do some vendor specific tweak if necessary.
> >>>     
> >>
> >> Not sure whether this can done within MDEV framework (attrs provided by
> >> vendor driver of course), or must be within the vendor driver.  
> > 
> > The purpose of the sub-directories is that libvirt doesn't need to pass
> > arbitrary, vendor strings to the create function, the attributes of the
> > mdev device created are defined by the attributes in the sysfs
> > directory where the create is done.  The user only provides a uuid for
> > the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> > need to know the meaning, but would need to know when to apply them,
> > which is just as bad.  Ultimately we want libvirt to be able to
> > interact with sysfs without having an vendor specific knowledge.
> >   
> 
> Above directory hierarchy looks fine to me. Along with the fixed set of
> parameter, a optional field of extra parameter is also required. Such
> parameters are required for some specific testing or running benchmarks,
> for example to disable FRL (framerate limiter) or to disable console vnc
> when not required. Libvirt don't need to know its details, its just a
> string that user can provide and libvirt need to pass the string as it
> is to vendor driver, vendor driver would act accordingly.

Wouldn't it make more sense to enable these through the vendor driver
which would then provide additional types through the sysfs interface
that could be selected by libvirt?  Or simply transparently change
these parameters within the existing types?  I think we really want to
get away from adding any sort of magic vendor strings.
 
> >>>>
> >>>> For vGPUs like NVIDIA where we don't support multiple types
> >>>> concurrently, this directory structure would update as mdev devices are
> >>>> created, removing no longer available types.  I carried forward    
> >>>
> >>> or keep the type with max_instances cleared to ZERO.
> >>>    
> >>
> >> +1 :)  
> > 
> > Possible yes, but why would the vendor driver report types that the
> > user cannot create?  It just seems like superfluous information (well,
> > except for the use I discover below).
> >   
> 
> The directory structure for a physical GPU will be defined when device
> is register to mdev module. It would be simpler to change creatable
> instance count i.e for the types which can't be created creatable
> instance count would be set to 0.
> 
> 
> >>>> max_instances here, but perhaps we really want to copy SR-IOV and
> >>>> report a max and current allocation.  Creation and deletion is    
> >>>
> >>> right, cur/max_instances look reasonable.
> >>>     
> >>>> simplified as we can simply "echo $UUID > create" per type.  I don't
> >>>> understand why destroy had a parameter list, so here I imagine we can
> >>>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> >>>> entry under each mdev device, so we remove it at the device rather than
> >>>> in some central location (any objections?).    
> >>>
> >>> OK to me.     
> >>
> >> IIUC, "destroy" has a parameter list is only because the previous
> >> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> >> file under mdev now.
> >>  
> 
> Sorry if that was there in libvirt discussion, but "destroy" don't need
> extra parameters. Yes it could be moved to mdev device directory.
> 
> >>>> We discussed how this might look with Intel devices which do allow
> >>>> mixed vGPU types concurrently.  We believe, but need confirmation, that
> >>>> the vendor driver could still make a finite set of supported types,
> >>>> perhaps with additional module options to the vendor driver to enable
> >>>> more "exotic" types.  So for instance if IGD vGPUs are based on
> >>>> power-of-2 portions of the framebuffer size, then the vendor driver
> >>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> >>>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.    
> >>>
> >>> Yes, Intel can do such type of definition. One thing I'm not sure is 
> >>> about impact cross listed types, i.e. when creating a new instance
> >>> under a given type, max_instances under other types would be 
> >>> dynamically decremented based on available resource. Would it be
> >>> a problem for libvirt or upper level stack, since a natural interpretation
> >>> of max_instances should be a static number?
> >>>
> >>> An alternative is to make max_instances configurable, so libvirt has
> >>> chance to define a pool of available instances with different types
> >>> before creating any instance. For example, initially IGD driver may 
> >>> report max_instances only for a minimal sharing granularity:
> >>> 	128MB:
> >>> 		max_instances (8)
> >>> 	256MB:
> >>> 		max_instances (0)
> >>> 	512MB:
> >>> 		max_instances (0)
> >>>
> >>> Then libvirt can configure more types as:
> >>> 	128MB:
> >>> 		max_instances (2)
> >>> 	256MB:
> >>> 		max_instances (1)
> >>> 	512MB:
> >>> 		max_instances (1)
> >>>
> >>> Starting from this point, max_instances would be static and then
> >>> mdev instance can be created under each type. But I'm not
> >>> sure whether such additional configuration role is reasonable to libvirt...    
> > 
> > My expectation of your example, where I'm assuming you have 1G of total
> > memory that can be divided between the mdev devices would be:
> > 
> >  128M: 8
> >  256M: 4
> >  512M: 2
> > 
> > If a 512M mdev device is created, this becomes:
> > 
> >  128M: 4
> >  256M: 2
> >  512M: 1
> > 
> > Creating a 128M mdev device from that becomes:
> > 
> >  128M: 3
> >  256M: 1
> >  512M: 0
> > 
> > It's not great, but I don't know how to do it better without the user
> > having a clear understanding of the algorithm and resources required
> > for each mdev device.  For instance, the size here, presumably the
> > framebuffer size, is just one attribute in the device directory, the
> > user won't know that this attribute is the key to the available
> > instances.
> > 
> > I don't particularly like the idea of a writeable max_instances, the
> > user can simply create instances of the type and see the results.
> > 
> > Just thought of another thing; do we need some way to determine the
> > type of an mdev device from sysfs or is this implicit knowledge for the
> > user that created the device?  For instance, we create a 512M device
> > and it becomes a child device to the parent, so we can associate to the
> > parent, but if we come back later, how do we know it's a 512M device?
> > Perhaps this is a reason to keep the type directories around and we can
> > cross link the device to the type and create a devices subdirectory
> > under each type.  Perhaps then "max_instances" becomes
> > "available_instances" (ie. how many left we can create) and we don't
> > need a "current_instances" because we can simply look in the devices
> > directory.
> >   
> 
> When mdev module creates mdev device, mdev_device_create() in patch,
> here 'mdev->dev.parent' is assigned as its parent physical device. So
> device_register() create child's directory inside parent's directory.
> Directory for mdev device is not explicitly created. So I don't think we
> can move this directory to type directory. But we can think of adding
> link to type directory from mdev device's directory.

Yes, the idea was only to add links, not to change anything about the
parent/child hierarchy in sysfs.  The result would be similar to how we
have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices
contained within that group.
 
> >>>>
> >>>> We still don't have any way for the admin to learn in advance how the
> >>>> available supported types will change once mdev devices start to be
> >>>> created.  I'm not sure how we can create a specification for this, so
> >>>> probing by creating devices may be the most flexible model.
> >>>>  
> 
> Removing type directory dynamically seems difficult. So the other way as
> suggested here, when that type is not supported, vendor driver can
> return max_instance to 0.

I'm ok with this, seems like there are enough uses for it and it's
necessary to keep the directory for the device links.
 
> >>>> The other issue is the start/stop requirement, which was revealed to
> >>>> setup peer-to-peer resources between vGPUs which is a limited hardware
> >>>> resource.  We'd really like to have these happen automatically on the
> >>>> first open of a vfio mdev device file and final release.  So we
> >>>> brainstormed how the open/release callbacks could know the other mdev
> >>>> devices for a given user.  This is where the instance number came into
> >>>> play previously.  This is an area that needs work.    
> >>>
> >>> IGD doesn't have such peer-to-peer resource setup requirement. So
> >>> it's sufficient to create/destroy a mdev instance in a single action on
> >>> IGD. However I'd expect we still keep the "start/stop" interface (
> >>> maybe not exposed as sysfs node, instead being a VFIO API), as 
> >>> required to support future live migration usage. We've made prototype
> >>> working for KVMGT today.    
> > 
> > Great!
> >   
> 
> In this v7 version of patch, I had made changes that introduce 'online'
> in mdev device directory as discussed in v6 reviews. We need this to
> commit resources for that device(s).

But if we have some number of mdev devices, each with just a UUID
identifier, how are separate online callbacks for each device
associated to a single peer-to-peer context?

> >> It's good for the framework to define start/stop interfaces, but as Alex
> >> said below, it should be MDEV oriented, not VM oriented.
> >>
> >> I don't know a lot about the peer-to-peer resource, but to me, although
> >> VM_UUID + instance is not applicable, userspace can always achieve the
> >> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> >> under every mdev:
> >>
> >> 	/sys/bus/pci/devices/<sbdf>/mdev/
> >> 	|-- mdev01/
> >> 	|   `-- vm_uuid
> >> 	`-- mdev02/
> >> 	    `-- vm_uuid
> >>
> >> Did I miss something?  
> > 
> > Sure, this is just another way of doing UUID+instance.  Nit, it might
> > look more like:
> > 
> >  	/sys/bus/pci/devices/<sbdf>/mdev/
> >  	|-- uuid1/
> >  	|   `-- group_uuid
> >  	`-- uuid2/
> >  	    `-- group_uuid
> > 
> > Where each mdev device is actually referenced by its UUID name then
> > we'd have some writable attribute under the device where mdev devices
> > sharing the same group UUID are handled together.    
> 
> Group UUID would also work, as long as its unique and set for all
> devices in a group, it should work.

Well, except for the problem I mention in the quoted paragraph below.

> > There's a problem
> > here though that vfio doesn't know about this level of grouping, so
> > uuid1 and uuid2 could actually be given to different users despite the
> > grouping here, which results in one or both devices not working or
> > creating security issues.  That sort of implies that this would
> > necessarily need to be exposed as iommu grouping.  This factors into why
> > it seems like a good idea to make the start/stop implicit within the
> > interface.  In that way each mdev device is fungible as far as a user
> > like libvirt is concerned, internal details like peer-to-peer resources
> > are handled automatically as the devices are accessed.
> >   
> 
> I understand your concerns here. But making implicit doesn't guarantee
> that device will not be accessed unless all mdev devices are started.

This is true, start on mmio fault relies on devices being setup w/o
accessing the mmio space.  It should be how QEMU works today though.
 
> >>>> There was a thought that perhaps on open() the vendor driver could look
> >>>> at the user pid and use that to associate with other devices, but the
> >>>> problem here is that we open and begin access to each device, so
> >>>> devices do this discovery serially rather than in parallel as desired.
> >>>> (we might not fault in mmio space yet though, so I wonder if open()
> >>>> could set the association of mdev to pid, then the first mmio fault
> >>>> would trigger the resource allocation?  Then all the "magic" would live
> >>>> in the vendor driver.  open() could fail if the pid already has running
> >>>> mdev devices and the vendor driver chooses not to support hotplug)
> >>>>  
> 
> Problem is resources should be committed before any device being
> accessed and not at fault at mmio space.

It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices
within the group.  IIRC, the bus drivers don't have any visibility
to opening and releasing of the group itself to trigger the
online/offline, but they can track opening of the device file
descriptors within the group.  Within the VFIO API the user cannot
access the device without the device file descriptor, so a "first
device opened" and "last device closed" trigger would provide the
trigger points you need.  Some sort of new sysfs interface would need to
be invented to allow this sort of manipulation.

Also we should probably keep sight of whether we feel this is
sufficiently necessary for the complexity.  If we can get by with only
doing this grouping at creation time then we could define the "create"
interface in various ways.  For example:

echo $UUID0 > create

would create a single mdev named $UUID0 in it's own group.

echo {$UUID0,$UUID1} > create

could create mdev devices $UUID0 and $UUID1 grouped together.

We could even do:

echo $UUID1:$GROUPA > create

where $GROUPA is the group ID of a previously created mdev device into
which $UUID1 is to be created and added to the same group.

Currently iommu groups are determined at device discovery time and not
changeable, so it seems like this sort of matches that model, but it
makes life difficult for libvirt if they want to have a pool of mdev
devices that they arbitrarily assigned to VMs.  Also the question of
whether libvirt applies this all mdev devices or only NVIDIA.  Does it
try to use the same group across different parent devices?  Does it
only group devices with matching vendor strings?  Much to be
specified...

Thanks,
Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-01 20:01             ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-01 20:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, Tian, Kevin, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump

On Thu, 1 Sep 2016 23:52:02 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> Thanks for summarizing the discussion.
> 
> On 8/31/2016 9:18 PM, Alex Williamson wrote:
> > On Wed, 31 Aug 2016 15:04:13 +0800
> > Jike Song <jike.song@intel.com> wrote:
> >   
> >> On 08/31/2016 02:12 PM, Tian, Kevin wrote:  
> >>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >>>> Sent: Wednesday, August 31, 2016 12:17 AM
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> At KVM Forum we had a BoF session primarily around the mediated device
> >>>> sysfs interface.  I'd like to share what I think we agreed on and the
> >>>> "problem areas" that still need some work so we can get the thoughts
> >>>> and ideas from those who weren't able to attend.
> >>>>
> >>>> DanPB expressed some concern about the mdev_supported_types sysfs
> >>>> interface, which exposes a flat csv file with fields like "type",
> >>>> "number of instance", "vendor string", and then a bunch of type
> >>>> specific fields like "framebuffer size", "resolution", "frame rate
> >>>> limit", etc.  This is not entirely machine parsing friendly and sort of
> >>>> abuses the sysfs concept of one value per file.  Example output taken
> >>>> from Neo's libvirt RFC:
> >>>>
> >>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >>>> max_resolution
> >>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>>>
> >>>> The create/destroy then looks like this:
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_create
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_destroy
> >>>>
> >>>> "vendor_specific_argument_list" is nebulous.
> >>>>
> >>>> So the idea to fix this is to explode this into a directory structure,
> >>>> something like:
> >>>>
> >>>> ├── mdev_destroy
> >>>> └── mdev_supported_types
> >>>>     ├── 11
> >>>>     │   ├── create
> >>>>     │   ├── description
> >>>>     │   └── max_instances
> >>>>     ├── 12
> >>>>     │   ├── create
> >>>>     │   ├── description
> >>>>     │   └── max_instances
> >>>>     └── 13
> >>>>         ├── create
> >>>>         ├── description
> >>>>         └── max_instances
> >>>>
> >>>> Note that I'm only exposing the minimal attributes here for simplicity,
> >>>> the other attributes would be included in separate files and we would
> >>>> require vendors to create standard attributes for common device classes.    
> >>>
> >>> I like this idea. All standard attributes are reflected into this hierarchy.
> >>> In the meantime, can we still allow optional vendor string in create 
> >>> interface? libvirt doesn't need to know the meaning, but allows upper
> >>> layer to do some vendor specific tweak if necessary.
> >>>     
> >>
> >> Not sure whether this can done within MDEV framework (attrs provided by
> >> vendor driver of course), or must be within the vendor driver.  
> > 
> > The purpose of the sub-directories is that libvirt doesn't need to pass
> > arbitrary, vendor strings to the create function, the attributes of the
> > mdev device created are defined by the attributes in the sysfs
> > directory where the create is done.  The user only provides a uuid for
> > the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> > need to know the meaning, but would need to know when to apply them,
> > which is just as bad.  Ultimately we want libvirt to be able to
> > interact with sysfs without having an vendor specific knowledge.
> >   
> 
> Above directory hierarchy looks fine to me. Along with the fixed set of
> parameter, a optional field of extra parameter is also required. Such
> parameters are required for some specific testing or running benchmarks,
> for example to disable FRL (framerate limiter) or to disable console vnc
> when not required. Libvirt don't need to know its details, its just a
> string that user can provide and libvirt need to pass the string as it
> is to vendor driver, vendor driver would act accordingly.

Wouldn't it make more sense to enable these through the vendor driver
which would then provide additional types through the sysfs interface
that could be selected by libvirt?  Or simply transparently change
these parameters within the existing types?  I think we really want to
get away from adding any sort of magic vendor strings.
 
> >>>>
> >>>> For vGPUs like NVIDIA where we don't support multiple types
> >>>> concurrently, this directory structure would update as mdev devices are
> >>>> created, removing no longer available types.  I carried forward    
> >>>
> >>> or keep the type with max_instances cleared to ZERO.
> >>>    
> >>
> >> +1 :)  
> > 
> > Possible yes, but why would the vendor driver report types that the
> > user cannot create?  It just seems like superfluous information (well,
> > except for the use I discover below).
> >   
> 
> The directory structure for a physical GPU will be defined when device
> is register to mdev module. It would be simpler to change creatable
> instance count i.e for the types which can't be created creatable
> instance count would be set to 0.
> 
> 
> >>>> max_instances here, but perhaps we really want to copy SR-IOV and
> >>>> report a max and current allocation.  Creation and deletion is    
> >>>
> >>> right, cur/max_instances look reasonable.
> >>>     
> >>>> simplified as we can simply "echo $UUID > create" per type.  I don't
> >>>> understand why destroy had a parameter list, so here I imagine we can
> >>>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> >>>> entry under each mdev device, so we remove it at the device rather than
> >>>> in some central location (any objections?).    
> >>>
> >>> OK to me.     
> >>
> >> IIUC, "destroy" has a parameter list is only because the previous
> >> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> >> file under mdev now.
> >>  
> 
> Sorry if that was there in libvirt discussion, but "destroy" don't need
> extra parameters. Yes it could be moved to mdev device directory.
> 
> >>>> We discussed how this might look with Intel devices which do allow
> >>>> mixed vGPU types concurrently.  We believe, but need confirmation, that
> >>>> the vendor driver could still make a finite set of supported types,
> >>>> perhaps with additional module options to the vendor driver to enable
> >>>> more "exotic" types.  So for instance if IGD vGPUs are based on
> >>>> power-of-2 portions of the framebuffer size, then the vendor driver
> >>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> >>>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.    
> >>>
> >>> Yes, Intel can do such type of definition. One thing I'm not sure is 
> >>> about impact cross listed types, i.e. when creating a new instance
> >>> under a given type, max_instances under other types would be 
> >>> dynamically decremented based on available resource. Would it be
> >>> a problem for libvirt or upper level stack, since a natural interpretation
> >>> of max_instances should be a static number?
> >>>
> >>> An alternative is to make max_instances configurable, so libvirt has
> >>> chance to define a pool of available instances with different types
> >>> before creating any instance. For example, initially IGD driver may 
> >>> report max_instances only for a minimal sharing granularity:
> >>> 	128MB:
> >>> 		max_instances (8)
> >>> 	256MB:
> >>> 		max_instances (0)
> >>> 	512MB:
> >>> 		max_instances (0)
> >>>
> >>> Then libvirt can configure more types as:
> >>> 	128MB:
> >>> 		max_instances (2)
> >>> 	256MB:
> >>> 		max_instances (1)
> >>> 	512MB:
> >>> 		max_instances (1)
> >>>
> >>> Starting from this point, max_instances would be static and then
> >>> mdev instance can be created under each type. But I'm not
> >>> sure whether such additional configuration role is reasonable to libvirt...    
> > 
> > My expectation of your example, where I'm assuming you have 1G of total
> > memory that can be divided between the mdev devices would be:
> > 
> >  128M: 8
> >  256M: 4
> >  512M: 2
> > 
> > If a 512M mdev device is created, this becomes:
> > 
> >  128M: 4
> >  256M: 2
> >  512M: 1
> > 
> > Creating a 128M mdev device from that becomes:
> > 
> >  128M: 3
> >  256M: 1
> >  512M: 0
> > 
> > It's not great, but I don't know how to do it better without the user
> > having a clear understanding of the algorithm and resources required
> > for each mdev device.  For instance, the size here, presumably the
> > framebuffer size, is just one attribute in the device directory, the
> > user won't know that this attribute is the key to the available
> > instances.
> > 
> > I don't particularly like the idea of a writeable max_instances, the
> > user can simply create instances of the type and see the results.
> > 
> > Just thought of another thing; do we need some way to determine the
> > type of an mdev device from sysfs or is this implicit knowledge for the
> > user that created the device?  For instance, we create a 512M device
> > and it becomes a child device to the parent, so we can associate to the
> > parent, but if we come back later, how do we know it's a 512M device?
> > Perhaps this is a reason to keep the type directories around and we can
> > cross link the device to the type and create a devices subdirectory
> > under each type.  Perhaps then "max_instances" becomes
> > "available_instances" (ie. how many left we can create) and we don't
> > need a "current_instances" because we can simply look in the devices
> > directory.
> >   
> 
> When mdev module creates mdev device, mdev_device_create() in patch,
> here 'mdev->dev.parent' is assigned as its parent physical device. So
> device_register() create child's directory inside parent's directory.
> Directory for mdev device is not explicitly created. So I don't think we
> can move this directory to type directory. But we can think of adding
> link to type directory from mdev device's directory.

Yes, the idea was only to add links, not to change anything about the
parent/child hierarchy in sysfs.  The result would be similar to how we
have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices
contained within that group.
 
> >>>>
> >>>> We still don't have any way for the admin to learn in advance how the
> >>>> available supported types will change once mdev devices start to be
> >>>> created.  I'm not sure how we can create a specification for this, so
> >>>> probing by creating devices may be the most flexible model.
> >>>>  
> 
> Removing type directory dynamically seems difficult. So the other way as
> suggested here, when that type is not supported, vendor driver can
> return max_instance to 0.

I'm ok with this, seems like there are enough uses for it and it's
necessary to keep the directory for the device links.
 
> >>>> The other issue is the start/stop requirement, which was revealed to
> >>>> setup peer-to-peer resources between vGPUs which is a limited hardware
> >>>> resource.  We'd really like to have these happen automatically on the
> >>>> first open of a vfio mdev device file and final release.  So we
> >>>> brainstormed how the open/release callbacks could know the other mdev
> >>>> devices for a given user.  This is where the instance number came into
> >>>> play previously.  This is an area that needs work.    
> >>>
> >>> IGD doesn't have such peer-to-peer resource setup requirement. So
> >>> it's sufficient to create/destroy a mdev instance in a single action on
> >>> IGD. However I'd expect we still keep the "start/stop" interface (
> >>> maybe not exposed as sysfs node, instead being a VFIO API), as 
> >>> required to support future live migration usage. We've made prototype
> >>> working for KVMGT today.    
> > 
> > Great!
> >   
> 
> In this v7 version of patch, I had made changes that introduce 'online'
> in mdev device directory as discussed in v6 reviews. We need this to
> commit resources for that device(s).

But if we have some number of mdev devices, each with just a UUID
identifier, how are separate online callbacks for each device
associated to a single peer-to-peer context?

> >> It's good for the framework to define start/stop interfaces, but as Alex
> >> said below, it should be MDEV oriented, not VM oriented.
> >>
> >> I don't know a lot about the peer-to-peer resource, but to me, although
> >> VM_UUID + instance is not applicable, userspace can always achieve the
> >> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> >> under every mdev:
> >>
> >> 	/sys/bus/pci/devices/<sbdf>/mdev/
> >> 	|-- mdev01/
> >> 	|   `-- vm_uuid
> >> 	`-- mdev02/
> >> 	    `-- vm_uuid
> >>
> >> Did I miss something?  
> > 
> > Sure, this is just another way of doing UUID+instance.  Nit, it might
> > look more like:
> > 
> >  	/sys/bus/pci/devices/<sbdf>/mdev/
> >  	|-- uuid1/
> >  	|   `-- group_uuid
> >  	`-- uuid2/
> >  	    `-- group_uuid
> > 
> > Where each mdev device is actually referenced by its UUID name then
> > we'd have some writable attribute under the device where mdev devices
> > sharing the same group UUID are handled together.    
> 
> Group UUID would also work, as long as its unique and set for all
> devices in a group, it should work.

Well, except for the problem I mention in the quoted paragraph below.

> > There's a problem
> > here though that vfio doesn't know about this level of grouping, so
> > uuid1 and uuid2 could actually be given to different users despite the
> > grouping here, which results in one or both devices not working or
> > creating security issues.  That sort of implies that this would
> > necessarily need to be exposed as iommu grouping.  This factors into why
> > it seems like a good idea to make the start/stop implicit within the
> > interface.  In that way each mdev device is fungible as far as a user
> > like libvirt is concerned, internal details like peer-to-peer resources
> > are handled automatically as the devices are accessed.
> >   
> 
> I understand your concerns here. But making implicit doesn't guarantee
> that device will not be accessed unless all mdev devices are started.

This is true, start on mmio fault relies on devices being setup w/o
accessing the mmio space.  It should be how QEMU works today though.
 
> >>>> There was a thought that perhaps on open() the vendor driver could look
> >>>> at the user pid and use that to associate with other devices, but the
> >>>> problem here is that we open and begin access to each device, so
> >>>> devices do this discovery serially rather than in parallel as desired.
> >>>> (we might not fault in mmio space yet though, so I wonder if open()
> >>>> could set the association of mdev to pid, then the first mmio fault
> >>>> would trigger the resource allocation?  Then all the "magic" would live
> >>>> in the vendor driver.  open() could fail if the pid already has running
> >>>> mdev devices and the vendor driver chooses not to support hotplug)
> >>>>  
> 
> Problem is resources should be committed before any device being
> accessed and not at fault at mmio space.

It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices
within the group.  IIRC, the bus drivers don't have any visibility
to opening and releasing of the group itself to trigger the
online/offline, but they can track opening of the device file
descriptors within the group.  Within the VFIO API the user cannot
access the device without the device file descriptor, so a "first
device opened" and "last device closed" trigger would provide the
trigger points you need.  Some sort of new sysfs interface would need to
be invented to allow this sort of manipulation.

Also we should probably keep sight of whether we feel this is
sufficiently necessary for the complexity.  If we can get by with only
doing this grouping at creation time then we could define the "create"
interface in various ways.  For example:

echo $UUID0 > create

would create a single mdev named $UUID0 in it's own group.

echo {$UUID0,$UUID1} > create

could create mdev devices $UUID0 and $UUID1 grouped together.

We could even do:

echo $UUID1:$GROUPA > create

where $GROUPA is the group ID of a previously created mdev device into
which $UUID1 is to be created and added to the same group.

Currently iommu groups are determined at device discovery time and not
changeable, so it seems like this sort of matches that model, but it
makes life difficult for libvirt if they want to have a pool of mdev
devices that they arbitrarily assigned to VMs.  Also the question of
whether libvirt applies this all mdev devices or only NVIDIA.  Does it
try to use the same group across different parent devices?  Does it
only group devices with matching vendor strings?  Much to be
specified...

Thanks,
Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-01 16:59         ` [Qemu-devel] " Alex Williamson
  (?)
@ 2016-09-02  4:48         ` Michal Privoznik
  2016-09-02  5:21           ` Kirti Wankhede
  -1 siblings, 1 reply; 162+ messages in thread
From: Michal Privoznik @ 2016-09-02  4:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	Kirti Wankhede, kraxel, Laine Stump, pbonzini, bjsdjshi

On 01.09.2016 18:59, Alex Williamson wrote:
> On Thu, 1 Sep 2016 18:47:06 +0200
> Michal Privoznik <mprivozn@redhat.com> wrote:
> 
>> On 31.08.2016 08:12, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>
>>>> Hi folks,
>>>>
>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>> "problem areas" that still need some work so we can get the thoughts
>>>> and ideas from those who weren't able to attend.
>>>>
>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>> interface, which exposes a flat csv file with fields like "type",
>>>> "number of instance", "vendor string", and then a bunch of type
>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>> from Neo's libvirt RFC:
>>>>
>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>> max_resolution
>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>
>>>> The create/destroy then looks like this:
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>
>>>> "vendor_specific_argument_list" is nebulous.
>>>>
>>>> So the idea to fix this is to explode this into a directory structure,
>>>> something like:
>>>>
>>>> ├── mdev_destroy
>>>> └── mdev_supported_types
>>>>     ├── 11
>>>>     │   ├── create
>>>>     │   ├── description
>>>>     │   └── max_instances
>>>>     ├── 12
>>>>     │   ├── create
>>>>     │   ├── description
>>>>     │   └── max_instances
>>>>     └── 13
>>>>         ├── create
>>>>         ├── description
>>>>         └── max_instances
>>>>
>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>> the other attributes would be included in separate files and we would
>>>> require vendors to create standard attributes for common device classes.  
>>>
>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>> In the meantime, can we still allow optional vendor string in create 
>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>> layer to do some vendor specific tweak if necessary.  
>>
>> This is not the best idea IMO. Libvirt is there to shadow differences
>> between hypervisors. While doing that, we often hide differences between
>> various types of HW too. Therefore in order to provide good abstraction
>> we should make vendor specific string as small as possible (ideally an
>> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
>> example above in domain XML. What I think the better idea is if we let
>> users chose resolution and frame buffer size, e.g.: <video
>> resolution="1024x768" framebuffer="16"/> (just the first idea that came
>> to my mind while writing this e-mail). The point is, XML part is
>> completely free of any vendor-specific knobs.
> 
> That's not really what you want though, a user actually cares whether
> they get an Intel of NVIDIA vGPU, we can't specify it as just a
> resolution and framebuffer size.  The user also doesn't want the model
> changing each time the VM is started, so not only do you *need* to know
> the vendor, you need to know the vendor model.  This is the only way to
> provide a consistent VM.  So as we discussed at the BoF, the libvirt
> xml will likely reference the vendor string, which will be a unique
> identifier that encompasses all the additional attributes we expose.
> Really the goal of the attributes is simply so you don't need a per
> vendor magic decoder ring to figure out the basic features of a given
> vendor string.  Thanks,

Okay, maybe I'm misunderstanding something. I just thought that users
will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
to construct domain XML.
Also, I guess libvirt will need some sort of understanding of vGPUs in
sense that if there are two vGPUs in the system (say both INTEL and
NVIDIA) libvirt must create mdev on the right one. I guess we can't rely
solely on vgpu_type_id uniqueness here, can we.

Michal

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02  4:48         ` Michal Privoznik
@ 2016-09-02  5:21           ` Kirti Wankhede
  2016-09-02 10:05             ` Paolo Bonzini
  0 siblings, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-02  5:21 UTC (permalink / raw)
  To: Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, pbonzini, bjsdjshi



On 9/2/2016 10:18 AM, Michal Privoznik wrote:
> On 01.09.2016 18:59, Alex Williamson wrote:
>> On Thu, 1 Sep 2016 18:47:06 +0200
>> Michal Privoznik <mprivozn@redhat.com> wrote:
>>
>>> On 31.08.2016 08:12, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>>> "problem areas" that still need some work so we can get the thoughts
>>>>> and ideas from those who weren't able to attend.
>>>>>
>>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>>> interface, which exposes a flat csv file with fields like "type",
>>>>> "number of instance", "vendor string", and then a bunch of type
>>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>>> from Neo's libvirt RFC:
>>>>>
>>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>>> max_resolution
>>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>>
>>>>> The create/destroy then looks like this:
>>>>>
>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>>
>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>>
>>>>> "vendor_specific_argument_list" is nebulous.
>>>>>
>>>>> So the idea to fix this is to explode this into a directory structure,
>>>>> something like:
>>>>>
>>>>> ├── mdev_destroy
>>>>> └── mdev_supported_types
>>>>>     ├── 11
>>>>>     │   ├── create
>>>>>     │   ├── description
>>>>>     │   └── max_instances
>>>>>     ├── 12
>>>>>     │   ├── create
>>>>>     │   ├── description
>>>>>     │   └── max_instances
>>>>>     └── 13
>>>>>         ├── create
>>>>>         ├── description
>>>>>         └── max_instances
>>>>>
>>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>>> the other attributes would be included in separate files and we would
>>>>> require vendors to create standard attributes for common device classes.  
>>>>
>>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>>> In the meantime, can we still allow optional vendor string in create 
>>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>>> layer to do some vendor specific tweak if necessary.  
>>>
>>> This is not the best idea IMO. Libvirt is there to shadow differences
>>> between hypervisors. While doing that, we often hide differences between
>>> various types of HW too. Therefore in order to provide good abstraction
>>> we should make vendor specific string as small as possible (ideally an
>>> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
>>> example above in domain XML. What I think the better idea is if we let
>>> users chose resolution and frame buffer size, e.g.: <video
>>> resolution="1024x768" framebuffer="16"/> (just the first idea that came
>>> to my mind while writing this e-mail). The point is, XML part is
>>> completely free of any vendor-specific knobs.
>>
>> That's not really what you want though, a user actually cares whether
>> they get an Intel of NVIDIA vGPU, we can't specify it as just a
>> resolution and framebuffer size.  The user also doesn't want the model
>> changing each time the VM is started, so not only do you *need* to know
>> the vendor, you need to know the vendor model.  This is the only way to
>> provide a consistent VM.  So as we discussed at the BoF, the libvirt
>> xml will likely reference the vendor string, which will be a unique
>> identifier that encompasses all the additional attributes we expose.
>> Really the goal of the attributes is simply so you don't need a per
>> vendor magic decoder ring to figure out the basic features of a given
>> vendor string.  Thanks,
> 
> Okay, maybe I'm misunderstanding something. I just thought that users
> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
> to construct domain XML.

I'm not familiar with libvirt code, curious how libvirt's nodedev driver
enumerates devices in the system?

> Also, I guess libvirt will need some sort of understanding of vGPUs in
> sense that if there are two vGPUs in the system 

I think you meant two physical GPUs in the system, right?

> (say both INTEL and
> NVIDIA) libvirt must create mdev on the right one. I guess we can't rely
> solely on vgpu_type_id uniqueness here, can we.
> 

When two GPUs are present in the system, both INTEL and NVIDIA, these
devices have unique domain:bus:device:function. 'mdev_create' sysfs file
for mdev would be present for each device in their device directory (as
per v7 version patch below is the path of 'mdev_create')
    /sys/bus/pci/devices/<domain:bus:device:function>/mdev_create

So libvirt need to know on which physical device mdev device need to be
created.

Thanks,
Kirti

> Michal
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 0/4] Add Mediated device support
  2016-09-01 20:01             ` [Qemu-devel] " Alex Williamson
@ 2016-09-02  6:17               ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-02  6:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, Tian, Kevin, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump



On 9/2/2016 1:31 AM, Alex Williamson wrote:
> On Thu, 1 Sep 2016 23:52:02 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Alex,
>> Thanks for summarizing the discussion.
>>
>> On 8/31/2016 9:18 PM, Alex Williamson wrote:
>>> On Wed, 31 Aug 2016 15:04:13 +0800
>>> Jike Song <jike.song@intel.com> wrote:
>>>   
>>>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:  
>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>>>> "problem areas" that still need some work so we can get the thoughts
>>>>>> and ideas from those who weren't able to attend.
>>>>>>
>>>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>>>> interface, which exposes a flat csv file with fields like "type",
>>>>>> "number of instance", "vendor string", and then a bunch of type
>>>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>>>> from Neo's libvirt RFC:
>>>>>>
>>>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>>>> max_resolution
>>>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>>>
>>>>>> The create/destroy then looks like this:
>>>>>>
>>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>>>
>>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>>>
>>>>>> "vendor_specific_argument_list" is nebulous.
>>>>>>
>>>>>> So the idea to fix this is to explode this into a directory structure,
>>>>>> something like:
>>>>>>
>>>>>> ├── mdev_destroy
>>>>>> └── mdev_supported_types
>>>>>>     ├── 11
>>>>>>     │   ├── create
>>>>>>     │   ├── description
>>>>>>     │   └── max_instances
>>>>>>     ├── 12
>>>>>>     │   ├── create
>>>>>>     │   ├── description
>>>>>>     │   └── max_instances
>>>>>>     └── 13
>>>>>>         ├── create
>>>>>>         ├── description
>>>>>>         └── max_instances
>>>>>>
>>>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>>>> the other attributes would be included in separate files and we would
>>>>>> require vendors to create standard attributes for common device classes.    
>>>>>
>>>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>>>> In the meantime, can we still allow optional vendor string in create 
>>>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>>>> layer to do some vendor specific tweak if necessary.
>>>>>     
>>>>
>>>> Not sure whether this can done within MDEV framework (attrs provided by
>>>> vendor driver of course), or must be within the vendor driver.  
>>>
>>> The purpose of the sub-directories is that libvirt doesn't need to pass
>>> arbitrary, vendor strings to the create function, the attributes of the
>>> mdev device created are defined by the attributes in the sysfs
>>> directory where the create is done.  The user only provides a uuid for
>>> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
>>> need to know the meaning, but would need to know when to apply them,
>>> which is just as bad.  Ultimately we want libvirt to be able to
>>> interact with sysfs without having an vendor specific knowledge.
>>>   
>>
>> Above directory hierarchy looks fine to me. Along with the fixed set of
>> parameter, a optional field of extra parameter is also required. Such
>> parameters are required for some specific testing or running benchmarks,
>> for example to disable FRL (framerate limiter) or to disable console vnc
>> when not required. Libvirt don't need to know its details, its just a
>> string that user can provide and libvirt need to pass the string as it
>> is to vendor driver, vendor driver would act accordingly.
> 
> Wouldn't it make more sense to enable these through the vendor driver
> which would then provide additional types through the sysfs interface
> that could be selected by libvirt?  Or simply transparently change
> these parameters within the existing types?  I think we really want to
> get away from adding any sort of magic vendor strings.
>  

In the directory structure, a 'params' can take optional parameters.
Libvirt then can set 'params' and then create mdev device. For example,
param say 'disable_console_vnc=1' is set for type 11, then devices
created of type 11 will have that param set unless it is cleared.

 └── mdev_supported_types
     ├── 11
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     ├── 12
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     └── 13
         ├── create
         ├── description
         └── max_instances
         └── params

This has to come from libvirt since such params could be different for
each mdev device.

>>>>>>
>>>>>> For vGPUs like NVIDIA where we don't support multiple types
>>>>>> concurrently, this directory structure would update as mdev devices are
>>>>>> created, removing no longer available types.  I carried forward    
>>>>>
>>>>> or keep the type with max_instances cleared to ZERO.
>>>>>    
>>>>
>>>> +1 :)  
>>>
>>> Possible yes, but why would the vendor driver report types that the
>>> user cannot create?  It just seems like superfluous information (well,
>>> except for the use I discover below).
>>>   
>>
>> The directory structure for a physical GPU will be defined when device
>> is register to mdev module. It would be simpler to change creatable
>> instance count i.e for the types which can't be created creatable
>> instance count would be set to 0.
>>
>>
>>>>>> max_instances here, but perhaps we really want to copy SR-IOV and
>>>>>> report a max and current allocation.  Creation and deletion is    
>>>>>
>>>>> right, cur/max_instances look reasonable.
>>>>>     
>>>>>> simplified as we can simply "echo $UUID > create" per type.  I don't
>>>>>> understand why destroy had a parameter list, so here I imagine we can
>>>>>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>>>>>> entry under each mdev device, so we remove it at the device rather than
>>>>>> in some central location (any objections?).    
>>>>>
>>>>> OK to me.     
>>>>
>>>> IIUC, "destroy" has a parameter list is only because the previous
>>>> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
>>>> file under mdev now.
>>>>  
>>
>> Sorry if that was there in libvirt discussion, but "destroy" don't need
>> extra parameters. Yes it could be moved to mdev device directory.
>>
>>>>>> We discussed how this might look with Intel devices which do allow
>>>>>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>>>>>> the vendor driver could still make a finite set of supported types,
>>>>>> perhaps with additional module options to the vendor driver to enable
>>>>>> more "exotic" types.  So for instance if IGD vGPUs are based on
>>>>>> power-of-2 portions of the framebuffer size, then the vendor driver
>>>>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>>>>>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.    
>>>>>
>>>>> Yes, Intel can do such type of definition. One thing I'm not sure is 
>>>>> about impact cross listed types, i.e. when creating a new instance
>>>>> under a given type, max_instances under other types would be 
>>>>> dynamically decremented based on available resource. Would it be
>>>>> a problem for libvirt or upper level stack, since a natural interpretation
>>>>> of max_instances should be a static number?
>>>>>
>>>>> An alternative is to make max_instances configurable, so libvirt has
>>>>> chance to define a pool of available instances with different types
>>>>> before creating any instance. For example, initially IGD driver may 
>>>>> report max_instances only for a minimal sharing granularity:
>>>>> 	128MB:
>>>>> 		max_instances (8)
>>>>> 	256MB:
>>>>> 		max_instances (0)
>>>>> 	512MB:
>>>>> 		max_instances (0)
>>>>>
>>>>> Then libvirt can configure more types as:
>>>>> 	128MB:
>>>>> 		max_instances (2)
>>>>> 	256MB:
>>>>> 		max_instances (1)
>>>>> 	512MB:
>>>>> 		max_instances (1)
>>>>>
>>>>> Starting from this point, max_instances would be static and then
>>>>> mdev instance can be created under each type. But I'm not
>>>>> sure whether such additional configuration role is reasonable to libvirt...    
>>>
>>> My expectation of your example, where I'm assuming you have 1G of total
>>> memory that can be divided between the mdev devices would be:
>>>
>>>  128M: 8
>>>  256M: 4
>>>  512M: 2
>>>
>>> If a 512M mdev device is created, this becomes:
>>>
>>>  128M: 4
>>>  256M: 2
>>>  512M: 1
>>>
>>> Creating a 128M mdev device from that becomes:
>>>
>>>  128M: 3
>>>  256M: 1
>>>  512M: 0
>>>
>>> It's not great, but I don't know how to do it better without the user
>>> having a clear understanding of the algorithm and resources required
>>> for each mdev device.  For instance, the size here, presumably the
>>> framebuffer size, is just one attribute in the device directory, the
>>> user won't know that this attribute is the key to the available
>>> instances.
>>>
>>> I don't particularly like the idea of a writeable max_instances, the
>>> user can simply create instances of the type and see the results.
>>>
>>> Just thought of another thing; do we need some way to determine the
>>> type of an mdev device from sysfs or is this implicit knowledge for the
>>> user that created the device?  For instance, we create a 512M device
>>> and it becomes a child device to the parent, so we can associate to the
>>> parent, but if we come back later, how do we know it's a 512M device?
>>> Perhaps this is a reason to keep the type directories around and we can
>>> cross link the device to the type and create a devices subdirectory
>>> under each type.  Perhaps then "max_instances" becomes
>>> "available_instances" (ie. how many left we can create) and we don't
>>> need a "current_instances" because we can simply look in the devices
>>> directory.
>>>   
>>
>> When mdev module creates mdev device, mdev_device_create() in patch,
>> here 'mdev->dev.parent' is assigned as its parent physical device. So
>> device_register() create child's directory inside parent's directory.
>> Directory for mdev device is not explicitly created. So I don't think we
>> can move this directory to type directory. But we can think of adding
>> link to type directory from mdev device's directory.
> 
> Yes, the idea was only to add links, not to change anything about the
> parent/child hierarchy in sysfs.  The result would be similar to how we
> have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices
> contained within that group.
>  
>>>>>>
>>>>>> We still don't have any way for the admin to learn in advance how the
>>>>>> available supported types will change once mdev devices start to be
>>>>>> created.  I'm not sure how we can create a specification for this, so
>>>>>> probing by creating devices may be the most flexible model.
>>>>>>  
>>
>> Removing type directory dynamically seems difficult. So the other way as
>> suggested here, when that type is not supported, vendor driver can
>> return max_instance to 0.
> 
> I'm ok with this, seems like there are enough uses for it and it's
> necessary to keep the directory for the device links.
>  
>>>>>> The other issue is the start/stop requirement, which was revealed to
>>>>>> setup peer-to-peer resources between vGPUs which is a limited hardware
>>>>>> resource.  We'd really like to have these happen automatically on the
>>>>>> first open of a vfio mdev device file and final release.  So we
>>>>>> brainstormed how the open/release callbacks could know the other mdev
>>>>>> devices for a given user.  This is where the instance number came into
>>>>>> play previously.  This is an area that needs work.    
>>>>>
>>>>> IGD doesn't have such peer-to-peer resource setup requirement. So
>>>>> it's sufficient to create/destroy a mdev instance in a single action on
>>>>> IGD. However I'd expect we still keep the "start/stop" interface (
>>>>> maybe not exposed as sysfs node, instead being a VFIO API), as 
>>>>> required to support future live migration usage. We've made prototype
>>>>> working for KVMGT today.    
>>>
>>> Great!
>>>   
>>
>> In this v7 version of patch, I had made changes that introduce 'online'
>> in mdev device directory as discussed in v6 reviews. We need this to
>> commit resources for that device(s).
> 
> But if we have some number of mdev devices, each with just a UUID
> identifier, how are separate online callbacks for each device
> associated to a single peer-to-peer context?
> 
>>>> It's good for the framework to define start/stop interfaces, but as Alex
>>>> said below, it should be MDEV oriented, not VM oriented.
>>>>
>>>> I don't know a lot about the peer-to-peer resource, but to me, although
>>>> VM_UUID + instance is not applicable, userspace can always achieve the
>>>> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
>>>> under every mdev:
>>>>
>>>> 	/sys/bus/pci/devices/<sbdf>/mdev/
>>>> 	|-- mdev01/
>>>> 	|   `-- vm_uuid
>>>> 	`-- mdev02/
>>>> 	    `-- vm_uuid
>>>>
>>>> Did I miss something?  
>>>
>>> Sure, this is just another way of doing UUID+instance.  Nit, it might
>>> look more like:
>>>
>>>  	/sys/bus/pci/devices/<sbdf>/mdev/
>>>  	|-- uuid1/
>>>  	|   `-- group_uuid
>>>  	`-- uuid2/
>>>  	    `-- group_uuid
>>>
>>> Where each mdev device is actually referenced by its UUID name then
>>> we'd have some writable attribute under the device where mdev devices
>>> sharing the same group UUID are handled together.    
>>
>> Group UUID would also work, as long as its unique and set for all
>> devices in a group, it should work.
> 
> Well, except for the problem I mention in the quoted paragraph below.
> 
>>> There's a problem
>>> here though that vfio doesn't know about this level of grouping, so
>>> uuid1 and uuid2 could actually be given to different users despite the
>>> grouping here, which results in one or both devices not working or
>>> creating security issues.  That sort of implies that this would
>>> necessarily need to be exposed as iommu grouping.  This factors into why
>>> it seems like a good idea to make the start/stop implicit within the
>>> interface.  In that way each mdev device is fungible as far as a user
>>> like libvirt is concerned, internal details like peer-to-peer resources
>>> are handled automatically as the devices are accessed.
>>>   
>>
>> I understand your concerns here. But making implicit doesn't guarantee
>> that device will not be accessed unless all mdev devices are started.
> 
> This is true, start on mmio fault relies on devices being setup w/o
> accessing the mmio space.  It should be how QEMU works today though.
>  
>>>>>> There was a thought that perhaps on open() the vendor driver could look
>>>>>> at the user pid and use that to associate with other devices, but the
>>>>>> problem here is that we open and begin access to each device, so
>>>>>> devices do this discovery serially rather than in parallel as desired.
>>>>>> (we might not fault in mmio space yet though, so I wonder if open()
>>>>>> could set the association of mdev to pid, then the first mmio fault
>>>>>> would trigger the resource allocation?  Then all the "magic" would live
>>>>>> in the vendor driver.  open() could fail if the pid already has running
>>>>>> mdev devices and the vendor driver chooses not to support hotplug)
>>>>>>  
>>
>> Problem is resources should be committed before any device being
>> accessed and not at fault at mmio space.
> 
> It seems then that the grouping needs to affect the iommu group so that
> you know that there's only a single owner for all the mdev devices
> within the group.  IIRC, the bus drivers don't have any visibility
> to opening and releasing of the group itself to trigger the
> online/offline, but they can track opening of the device file
> descriptors within the group.  Within the VFIO API the user cannot
> access the device without the device file descriptor, so a "first
> device opened" and "last device closed" trigger would provide the
> trigger points you need.  Some sort of new sysfs interface would need to
> be invented to allow this sort of manipulation.
> 

I like this suggestion and thinking around it.

> Also we should probably keep sight of whether we feel this is
> sufficiently necessary for the complexity.  If we can get by with only
> doing this grouping at creation time then we could define the "create"
> interface in various ways.  For example:
> 
> echo $UUID0 > create
> 
> would create a single mdev named $UUID0 in it's own group.
> 
> echo {$UUID0,$UUID1} > create
> 
> could create mdev devices $UUID0 and $UUID1 grouped together.
> 

I think this would create mdev device of same type on same parent
device. We need to consider the case of multiple mdev devices of
different types and with different parents to be grouped together.

> We could even do:
> 
> echo $UUID1:$GROUPA > create
> 
> where $GROUPA is the group ID of a previously created mdev device into
> which $UUID1 is to be created and added to the same group.
>

I was thinking about:

  echo $UUID0 > create

would create mdev device

  echo $UUID0 > /sys/class/mdev/create_group

would add created device to group.

For multiple devices case:
  echo $UUID0 > create
  echo $UUID1 > create

would create mdev devices which could be of different types and
different parents.
  echo $UUID0, $UUID1 > /sys/class/mdev/create_group

would add devices in a group.
Mdev core module would create a new group with unique number.  On mdev
device 'destroy' that mdev device would be removed from the group. When
there are no devices left in the group, group would be deleted. With
this "first device opened" and "last device closed" trigger can be used
to commit resources.
Then libvirt use mdev device path to pass as argument to QEMU, same as
it does for VFIO. Libvirt don't have to care about group number.

> Currently iommu groups are determined at device discovery time and not
> changeable, so it seems like this sort of matches that model, but it
> makes life difficult for libvirt if they want to have a pool of mdev
> devices that they arbitrarily assigned to VMs.  Also the question of
> whether libvirt applies this all mdev devices or only NVIDIA.  Does it
> try to use the same group across different parent devices? 

Yes, group could consists of mdev devices with different parent devices.

> Does it
> only group devices with matching vendor strings?  Much to be
> specified...

I don't think it should be vendor specific.

Thanks,
Kirti

> 
> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02  6:17               ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-02  6:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, Tian, Kevin, pbonzini, kraxel, cjia, qemu-devel, kvm,
	bjsdjshi, libvir-list, Daniel P. Berrange, Laine Stump



On 9/2/2016 1:31 AM, Alex Williamson wrote:
> On Thu, 1 Sep 2016 23:52:02 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Alex,
>> Thanks for summarizing the discussion.
>>
>> On 8/31/2016 9:18 PM, Alex Williamson wrote:
>>> On Wed, 31 Aug 2016 15:04:13 +0800
>>> Jike Song <jike.song@intel.com> wrote:
>>>   
>>>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:  
>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>>>> "problem areas" that still need some work so we can get the thoughts
>>>>>> and ideas from those who weren't able to attend.
>>>>>>
>>>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>>>> interface, which exposes a flat csv file with fields like "type",
>>>>>> "number of instance", "vendor string", and then a bunch of type
>>>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>>>> from Neo's libvirt RFC:
>>>>>>
>>>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>>>> max_resolution
>>>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>>>
>>>>>> The create/destroy then looks like this:
>>>>>>
>>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>>>
>>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>>>
>>>>>> "vendor_specific_argument_list" is nebulous.
>>>>>>
>>>>>> So the idea to fix this is to explode this into a directory structure,
>>>>>> something like:
>>>>>>
>>>>>> ├── mdev_destroy
>>>>>> └── mdev_supported_types
>>>>>>     ├── 11
>>>>>>     │   ├── create
>>>>>>     │   ├── description
>>>>>>     │   └── max_instances
>>>>>>     ├── 12
>>>>>>     │   ├── create
>>>>>>     │   ├── description
>>>>>>     │   └── max_instances
>>>>>>     └── 13
>>>>>>         ├── create
>>>>>>         ├── description
>>>>>>         └── max_instances
>>>>>>
>>>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>>>> the other attributes would be included in separate files and we would
>>>>>> require vendors to create standard attributes for common device classes.    
>>>>>
>>>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>>>> In the meantime, can we still allow optional vendor string in create 
>>>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>>>> layer to do some vendor specific tweak if necessary.
>>>>>     
>>>>
>>>> Not sure whether this can done within MDEV framework (attrs provided by
>>>> vendor driver of course), or must be within the vendor driver.  
>>>
>>> The purpose of the sub-directories is that libvirt doesn't need to pass
>>> arbitrary, vendor strings to the create function, the attributes of the
>>> mdev device created are defined by the attributes in the sysfs
>>> directory where the create is done.  The user only provides a uuid for
>>> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
>>> need to know the meaning, but would need to know when to apply them,
>>> which is just as bad.  Ultimately we want libvirt to be able to
>>> interact with sysfs without having an vendor specific knowledge.
>>>   
>>
>> Above directory hierarchy looks fine to me. Along with the fixed set of
>> parameter, a optional field of extra parameter is also required. Such
>> parameters are required for some specific testing or running benchmarks,
>> for example to disable FRL (framerate limiter) or to disable console vnc
>> when not required. Libvirt don't need to know its details, its just a
>> string that user can provide and libvirt need to pass the string as it
>> is to vendor driver, vendor driver would act accordingly.
> 
> Wouldn't it make more sense to enable these through the vendor driver
> which would then provide additional types through the sysfs interface
> that could be selected by libvirt?  Or simply transparently change
> these parameters within the existing types?  I think we really want to
> get away from adding any sort of magic vendor strings.
>  

In the directory structure, a 'params' can take optional parameters.
Libvirt then can set 'params' and then create mdev device. For example,
param say 'disable_console_vnc=1' is set for type 11, then devices
created of type 11 will have that param set unless it is cleared.

 └── mdev_supported_types
     ├── 11
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     ├── 12
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     └── 13
         ├── create
         ├── description
         └── max_instances
         └── params

This has to come from libvirt since such params could be different for
each mdev device.

>>>>>>
>>>>>> For vGPUs like NVIDIA where we don't support multiple types
>>>>>> concurrently, this directory structure would update as mdev devices are
>>>>>> created, removing no longer available types.  I carried forward    
>>>>>
>>>>> or keep the type with max_instances cleared to ZERO.
>>>>>    
>>>>
>>>> +1 :)  
>>>
>>> Possible yes, but why would the vendor driver report types that the
>>> user cannot create?  It just seems like superfluous information (well,
>>> except for the use I discover below).
>>>   
>>
>> The directory structure for a physical GPU will be defined when device
>> is register to mdev module. It would be simpler to change creatable
>> instance count i.e for the types which can't be created creatable
>> instance count would be set to 0.
>>
>>
>>>>>> max_instances here, but perhaps we really want to copy SR-IOV and
>>>>>> report a max and current allocation.  Creation and deletion is    
>>>>>
>>>>> right, cur/max_instances look reasonable.
>>>>>     
>>>>>> simplified as we can simply "echo $UUID > create" per type.  I don't
>>>>>> understand why destroy had a parameter list, so here I imagine we can
>>>>>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>>>>>> entry under each mdev device, so we remove it at the device rather than
>>>>>> in some central location (any objections?).    
>>>>>
>>>>> OK to me.     
>>>>
>>>> IIUC, "destroy" has a parameter list is only because the previous
>>>> $VM_UUID + instnace implementation. It should be safe to move the "destroy"
>>>> file under mdev now.
>>>>  
>>
>> Sorry if that was there in libvirt discussion, but "destroy" don't need
>> extra parameters. Yes it could be moved to mdev device directory.
>>
>>>>>> We discussed how this might look with Intel devices which do allow
>>>>>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>>>>>> the vendor driver could still make a finite set of supported types,
>>>>>> perhaps with additional module options to the vendor driver to enable
>>>>>> more "exotic" types.  So for instance if IGD vGPUs are based on
>>>>>> power-of-2 portions of the framebuffer size, then the vendor driver
>>>>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>>>>>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.    
>>>>>
>>>>> Yes, Intel can do such type of definition. One thing I'm not sure is 
>>>>> about impact cross listed types, i.e. when creating a new instance
>>>>> under a given type, max_instances under other types would be 
>>>>> dynamically decremented based on available resource. Would it be
>>>>> a problem for libvirt or upper level stack, since a natural interpretation
>>>>> of max_instances should be a static number?
>>>>>
>>>>> An alternative is to make max_instances configurable, so libvirt has
>>>>> chance to define a pool of available instances with different types
>>>>> before creating any instance. For example, initially IGD driver may 
>>>>> report max_instances only for a minimal sharing granularity:
>>>>> 	128MB:
>>>>> 		max_instances (8)
>>>>> 	256MB:
>>>>> 		max_instances (0)
>>>>> 	512MB:
>>>>> 		max_instances (0)
>>>>>
>>>>> Then libvirt can configure more types as:
>>>>> 	128MB:
>>>>> 		max_instances (2)
>>>>> 	256MB:
>>>>> 		max_instances (1)
>>>>> 	512MB:
>>>>> 		max_instances (1)
>>>>>
>>>>> Starting from this point, max_instances would be static and then
>>>>> mdev instance can be created under each type. But I'm not
>>>>> sure whether such additional configuration role is reasonable to libvirt...    
>>>
>>> My expectation of your example, where I'm assuming you have 1G of total
>>> memory that can be divided between the mdev devices would be:
>>>
>>>  128M: 8
>>>  256M: 4
>>>  512M: 2
>>>
>>> If a 512M mdev device is created, this becomes:
>>>
>>>  128M: 4
>>>  256M: 2
>>>  512M: 1
>>>
>>> Creating a 128M mdev device from that becomes:
>>>
>>>  128M: 3
>>>  256M: 1
>>>  512M: 0
>>>
>>> It's not great, but I don't know how to do it better without the user
>>> having a clear understanding of the algorithm and resources required
>>> for each mdev device.  For instance, the size here, presumably the
>>> framebuffer size, is just one attribute in the device directory, the
>>> user won't know that this attribute is the key to the available
>>> instances.
>>>
>>> I don't particularly like the idea of a writeable max_instances, the
>>> user can simply create instances of the type and see the results.
>>>
>>> Just thought of another thing; do we need some way to determine the
>>> type of an mdev device from sysfs or is this implicit knowledge for the
>>> user that created the device?  For instance, we create a 512M device
>>> and it becomes a child device to the parent, so we can associate to the
>>> parent, but if we come back later, how do we know it's a 512M device?
>>> Perhaps this is a reason to keep the type directories around and we can
>>> cross link the device to the type and create a devices subdirectory
>>> under each type.  Perhaps then "max_instances" becomes
>>> "available_instances" (ie. how many left we can create) and we don't
>>> need a "current_instances" because we can simply look in the devices
>>> directory.
>>>   
>>
>> When mdev module creates mdev device, mdev_device_create() in patch,
>> here 'mdev->dev.parent' is assigned as its parent physical device. So
>> device_register() create child's directory inside parent's directory.
>> Directory for mdev device is not explicitly created. So I don't think we
>> can move this directory to type directory. But we can think of adding
>> link to type directory from mdev device's directory.
> 
> Yes, the idea was only to add links, not to change anything about the
> parent/child hierarchy in sysfs.  The result would be similar to how we
> have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices
> contained within that group.
>  
>>>>>>
>>>>>> We still don't have any way for the admin to learn in advance how the
>>>>>> available supported types will change once mdev devices start to be
>>>>>> created.  I'm not sure how we can create a specification for this, so
>>>>>> probing by creating devices may be the most flexible model.
>>>>>>  
>>
>> Removing type directory dynamically seems difficult. So the other way as
>> suggested here, when that type is not supported, vendor driver can
>> return max_instance to 0.
> 
> I'm ok with this, seems like there are enough uses for it and it's
> necessary to keep the directory for the device links.
>  
>>>>>> The other issue is the start/stop requirement, which was revealed to
>>>>>> setup peer-to-peer resources between vGPUs which is a limited hardware
>>>>>> resource.  We'd really like to have these happen automatically on the
>>>>>> first open of a vfio mdev device file and final release.  So we
>>>>>> brainstormed how the open/release callbacks could know the other mdev
>>>>>> devices for a given user.  This is where the instance number came into
>>>>>> play previously.  This is an area that needs work.    
>>>>>
>>>>> IGD doesn't have such peer-to-peer resource setup requirement. So
>>>>> it's sufficient to create/destroy a mdev instance in a single action on
>>>>> IGD. However I'd expect we still keep the "start/stop" interface (
>>>>> maybe not exposed as sysfs node, instead being a VFIO API), as 
>>>>> required to support future live migration usage. We've made prototype
>>>>> working for KVMGT today.    
>>>
>>> Great!
>>>   
>>
>> In this v7 version of patch, I had made changes that introduce 'online'
>> in mdev device directory as discussed in v6 reviews. We need this to
>> commit resources for that device(s).
> 
> But if we have some number of mdev devices, each with just a UUID
> identifier, how are separate online callbacks for each device
> associated to a single peer-to-peer context?
> 
>>>> It's good for the framework to define start/stop interfaces, but as Alex
>>>> said below, it should be MDEV oriented, not VM oriented.
>>>>
>>>> I don't know a lot about the peer-to-peer resource, but to me, although
>>>> VM_UUID + instance is not applicable, userspace can always achieve the
>>>> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
>>>> under every mdev:
>>>>
>>>> 	/sys/bus/pci/devices/<sbdf>/mdev/
>>>> 	|-- mdev01/
>>>> 	|   `-- vm_uuid
>>>> 	`-- mdev02/
>>>> 	    `-- vm_uuid
>>>>
>>>> Did I miss something?  
>>>
>>> Sure, this is just another way of doing UUID+instance.  Nit, it might
>>> look more like:
>>>
>>>  	/sys/bus/pci/devices/<sbdf>/mdev/
>>>  	|-- uuid1/
>>>  	|   `-- group_uuid
>>>  	`-- uuid2/
>>>  	    `-- group_uuid
>>>
>>> Where each mdev device is actually referenced by its UUID name then
>>> we'd have some writable attribute under the device where mdev devices
>>> sharing the same group UUID are handled together.    
>>
>> Group UUID would also work, as long as its unique and set for all
>> devices in a group, it should work.
> 
> Well, except for the problem I mention in the quoted paragraph below.
> 
>>> There's a problem
>>> here though that vfio doesn't know about this level of grouping, so
>>> uuid1 and uuid2 could actually be given to different users despite the
>>> grouping here, which results in one or both devices not working or
>>> creating security issues.  That sort of implies that this would
>>> necessarily need to be exposed as iommu grouping.  This factors into why
>>> it seems like a good idea to make the start/stop implicit within the
>>> interface.  In that way each mdev device is fungible as far as a user
>>> like libvirt is concerned, internal details like peer-to-peer resources
>>> are handled automatically as the devices are accessed.
>>>   
>>
>> I understand your concerns here. But making implicit doesn't guarantee
>> that device will not be accessed unless all mdev devices are started.
> 
> This is true, start on mmio fault relies on devices being setup w/o
> accessing the mmio space.  It should be how QEMU works today though.
>  
>>>>>> There was a thought that perhaps on open() the vendor driver could look
>>>>>> at the user pid and use that to associate with other devices, but the
>>>>>> problem here is that we open and begin access to each device, so
>>>>>> devices do this discovery serially rather than in parallel as desired.
>>>>>> (we might not fault in mmio space yet though, so I wonder if open()
>>>>>> could set the association of mdev to pid, then the first mmio fault
>>>>>> would trigger the resource allocation?  Then all the "magic" would live
>>>>>> in the vendor driver.  open() could fail if the pid already has running
>>>>>> mdev devices and the vendor driver chooses not to support hotplug)
>>>>>>  
>>
>> Problem is resources should be committed before any device being
>> accessed and not at fault at mmio space.
> 
> It seems then that the grouping needs to affect the iommu group so that
> you know that there's only a single owner for all the mdev devices
> within the group.  IIRC, the bus drivers don't have any visibility
> to opening and releasing of the group itself to trigger the
> online/offline, but they can track opening of the device file
> descriptors within the group.  Within the VFIO API the user cannot
> access the device without the device file descriptor, so a "first
> device opened" and "last device closed" trigger would provide the
> trigger points you need.  Some sort of new sysfs interface would need to
> be invented to allow this sort of manipulation.
> 

I like this suggestion and thinking around it.

> Also we should probably keep sight of whether we feel this is
> sufficiently necessary for the complexity.  If we can get by with only
> doing this grouping at creation time then we could define the "create"
> interface in various ways.  For example:
> 
> echo $UUID0 > create
> 
> would create a single mdev named $UUID0 in it's own group.
> 
> echo {$UUID0,$UUID1} > create
> 
> could create mdev devices $UUID0 and $UUID1 grouped together.
> 

I think this would create mdev device of same type on same parent
device. We need to consider the case of multiple mdev devices of
different types and with different parents to be grouped together.

> We could even do:
> 
> echo $UUID1:$GROUPA > create
> 
> where $GROUPA is the group ID of a previously created mdev device into
> which $UUID1 is to be created and added to the same group.
>

I was thinking about:

  echo $UUID0 > create

would create mdev device

  echo $UUID0 > /sys/class/mdev/create_group

would add created device to group.

For multiple devices case:
  echo $UUID0 > create
  echo $UUID1 > create

would create mdev devices which could be of different types and
different parents.
  echo $UUID0, $UUID1 > /sys/class/mdev/create_group

would add devices in a group.
Mdev core module would create a new group with unique number.  On mdev
device 'destroy' that mdev device would be removed from the group. When
there are no devices left in the group, group would be deleted. With
this "first device opened" and "last device closed" trigger can be used
to commit resources.
Then libvirt use mdev device path to pass as argument to QEMU, same as
it does for VFIO. Libvirt don't have to care about group number.

> Currently iommu groups are determined at device discovery time and not
> changeable, so it seems like this sort of matches that model, but it
> makes life difficult for libvirt if they want to have a pool of mdev
> devices that they arbitrarily assigned to VMs.  Also the question of
> whether libvirt applies this all mdev devices or only NVIDIA.  Does it
> try to use the same group across different parent devices? 

Yes, group could consists of mdev devices with different parent devices.

> Does it
> only group devices with matching vendor strings?  Much to be
> specified...

I don't think it should be vendor specific.

Thanks,
Kirti

> 
> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02  5:21           ` Kirti Wankhede
@ 2016-09-02 10:05             ` Paolo Bonzini
  2016-09-02 17:15               ` Kirti Wankhede
  2016-09-02 20:19                 ` [Qemu-devel] [libvirt] " John Ferlan
  0 siblings, 2 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-02 10:05 UTC (permalink / raw)
  To: Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 02/09/2016 07:21, Kirti Wankhede wrote:
> On 9/2/2016 10:18 AM, Michal Privoznik wrote:
> > Okay, maybe I'm misunderstanding something. I just thought that users
> > will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
> > nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
> > to construct domain XML.
> 
> I'm not familiar with libvirt code, curious how libvirt's nodedev driver
> enumerates devices in the system?

It looks at sysfs and/or the udev database and transforms what it finds
there to XML.

I think people would consult the nodedev driver to fetch vGPU
capabilities, use "virsh nodedev-create" to create the vGPU device on
the host, and then somehow refer to the nodedev in the domain XML.

There isn't very much documentation on nodedev-create, but it's used
mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:

   <device>
     <name>scsi_host6</name>
     <parent>scsi_host5</parent>
     <capability type='scsi_host'>
       <capability type='fc_host'>
         <wwnn>2001001b32a9da5e</wwnn>
         <wwpn>2101001b32a9da5e</wwpn>
       </capability>
     </capability>
   </device>

so I suppose for vGPU it would look like this:

   <device>
     <name>my-vgpu</name>
     <parent>pci_0000_86_00_0</parent>
     <capability type='mdev'>
       <type id='11'/>
       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
     </capability>
   </device>

while the parent would have:

   <device>
     <name>pci_0000_86_00_0</name>
     <capability type='pci'>
       <domain>0</domain>
       <bus>134</bus>
       <slot>0</slot>
       <function>0</function>
       <capability type='mdev'>
         <!-- one type element per sysfs directory -->
         <type id='11'>
           <!-- one element per sysfs file roughly -->
           <name>GRID M60-0B</name>
           <attribute name='num_heads'>2</attribute>
           <attribute name='frl_config'>45</attribute>
           <attribute name='framebuffer'>524288</attribute>
           <attribute name='hres'>2560</attribute>
           <attribute name='vres'>1600</attribute>
         </type>
       </capability>
       <product id='...'>GRID M60</product>
       <vendor id='0x10de'>NVIDIA</vendor>
     </capability>
   </device>

After creating the vGPU, if required by the host driver, all the other
type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.

When dumping the mdev with nodedev-dumpxml, it could show more complete
info, again taken from sysfs:

   <device>
     <name>my-vgpu</name>
     <parent>pci_0000_86_00_0</parent>
     <capability type='mdev'>
       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
       <!-- only the chosen type -->
       <type id='11'>
         <name>GRID M60-0B</name>
         <attribute name='num_heads'>2</attribute>
         <attribute name='frl_config'>45</attribute>
         <attribute name='framebuffer'>524288</attribute>
         <attribute name='hres'>2560</attribute>
         <attribute name='vres'>1600</attribute>
       </type>
       <capability type='pci'>
         <!-- no domain/bus/slot/function of course -->
         <!-- could show whatever PCI IDs are seen by the guest: -->
         <product id='...'>...</product>
         <vendor id='0x10de'>NVIDIA</vendor>
       </capability>
     </capability>
   </device>

Notice how the parent has mdev inside pci; the vGPU, if it has to have
pci at all, would have it inside mdev.  This represents the difference
between the mdev provider and the mdev device.

Random proposal for the domain XML too:

  <hostdev mode='subsystem' type='pci'>
    <source type='mdev'>
      <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
      <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
    </source>
    <address type='pci' bus='0' slot='2' function='0'/>
  </hostdev>

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 10:05             ` Paolo Bonzini
@ 2016-09-02 17:15               ` Kirti Wankhede
  2016-09-02 17:25                 ` Paolo Bonzini
  2016-09-02 20:19                 ` [Qemu-devel] [libvirt] " John Ferlan
  1 sibling, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-02 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi

On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 07:21, Kirti Wankhede wrote:
>> On 9/2/2016 10:18 AM, Michal Privoznik wrote:
>>> Okay, maybe I'm misunderstanding something. I just thought that users
>>> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
>>> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
>>> to construct domain XML.
>>
>> I'm not familiar with libvirt code, curious how libvirt's nodedev driver
>> enumerates devices in the system?
> 
> It looks at sysfs and/or the udev database and transforms what it finds
> there to XML.
> 
> I think people would consult the nodedev driver to fetch vGPU
> capabilities, use "virsh nodedev-create" to create the vGPU device on
> the host, and then somehow refer to the nodedev in the domain XML.
> 
> There isn't very much documentation on nodedev-create, but it's used
> mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
> 
>    <device>
>      <name>scsi_host6</name>
>      <parent>scsi_host5</parent>
>      <capability type='scsi_host'>
>        <capability type='fc_host'>
>          <wwnn>2001001b32a9da5e</wwnn>
>          <wwpn>2101001b32a9da5e</wwpn>
>        </capability>
>      </capability>
>    </device>
> 
> so I suppose for vGPU it would look like this:
> 
>    <device>
>      <name>my-vgpu</name>
>      <parent>pci_0000_86_00_0</parent>
>      <capability type='mdev'>
>        <type id='11'/>
>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>      </capability>
>    </device>
> 
> while the parent would have:
> 
>    <device>
>      <name>pci_0000_86_00_0</name>
>      <capability type='pci'>
>        <domain>0</domain>
>        <bus>134</bus>
>        <slot>0</slot>
>        <function>0</function>
>        <capability type='mdev'>
>          <!-- one type element per sysfs directory -->
>          <type id='11'>
>            <!-- one element per sysfs file roughly -->
>            <name>GRID M60-0B</name>
>            <attribute name='num_heads'>2</attribute>
>            <attribute name='frl_config'>45</attribute>
>            <attribute name='framebuffer'>524288</attribute>
>            <attribute name='hres'>2560</attribute>
>            <attribute name='vres'>1600</attribute>
>          </type>
>        </capability>
>        <product id='...'>GRID M60</product>
>        <vendor id='0x10de'>NVIDIA</vendor>
>      </capability>
>    </device>
> 
> After creating the vGPU, if required by the host driver, all the other
> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>

Thanks Paolo for details.
'nodedev-create' parse the xml file and accordingly write to 'create'
file in sysfs to create mdev device. Right?
At this moment, does libvirt know which VM this device would be
associated with?

> When dumping the mdev with nodedev-dumpxml, it could show more complete
> info, again taken from sysfs:
> 
>    <device>
>      <name>my-vgpu</name>
>      <parent>pci_0000_86_00_0</parent>
>      <capability type='mdev'>
>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>        <!-- only the chosen type -->
>        <type id='11'>
>          <name>GRID M60-0B</name>
>          <attribute name='num_heads'>2</attribute>
>          <attribute name='frl_config'>45</attribute>
>          <attribute name='framebuffer'>524288</attribute>
>          <attribute name='hres'>2560</attribute>
>          <attribute name='vres'>1600</attribute>
>        </type>
>        <capability type='pci'>
>          <!-- no domain/bus/slot/function of course -->
>          <!-- could show whatever PCI IDs are seen by the guest: -->
>          <product id='...'>...</product>
>          <vendor id='0x10de'>NVIDIA</vendor>
>        </capability>
>      </capability>
>    </device>
> 
> Notice how the parent has mdev inside pci; the vGPU, if it has to have
> pci at all, would have it inside mdev.  This represents the difference
> between the mdev provider and the mdev device.
>

Parent of mdev device might not always be a PCI device. I think we
shouldn't consider it as PCI capability.

> Random proposal for the domain XML too:
> 
>   <hostdev mode='subsystem' type='pci'>
>     <source type='mdev'>
>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>     </source>
>     <address type='pci' bus='0' slot='2' function='0'/>
>   </hostdev>
>

When user wants to assign two mdev devices to one VM, user have to add
such two entries or group the two devices in one entry?
On other mail thread with same subject we are thinking of creating group
of mdev devices to assign multiple mdev devices to one VM. Libvirt don't
have to know about group number but libvirt should add all mdev devices
in a group. Is that possible to do before starting QEMU process?

Thanks,
Kirti


> Paolo
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 17:15               ` Kirti Wankhede
@ 2016-09-02 17:25                 ` Paolo Bonzini
  2016-09-02 18:33                   ` Kirti Wankhede
  0 siblings, 1 reply; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-02 17:25 UTC (permalink / raw)
  To: Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 02/09/2016 19:15, Kirti Wankhede wrote:
> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>    <device>
>>      <name>my-vgpu</name>
>>      <parent>pci_0000_86_00_0</parent>
>>      <capability type='mdev'>
>>        <type id='11'/>
>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>      </capability>
>>    </device>
>>
>> After creating the vGPU, if required by the host driver, all the other
>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
> 
> Thanks Paolo for details.
> 'nodedev-create' parse the xml file and accordingly write to 'create'
> file in sysfs to create mdev device. Right?
> At this moment, does libvirt know which VM this device would be
> associated with?

No, the VM will associate to the nodedev through the UUID.  The nodedev
is created separately from the VM.

>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>> info, again taken from sysfs:
>>
>>    <device>
>>      <name>my-vgpu</name>
>>      <parent>pci_0000_86_00_0</parent>
>>      <capability type='mdev'>
>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>        <!-- only the chosen type -->
>>        <type id='11'>
>>          <!-- ... snip ... -->
>>        </type>
>>        <capability type='pci'>
>>          <!-- no domain/bus/slot/function of course -->
>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>          <product id='...'>...</product>
>>          <vendor id='0x10de'>NVIDIA</vendor>
>>        </capability>
>>      </capability>
>>    </device>
>>
>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>> pci at all, would have it inside mdev.  This represents the difference
>> between the mdev provider and the mdev device.
> 
> Parent of mdev device might not always be a PCI device. I think we
> shouldn't consider it as PCI capability.

The <capability type='pci'> in the vGPU means that it _will_ be exposed
as a PCI device by VFIO.

The <capability type='pci'> in the physical GPU means that the GPU is a
PCI device.

>> Random proposal for the domain XML too:
>>
>>   <hostdev mode='subsystem' type='pci'>
>>     <source type='mdev'>
>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>     </source>
>>     <address type='pci' bus='0' slot='2' function='0'/>
>>   </hostdev>
>>
> 
> When user wants to assign two mdev devices to one VM, user have to add
> such two entries or group the two devices in one entry?

Two entries, one per UUID, each with its own PCI address in the guest.

> On other mail thread with same subject we are thinking of creating group
> of mdev devices to assign multiple mdev devices to one VM.

What is the advantage in managing mdev groups?  (Sorry didn't follow the
other thread).

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-01 16:59         ` [Qemu-devel] " Alex Williamson
@ 2016-09-02 17:55           ` Laine Stump
  -1 siblings, 0 replies; 162+ messages in thread
From: Laine Stump @ 2016-09-02 17:55 UTC (permalink / raw)
  To: libvir-list
  Cc: Alex Williamson, Michal Privoznik, Song, Jike, cjia, kvm, Tian,
	Kevin, qemu-devel, Kirti Wankhede, kraxel, Laine Stump, pbonzini,
	bjsdjshi

On 09/01/2016 12:59 PM, Alex Williamson wrote:
> On Thu, 1 Sep 2016 18:47:06 +0200
> Michal Privoznik <mprivozn@redhat.com> wrote:
>
>> On 31.08.2016 08:12, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>
>>>> Hi folks,
>>>>
>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>> "problem areas" that still need some work so we can get the thoughts
>>>> and ideas from those who weren't able to attend.
>>>>
>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>> interface, which exposes a flat csv file with fields like "type",
>>>> "number of instance", "vendor string", and then a bunch of type
>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>> from Neo's libvirt RFC:
>>>>
>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>> max_resolution
>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>
>>>> The create/destroy then looks like this:
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>
>>>> "vendor_specific_argument_list" is nebulous.
>>>>
>>>> So the idea to fix this is to explode this into a directory structure,
>>>> something like:
>>>>
>>>> ├── mdev_destroy
>>>> └── mdev_supported_types
>>>>      ├── 11
>>>>      │   ├── create
>>>>      │   ├── description
>>>>      │   └── max_instances
>>>>      ├── 12
>>>>      │   ├── create
>>>>      │   ├── description
>>>>      │   └── max_instances
>>>>      └── 13
>>>>          ├── create
>>>>          ├── description
>>>>          └── max_instances
>>>>
>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>> the other attributes would be included in separate files and we would
>>>> require vendors to create standard attributes for common device classes.
>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>> In the meantime, can we still allow optional vendor string in create
>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>> layer to do some vendor specific tweak if necessary.
>> This is not the best idea IMO. Libvirt is there to shadow differences
>> between hypervisors. While doing that, we often hide differences between
>> various types of HW too. Therefore in order to provide good abstraction
>> we should make vendor specific string as small as possible (ideally an
>> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
>> example above in domain XML. What I think the better idea is if we let
>> users chose resolution and frame buffer size, e.g.: <video
>> resolution="1024x768" framebuffer="16"/> (just the first idea that came
>> to my mind while writing this e-mail). The point is, XML part is
>> completely free of any vendor-specific knobs.
> That's not really what you want though, a user actually cares whether
> they get an Intel of NVIDIA vGPU, we can't specify it as just a
> resolution and framebuffer size.  The user also doesn't want the model
> changing each time the VM is started, so not only do you *need* to know
> the vendor, you need to know the vendor model

as well as any other configuration that might change over time. A 
similar issue - libvirt really doesn't know or care what a "chassis" is 
in an ioh3420 (a PCIe root-port), but it's a guest-visible property of 
the device that qemu can set (and could presumably decide to change the 
default setting of some time in the future), so libvirt has to set a 
value for it in the config, and specify it on the qemu commandline.

What I'm getting at is that if there is anything in the vendor-specific 
string that changes guest ABI, and that could change over time, then 
libvirt can't just rely on it remaining the same, it needs to have it 
saved in the config for later reproduction, even if it doesn't 
understand the contents.

(for that matter,  you may want to consider some type of "versioned vGPU 
type" similar to qemu's versions machinetypes (e.g. "pc-i440fx-2.6", 
which has some sort of incompatible ABI differences from 
"pc-i440fx-1.4"), where any guest-ABI-changing modifications to the vGPU 
would take effect only when the appropriate version of device was 
requested. That way a guest originally created to use today's version of 
vGPU X in resolution Y would continue to work even if incompatible guest 
ABI changes were made in the future.)



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02 17:55           ` Laine Stump
  0 siblings, 0 replies; 162+ messages in thread
From: Laine Stump @ 2016-09-02 17:55 UTC (permalink / raw)
  To: libvir-list
  Cc: Alex Williamson, Michal Privoznik, Song, Jike, cjia, kvm, Tian,
	Kevin, qemu-devel, Kirti Wankhede, kraxel, Laine Stump, pbonzini,
	bjsdjshi

On 09/01/2016 12:59 PM, Alex Williamson wrote:
> On Thu, 1 Sep 2016 18:47:06 +0200
> Michal Privoznik <mprivozn@redhat.com> wrote:
>
>> On 31.08.2016 08:12, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>
>>>> Hi folks,
>>>>
>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>> "problem areas" that still need some work so we can get the thoughts
>>>> and ideas from those who weren't able to attend.
>>>>
>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>> interface, which exposes a flat csv file with fields like "type",
>>>> "number of instance", "vendor string", and then a bunch of type
>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>> from Neo's libvirt RFC:
>>>>
>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>>>> max_resolution
>>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>>>
>>>> The create/destroy then looks like this:
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_create
>>>>
>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>> 	/sys/bus/pci/devices/.../mdev_destroy
>>>>
>>>> "vendor_specific_argument_list" is nebulous.
>>>>
>>>> So the idea to fix this is to explode this into a directory structure,
>>>> something like:
>>>>
>>>> ├── mdev_destroy
>>>> └── mdev_supported_types
>>>>      ├── 11
>>>>      │   ├── create
>>>>      │   ├── description
>>>>      │   └── max_instances
>>>>      ├── 12
>>>>      │   ├── create
>>>>      │   ├── description
>>>>      │   └── max_instances
>>>>      └── 13
>>>>          ├── create
>>>>          ├── description
>>>>          └── max_instances
>>>>
>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>> the other attributes would be included in separate files and we would
>>>> require vendors to create standard attributes for common device classes.
>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>> In the meantime, can we still allow optional vendor string in create
>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>> layer to do some vendor specific tweak if necessary.
>> This is not the best idea IMO. Libvirt is there to shadow differences
>> between hypervisors. While doing that, we often hide differences between
>> various types of HW too. Therefore in order to provide good abstraction
>> we should make vendor specific string as small as possible (ideally an
>> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
>> example above in domain XML. What I think the better idea is if we let
>> users chose resolution and frame buffer size, e.g.: <video
>> resolution="1024x768" framebuffer="16"/> (just the first idea that came
>> to my mind while writing this e-mail). The point is, XML part is
>> completely free of any vendor-specific knobs.
> That's not really what you want though, a user actually cares whether
> they get an Intel of NVIDIA vGPU, we can't specify it as just a
> resolution and framebuffer size.  The user also doesn't want the model
> changing each time the VM is started, so not only do you *need* to know
> the vendor, you need to know the vendor model

as well as any other configuration that might change over time. A 
similar issue - libvirt really doesn't know or care what a "chassis" is 
in an ioh3420 (a PCIe root-port), but it's a guest-visible property of 
the device that qemu can set (and could presumably decide to change the 
default setting of some time in the future), so libvirt has to set a 
value for it in the config, and specify it on the qemu commandline.

What I'm getting at is that if there is anything in the vendor-specific 
string that changes guest ABI, and that could change over time, then 
libvirt can't just rely on it remaining the same, it needs to have it 
saved in the config for later reproduction, even if it doesn't 
understand the contents.

(for that matter,  you may want to consider some type of "versioned vGPU 
type" similar to qemu's versions machinetypes (e.g. "pc-i440fx-2.6", 
which has some sort of incompatible ABI differences from 
"pc-i440fx-1.4"), where any guest-ABI-changing modifications to the vGPU 
would take effect only when the appropriate version of device was 
requested. That way a guest originally created to use today's version of 
vGPU X in resolution Y would continue to work even if incompatible guest 
ABI changes were made in the future.)

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 17:25                 ` Paolo Bonzini
@ 2016-09-02 18:33                   ` Kirti Wankhede
  2016-09-02 20:29                       ` [Qemu-devel] [libvirt] " John Ferlan
  2016-09-02 21:48                     ` [Qemu-devel] " Paolo Bonzini
  0 siblings, 2 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-02 18:33 UTC (permalink / raw)
  To: Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi


On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 19:15, Kirti Wankhede wrote:
>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>    <device>
>>>      <name>my-vgpu</name>
>>>      <parent>pci_0000_86_00_0</parent>
>>>      <capability type='mdev'>
>>>        <type id='11'/>
>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>      </capability>
>>>    </device>
>>>
>>> After creating the vGPU, if required by the host driver, all the other
>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>
>> Thanks Paolo for details.
>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>> file in sysfs to create mdev device. Right?
>> At this moment, does libvirt know which VM this device would be
>> associated with?
> 
> No, the VM will associate to the nodedev through the UUID.  The nodedev
> is created separately from the VM.
> 
>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>> info, again taken from sysfs:
>>>
>>>    <device>
>>>      <name>my-vgpu</name>
>>>      <parent>pci_0000_86_00_0</parent>
>>>      <capability type='mdev'>
>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>        <!-- only the chosen type -->
>>>        <type id='11'>
>>>          <!-- ... snip ... -->
>>>        </type>
>>>        <capability type='pci'>
>>>          <!-- no domain/bus/slot/function of course -->
>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>          <product id='...'>...</product>
>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>        </capability>
>>>      </capability>
>>>    </device>
>>>
>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>> pci at all, would have it inside mdev.  This represents the difference
>>> between the mdev provider and the mdev device.
>>
>> Parent of mdev device might not always be a PCI device. I think we
>> shouldn't consider it as PCI capability.
> 
> The <capability type='pci'> in the vGPU means that it _will_ be exposed
> as a PCI device by VFIO.
> 
> The <capability type='pci'> in the physical GPU means that the GPU is a
> PCI device.
> 

Ok. Got that.

>>> Random proposal for the domain XML too:
>>>
>>>   <hostdev mode='subsystem' type='pci'>
>>>     <source type='mdev'>
>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>     </source>
>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>   </hostdev>
>>>
>>
>> When user wants to assign two mdev devices to one VM, user have to add
>> such two entries or group the two devices in one entry?
> 
> Two entries, one per UUID, each with its own PCI address in the guest.
> 
>> On other mail thread with same subject we are thinking of creating group
>> of mdev devices to assign multiple mdev devices to one VM.
> 
> What is the advantage in managing mdev groups?  (Sorry didn't follow the
> other thread).
> 

When mdev device is created, resources from physical device is assigned
to this device. But resources are committed only when device goes
'online' ('start' in v6 patch)
In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
for all vGPU devices in a VM are committed at one place. So we need to
know the vGPUs assigned to a VM before QEMU starts.

Grouping would help here as Alex suggested in that mail. Pulling only
that part of discussion here:

<Alex> It seems then that the grouping needs to affect the iommu group
so that
> you know that there's only a single owner for all the mdev devices
> within the group.  IIRC, the bus drivers don't have any visibility
> to opening and releasing of the group itself to trigger the
> online/offline, but they can track opening of the device file
> descriptors within the group.  Within the VFIO API the user cannot
> access the device without the device file descriptor, so a "first
> device opened" and "last device closed" trigger would provide the
> trigger points you need.  Some sort of new sysfs interface would need
> to be invented to allow this sort of manipulation.
> Also we should probably keep sight of whether we feel this is
> sufficiently necessary for the complexity.  If we can get by with only
> doing this grouping at creation time then we could define the "create"
> interface in various ways.  For example:
>
> echo $UUID0 > create
>
> would create a single mdev named $UUID0 in it's own group.
>
> echo {$UUID0,$UUID1} > create
>
> could create mdev devices $UUID0 and $UUID1 grouped together.
>
</Alex>

<Kirti>
I think this would create mdev device of same type on same parent
device. We need to consider the case of multiple mdev devices of
different types and with different parents to be grouped together.
</Kirti>

<Alex> We could even do:
>
> echo $UUID1:$GROUPA > create
>
> where $GROUPA is the group ID of a previously created mdev device into
> which $UUID1 is to be created and added to the same group.
</Alex>

<Kirti>
I was thinking about:

  echo $UUID0 > create

would create mdev device

  echo $UUID0 > /sys/class/mdev/create_group

would add created device to group.

For multiple devices case:
  echo $UUID0 > create
  echo $UUID1 > create

would create mdev devices which could be of different types and
different parents.
  echo $UUID0, $UUID1 > /sys/class/mdev/create_group

would add devices in a group.
Mdev core module would create a new group with unique number.  On mdev
device 'destroy' that mdev device would be removed from the group. When
there are no devices left in the group, group would be deleted. With
this "first device opened" and "last device closed" trigger can be used
to commit resources.
Then libvirt use mdev device path to pass as argument to QEMU, same as
it does for VFIO. Libvirt don't have to care about group number.
</Kirti>

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 17:55           ` [Qemu-devel] [libvirt] " Laine Stump
@ 2016-09-02 19:15             ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-02 19:15 UTC (permalink / raw)
  To: Laine Stump
  Cc: Song, Jike, cjia, kvm, libvir-list, Michal Privoznik, Tian,
	Kevin, qemu-devel, Kirti Wankhede, kraxel, Laine Stump, pbonzini,
	bjsdjshi

On Fri, 2 Sep 2016 13:55:19 -0400
Laine Stump <laine@laine.org> wrote:

> On 09/01/2016 12:59 PM, Alex Williamson wrote:
> > On Thu, 1 Sep 2016 18:47:06 +0200
> > Michal Privoznik <mprivozn@redhat.com> wrote:
> >  
> >> On 31.08.2016 08:12, Tian, Kevin wrote:  
> >>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >>>> Sent: Wednesday, August 31, 2016 12:17 AM
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> At KVM Forum we had a BoF session primarily around the mediated device
> >>>> sysfs interface.  I'd like to share what I think we agreed on and the
> >>>> "problem areas" that still need some work so we can get the thoughts
> >>>> and ideas from those who weren't able to attend.
> >>>>
> >>>> DanPB expressed some concern about the mdev_supported_types sysfs
> >>>> interface, which exposes a flat csv file with fields like "type",
> >>>> "number of instance", "vendor string", and then a bunch of type
> >>>> specific fields like "framebuffer size", "resolution", "frame rate
> >>>> limit", etc.  This is not entirely machine parsing friendly and sort of
> >>>> abuses the sysfs concept of one value per file.  Example output taken
> >>>> from Neo's libvirt RFC:
> >>>>
> >>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >>>> max_resolution
> >>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>>>
> >>>> The create/destroy then looks like this:
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_create
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_destroy
> >>>>
> >>>> "vendor_specific_argument_list" is nebulous.
> >>>>
> >>>> So the idea to fix this is to explode this into a directory structure,
> >>>> something like:
> >>>>
> >>>> ├── mdev_destroy
> >>>> └── mdev_supported_types
> >>>>      ├── 11
> >>>>      │   ├── create
> >>>>      │   ├── description
> >>>>      │   └── max_instances
> >>>>      ├── 12
> >>>>      │   ├── create
> >>>>      │   ├── description
> >>>>      │   └── max_instances
> >>>>      └── 13
> >>>>          ├── create
> >>>>          ├── description
> >>>>          └── max_instances
> >>>>
> >>>> Note that I'm only exposing the minimal attributes here for simplicity,
> >>>> the other attributes would be included in separate files and we would
> >>>> require vendors to create standard attributes for common device classes.  
> >>> I like this idea. All standard attributes are reflected into this hierarchy.
> >>> In the meantime, can we still allow optional vendor string in create
> >>> interface? libvirt doesn't need to know the meaning, but allows upper
> >>> layer to do some vendor specific tweak if necessary.  
> >> This is not the best idea IMO. Libvirt is there to shadow differences
> >> between hypervisors. While doing that, we often hide differences between
> >> various types of HW too. Therefore in order to provide good abstraction
> >> we should make vendor specific string as small as possible (ideally an
> >> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
> >> example above in domain XML. What I think the better idea is if we let
> >> users chose resolution and frame buffer size, e.g.: <video
> >> resolution="1024x768" framebuffer="16"/> (just the first idea that came
> >> to my mind while writing this e-mail). The point is, XML part is
> >> completely free of any vendor-specific knobs.  
> > That's not really what you want though, a user actually cares whether
> > they get an Intel of NVIDIA vGPU, we can't specify it as just a
> > resolution and framebuffer size.  The user also doesn't want the model
> > changing each time the VM is started, so not only do you *need* to know
> > the vendor, you need to know the vendor model  
> 
> as well as any other configuration that might change over time. A 
> similar issue - libvirt really doesn't know or care what a "chassis" is 
> in an ioh3420 (a PCIe root-port), but it's a guest-visible property of 
> the device that qemu can set (and could presumably decide to change the 
> default setting of some time in the future), so libvirt has to set a 
> value for it in the config, and specify it on the qemu commandline.
> 
> What I'm getting at is that if there is anything in the vendor-specific 
> string that changes guest ABI, and that could change over time, then 
> libvirt can't just rely on it remaining the same, it needs to have it 
> saved in the config for later reproduction, even if it doesn't 
> understand the contents.
> 
> (for that matter,  you may want to consider some type of "versioned vGPU 
> type" similar to qemu's versions machinetypes (e.g. "pc-i440fx-2.6", 
> which has some sort of incompatible ABI differences from 
> "pc-i440fx-1.4"), where any guest-ABI-changing modifications to the vGPU 
> would take effect only when the appropriate version of device was 
> requested. That way a guest originally created to use today's version of 
> vGPU X in resolution Y would continue to work even if incompatible guest 
> ABI changes were made in the future.)

I fully agree, but I don't know if it's anything we can actually
codify, only document that this is the way the vendor driver *should*
behave.  If the vendor driver modifies the guest visible device without
modifying the vendor string... well that's just something they
shouldn't have done.  Bad vendor.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02 19:15             ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-02 19:15 UTC (permalink / raw)
  To: Laine Stump
  Cc: libvir-list, Michal Privoznik, Song, Jike, cjia, kvm, Tian,
	Kevin, qemu-devel, Kirti Wankhede, kraxel, Laine Stump, pbonzini,
	bjsdjshi

On Fri, 2 Sep 2016 13:55:19 -0400
Laine Stump <laine@laine.org> wrote:

> On 09/01/2016 12:59 PM, Alex Williamson wrote:
> > On Thu, 1 Sep 2016 18:47:06 +0200
> > Michal Privoznik <mprivozn@redhat.com> wrote:
> >  
> >> On 31.08.2016 08:12, Tian, Kevin wrote:  
> >>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >>>> Sent: Wednesday, August 31, 2016 12:17 AM
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> At KVM Forum we had a BoF session primarily around the mediated device
> >>>> sysfs interface.  I'd like to share what I think we agreed on and the
> >>>> "problem areas" that still need some work so we can get the thoughts
> >>>> and ideas from those who weren't able to attend.
> >>>>
> >>>> DanPB expressed some concern about the mdev_supported_types sysfs
> >>>> interface, which exposes a flat csv file with fields like "type",
> >>>> "number of instance", "vendor string", and then a bunch of type
> >>>> specific fields like "framebuffer size", "resolution", "frame rate
> >>>> limit", etc.  This is not entirely machine parsing friendly and sort of
> >>>> abuses the sysfs concept of one value per file.  Example output taken
> >>>> from Neo's libvirt RFC:
> >>>>
> >>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> >>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> >>>> max_resolution
> >>>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> >>>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> >>>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> >>>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> >>>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> >>>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> >>>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> >>>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> >>>>
> >>>> The create/destroy then looks like this:
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_create
> >>>>
> >>>> echo "$mdev_UUID:vendor_specific_argument_list" >
> >>>> 	/sys/bus/pci/devices/.../mdev_destroy
> >>>>
> >>>> "vendor_specific_argument_list" is nebulous.
> >>>>
> >>>> So the idea to fix this is to explode this into a directory structure,
> >>>> something like:
> >>>>
> >>>> ├── mdev_destroy
> >>>> └── mdev_supported_types
> >>>>      ├── 11
> >>>>      │   ├── create
> >>>>      │   ├── description
> >>>>      │   └── max_instances
> >>>>      ├── 12
> >>>>      │   ├── create
> >>>>      │   ├── description
> >>>>      │   └── max_instances
> >>>>      └── 13
> >>>>          ├── create
> >>>>          ├── description
> >>>>          └── max_instances
> >>>>
> >>>> Note that I'm only exposing the minimal attributes here for simplicity,
> >>>> the other attributes would be included in separate files and we would
> >>>> require vendors to create standard attributes for common device classes.  
> >>> I like this idea. All standard attributes are reflected into this hierarchy.
> >>> In the meantime, can we still allow optional vendor string in create
> >>> interface? libvirt doesn't need to know the meaning, but allows upper
> >>> layer to do some vendor specific tweak if necessary.  
> >> This is not the best idea IMO. Libvirt is there to shadow differences
> >> between hypervisors. While doing that, we often hide differences between
> >> various types of HW too. Therefore in order to provide good abstraction
> >> we should make vendor specific string as small as possible (ideally an
> >> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
> >> example above in domain XML. What I think the better idea is if we let
> >> users chose resolution and frame buffer size, e.g.: <video
> >> resolution="1024x768" framebuffer="16"/> (just the first idea that came
> >> to my mind while writing this e-mail). The point is, XML part is
> >> completely free of any vendor-specific knobs.  
> > That's not really what you want though, a user actually cares whether
> > they get an Intel of NVIDIA vGPU, we can't specify it as just a
> > resolution and framebuffer size.  The user also doesn't want the model
> > changing each time the VM is started, so not only do you *need* to know
> > the vendor, you need to know the vendor model  
> 
> as well as any other configuration that might change over time. A 
> similar issue - libvirt really doesn't know or care what a "chassis" is 
> in an ioh3420 (a PCIe root-port), but it's a guest-visible property of 
> the device that qemu can set (and could presumably decide to change the 
> default setting of some time in the future), so libvirt has to set a 
> value for it in the config, and specify it on the qemu commandline.
> 
> What I'm getting at is that if there is anything in the vendor-specific 
> string that changes guest ABI, and that could change over time, then 
> libvirt can't just rely on it remaining the same, it needs to have it 
> saved in the config for later reproduction, even if it doesn't 
> understand the contents.
> 
> (for that matter,  you may want to consider some type of "versioned vGPU 
> type" similar to qemu's versions machinetypes (e.g. "pc-i440fx-2.6", 
> which has some sort of incompatible ABI differences from 
> "pc-i440fx-1.4"), where any guest-ABI-changing modifications to the vGPU 
> would take effect only when the appropriate version of device was 
> requested. That way a guest originally created to use today's version of 
> vGPU X in resolution Y would continue to work even if incompatible guest 
> ABI changes were made in the future.)

I fully agree, but I don't know if it's anything we can actually
codify, only document that this is the way the vendor driver *should*
behave.  If the vendor driver modifies the guest visible device without
modifying the vendor string... well that's just something they
shouldn't have done.  Bad vendor.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 10:05             ` Paolo Bonzini
@ 2016-09-02 20:19                 ` John Ferlan
  2016-09-02 20:19                 ` [Qemu-devel] [libvirt] " John Ferlan
  1 sibling, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-02 20:19 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 09/02/2016 06:05 AM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 07:21, Kirti Wankhede wrote:
>> On 9/2/2016 10:18 AM, Michal Privoznik wrote:
>>> Okay, maybe I'm misunderstanding something. I just thought that users
>>> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
>>> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
>>> to construct domain XML.
>>
>> I'm not familiar with libvirt code, curious how libvirt's nodedev driver
>> enumerates devices in the system?
> 
> It looks at sysfs and/or the udev database and transforms what it finds
> there to XML.

Caveat: I started writing this in the morning... Of course the email
thread has evolved even more since then...

If you have libvirt installed, use 'virsh nodedev-list --tree' to get a
tree format of what libvirt "finds". But to answer the question, it's
mostly a brute force method of perusing the sysfs trees that libvirt
cares about and storing away the data in nodedev driver objects.
As/when new devices are found there's a udev create device event that
libvirtd follows in order to generate a new nodedev object for devices
that libvirt cares about. Similarly there's a udev delete device event
to remove devices.

FWIW: Some examples of nodedev output can be found at:

http://libvirt.org/formatnode.html

> 
> I think people would consult the nodedev driver to fetch vGPU
> capabilities, use "virsh nodedev-create" to create the vGPU device on
> the host, and then somehow refer to the nodedev in the domain XML.
> 
> There isn't very much documentation on nodedev-create, but it's used
> mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
> 
>    <device>
>      <name>scsi_host6</name>
>      <parent>scsi_host5</parent>
>      <capability type='scsi_host'>
>        <capability type='fc_host'>
>          <wwnn>2001001b32a9da5e</wwnn>
>          <wwpn>2101001b32a9da5e</wwpn>
>        </capability>
>      </capability>
>    </device>
> 

The above is the nodedev-dumpxml of the created NPIV (a/k/a vHBA) node
device - although there's also a "<fabric_wwn>" now too.

One can also look at http://wiki.libvirt.org/page/NPIV_in_libvirt to get
a practical example of vHBA creation. The libvirt wiki data was more
elegantly transposed into RHEL7 docs at:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-NPIV_storage.html

The nodedev-create sole purpose is vHBA creation - the API was
introduced in 0.6.5 (commit id '81d0ffbc'). Without going into a lot of
detail - the API is WWNN/WWPN centric and relies on udev create device
events (via udevEventHandleCallback) to add the scsi_hostM vHBA with the
WWNN/WWPN.

NB: There's a systemd/udev "lag" issue to make note of - the add event
is generated before all the files are populated with correct values
(https://bugzilla.redhat.com/show_bug.cgi?id=1210832). In order to work
around that the nodedev-create logic scans the scsi_host devices to find
the matching scsi_hostM.

> so I suppose for vGPU it would look like this:
> 
>    <device>
>      <name>my-vgpu</name>
>      <parent>pci_0000_86_00_0</parent>
>      <capability type='mdev'>
>        <type id='11'/>
>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>      </capability>
>    </device>

So one question would be "where" does one find the value for the <uuid>
field? From the initial libvirt RFC it seems as though a generated UUID
is fine, but figured I'd ask just to be sure I'm not making any assumptions.

Based on how the email thread is going - figuring out the input format
to mdev_create needs to be agreed upon... Once that's done figuring out
how to generate XML that can be used for the input should be simpler.

In end, so far I've assumed there would be one vGPU referenced by a
$UUID and perhaps a name... I have no idea what udev creates when
mdev_create is called - is it only the /sys/bus/mdev/devices/$UUID?  Or
is there some new /sys/bus/pci/devices/$PCIADDR as well?


FWIW:
Hopefully it'll help to give the vHBA comparison. The minimal equivalent
*pre* vHBA XML looks like:

   <device>
     <parent>scsi_host5</parent>
     <capability type='scsi_host'>
       <capability type='fc_host'>
       </capability>
     </capability>
   </device>

This is fed into 'virsh nodedev-create $XMLFILE' and the result is the
vHBA XML (e.g. the scsi_host6 output above). Providing a wwnn/wwpn is
not necessary - if not provided they are generated. The wwnn/wwpn pair
is fed to the "vport_create" (via echo "wwpn:wwnn" > vport_create), then
udev takes over and creates a new scsi_hostM device (in the
/sys/class/scsi_host directory just like the HBA) with a parent using
the wwnn, wwpn. The nodedev-create code doesn't do the nodedev object
creation - that's done automagically via udev add event processing. Once
udev creates the device, it sends an event which the nodedev driver handles.

Note that for nodedev-create, the <name> field is ignored. The reason
it's ignored is because the logic knows udev will create one for us,
e.g. scsi_host6 in the above XML based on running the vport_create from
the parent HBA.

In order to determine the <parent> field, one uses "virsh nodedev-list
--caps vports" and chooses from the output one of the scsi_hostN's
provided. That capability is determined during libvirtd node device db
initialization by finding "/sys/class/fc_host/hostN/vport_create" files
and setting a bit from which future searches can use the capability string.

The resulting vHBA can be fed into XML for a 'scsi' storage pool and the
LUN's for the vHBA will be listed once the pool is started via 'virsh
vol-list $POOLNAME. Those LUN's can then be fed into guest XML as a
'disk' or passthru 'lun'.  The format is on the wiki page.

> 
> while the parent would have:
> 
>    <device>
>      <name>pci_0000_86_00_0</name>
>      <capability type='pci'>
>        <domain>0</domain>
>        <bus>134</bus>
>        <slot>0</slot>
>        <function>0</function>
>        <capability type='mdev'>
>          <!-- one type element per sysfs directory -->
>          <type id='11'>
>            <!-- one element per sysfs file roughly -->
>            <name>GRID M60-0B</name>
>            <attribute name='num_heads'>2</attribute>
>            <attribute name='frl_config'>45</attribute>
>            <attribute name='framebuffer'>524288</attribute>
>            <attribute name='hres'>2560</attribute>
>            <attribute name='vres'>1600</attribute>
>          </type>
>        </capability>
>        <product id='...'>GRID M60</product>
>        <vendor id='0x10de'>NVIDIA</vendor>
>      </capability>
>    </device>
> 

I would consider this to be the starting point (GPU) that's needed to
create vGPU's for libvirt. In order to find this needle in the haystack
of PCI devices, code would need to be added to find the
"/sys/bus/pci/devices/$PCIADDR/mdev_create" files during initial sysfs
tree parsing, where $PCIADDR in this case is "0000:86:0.0". Someone
doing this should search on VPORTS and VPORT_OPS in the libvirt code.

Once a a new capability flag is added, it'll be easy to use "virsh
nodedev-list mdevs" in order to get a list of pci_* devices which can
support vGPU.

>From that list, the above XML would be generated via "virsh
nodedev-dumpxml pci_0000_86_00_0" (for example). Whatever one finds in
that output I would expect to be used to feed into the XML that would
need to be created to generate a vGPU via nodedev-create and thus become
parameters to "mdev_create".

Once the mdev_create is done, then watching /sys/bus/mdev/devices/ for
the UUID would mimic how vHBA does things.

So we got this far, but how do we ensure that subsequent reboots create
the same vGPU's for guests? The vHBA code achieves this by creating a
storage pool that creates the vHBA when the storage pool starts. That
way when the guest starts it can reference the storage pool and unit.

We don't have such a pool for GPU's (yet) - although I suppose they
could just become a class of storage pools.

The issue being nodedev device objects are not saved between reboots.
They are generated on the fly. Hence the "create-nodedev' API - notice
there's no "define-nodedev' API, although I suppose one could be
created. It's just more work to get this all to work properly.

> After creating the vGPU, if required by the host driver, all the other
> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
> 

Not wanting to make assumptions, but this reads as if I create one type
11 vGPU, then I can create no others on the host.  Maybe I'm reading it
wrong - it's been a long week.

> When dumping the mdev with nodedev-dumpxml, it could show more complete
> info, again taken from sysfs:
> 
>    <device>
>      <name>my-vgpu</name>
>      <parent>pci_0000_86_00_0</parent>
>      <capability type='mdev'>
>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>        <!-- only the chosen type -->
>        <type id='11'>
>          <name>GRID M60-0B</name>
>          <attribute name='num_heads'>2</attribute>
>          <attribute name='frl_config'>45</attribute>
>          <attribute name='framebuffer'>524288</attribute>
>          <attribute name='hres'>2560</attribute>
>          <attribute name='vres'>1600</attribute>
>        </type>
>        <capability type='pci'>
>          <!-- no domain/bus/slot/function of course -->
>          <!-- could show whatever PCI IDs are seen by the guest: -->
>          <product id='...'>...</product>
>          <vendor id='0x10de'>NVIDIA</vendor>
>        </capability>
>      </capability>
>    </device>
> 
> Notice how the parent has mdev inside pci; the vGPU, if it has to have
> pci at all, would have it inside mdev.  This represents the difference
> between the mdev provider and the mdev device.
> 
> Random proposal for the domain XML too:
> 
>   <hostdev mode='subsystem' type='pci'>
>     <source type='mdev'>
>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>     </source>
>     <address type='pci' bus='0' slot='2' function='0'/>
>   </hostdev>
> 

PCI devices have the "managed='yes|no'" attribute as well. That's what
determines whether the device is to be detached from the host or not.
That's been something very painful to manage for vfio and well libvirt!

John



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02 20:19                 ` John Ferlan
  0 siblings, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-02 20:19 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 09/02/2016 06:05 AM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 07:21, Kirti Wankhede wrote:
>> On 9/2/2016 10:18 AM, Michal Privoznik wrote:
>>> Okay, maybe I'm misunderstanding something. I just thought that users
>>> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
>>> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
>>> to construct domain XML.
>>
>> I'm not familiar with libvirt code, curious how libvirt's nodedev driver
>> enumerates devices in the system?
> 
> It looks at sysfs and/or the udev database and transforms what it finds
> there to XML.

Caveat: I started writing this in the morning... Of course the email
thread has evolved even more since then...

If you have libvirt installed, use 'virsh nodedev-list --tree' to get a
tree format of what libvirt "finds". But to answer the question, it's
mostly a brute force method of perusing the sysfs trees that libvirt
cares about and storing away the data in nodedev driver objects.
As/when new devices are found there's a udev create device event that
libvirtd follows in order to generate a new nodedev object for devices
that libvirt cares about. Similarly there's a udev delete device event
to remove devices.

FWIW: Some examples of nodedev output can be found at:

http://libvirt.org/formatnode.html

> 
> I think people would consult the nodedev driver to fetch vGPU
> capabilities, use "virsh nodedev-create" to create the vGPU device on
> the host, and then somehow refer to the nodedev in the domain XML.
> 
> There isn't very much documentation on nodedev-create, but it's used
> mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
> 
>    <device>
>      <name>scsi_host6</name>
>      <parent>scsi_host5</parent>
>      <capability type='scsi_host'>
>        <capability type='fc_host'>
>          <wwnn>2001001b32a9da5e</wwnn>
>          <wwpn>2101001b32a9da5e</wwpn>
>        </capability>
>      </capability>
>    </device>
> 

The above is the nodedev-dumpxml of the created NPIV (a/k/a vHBA) node
device - although there's also a "<fabric_wwn>" now too.

One can also look at http://wiki.libvirt.org/page/NPIV_in_libvirt to get
a practical example of vHBA creation. The libvirt wiki data was more
elegantly transposed into RHEL7 docs at:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-NPIV_storage.html

The nodedev-create sole purpose is vHBA creation - the API was
introduced in 0.6.5 (commit id '81d0ffbc'). Without going into a lot of
detail - the API is WWNN/WWPN centric and relies on udev create device
events (via udevEventHandleCallback) to add the scsi_hostM vHBA with the
WWNN/WWPN.

NB: There's a systemd/udev "lag" issue to make note of - the add event
is generated before all the files are populated with correct values
(https://bugzilla.redhat.com/show_bug.cgi?id=1210832). In order to work
around that the nodedev-create logic scans the scsi_host devices to find
the matching scsi_hostM.

> so I suppose for vGPU it would look like this:
> 
>    <device>
>      <name>my-vgpu</name>
>      <parent>pci_0000_86_00_0</parent>
>      <capability type='mdev'>
>        <type id='11'/>
>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>      </capability>
>    </device>

So one question would be "where" does one find the value for the <uuid>
field? From the initial libvirt RFC it seems as though a generated UUID
is fine, but figured I'd ask just to be sure I'm not making any assumptions.

Based on how the email thread is going - figuring out the input format
to mdev_create needs to be agreed upon... Once that's done figuring out
how to generate XML that can be used for the input should be simpler.

In end, so far I've assumed there would be one vGPU referenced by a
$UUID and perhaps a name... I have no idea what udev creates when
mdev_create is called - is it only the /sys/bus/mdev/devices/$UUID?  Or
is there some new /sys/bus/pci/devices/$PCIADDR as well?


FWIW:
Hopefully it'll help to give the vHBA comparison. The minimal equivalent
*pre* vHBA XML looks like:

   <device>
     <parent>scsi_host5</parent>
     <capability type='scsi_host'>
       <capability type='fc_host'>
       </capability>
     </capability>
   </device>

This is fed into 'virsh nodedev-create $XMLFILE' and the result is the
vHBA XML (e.g. the scsi_host6 output above). Providing a wwnn/wwpn is
not necessary - if not provided they are generated. The wwnn/wwpn pair
is fed to the "vport_create" (via echo "wwpn:wwnn" > vport_create), then
udev takes over and creates a new scsi_hostM device (in the
/sys/class/scsi_host directory just like the HBA) with a parent using
the wwnn, wwpn. The nodedev-create code doesn't do the nodedev object
creation - that's done automagically via udev add event processing. Once
udev creates the device, it sends an event which the nodedev driver handles.

Note that for nodedev-create, the <name> field is ignored. The reason
it's ignored is because the logic knows udev will create one for us,
e.g. scsi_host6 in the above XML based on running the vport_create from
the parent HBA.

In order to determine the <parent> field, one uses "virsh nodedev-list
--caps vports" and chooses from the output one of the scsi_hostN's
provided. That capability is determined during libvirtd node device db
initialization by finding "/sys/class/fc_host/hostN/vport_create" files
and setting a bit from which future searches can use the capability string.

The resulting vHBA can be fed into XML for a 'scsi' storage pool and the
LUN's for the vHBA will be listed once the pool is started via 'virsh
vol-list $POOLNAME. Those LUN's can then be fed into guest XML as a
'disk' or passthru 'lun'.  The format is on the wiki page.

> 
> while the parent would have:
> 
>    <device>
>      <name>pci_0000_86_00_0</name>
>      <capability type='pci'>
>        <domain>0</domain>
>        <bus>134</bus>
>        <slot>0</slot>
>        <function>0</function>
>        <capability type='mdev'>
>          <!-- one type element per sysfs directory -->
>          <type id='11'>
>            <!-- one element per sysfs file roughly -->
>            <name>GRID M60-0B</name>
>            <attribute name='num_heads'>2</attribute>
>            <attribute name='frl_config'>45</attribute>
>            <attribute name='framebuffer'>524288</attribute>
>            <attribute name='hres'>2560</attribute>
>            <attribute name='vres'>1600</attribute>
>          </type>
>        </capability>
>        <product id='...'>GRID M60</product>
>        <vendor id='0x10de'>NVIDIA</vendor>
>      </capability>
>    </device>
> 

I would consider this to be the starting point (GPU) that's needed to
create vGPU's for libvirt. In order to find this needle in the haystack
of PCI devices, code would need to be added to find the
"/sys/bus/pci/devices/$PCIADDR/mdev_create" files during initial sysfs
tree parsing, where $PCIADDR in this case is "0000:86:0.0". Someone
doing this should search on VPORTS and VPORT_OPS in the libvirt code.

Once a a new capability flag is added, it'll be easy to use "virsh
nodedev-list mdevs" in order to get a list of pci_* devices which can
support vGPU.

>From that list, the above XML would be generated via "virsh
nodedev-dumpxml pci_0000_86_00_0" (for example). Whatever one finds in
that output I would expect to be used to feed into the XML that would
need to be created to generate a vGPU via nodedev-create and thus become
parameters to "mdev_create".

Once the mdev_create is done, then watching /sys/bus/mdev/devices/ for
the UUID would mimic how vHBA does things.

So we got this far, but how do we ensure that subsequent reboots create
the same vGPU's for guests? The vHBA code achieves this by creating a
storage pool that creates the vHBA when the storage pool starts. That
way when the guest starts it can reference the storage pool and unit.

We don't have such a pool for GPU's (yet) - although I suppose they
could just become a class of storage pools.

The issue being nodedev device objects are not saved between reboots.
They are generated on the fly. Hence the "create-nodedev' API - notice
there's no "define-nodedev' API, although I suppose one could be
created. It's just more work to get this all to work properly.

> After creating the vGPU, if required by the host driver, all the other
> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
> 

Not wanting to make assumptions, but this reads as if I create one type
11 vGPU, then I can create no others on the host.  Maybe I'm reading it
wrong - it's been a long week.

> When dumping the mdev with nodedev-dumpxml, it could show more complete
> info, again taken from sysfs:
> 
>    <device>
>      <name>my-vgpu</name>
>      <parent>pci_0000_86_00_0</parent>
>      <capability type='mdev'>
>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>        <!-- only the chosen type -->
>        <type id='11'>
>          <name>GRID M60-0B</name>
>          <attribute name='num_heads'>2</attribute>
>          <attribute name='frl_config'>45</attribute>
>          <attribute name='framebuffer'>524288</attribute>
>          <attribute name='hres'>2560</attribute>
>          <attribute name='vres'>1600</attribute>
>        </type>
>        <capability type='pci'>
>          <!-- no domain/bus/slot/function of course -->
>          <!-- could show whatever PCI IDs are seen by the guest: -->
>          <product id='...'>...</product>
>          <vendor id='0x10de'>NVIDIA</vendor>
>        </capability>
>      </capability>
>    </device>
> 
> Notice how the parent has mdev inside pci; the vGPU, if it has to have
> pci at all, would have it inside mdev.  This represents the difference
> between the mdev provider and the mdev device.
> 
> Random proposal for the domain XML too:
> 
>   <hostdev mode='subsystem' type='pci'>
>     <source type='mdev'>
>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>     </source>
>     <address type='pci' bus='0' slot='2' function='0'/>
>   </hostdev>
> 

PCI devices have the "managed='yes|no'" attribute as well. That's what
determines whether the device is to be detached from the host or not.
That's been something very painful to manage for vfio and well libvirt!

John

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 18:33                   ` Kirti Wankhede
@ 2016-09-02 20:29                       ` John Ferlan
  2016-09-02 21:48                     ` [Qemu-devel] " Paolo Bonzini
  1 sibling, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-02 20:29 UTC (permalink / raw)
  To: Kirti Wankhede, Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
> 
> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>
>>
>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>    <device>
>>>>      <name>my-vgpu</name>
>>>>      <parent>pci_0000_86_00_0</parent>
>>>>      <capability type='mdev'>
>>>>        <type id='11'/>
>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>      </capability>
>>>>    </device>
>>>>
>>>> After creating the vGPU, if required by the host driver, all the other
>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>
>>> Thanks Paolo for details.
>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>> file in sysfs to create mdev device. Right?
>>> At this moment, does libvirt know which VM this device would be
>>> associated with?
>>
>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>> is created separately from the VM.
>>
>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>> info, again taken from sysfs:
>>>>
>>>>    <device>
>>>>      <name>my-vgpu</name>
>>>>      <parent>pci_0000_86_00_0</parent>
>>>>      <capability type='mdev'>
>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>        <!-- only the chosen type -->
>>>>        <type id='11'>
>>>>          <!-- ... snip ... -->
>>>>        </type>
>>>>        <capability type='pci'>
>>>>          <!-- no domain/bus/slot/function of course -->
>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>>          <product id='...'>...</product>
>>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>>        </capability>
>>>>      </capability>
>>>>    </device>
>>>>
>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>> pci at all, would have it inside mdev.  This represents the difference
>>>> between the mdev provider and the mdev device.
>>>
>>> Parent of mdev device might not always be a PCI device. I think we
>>> shouldn't consider it as PCI capability.
>>
>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
>> as a PCI device by VFIO.
>>
>> The <capability type='pci'> in the physical GPU means that the GPU is a
>> PCI device.
>>
> 
> Ok. Got that.
> 
>>>> Random proposal for the domain XML too:
>>>>
>>>>   <hostdev mode='subsystem' type='pci'>
>>>>     <source type='mdev'>
>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>     </source>
>>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>>   </hostdev>
>>>>
>>>
>>> When user wants to assign two mdev devices to one VM, user have to add
>>> such two entries or group the two devices in one entry?
>>
>> Two entries, one per UUID, each with its own PCI address in the guest.
>>
>>> On other mail thread with same subject we are thinking of creating group
>>> of mdev devices to assign multiple mdev devices to one VM.
>>
>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>> other thread).
>>
> 
> When mdev device is created, resources from physical device is assigned
> to this device. But resources are committed only when device goes
> 'online' ('start' in v6 patch)
> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
> for all vGPU devices in a VM are committed at one place. So we need to
> know the vGPUs assigned to a VM before QEMU starts.
> 
> Grouping would help here as Alex suggested in that mail. Pulling only
> that part of discussion here:
> 
> <Alex> It seems then that the grouping needs to affect the iommu group
> so that
>> you know that there's only a single owner for all the mdev devices
>> within the group.  IIRC, the bus drivers don't have any visibility
>> to opening and releasing of the group itself to trigger the
>> online/offline, but they can track opening of the device file
>> descriptors within the group.  Within the VFIO API the user cannot
>> access the device without the device file descriptor, so a "first
>> device opened" and "last device closed" trigger would provide the
>> trigger points you need.  Some sort of new sysfs interface would need
>> to be invented to allow this sort of manipulation.
>> Also we should probably keep sight of whether we feel this is
>> sufficiently necessary for the complexity.  If we can get by with only
>> doing this grouping at creation time then we could define the "create"
>> interface in various ways.  For example:
>>
>> echo $UUID0 > create
>>
>> would create a single mdev named $UUID0 in it's own group.
>>
>> echo {$UUID0,$UUID1} > create
>>
>> could create mdev devices $UUID0 and $UUID1 grouped together.
>>
> </Alex>
> 
> <Kirti>
> I think this would create mdev device of same type on same parent
> device. We need to consider the case of multiple mdev devices of
> different types and with different parents to be grouped together.
> </Kirti>
> 
> <Alex> We could even do:
>>
>> echo $UUID1:$GROUPA > create
>>
>> where $GROUPA is the group ID of a previously created mdev device into
>> which $UUID1 is to be created and added to the same group.
> </Alex>
> 
> <Kirti>
> I was thinking about:
> 
>   echo $UUID0 > create
> 
> would create mdev device
> 
>   echo $UUID0 > /sys/class/mdev/create_group
> 
> would add created device to group.
> 
> For multiple devices case:
>   echo $UUID0 > create
>   echo $UUID1 > create
> 
> would create mdev devices which could be of different types and
> different parents.
>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
> 
> would add devices in a group.
> Mdev core module would create a new group with unique number.  On mdev
> device 'destroy' that mdev device would be removed from the group. When
> there are no devices left in the group, group would be deleted. With
> this "first device opened" and "last device closed" trigger can be used
> to commit resources.
> Then libvirt use mdev device path to pass as argument to QEMU, same as
> it does for VFIO. Libvirt don't have to care about group number.
> </Kirti>
> 

The more complicated one makes this, the more difficult it is for the
customer to configure and the more difficult it is and the longer it
takes to get something out. I didn't follow the details of groups...

What gets created from a pass through some *mdev/create_group?  Does
some new udev device get create that then is fed to the guest? Seems
painful to make two distinct/async passes through systemd/udev. I
foresee testing nightmares with creating 3 vGPU's, processing a group
request, while some other process/thread is deleting a vGPU... How do
the vGPU's get marked so that the delete cannot happen.

If a vendor wants to create their own utility to group vHBA's together
and manage that grouping, then have at it...  Doesn't seem to be
something libvirt needs to be or should be managing...  As I go running
for cover...

If having multiple types generated for a single vGPU, then consider the
following XML:

   <capability type='mdev'>
     <type id='11' [other attributes]/>
     <type id='11' [other attributes]/>
     <type id='12' [other attributes]/>
     [<uuid>...</uuid>]
    </capability>

then perhaps building the mdev_create input would be a comma separated
list of type's to be added... "$UUID:11,11,12". Just a thought...


John

> Thanks,
> Kirti
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02 20:29                       ` John Ferlan
  0 siblings, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-02 20:29 UTC (permalink / raw)
  To: Kirti Wankhede, Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
> 
> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>
>>
>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>    <device>
>>>>      <name>my-vgpu</name>
>>>>      <parent>pci_0000_86_00_0</parent>
>>>>      <capability type='mdev'>
>>>>        <type id='11'/>
>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>      </capability>
>>>>    </device>
>>>>
>>>> After creating the vGPU, if required by the host driver, all the other
>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>
>>> Thanks Paolo for details.
>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>> file in sysfs to create mdev device. Right?
>>> At this moment, does libvirt know which VM this device would be
>>> associated with?
>>
>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>> is created separately from the VM.
>>
>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>> info, again taken from sysfs:
>>>>
>>>>    <device>
>>>>      <name>my-vgpu</name>
>>>>      <parent>pci_0000_86_00_0</parent>
>>>>      <capability type='mdev'>
>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>        <!-- only the chosen type -->
>>>>        <type id='11'>
>>>>          <!-- ... snip ... -->
>>>>        </type>
>>>>        <capability type='pci'>
>>>>          <!-- no domain/bus/slot/function of course -->
>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>>          <product id='...'>...</product>
>>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>>        </capability>
>>>>      </capability>
>>>>    </device>
>>>>
>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>> pci at all, would have it inside mdev.  This represents the difference
>>>> between the mdev provider and the mdev device.
>>>
>>> Parent of mdev device might not always be a PCI device. I think we
>>> shouldn't consider it as PCI capability.
>>
>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
>> as a PCI device by VFIO.
>>
>> The <capability type='pci'> in the physical GPU means that the GPU is a
>> PCI device.
>>
> 
> Ok. Got that.
> 
>>>> Random proposal for the domain XML too:
>>>>
>>>>   <hostdev mode='subsystem' type='pci'>
>>>>     <source type='mdev'>
>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>     </source>
>>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>>   </hostdev>
>>>>
>>>
>>> When user wants to assign two mdev devices to one VM, user have to add
>>> such two entries or group the two devices in one entry?
>>
>> Two entries, one per UUID, each with its own PCI address in the guest.
>>
>>> On other mail thread with same subject we are thinking of creating group
>>> of mdev devices to assign multiple mdev devices to one VM.
>>
>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>> other thread).
>>
> 
> When mdev device is created, resources from physical device is assigned
> to this device. But resources are committed only when device goes
> 'online' ('start' in v6 patch)
> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
> for all vGPU devices in a VM are committed at one place. So we need to
> know the vGPUs assigned to a VM before QEMU starts.
> 
> Grouping would help here as Alex suggested in that mail. Pulling only
> that part of discussion here:
> 
> <Alex> It seems then that the grouping needs to affect the iommu group
> so that
>> you know that there's only a single owner for all the mdev devices
>> within the group.  IIRC, the bus drivers don't have any visibility
>> to opening and releasing of the group itself to trigger the
>> online/offline, but they can track opening of the device file
>> descriptors within the group.  Within the VFIO API the user cannot
>> access the device without the device file descriptor, so a "first
>> device opened" and "last device closed" trigger would provide the
>> trigger points you need.  Some sort of new sysfs interface would need
>> to be invented to allow this sort of manipulation.
>> Also we should probably keep sight of whether we feel this is
>> sufficiently necessary for the complexity.  If we can get by with only
>> doing this grouping at creation time then we could define the "create"
>> interface in various ways.  For example:
>>
>> echo $UUID0 > create
>>
>> would create a single mdev named $UUID0 in it's own group.
>>
>> echo {$UUID0,$UUID1} > create
>>
>> could create mdev devices $UUID0 and $UUID1 grouped together.
>>
> </Alex>
> 
> <Kirti>
> I think this would create mdev device of same type on same parent
> device. We need to consider the case of multiple mdev devices of
> different types and with different parents to be grouped together.
> </Kirti>
> 
> <Alex> We could even do:
>>
>> echo $UUID1:$GROUPA > create
>>
>> where $GROUPA is the group ID of a previously created mdev device into
>> which $UUID1 is to be created and added to the same group.
> </Alex>
> 
> <Kirti>
> I was thinking about:
> 
>   echo $UUID0 > create
> 
> would create mdev device
> 
>   echo $UUID0 > /sys/class/mdev/create_group
> 
> would add created device to group.
> 
> For multiple devices case:
>   echo $UUID0 > create
>   echo $UUID1 > create
> 
> would create mdev devices which could be of different types and
> different parents.
>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
> 
> would add devices in a group.
> Mdev core module would create a new group with unique number.  On mdev
> device 'destroy' that mdev device would be removed from the group. When
> there are no devices left in the group, group would be deleted. With
> this "first device opened" and "last device closed" trigger can be used
> to commit resources.
> Then libvirt use mdev device path to pass as argument to QEMU, same as
> it does for VFIO. Libvirt don't have to care about group number.
> </Kirti>
> 

The more complicated one makes this, the more difficult it is for the
customer to configure and the more difficult it is and the longer it
takes to get something out. I didn't follow the details of groups...

What gets created from a pass through some *mdev/create_group?  Does
some new udev device get create that then is fed to the guest? Seems
painful to make two distinct/async passes through systemd/udev. I
foresee testing nightmares with creating 3 vGPU's, processing a group
request, while some other process/thread is deleting a vGPU... How do
the vGPU's get marked so that the delete cannot happen.

If a vendor wants to create their own utility to group vHBA's together
and manage that grouping, then have at it...  Doesn't seem to be
something libvirt needs to be or should be managing...  As I go running
for cover...

If having multiple types generated for a single vGPU, then consider the
following XML:

   <capability type='mdev'>
     <type id='11' [other attributes]/>
     <type id='11' [other attributes]/>
     <type id='12' [other attributes]/>
     [<uuid>...</uuid>]
    </capability>

then perhaps building the mdev_create input would be a comma separated
list of type's to be added... "$UUID:11,11,12". Just a thought...


John

> Thanks,
> Kirti
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 20:19                 ` [Qemu-devel] [libvirt] " John Ferlan
@ 2016-09-02 21:44                   ` Paolo Bonzini
  -1 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-02 21:44 UTC (permalink / raw)
  To: John Ferlan, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 02/09/2016 22:19, John Ferlan wrote:
> We don't have such a pool for GPU's (yet) - although I suppose they
> could just become a class of storage pools.
> 
> The issue being nodedev device objects are not saved between reboots.
> They are generated on the fly. Hence the "create-nodedev' API - notice
> there's no "define-nodedev' API, although I suppose one could be
> created. It's just more work to get this all to work properly.

It can all be made transient to begin with.  The VM can be defined but
won't start unless the mdev(s) exist with the right UUIDs.

>> After creating the vGPU, if required by the host driver, all the other
>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
> 
> Not wanting to make assumptions, but this reads as if I create one type
> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
> wrong - it's been a long week.

Correct, at least for NVIDIA.

> PCI devices have the "managed='yes|no'" attribute as well. That's what
> determines whether the device is to be detached from the host or not.
> That's been something very painful to manage for vfio and well libvirt!

mdevs do not exist on the host (they do not have a driver on the host
because they are not PCI devices) so they do need any management.  At
least I hope that's good news. :)

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02 21:44                   ` Paolo Bonzini
  0 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-02 21:44 UTC (permalink / raw)
  To: John Ferlan, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 02/09/2016 22:19, John Ferlan wrote:
> We don't have such a pool for GPU's (yet) - although I suppose they
> could just become a class of storage pools.
> 
> The issue being nodedev device objects are not saved between reboots.
> They are generated on the fly. Hence the "create-nodedev' API - notice
> there's no "define-nodedev' API, although I suppose one could be
> created. It's just more work to get this all to work properly.

It can all be made transient to begin with.  The VM can be defined but
won't start unless the mdev(s) exist with the right UUIDs.

>> After creating the vGPU, if required by the host driver, all the other
>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
> 
> Not wanting to make assumptions, but this reads as if I create one type
> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
> wrong - it's been a long week.

Correct, at least for NVIDIA.

> PCI devices have the "managed='yes|no'" attribute as well. That's what
> determines whether the device is to be detached from the host or not.
> That's been something very painful to manage for vfio and well libvirt!

mdevs do not exist on the host (they do not have a driver on the host
because they are not PCI devices) so they do need any management.  At
least I hope that's good news. :)

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 18:33                   ` Kirti Wankhede
  2016-09-02 20:29                       ` [Qemu-devel] [libvirt] " John Ferlan
@ 2016-09-02 21:48                     ` Paolo Bonzini
  2016-09-03 11:56                         ` [Qemu-devel] [libvirt] " John Ferlan
  2016-09-03 16:34                       ` [Qemu-devel] " Kirti Wankhede
  1 sibling, 2 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-02 21:48 UTC (permalink / raw)
  To: Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 02/09/2016 20:33, Kirti Wankhede wrote:
> <Alex> We could even do:
>> >
>> > echo $UUID1:$GROUPA > create
>> >
>> > where $GROUPA is the group ID of a previously created mdev device into
>> > which $UUID1 is to be created and added to the same group.
> </Alex>

>From the point of view of libvirt, I think I prefer Alex's idea.
<group> could be an additional element in the nodedev-create XML:

    <device>
      <name>my-vgpu</name>
      <parent>pci_0000_86_00_0</parent>
      <capability type='mdev'>
        <type id='11'/>
        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
        <group>group1</group>
      </capability>
    </device>

(should group also be a UUID?)

Since John brought up the topic of minimal XML, in this case it will be
like this:

    <device>
      <name>my-vgpu</name>
      <parent>pci_0000_86_00_0</parent>
      <capability type='mdev'>
        <type id='11'/>
      </capability>
    </device>

The uuid will be autogenerated by libvirt and if there's no <group> (as
is common for VMs with only 1 vGPU) it will be a single-device group.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 21:44                   ` [Qemu-devel] [libvirt] " Paolo Bonzini
@ 2016-09-02 23:57                     ` Laine Stump
  -1 siblings, 0 replies; 162+ messages in thread
From: Laine Stump @ 2016-09-02 23:57 UTC (permalink / raw)
  To: Michal Privoznik, Alex Williamson, libvir-list
  Cc: Paolo Bonzini, John Ferlan, Kirti Wankhede, Song, Jike, cjia,
	kvm, Tian, Kevin, qemu-devel, kraxel, bjsdjshi

On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
>
>
> On 02/09/2016 22:19, John Ferlan wrote:
>> We don't have such a pool for GPU's (yet) - although I suppose they
>> could just become a class of storage pools.
>>
>> The issue being nodedev device objects are not saved between reboots.
>> They are generated on the fly. Hence the "create-nodedev' API - notice
>> there's no "define-nodedev' API, although I suppose one could be
>> created. It's just more work to get this all to work properly.
>
> It can all be made transient to begin with.  The VM can be defined but
> won't start unless the mdev(s) exist with the right UUIDs.
>
>>> After creating the vGPU, if required by the host driver, all the other
>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>
>> Not wanting to make assumptions, but this reads as if I create one type
>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>> wrong - it's been a long week.
>
> Correct, at least for NVIDIA.
>
>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>> determines whether the device is to be detached from the host or not.
>> That's been something very painful to manage for vfio and well libvirt!
>
> mdevs do not exist on the host (they do not have a driver on the host
> because they are not PCI devices) so they do need any management.  At
> least I hope that's good news. :)

What's your definition of "management"? They don't need the same type of 
management as a traditional hostdev, but they certainly don't just 
appear by magic! :-)

For standard PCI devices, the managed attribute says whether or not the 
device needs to be detached from the host driver and attached to 
vfio-pci. For other kinds of hostdev devices, we could decide that it 
meant something different. In this case, perhaps managed='yes' could 
mean that the vGPU will be created as needed, and destroyed when the 
guest is finished with it, and managed='no' could mean that we expect a 
vGPU to already exist, and just need starting.

Or not. Maybe that's a pointless distinction in this case. Just pointing 
out the option...


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-02 23:57                     ` Laine Stump
  0 siblings, 0 replies; 162+ messages in thread
From: Laine Stump @ 2016-09-02 23:57 UTC (permalink / raw)
  To: Michal Privoznik, Alex Williamson, libvir-list
  Cc: Paolo Bonzini, John Ferlan, Kirti Wankhede, Song, Jike, cjia,
	kvm, Tian, Kevin, qemu-devel, kraxel, bjsdjshi

On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
>
>
> On 02/09/2016 22:19, John Ferlan wrote:
>> We don't have such a pool for GPU's (yet) - although I suppose they
>> could just become a class of storage pools.
>>
>> The issue being nodedev device objects are not saved between reboots.
>> They are generated on the fly. Hence the "create-nodedev' API - notice
>> there's no "define-nodedev' API, although I suppose one could be
>> created. It's just more work to get this all to work properly.
>
> It can all be made transient to begin with.  The VM can be defined but
> won't start unless the mdev(s) exist with the right UUIDs.
>
>>> After creating the vGPU, if required by the host driver, all the other
>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>
>> Not wanting to make assumptions, but this reads as if I create one type
>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>> wrong - it's been a long week.
>
> Correct, at least for NVIDIA.
>
>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>> determines whether the device is to be detached from the host or not.
>> That's been something very painful to manage for vfio and well libvirt!
>
> mdevs do not exist on the host (they do not have a driver on the host
> because they are not PCI devices) so they do need any management.  At
> least I hope that's good news. :)

What's your definition of "management"? They don't need the same type of 
management as a traditional hostdev, but they certainly don't just 
appear by magic! :-)

For standard PCI devices, the managed attribute says whether or not the 
device needs to be detached from the host driver and attached to 
vfio-pci. For other kinds of hostdev devices, we could decide that it 
meant something different. In this case, perhaps managed='yes' could 
mean that the vGPU will be created as needed, and destroyed when the 
guest is finished with it, and managed='no' could mean that we expect a 
vGPU to already exist, and just need starting.

Or not. Maybe that's a pointless distinction in this case. Just pointing 
out the option...

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 21:48                     ` [Qemu-devel] " Paolo Bonzini
@ 2016-09-03 11:56                         ` John Ferlan
  2016-09-03 16:34                       ` [Qemu-devel] " Kirti Wankhede
  1 sibling, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-03 11:56 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 20:33, Kirti Wankhede wrote:
>> <Alex> We could even do:
>>>>
>>>> echo $UUID1:$GROUPA > create
>>>>
>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
> 
>>From the point of view of libvirt, I think I prefer Alex's idea.
> <group> could be an additional element in the nodedev-create XML:
> 
>     <device>
>       <name>my-vgpu</name>
>       <parent>pci_0000_86_00_0</parent>
>       <capability type='mdev'>
>         <type id='11'/>
>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>         <group>group1</group>
>       </capability>
>     </device>
> 
> (should group also be a UUID?)

As long as create_group handles all the work and all libvirt does is
call it, get the return status/error, and handle deleting the vGPU on
error, then I guess it's doable.

Alternatively having multiple <type id='#'> in the XML and performing a
single *mdev/create_group is an option. I suppose it all depends on the
arguments to create_group and the expected output and how that's
expected to be used.

That is, what is the "output" from create_group that gets added to the
domain XML?  How is that found? Also, once the domain is running can a
vGPU be added to the group?  Removed?  What allows/prevents?

> 
> Since John brought up the topic of minimal XML, in this case it will be
> like this:
> 
>     <device>
>       <name>my-vgpu</name>
>       <parent>pci_0000_86_00_0</parent>
>       <capability type='mdev'>
>         <type id='11'/>
>       </capability>
>     </device>
> 
> The uuid will be autogenerated by libvirt and if there's no <group> (as
> is common for VMs with only 1 vGPU) it will be a single-device group.
> 

The <name> could be ignored as it seems existing libvirt code wants to
generate a name via udevGenerateDeviceName for other devices. I haven't
studied it long enough, but I believe that's how those pci_####* names
created.

John

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-03 11:56                         ` John Ferlan
  0 siblings, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-03 11:56 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 20:33, Kirti Wankhede wrote:
>> <Alex> We could even do:
>>>>
>>>> echo $UUID1:$GROUPA > create
>>>>
>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
> 
>>From the point of view of libvirt, I think I prefer Alex's idea.
> <group> could be an additional element in the nodedev-create XML:
> 
>     <device>
>       <name>my-vgpu</name>
>       <parent>pci_0000_86_00_0</parent>
>       <capability type='mdev'>
>         <type id='11'/>
>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>         <group>group1</group>
>       </capability>
>     </device>
> 
> (should group also be a UUID?)

As long as create_group handles all the work and all libvirt does is
call it, get the return status/error, and handle deleting the vGPU on
error, then I guess it's doable.

Alternatively having multiple <type id='#'> in the XML and performing a
single *mdev/create_group is an option. I suppose it all depends on the
arguments to create_group and the expected output and how that's
expected to be used.

That is, what is the "output" from create_group that gets added to the
domain XML?  How is that found? Also, once the domain is running can a
vGPU be added to the group?  Removed?  What allows/prevents?

> 
> Since John brought up the topic of minimal XML, in this case it will be
> like this:
> 
>     <device>
>       <name>my-vgpu</name>
>       <parent>pci_0000_86_00_0</parent>
>       <capability type='mdev'>
>         <type id='11'/>
>       </capability>
>     </device>
> 
> The uuid will be autogenerated by libvirt and if there's no <group> (as
> is common for VMs with only 1 vGPU) it will be a single-device group.
> 

The <name> could be ignored as it seems existing libvirt code wants to
generate a name via udevGenerateDeviceName for other devices. I haven't
studied it long enough, but I believe that's how those pci_####* names
created.

John

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 21:44                   ` [Qemu-devel] [libvirt] " Paolo Bonzini
@ 2016-09-03 11:57                     ` John Ferlan
  -1 siblings, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-03 11:57 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



>>> After creating the vGPU, if required by the host driver, all the other
>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>
>> Not wanting to make assumptions, but this reads as if I create one type
>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>> wrong - it's been a long week.
> 
> Correct, at least for NVIDIA.
> 

OK, but so what am I missing vis-a-vis the groups conversation?  Sounds
like multiple vGPU's are being combined, but if only one can be created.
I think this is where I got confused while reading...

>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>> determines whether the device is to be detached from the host or not.
>> That's been something very painful to manage for vfio and well libvirt!
> 
> mdevs do not exist on the host (they do not have a driver on the host
> because they are not PCI devices) so they do need any management.  At
> least I hope that's good news. :)
> 

Laine was more eloquent than I on this...

John

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-03 11:57                     ` John Ferlan
  0 siblings, 0 replies; 162+ messages in thread
From: John Ferlan @ 2016-09-03 11:57 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



>>> After creating the vGPU, if required by the host driver, all the other
>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>
>> Not wanting to make assumptions, but this reads as if I create one type
>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>> wrong - it's been a long week.
> 
> Correct, at least for NVIDIA.
> 

OK, but so what am I missing vis-a-vis the groups conversation?  Sounds
like multiple vGPU's are being combined, but if only one can be created.
I think this is where I got confused while reading...

>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>> determines whether the device is to be detached from the host or not.
>> That's been something very painful to manage for vfio and well libvirt!
> 
> mdevs do not exist on the host (they do not have a driver on the host
> because they are not PCI devices) so they do need any management.  At
> least I hope that's good news. :)
> 

Laine was more eloquent than I on this...

John

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-03 11:56                         ` [Qemu-devel] [libvirt] " John Ferlan
@ 2016-09-03 13:07                           ` Paolo Bonzini
  -1 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-03 13:07 UTC (permalink / raw)
  To: John Ferlan, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 03/09/2016 13:56, John Ferlan wrote:
> On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>> <Alex> We could even do:
>>>>>
>>>>> echo $UUID1:$GROUPA > create
>>>>>
>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>> which $UUID1 is to be created and added to the same group.
>>> </Alex>
>>
>> >From the point of view of libvirt, I think I prefer Alex's idea.
>> <group> could be an additional element in the nodedev-create XML:
>>
>>     <device>
>>       <name>my-vgpu</name>
>>       <parent>pci_0000_86_00_0</parent>
>>       <capability type='mdev'>
>>         <type id='11'/>
>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>         <group>group1</group>
>>       </capability>
>>     </device>
>>
>> (should group also be a UUID?)
> 
> As long as create_group handles all the work and all libvirt does is
> call it, get the return status/error, and handle deleting the vGPU on
> error, then I guess it's doable.
> 
> Alternatively having multiple <type id='#'> in the XML and performing a
> single *mdev/create_group is an option.

I don't really like the idea of a single nodedev-create creating
multiple devices, but that would work too.

> That is, what is the "output" from create_group that gets added to the
> domain XML?  How is that found?

A new sysfs path is created, whose name depends on the UUID.  The UUID
is used in a <hostdev> element in the domain XML and the sysfs path
appears in the QEMU command line.  Kirti and Neo had examples in their
presentation at KVM Forum.

If you create multiple devices in the same group, they are added to the
same IOMMU group so they must be used by the same VM.  However they
don't have to be available from the beginning; they could be
hotplugged/hot-unplugged later, since from the point of view of the VM
those are just another PCI device.

> Also, once the domain is running can a
> vGPU be added to the group?  Removed?  What allows/prevents?

Kirti?... :)

In principle I don't think anything should block vGPUs from different
groups being added to the same VM, but I have to defer to Alex and Kirti
again on this.

>> Since John brought up the topic of minimal XML, in this case it will be
>> like this:
>>
>>     <device>
>>       <name>my-vgpu</name>
>>       <parent>pci_0000_86_00_0</parent>
>>       <capability type='mdev'>
>>         <type id='11'/>
>>       </capability>
>>     </device>
>>
>> The uuid will be autogenerated by libvirt and if there's no <group> (as
>> is common for VMs with only 1 vGPU) it will be a single-device group.
> 
> The <name> could be ignored as it seems existing libvirt code wants to
> generate a name via udevGenerateDeviceName for other devices. I haven't
> studied it long enough, but I believe that's how those pci_####* names
> created.

Yeah that makes sense.  So we get down to a minimal XML that has just
parent, and capability with type in it; additional elements could be
name (ignored anyway), and within capability uuid and group.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-03 13:07                           ` Paolo Bonzini
  0 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-03 13:07 UTC (permalink / raw)
  To: John Ferlan, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 03/09/2016 13:56, John Ferlan wrote:
> On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>> <Alex> We could even do:
>>>>>
>>>>> echo $UUID1:$GROUPA > create
>>>>>
>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>> which $UUID1 is to be created and added to the same group.
>>> </Alex>
>>
>> >From the point of view of libvirt, I think I prefer Alex's idea.
>> <group> could be an additional element in the nodedev-create XML:
>>
>>     <device>
>>       <name>my-vgpu</name>
>>       <parent>pci_0000_86_00_0</parent>
>>       <capability type='mdev'>
>>         <type id='11'/>
>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>         <group>group1</group>
>>       </capability>
>>     </device>
>>
>> (should group also be a UUID?)
> 
> As long as create_group handles all the work and all libvirt does is
> call it, get the return status/error, and handle deleting the vGPU on
> error, then I guess it's doable.
> 
> Alternatively having multiple <type id='#'> in the XML and performing a
> single *mdev/create_group is an option.

I don't really like the idea of a single nodedev-create creating
multiple devices, but that would work too.

> That is, what is the "output" from create_group that gets added to the
> domain XML?  How is that found?

A new sysfs path is created, whose name depends on the UUID.  The UUID
is used in a <hostdev> element in the domain XML and the sysfs path
appears in the QEMU command line.  Kirti and Neo had examples in their
presentation at KVM Forum.

If you create multiple devices in the same group, they are added to the
same IOMMU group so they must be used by the same VM.  However they
don't have to be available from the beginning; they could be
hotplugged/hot-unplugged later, since from the point of view of the VM
those are just another PCI device.

> Also, once the domain is running can a
> vGPU be added to the group?  Removed?  What allows/prevents?

Kirti?... :)

In principle I don't think anything should block vGPUs from different
groups being added to the same VM, but I have to defer to Alex and Kirti
again on this.

>> Since John brought up the topic of minimal XML, in this case it will be
>> like this:
>>
>>     <device>
>>       <name>my-vgpu</name>
>>       <parent>pci_0000_86_00_0</parent>
>>       <capability type='mdev'>
>>         <type id='11'/>
>>       </capability>
>>     </device>
>>
>> The uuid will be autogenerated by libvirt and if there's no <group> (as
>> is common for VMs with only 1 vGPU) it will be a single-device group.
> 
> The <name> could be ignored as it seems existing libvirt code wants to
> generate a name via udevGenerateDeviceName for other devices. I haven't
> studied it long enough, but I believe that's how those pci_####* names
> created.

Yeah that makes sense.  So we get down to a minimal XML that has just
parent, and capability with type in it; additional elements could be
name (ignored anyway), and within capability uuid and group.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 20:29                       ` [Qemu-devel] [libvirt] " John Ferlan
@ 2016-09-03 16:31                         ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:31 UTC (permalink / raw)
  To: John Ferlan, Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 9/3/2016 1:59 AM, John Ferlan wrote:
> 
> 
> On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
>>
>> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <type id='11'/>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> After creating the vGPU, if required by the host driver, all the other
>>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>>
>>>> Thanks Paolo for details.
>>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>>> file in sysfs to create mdev device. Right?
>>>> At this moment, does libvirt know which VM this device would be
>>>> associated with?
>>>
>>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>>> is created separately from the VM.
>>>
>>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>>> info, again taken from sysfs:
>>>>>
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>        <!-- only the chosen type -->
>>>>>        <type id='11'>
>>>>>          <!-- ... snip ... -->
>>>>>        </type>
>>>>>        <capability type='pci'>
>>>>>          <!-- no domain/bus/slot/function of course -->
>>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>>>          <product id='...'>...</product>
>>>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>>>        </capability>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>>> pci at all, would have it inside mdev.  This represents the difference
>>>>> between the mdev provider and the mdev device.
>>>>
>>>> Parent of mdev device might not always be a PCI device. I think we
>>>> shouldn't consider it as PCI capability.
>>>
>>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
>>> as a PCI device by VFIO.
>>>
>>> The <capability type='pci'> in the physical GPU means that the GPU is a
>>> PCI device.
>>>
>>
>> Ok. Got that.
>>
>>>>> Random proposal for the domain XML too:
>>>>>
>>>>>   <hostdev mode='subsystem' type='pci'>
>>>>>     <source type='mdev'>
>>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>     </source>
>>>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>>>   </hostdev>
>>>>>
>>>>
>>>> When user wants to assign two mdev devices to one VM, user have to add
>>>> such two entries or group the two devices in one entry?
>>>
>>> Two entries, one per UUID, each with its own PCI address in the guest.
>>>
>>>> On other mail thread with same subject we are thinking of creating group
>>>> of mdev devices to assign multiple mdev devices to one VM.
>>>
>>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>>> other thread).
>>>
>>
>> When mdev device is created, resources from physical device is assigned
>> to this device. But resources are committed only when device goes
>> 'online' ('start' in v6 patch)
>> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
>> for all vGPU devices in a VM are committed at one place. So we need to
>> know the vGPUs assigned to a VM before QEMU starts.
>>
>> Grouping would help here as Alex suggested in that mail. Pulling only
>> that part of discussion here:
>>
>> <Alex> It seems then that the grouping needs to affect the iommu group
>> so that
>>> you know that there's only a single owner for all the mdev devices
>>> within the group.  IIRC, the bus drivers don't have any visibility
>>> to opening and releasing of the group itself to trigger the
>>> online/offline, but they can track opening of the device file
>>> descriptors within the group.  Within the VFIO API the user cannot
>>> access the device without the device file descriptor, so a "first
>>> device opened" and "last device closed" trigger would provide the
>>> trigger points you need.  Some sort of new sysfs interface would need
>>> to be invented to allow this sort of manipulation.
>>> Also we should probably keep sight of whether we feel this is
>>> sufficiently necessary for the complexity.  If we can get by with only
>>> doing this grouping at creation time then we could define the "create"
>>> interface in various ways.  For example:
>>>
>>> echo $UUID0 > create
>>>
>>> would create a single mdev named $UUID0 in it's own group.
>>>
>>> echo {$UUID0,$UUID1} > create
>>>
>>> could create mdev devices $UUID0 and $UUID1 grouped together.
>>>
>> </Alex>
>>
>> <Kirti>
>> I think this would create mdev device of same type on same parent
>> device. We need to consider the case of multiple mdev devices of
>> different types and with different parents to be grouped together.
>> </Kirti>
>>
>> <Alex> We could even do:
>>>
>>> echo $UUID1:$GROUPA > create
>>>
>>> where $GROUPA is the group ID of a previously created mdev device into
>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
>>
>> <Kirti>
>> I was thinking about:
>>
>>   echo $UUID0 > create
>>
>> would create mdev device
>>
>>   echo $UUID0 > /sys/class/mdev/create_group
>>
>> would add created device to group.
>>
>> For multiple devices case:
>>   echo $UUID0 > create
>>   echo $UUID1 > create
>>
>> would create mdev devices which could be of different types and
>> different parents.
>>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
>>
>> would add devices in a group.
>> Mdev core module would create a new group with unique number.  On mdev
>> device 'destroy' that mdev device would be removed from the group. When
>> there are no devices left in the group, group would be deleted. With
>> this "first device opened" and "last device closed" trigger can be used
>> to commit resources.
>> Then libvirt use mdev device path to pass as argument to QEMU, same as
>> it does for VFIO. Libvirt don't have to care about group number.
>> </Kirti>
>>
> 
> The more complicated one makes this, the more difficult it is for the
> customer to configure and the more difficult it is and the longer it
> takes to get something out. I didn't follow the details of groups...
> 
> What gets created from a pass through some *mdev/create_group?  

My proposal here is, on
  echo $UUID1, $UUID2 > /sys/class/mdev/create_group
would create a group in mdev core driver, which should be internal to
mdev core module. In mdev core module, a unique group number would be
saved in mdev_device structure for each device belonging to a that group.

> Does
> some new udev device get create that then is fed to the guest?

No, group is not a device. It will be like a identifier for the use of
vendor driver to identify devices in a group.

> Seems
> painful to make two distinct/async passes through systemd/udev. I
> foresee testing nightmares with creating 3 vGPU's, processing a group
> request, while some other process/thread is deleting a vGPU... How do
> the vGPU's get marked so that the delete cannot happen.
> 

How is the same case handled for direct assigned device? I mean a device
is unbound from its vendors driver, bound to vfio_pci device. How is it
guaranteed to be assigned to vfio_pci module? some other process/thread
might unbound it from vfio_pci module?

> If a vendor wants to create their own utility to group vHBA's together
> and manage that grouping, then have at it...  Doesn't seem to be
> something libvirt needs to be or should be managing...  As I go running
> for cover...
> 
> If having multiple types generated for a single vGPU, then consider the
> following XML:
> 
>    <capability type='mdev'>
>      <type id='11' [other attributes]/>
>      <type id='11' [other attributes]/>
>      <type id='12' [other attributes]/>
>      [<uuid>...</uuid>]
>     </capability>
> 
> then perhaps building the mdev_create input would be a comma separated
> list of type's to be added... "$UUID:11,11,12". Just a thought...
> 

In that case the vGPUs are created on same physical GPUs. Consider the
case two vGPUs on different physical devices need to be assigned to a
VM. Then those should be two different create commands:

   echo $UUID0 > /sys/../<bdf1>/mdev_create
   echo $UUID1 > /sys/../<bdf2>/mdev_create

Kirti.
> 
> John
> 
>> Thanks,
>> Kirti
>>
>> --
>> libvir-list mailing list
>> libvir-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-03 16:31                         ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:31 UTC (permalink / raw)
  To: John Ferlan, Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 9/3/2016 1:59 AM, John Ferlan wrote:
> 
> 
> On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
>>
>> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <type id='11'/>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> After creating the vGPU, if required by the host driver, all the other
>>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>>
>>>> Thanks Paolo for details.
>>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>>> file in sysfs to create mdev device. Right?
>>>> At this moment, does libvirt know which VM this device would be
>>>> associated with?
>>>
>>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>>> is created separately from the VM.
>>>
>>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>>> info, again taken from sysfs:
>>>>>
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>        <!-- only the chosen type -->
>>>>>        <type id='11'>
>>>>>          <!-- ... snip ... -->
>>>>>        </type>
>>>>>        <capability type='pci'>
>>>>>          <!-- no domain/bus/slot/function of course -->
>>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>>>          <product id='...'>...</product>
>>>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>>>        </capability>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>>> pci at all, would have it inside mdev.  This represents the difference
>>>>> between the mdev provider and the mdev device.
>>>>
>>>> Parent of mdev device might not always be a PCI device. I think we
>>>> shouldn't consider it as PCI capability.
>>>
>>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
>>> as a PCI device by VFIO.
>>>
>>> The <capability type='pci'> in the physical GPU means that the GPU is a
>>> PCI device.
>>>
>>
>> Ok. Got that.
>>
>>>>> Random proposal for the domain XML too:
>>>>>
>>>>>   <hostdev mode='subsystem' type='pci'>
>>>>>     <source type='mdev'>
>>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>     </source>
>>>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>>>   </hostdev>
>>>>>
>>>>
>>>> When user wants to assign two mdev devices to one VM, user have to add
>>>> such two entries or group the two devices in one entry?
>>>
>>> Two entries, one per UUID, each with its own PCI address in the guest.
>>>
>>>> On other mail thread with same subject we are thinking of creating group
>>>> of mdev devices to assign multiple mdev devices to one VM.
>>>
>>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>>> other thread).
>>>
>>
>> When mdev device is created, resources from physical device is assigned
>> to this device. But resources are committed only when device goes
>> 'online' ('start' in v6 patch)
>> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
>> for all vGPU devices in a VM are committed at one place. So we need to
>> know the vGPUs assigned to a VM before QEMU starts.
>>
>> Grouping would help here as Alex suggested in that mail. Pulling only
>> that part of discussion here:
>>
>> <Alex> It seems then that the grouping needs to affect the iommu group
>> so that
>>> you know that there's only a single owner for all the mdev devices
>>> within the group.  IIRC, the bus drivers don't have any visibility
>>> to opening and releasing of the group itself to trigger the
>>> online/offline, but they can track opening of the device file
>>> descriptors within the group.  Within the VFIO API the user cannot
>>> access the device without the device file descriptor, so a "first
>>> device opened" and "last device closed" trigger would provide the
>>> trigger points you need.  Some sort of new sysfs interface would need
>>> to be invented to allow this sort of manipulation.
>>> Also we should probably keep sight of whether we feel this is
>>> sufficiently necessary for the complexity.  If we can get by with only
>>> doing this grouping at creation time then we could define the "create"
>>> interface in various ways.  For example:
>>>
>>> echo $UUID0 > create
>>>
>>> would create a single mdev named $UUID0 in it's own group.
>>>
>>> echo {$UUID0,$UUID1} > create
>>>
>>> could create mdev devices $UUID0 and $UUID1 grouped together.
>>>
>> </Alex>
>>
>> <Kirti>
>> I think this would create mdev device of same type on same parent
>> device. We need to consider the case of multiple mdev devices of
>> different types and with different parents to be grouped together.
>> </Kirti>
>>
>> <Alex> We could even do:
>>>
>>> echo $UUID1:$GROUPA > create
>>>
>>> where $GROUPA is the group ID of a previously created mdev device into
>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
>>
>> <Kirti>
>> I was thinking about:
>>
>>   echo $UUID0 > create
>>
>> would create mdev device
>>
>>   echo $UUID0 > /sys/class/mdev/create_group
>>
>> would add created device to group.
>>
>> For multiple devices case:
>>   echo $UUID0 > create
>>   echo $UUID1 > create
>>
>> would create mdev devices which could be of different types and
>> different parents.
>>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
>>
>> would add devices in a group.
>> Mdev core module would create a new group with unique number.  On mdev
>> device 'destroy' that mdev device would be removed from the group. When
>> there are no devices left in the group, group would be deleted. With
>> this "first device opened" and "last device closed" trigger can be used
>> to commit resources.
>> Then libvirt use mdev device path to pass as argument to QEMU, same as
>> it does for VFIO. Libvirt don't have to care about group number.
>> </Kirti>
>>
> 
> The more complicated one makes this, the more difficult it is for the
> customer to configure and the more difficult it is and the longer it
> takes to get something out. I didn't follow the details of groups...
> 
> What gets created from a pass through some *mdev/create_group?  

My proposal here is, on
  echo $UUID1, $UUID2 > /sys/class/mdev/create_group
would create a group in mdev core driver, which should be internal to
mdev core module. In mdev core module, a unique group number would be
saved in mdev_device structure for each device belonging to a that group.

> Does
> some new udev device get create that then is fed to the guest?

No, group is not a device. It will be like a identifier for the use of
vendor driver to identify devices in a group.

> Seems
> painful to make two distinct/async passes through systemd/udev. I
> foresee testing nightmares with creating 3 vGPU's, processing a group
> request, while some other process/thread is deleting a vGPU... How do
> the vGPU's get marked so that the delete cannot happen.
> 

How is the same case handled for direct assigned device? I mean a device
is unbound from its vendors driver, bound to vfio_pci device. How is it
guaranteed to be assigned to vfio_pci module? some other process/thread
might unbound it from vfio_pci module?

> If a vendor wants to create their own utility to group vHBA's together
> and manage that grouping, then have at it...  Doesn't seem to be
> something libvirt needs to be or should be managing...  As I go running
> for cover...
> 
> If having multiple types generated for a single vGPU, then consider the
> following XML:
> 
>    <capability type='mdev'>
>      <type id='11' [other attributes]/>
>      <type id='11' [other attributes]/>
>      <type id='12' [other attributes]/>
>      [<uuid>...</uuid>]
>     </capability>
> 
> then perhaps building the mdev_create input would be a comma separated
> list of type's to be added... "$UUID:11,11,12". Just a thought...
> 

In that case the vGPUs are created on same physical GPUs. Consider the
case two vGPUs on different physical devices need to be assigned to a
VM. Then those should be two different create commands:

   echo $UUID0 > /sys/../<bdf1>/mdev_create
   echo $UUID1 > /sys/../<bdf2>/mdev_create

Kirti.
> 
> John
> 
>> Thanks,
>> Kirti
>>
>> --
>> libvir-list mailing list
>> libvir-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 21:48                     ` [Qemu-devel] " Paolo Bonzini
  2016-09-03 11:56                         ` [Qemu-devel] [libvirt] " John Ferlan
@ 2016-09-03 16:34                       ` Kirti Wankhede
  2016-09-06 17:40                         ` Alex Williamson
  1 sibling, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:34 UTC (permalink / raw)
  To: Paolo Bonzini, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 20:33, Kirti Wankhede wrote:
>> <Alex> We could even do:
>>>>
>>>> echo $UUID1:$GROUPA > create
>>>>
>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
> 
> From the point of view of libvirt, I think I prefer Alex's idea.
> <group> could be an additional element in the nodedev-create XML:
> 
>     <device>
>       <name>my-vgpu</name>
>       <parent>pci_0000_86_00_0</parent>
>       <capability type='mdev'>
>         <type id='11'/>
>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>         <group>group1</group>
>       </capability>
>     </device>
> 
> (should group also be a UUID?)
> 

No, this should be a unique number in a system, similar to iommu_group.

> Since John brought up the topic of minimal XML, in this case it will be
> like this:
> 
>     <device>
>       <name>my-vgpu</name>
>       <parent>pci_0000_86_00_0</parent>
>       <capability type='mdev'>
>         <type id='11'/>
>       </capability>
>     </device>
> 
> The uuid will be autogenerated by libvirt and if there's no <group> (as
> is common for VMs with only 1 vGPU) it will be a single-device group.
> 

Right.

Kirti.

> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 4/4] docs: Add Documentation for Mediated devices
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-03 16:40     ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, eblake

Adding Eric.

Eric,
This is the v7 version of patch. I'll incorporate changes that you
suggested here.

Kirti.

On 8/25/2016 9:23 AM, Kirti Wankhede wrote:
> Add file Documentation/vfio-mediated-device.txt that include details of
> mediated device framework.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
> Reviewed-on: http://git-master/r/1182512
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  Documentation/vfio-mediated-device.txt | 203 +++++++++++++++++++++++++++++++++
>  1 file changed, 203 insertions(+)
>  create mode 100644 Documentation/vfio-mediated-device.txt
> 
> diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
> new file mode 100644
> index 000000000000..237d8eb630b7
> --- /dev/null
> +++ b/Documentation/vfio-mediated-device.txt
> @@ -0,0 +1,203 @@
> +VFIO Mediated devices [1]
> +-------------------------------------------------------------------------------
> +
> +There are more and more use cases/demands to virtualize the DMA devices which
> +doesn't have SR_IOV capability built-in. To do this, drivers of different
> +devices had to develop their own management interface and set of APIs and then
> +integrate it to user space software. We've identified common requirements and
> +unified management interface for such devices to make user space software
> +integration easier.
> +
> +The VFIO driver framework provides unified APIs for direct device access. It is
> +an IOMMU/device agnostic framework for exposing direct device access to
> +user space, in a secure, IOMMU protected environment. This framework is
> +used for multiple devices like GPUs, network adapters and compute accelerators.
> +With direct device access, virtual machines or user space applications have
> +direct access of physical device. This framework is reused for mediated devices.
> +
> +Mediated core driver provides a common interface for mediated device management
> +that can be used by drivers of different devices. This module provides a generic
> +interface to create/destroy mediated device, add/remove it to mediated bus
> +driver, add/remove device to IOMMU group. It also provides an interface to
> +register bus driver, for example, Mediated VFIO mdev driver is designed for
> +mediated devices and supports VFIO APIs. Mediated bus driver add/delete mediated
> +device to VFIO Group.
> +
> +Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
> +as example, since these are the devices which are going to actively use
> +this module as of now.
> +
> +     +---------------+
> +     |               |
> +     | +-----------+ |  mdev_register_driver() +--------------+
> +     | |           | +<------------------------+              |
> +     | |  mdev     | |                         |              |
> +     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
> +     | |  driver   | |     probe()/remove()    |              |    APIs
> +     | |           | |                         +--------------+
> +     | +-----------+ |
> +     |               |
> +     |  MDEV CORE    |
> +     |   MODULE      |
> +     |   mdev.ko     |
> +     | +-----------+ |  mdev_register_device() +--------------+
> +     | |           | +<------------------------+              |
> +     | |           | |                         |  nvidia.ko   |<-> physical
> +     | |           | +------------------------>+              |    device
> +     | |           | |        callbacks        +--------------+
> +     | | Physical  | |
> +     | |  device   | |  mdev_register_device() +--------------+
> +     | | interface | |<------------------------+              |
> +     | |           | |                         |  i915.ko     |<-> physical
> +     | |           | +------------------------>+              |    device
> +     | |           | |        callbacks        +--------------+
> +     | |           | |
> +     | |           | |  mdev_register_device() +--------------+
> +     | |           | +<------------------------+              |
> +     | |           | |                         | ccw_device.ko|<-> physical
> +     | |           | +------------------------>+              |    device
> +     | |           | |        callbacks        +--------------+
> +     | +-----------+ |
> +     +---------------+
> +
> +
> +Registration Interfaces
> +-------------------------------------------------------------------------------
> +
> +Mediated core driver provides two types of registration interfaces:
> +
> +1. Registration interface for mediated bus driver:
> +-------------------------------------------------
> +     /*
> +      * struct mdev_driver [2] - Mediated device's driver
> +      * @name: driver name
> +      * @probe: called when new device created
> +      * @remove: called when device removed
> +      * @driver: device driver structure
> +      */
> +     struct mdev_driver {
> +	     const char *name;
> +	     int  (*probe)  (struct device *dev);
> +	     void (*remove) (struct device *dev);
> +	     struct device_driver    driver;
> +     };
> +
> +Mediated bus driver for mdev should use this interface to register and
> +unregister with core driver respectively:
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
> +group when devices are bound and unbound to the driver.
> +
> +2. Physical device driver interface:
> +-----------------------------------
> +This interface [3] provides a set of APIs to manage physical device related work
> +in its driver. APIs are:
> +
> +* dev_attr_groups: attributes of the parent device.
> +* mdev_attr_groups: attributes of the mediated device.
> +* supported_config: to provide supported configuration list by the driver.
> +* create: to allocate basic resources in driver for a mediated device.
> +* destroy: to free resources in driver when mediated device is destroyed.
> +* reset: to free and reallocate resources in driver on mediated device reset.
> +* set_online_status: to change online status of mediated device.
> +* get_online_status: to get current (online/offline) status of mediated device.
> +* read : read emulation callback.
> +* write: write emulation callback.
> +* mmap: mmap emulation callback.
> +* get_irq_info: to retrieve information about mediated device's IRQ.
> +* set_irqs: gives interrupt configuration information that VMM sets.
> +* get_region_info: to provide region size and its flags for the mediated device.
> +    Vendor driver can provide the capability id and corresponding capability
> +    structure if want to support a capability.
> +* get_device_info: to retrieve VFIO device related flags, number of regions and
> +  number of IRQs supported.
> +
> +Drivers should use this interface to register and unregister device to mdev core
> +driver respectively:
> +
> +extern int  mdev_register_device(struct device *dev,
> +                                 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +Mediated device management interface via sysfs
> +-------------------------------------------------------------------------------
> +This is the interface that allows user space software, like libvirt, to query
> +and configure mediated device in a HW agnostic fashion. This management
> +interface provide flexibility to underlying physical device's driver to support
> +mediated device hotplug, multiple mediated devices per virtual machine, multiple
> +mediated devices from different physical devices, etc.
> +
> +Under per-physical device sysfs:
> +--------------------------------
> +
> +* mdev_supported_types: (read only)
> +    List the current supported mediated device types and its details.
> +
> +* mdev_create: (write only)
> +	Create a mediated device on target physical device.
> +	Input syntax: <UUID:params>
> +	where,
> +		UUID: mediated device's UUID
> +		params: extra parameters required by driver
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
> +
> +* mdev_destroy: (write only)
> +	Destroy a mediated device on a target physical device.
> +	Input syntax: <UUID>
> +	where,
> +		UUID: mediated device's UUID
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc" >
> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
> +
> +Under per mdev device:
> +----------------------------------------
> +
> +* online: (read write)
> +	Read on this file provides current status of mediated device (0 or 1).
> +	Write on this file (0 or 1) will change the state of mediated device.
> +	This trigger the registration callback to notify the driver to commit
> +	or free mediated device resources. This callback is blocking call,
> +	successful return of this call will indicate requested mdev resources
> +	has been fully committed, the VMM should continue.
> +	Example:
> +	# echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> +
> +
> +Mediated device Hotplug:
> +-----------------------
> +
> +To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
> +accessed during VM runtime, and the corresponding registration callback is
> +invoked to allow driver to support hotplug.
> +
> +Translation APIs for Mediated device
> +------------------------------------------------------------------------------
> +
> +Below APIs are provided for user pfn to host pfn translation in VFIO driver:
> +
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +                           long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
> +These functions call back into the backend IOMMU module using two callbacks of
> +struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
> +supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
> +modules, such as PPC64 sPAPR module, they need to provide these two callback
> +functions.
> +
> +References
> +-------------------------------------------------------------------------------
> +
> +[1] See Documentation/vfio.txt for more information on VFIO.
> +[2] struct mdev_driver in include/linux/mdev.h
> +[3] struct parent_ops in include/linux/mdev.h
> +[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
> +
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 4/4] docs: Add Documentation for Mediated devices
@ 2016-09-03 16:40     ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, eblake

Adding Eric.

Eric,
This is the v7 version of patch. I'll incorporate changes that you
suggested here.

Kirti.

On 8/25/2016 9:23 AM, Kirti Wankhede wrote:
> Add file Documentation/vfio-mediated-device.txt that include details of
> mediated device framework.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
> Reviewed-on: http://git-master/r/1182512
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  Documentation/vfio-mediated-device.txt | 203 +++++++++++++++++++++++++++++++++
>  1 file changed, 203 insertions(+)
>  create mode 100644 Documentation/vfio-mediated-device.txt
> 
> diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
> new file mode 100644
> index 000000000000..237d8eb630b7
> --- /dev/null
> +++ b/Documentation/vfio-mediated-device.txt
> @@ -0,0 +1,203 @@
> +VFIO Mediated devices [1]
> +-------------------------------------------------------------------------------
> +
> +There are more and more use cases/demands to virtualize the DMA devices which
> +doesn't have SR_IOV capability built-in. To do this, drivers of different
> +devices had to develop their own management interface and set of APIs and then
> +integrate it to user space software. We've identified common requirements and
> +unified management interface for such devices to make user space software
> +integration easier.
> +
> +The VFIO driver framework provides unified APIs for direct device access. It is
> +an IOMMU/device agnostic framework for exposing direct device access to
> +user space, in a secure, IOMMU protected environment. This framework is
> +used for multiple devices like GPUs, network adapters and compute accelerators.
> +With direct device access, virtual machines or user space applications have
> +direct access of physical device. This framework is reused for mediated devices.
> +
> +Mediated core driver provides a common interface for mediated device management
> +that can be used by drivers of different devices. This module provides a generic
> +interface to create/destroy mediated device, add/remove it to mediated bus
> +driver, add/remove device to IOMMU group. It also provides an interface to
> +register bus driver, for example, Mediated VFIO mdev driver is designed for
> +mediated devices and supports VFIO APIs. Mediated bus driver add/delete mediated
> +device to VFIO Group.
> +
> +Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
> +as example, since these are the devices which are going to actively use
> +this module as of now.
> +
> +     +---------------+
> +     |               |
> +     | +-----------+ |  mdev_register_driver() +--------------+
> +     | |           | +<------------------------+              |
> +     | |  mdev     | |                         |              |
> +     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
> +     | |  driver   | |     probe()/remove()    |              |    APIs
> +     | |           | |                         +--------------+
> +     | +-----------+ |
> +     |               |
> +     |  MDEV CORE    |
> +     |   MODULE      |
> +     |   mdev.ko     |
> +     | +-----------+ |  mdev_register_device() +--------------+
> +     | |           | +<------------------------+              |
> +     | |           | |                         |  nvidia.ko   |<-> physical
> +     | |           | +------------------------>+              |    device
> +     | |           | |        callbacks        +--------------+
> +     | | Physical  | |
> +     | |  device   | |  mdev_register_device() +--------------+
> +     | | interface | |<------------------------+              |
> +     | |           | |                         |  i915.ko     |<-> physical
> +     | |           | +------------------------>+              |    device
> +     | |           | |        callbacks        +--------------+
> +     | |           | |
> +     | |           | |  mdev_register_device() +--------------+
> +     | |           | +<------------------------+              |
> +     | |           | |                         | ccw_device.ko|<-> physical
> +     | |           | +------------------------>+              |    device
> +     | |           | |        callbacks        +--------------+
> +     | +-----------+ |
> +     +---------------+
> +
> +
> +Registration Interfaces
> +-------------------------------------------------------------------------------
> +
> +Mediated core driver provides two types of registration interfaces:
> +
> +1. Registration interface for mediated bus driver:
> +-------------------------------------------------
> +     /*
> +      * struct mdev_driver [2] - Mediated device's driver
> +      * @name: driver name
> +      * @probe: called when new device created
> +      * @remove: called when device removed
> +      * @driver: device driver structure
> +      */
> +     struct mdev_driver {
> +	     const char *name;
> +	     int  (*probe)  (struct device *dev);
> +	     void (*remove) (struct device *dev);
> +	     struct device_driver    driver;
> +     };
> +
> +Mediated bus driver for mdev should use this interface to register and
> +unregister with core driver respectively:
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
> +group when devices are bound and unbound to the driver.
> +
> +2. Physical device driver interface:
> +-----------------------------------
> +This interface [3] provides a set of APIs to manage physical device related work
> +in its driver. APIs are:
> +
> +* dev_attr_groups: attributes of the parent device.
> +* mdev_attr_groups: attributes of the mediated device.
> +* supported_config: to provide supported configuration list by the driver.
> +* create: to allocate basic resources in driver for a mediated device.
> +* destroy: to free resources in driver when mediated device is destroyed.
> +* reset: to free and reallocate resources in driver on mediated device reset.
> +* set_online_status: to change online status of mediated device.
> +* get_online_status: to get current (online/offline) status of mediated device.
> +* read : read emulation callback.
> +* write: write emulation callback.
> +* mmap: mmap emulation callback.
> +* get_irq_info: to retrieve information about mediated device's IRQ.
> +* set_irqs: gives interrupt configuration information that VMM sets.
> +* get_region_info: to provide region size and its flags for the mediated device.
> +    Vendor driver can provide the capability id and corresponding capability
> +    structure if want to support a capability.
> +* get_device_info: to retrieve VFIO device related flags, number of regions and
> +  number of IRQs supported.
> +
> +Drivers should use this interface to register and unregister device to mdev core
> +driver respectively:
> +
> +extern int  mdev_register_device(struct device *dev,
> +                                 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +Mediated device management interface via sysfs
> +-------------------------------------------------------------------------------
> +This is the interface that allows user space software, like libvirt, to query
> +and configure mediated device in a HW agnostic fashion. This management
> +interface provide flexibility to underlying physical device's driver to support
> +mediated device hotplug, multiple mediated devices per virtual machine, multiple
> +mediated devices from different physical devices, etc.
> +
> +Under per-physical device sysfs:
> +--------------------------------
> +
> +* mdev_supported_types: (read only)
> +    List the current supported mediated device types and its details.
> +
> +* mdev_create: (write only)
> +	Create a mediated device on target physical device.
> +	Input syntax: <UUID:params>
> +	where,
> +		UUID: mediated device's UUID
> +		params: extra parameters required by driver
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
> +
> +* mdev_destroy: (write only)
> +	Destroy a mediated device on a target physical device.
> +	Input syntax: <UUID>
> +	where,
> +		UUID: mediated device's UUID
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc" >
> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
> +
> +Under per mdev device:
> +----------------------------------------
> +
> +* online: (read write)
> +	Read on this file provides current status of mediated device (0 or 1).
> +	Write on this file (0 or 1) will change the state of mediated device.
> +	This trigger the registration callback to notify the driver to commit
> +	or free mediated device resources. This callback is blocking call,
> +	successful return of this call will indicate requested mdev resources
> +	has been fully committed, the VMM should continue.
> +	Example:
> +	# echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> +
> +
> +Mediated device Hotplug:
> +-----------------------
> +
> +To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
> +accessed during VM runtime, and the corresponding registration callback is
> +invoked to allow driver to support hotplug.
> +
> +Translation APIs for Mediated device
> +------------------------------------------------------------------------------
> +
> +Below APIs are provided for user pfn to host pfn translation in VFIO driver:
> +
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +                           long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
> +These functions call back into the backend IOMMU module using two callbacks of
> +struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
> +supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
> +modules, such as PPC64 sPAPR module, they need to provide these two callback
> +functions.
> +
> +References
> +-------------------------------------------------------------------------------
> +
> +[1] See Documentation/vfio.txt for more information on VFIO.
> +[2] struct mdev_driver in include/linux/mdev.h
> +[3] struct parent_ops in include/linux/mdev.h
> +[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
> +
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 23:57                     ` [Qemu-devel] [libvirt] " Laine Stump
@ 2016-09-03 16:49                       ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:49 UTC (permalink / raw)
  To: Laine Stump, Michal Privoznik, Alex Williamson, libvir-list
  Cc: Paolo Bonzini, John Ferlan, Song, Jike, cjia, kvm, Tian, Kevin,
	qemu-devel, kraxel, bjsdjshi



On 9/3/2016 5:27 AM, Laine Stump wrote:
> On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
>>
>>
>> On 02/09/2016 22:19, John Ferlan wrote:
>>> We don't have such a pool for GPU's (yet) - although I suppose they
>>> could just become a class of storage pools.
>>>
>>> The issue being nodedev device objects are not saved between reboots.
>>> They are generated on the fly. Hence the "create-nodedev' API - notice
>>> there's no "define-nodedev' API, although I suppose one could be
>>> created. It's just more work to get this all to work properly.
>>
>> It can all be made transient to begin with.  The VM can be defined but
>> won't start unless the mdev(s) exist with the right UUIDs.
>>
>>>> After creating the vGPU, if required by the host driver, all the other
>>>> type ids would disappear from "virsh nodedev-dumpxml
>>>> pci_0000_86_00_0" too.
>>>
>>> Not wanting to make assumptions, but this reads as if I create one type
>>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>>> wrong - it's been a long week.
>>
>> Correct, at least for NVIDIA.
>>
>>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>>> determines whether the device is to be detached from the host or not.
>>> That's been something very painful to manage for vfio and well libvirt!
>>
>> mdevs do not exist on the host (they do not have a driver on the host
>> because they are not PCI devices) so they do need any management.  At
>> least I hope that's good news. :)
> 
> What's your definition of "management"? They don't need the same type of
> management as a traditional hostdev, but they certainly don't just
> appear by magic! :-)
> 
> For standard PCI devices, the managed attribute says whether or not the
> device needs to be detached from the host driver and attached to
> vfio-pci. For other kinds of hostdev devices, we could decide that it
> meant something different. In this case, perhaps managed='yes' could
> mean that the vGPU will be created as needed, and destroyed when the
> guest is finished with it, and managed='no' could mean that we expect a
> vGPU to already exist, and just need starting.
> 
> Or not. Maybe that's a pointless distinction in this case. Just pointing
> out the option...
> 

Mediated devices are like virtual device, there could be no direct
physical device associated with it. All mdev devices are owned by
vfio_mdev module, which is similar to vfio_pci module. I don't think we
need  to interpret 'managed' attribute for mdev devices same as standard
PCI devices.
If mdev device is created, you would find the device directory in
/sys/bus/mdev/devices/ directory.

Kirti.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-03 16:49                       ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 16:49 UTC (permalink / raw)
  To: Laine Stump, Michal Privoznik, Alex Williamson, libvir-list
  Cc: Paolo Bonzini, John Ferlan, Song, Jike, cjia, kvm, Tian, Kevin,
	qemu-devel, kraxel, bjsdjshi



On 9/3/2016 5:27 AM, Laine Stump wrote:
> On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
>>
>>
>> On 02/09/2016 22:19, John Ferlan wrote:
>>> We don't have such a pool for GPU's (yet) - although I suppose they
>>> could just become a class of storage pools.
>>>
>>> The issue being nodedev device objects are not saved between reboots.
>>> They are generated on the fly. Hence the "create-nodedev' API - notice
>>> there's no "define-nodedev' API, although I suppose one could be
>>> created. It's just more work to get this all to work properly.
>>
>> It can all be made transient to begin with.  The VM can be defined but
>> won't start unless the mdev(s) exist with the right UUIDs.
>>
>>>> After creating the vGPU, if required by the host driver, all the other
>>>> type ids would disappear from "virsh nodedev-dumpxml
>>>> pci_0000_86_00_0" too.
>>>
>>> Not wanting to make assumptions, but this reads as if I create one type
>>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>>> wrong - it's been a long week.
>>
>> Correct, at least for NVIDIA.
>>
>>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>>> determines whether the device is to be detached from the host or not.
>>> That's been something very painful to manage for vfio and well libvirt!
>>
>> mdevs do not exist on the host (they do not have a driver on the host
>> because they are not PCI devices) so they do need any management.  At
>> least I hope that's good news. :)
> 
> What's your definition of "management"? They don't need the same type of
> management as a traditional hostdev, but they certainly don't just
> appear by magic! :-)
> 
> For standard PCI devices, the managed attribute says whether or not the
> device needs to be detached from the host driver and attached to
> vfio-pci. For other kinds of hostdev devices, we could decide that it
> meant something different. In this case, perhaps managed='yes' could
> mean that the vGPU will be created as needed, and destroyed when the
> guest is finished with it, and managed='no' could mean that we expect a
> vGPU to already exist, and just need starting.
> 
> Or not. Maybe that's a pointless distinction in this case. Just pointing
> out the option...
> 

Mediated devices are like virtual device, there could be no direct
physical device associated with it. All mdev devices are owned by
vfio_mdev module, which is similar to vfio_pci module. I don't think we
need  to interpret 'managed' attribute for mdev devices same as standard
PCI devices.
If mdev device is created, you would find the device directory in
/sys/bus/mdev/devices/ directory.

Kirti.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [PATCH v7 0/4] Add Mediated device support
  2016-09-03 13:07                           ` [Qemu-devel] [libvirt] " Paolo Bonzini
@ 2016-09-03 17:47                             ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 17:47 UTC (permalink / raw)
  To: Paolo Bonzini, John Ferlan, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 9/3/2016 6:37 PM, Paolo Bonzini wrote:
> 
> 
> On 03/09/2016 13:56, John Ferlan wrote:
>> On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
>>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>>> <Alex> We could even do:
>>>>>>
>>>>>> echo $UUID1:$GROUPA > create
>>>>>>
>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>> which $UUID1 is to be created and added to the same group.
>>>> </Alex>
>>>
>>> >From the point of view of libvirt, I think I prefer Alex's idea.
>>> <group> could be an additional element in the nodedev-create XML:
>>>
>>>     <device>
>>>       <name>my-vgpu</name>
>>>       <parent>pci_0000_86_00_0</parent>
>>>       <capability type='mdev'>
>>>         <type id='11'/>
>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>         <group>group1</group>
>>>       </capability>
>>>     </device>
>>>
>>> (should group also be a UUID?)
>>

I replied to earlier mail too, group number doesn't need to be UUID. It
should be a unique number. I think in the discussion in bof someone
mentioned about using domain's unique number that libvirt generates.
That should also work.

>> As long as create_group handles all the work and all libvirt does is
>> call it, get the return status/error, and handle deleting the vGPU on
>> error, then I guess it's doable.
>>

Yes that is the idea. Libvirt doesn't have to care about the groups.
With Alex's proposal, as you mentioned above, libvirt have to provide
group number to mdev_create, check return status and handle error case.

  echo $UUID1:$GROUP1 > mdev_create
  echo $UUID2:$GROUP1 > mdev_create
would create two mdev devices assigned to same domain.


>> Alternatively having multiple <type id='#'> in the XML and performing a
>> single *mdev/create_group is an option.
> 
> I don't really like the idea of a single nodedev-create creating
> multiple devices, but that would work too.
> 
>> That is, what is the "output" from create_group that gets added to the
>> domain XML?  How is that found?
> 
> A new sysfs path is created, whose name depends on the UUID.  The UUID
> is used in a <hostdev> element in the domain XML and the sysfs path
> appears in the QEMU command line.  Kirti and Neo had examples in their
> presentation at KVM Forum.
> 
> If you create multiple devices in the same group, they are added to the
> same IOMMU group so they must be used by the same VM.  However they
> don't have to be available from the beginning; they could be
> hotplugged/hot-unplugged later, since from the point of view of the VM
> those are just another PCI device.
> 
>> Also, once the domain is running can a
>> vGPU be added to the group?  Removed?  What allows/prevents?
> 
> Kirti?... :)

Yes, vGPU could be hot-plugged or hot-unplugged. This also depends on
does vendor driver want to support that. For example, domain is running
with two vGPUs $UUID1 and $UUID2 and user tried to hot-unplug vGPU
$UUID2, vendor driver knows that domain is running and vGPU is being
used in guest, so vendor driver can fail offline/close() call if they
don't support hot-unplug. Similarly for hot-plug vendor driver can fail
create call to not to support hot-plug.

> 
> In principle I don't think anything should block vGPUs from different
> groups being added to the same VM, but I have to defer to Alex and Kirti
> again on this.
> 

No, there should be one group per VM.

>>> Since John brought up the topic of minimal XML, in this case it will be
>>> like this:
>>>
>>>     <device>
>>>       <name>my-vgpu</name>
>>>       <parent>pci_0000_86_00_0</parent>
>>>       <capability type='mdev'>
>>>         <type id='11'/>
>>>       </capability>
>>>     </device>
>>>
>>> The uuid will be autogenerated by libvirt and if there's no <group> (as
>>> is common for VMs with only 1 vGPU) it will be a single-device group.
>>
>> The <name> could be ignored as it seems existing libvirt code wants to
>> generate a name via udevGenerateDeviceName for other devices. I haven't
>> studied it long enough, but I believe that's how those pci_####* names
>> created.
> 
> Yeah that makes sense.  So we get down to a minimal XML that has just
> parent, and capability with type in it; additional elements could be
> name (ignored anyway), and within capability uuid and group.
>

Yes, this seems good.
I would like to have one more capability here. Pulling here some
suggestion from my previous mail:
In the directory structure, a 'params' can take optional parameters.
Libvirt then can set 'params' and then create mdev device. For example,
param say 'disable_console_vnc=1' is set for type 11, then devices
created of type 11 will have that param set unless it is cleared.

 └── mdev_supported_types
     ├── 11
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     ├── 12
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     └── 13
         ├── create
         ├── description
         └── max_instances
         └── params

So with that XML format would be:
    <device>
      <name>my-vgpu</name>
      <parent>pci_0000_86_00_0</parent>
       <capability type='mdev'>
        <type id='11'/>
	<group>group1</group>
	<params>disable_console_vnc=1</params>
      </capability>
    </device>

and 'params' field should be just a string to libvirt and its optional
also. If user want to provide extra parameter while creating vGPU device
they should provide it in XML file as above to nodedev-create.
Very initial proposal was to have this extra paramter list as a string
to mdev_create itself as:

  echo $UUID1:$PARAMS > mdev_create

I would like to know others opinions on whether it should be part of
mdev_create input or a separate write to 'params' file in sysfs as in
above directory structure.

Kirti.

> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-03 17:47                             ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-03 17:47 UTC (permalink / raw)
  To: Paolo Bonzini, John Ferlan, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 9/3/2016 6:37 PM, Paolo Bonzini wrote:
> 
> 
> On 03/09/2016 13:56, John Ferlan wrote:
>> On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
>>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>>> <Alex> We could even do:
>>>>>>
>>>>>> echo $UUID1:$GROUPA > create
>>>>>>
>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>> which $UUID1 is to be created and added to the same group.
>>>> </Alex>
>>>
>>> >From the point of view of libvirt, I think I prefer Alex's idea.
>>> <group> could be an additional element in the nodedev-create XML:
>>>
>>>     <device>
>>>       <name>my-vgpu</name>
>>>       <parent>pci_0000_86_00_0</parent>
>>>       <capability type='mdev'>
>>>         <type id='11'/>
>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>         <group>group1</group>
>>>       </capability>
>>>     </device>
>>>
>>> (should group also be a UUID?)
>>

I replied to earlier mail too, group number doesn't need to be UUID. It
should be a unique number. I think in the discussion in bof someone
mentioned about using domain's unique number that libvirt generates.
That should also work.

>> As long as create_group handles all the work and all libvirt does is
>> call it, get the return status/error, and handle deleting the vGPU on
>> error, then I guess it's doable.
>>

Yes that is the idea. Libvirt doesn't have to care about the groups.
With Alex's proposal, as you mentioned above, libvirt have to provide
group number to mdev_create, check return status and handle error case.

  echo $UUID1:$GROUP1 > mdev_create
  echo $UUID2:$GROUP1 > mdev_create
would create two mdev devices assigned to same domain.


>> Alternatively having multiple <type id='#'> in the XML and performing a
>> single *mdev/create_group is an option.
> 
> I don't really like the idea of a single nodedev-create creating
> multiple devices, but that would work too.
> 
>> That is, what is the "output" from create_group that gets added to the
>> domain XML?  How is that found?
> 
> A new sysfs path is created, whose name depends on the UUID.  The UUID
> is used in a <hostdev> element in the domain XML and the sysfs path
> appears in the QEMU command line.  Kirti and Neo had examples in their
> presentation at KVM Forum.
> 
> If you create multiple devices in the same group, they are added to the
> same IOMMU group so they must be used by the same VM.  However they
> don't have to be available from the beginning; they could be
> hotplugged/hot-unplugged later, since from the point of view of the VM
> those are just another PCI device.
> 
>> Also, once the domain is running can a
>> vGPU be added to the group?  Removed?  What allows/prevents?
> 
> Kirti?... :)

Yes, vGPU could be hot-plugged or hot-unplugged. This also depends on
does vendor driver want to support that. For example, domain is running
with two vGPUs $UUID1 and $UUID2 and user tried to hot-unplug vGPU
$UUID2, vendor driver knows that domain is running and vGPU is being
used in guest, so vendor driver can fail offline/close() call if they
don't support hot-unplug. Similarly for hot-plug vendor driver can fail
create call to not to support hot-plug.

> 
> In principle I don't think anything should block vGPUs from different
> groups being added to the same VM, but I have to defer to Alex and Kirti
> again on this.
> 

No, there should be one group per VM.

>>> Since John brought up the topic of minimal XML, in this case it will be
>>> like this:
>>>
>>>     <device>
>>>       <name>my-vgpu</name>
>>>       <parent>pci_0000_86_00_0</parent>
>>>       <capability type='mdev'>
>>>         <type id='11'/>
>>>       </capability>
>>>     </device>
>>>
>>> The uuid will be autogenerated by libvirt and if there's no <group> (as
>>> is common for VMs with only 1 vGPU) it will be a single-device group.
>>
>> The <name> could be ignored as it seems existing libvirt code wants to
>> generate a name via udevGenerateDeviceName for other devices. I haven't
>> studied it long enough, but I believe that's how those pci_####* names
>> created.
> 
> Yeah that makes sense.  So we get down to a minimal XML that has just
> parent, and capability with type in it; additional elements could be
> name (ignored anyway), and within capability uuid and group.
>

Yes, this seems good.
I would like to have one more capability here. Pulling here some
suggestion from my previous mail:
In the directory structure, a 'params' can take optional parameters.
Libvirt then can set 'params' and then create mdev device. For example,
param say 'disable_console_vnc=1' is set for type 11, then devices
created of type 11 will have that param set unless it is cleared.

 └── mdev_supported_types
     ├── 11
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     ├── 12
     │   ├── create
     │   ├── description
     │   └── max_instances
     │   └── params
     └── 13
         ├── create
         ├── description
         └── max_instances
         └── params

So with that XML format would be:
    <device>
      <name>my-vgpu</name>
      <parent>pci_0000_86_00_0</parent>
       <capability type='mdev'>
        <type id='11'/>
	<group>group1</group>
	<params>disable_console_vnc=1</params>
      </capability>
    </device>

and 'params' field should be just a string to libvirt and its optional
also. If user want to provide extra parameter while creating vGPU device
they should provide it in XML file as above to nodedev-create.
Very initial proposal was to have this extra paramter list as a string
to mdev_create itself as:

  echo $UUID1:$PARAMS > mdev_create

I would like to know others opinions on whether it should be part of
mdev_create input or a separate write to 'params' file in sysfs as in
above directory structure.

Kirti.

> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-02 23:57                     ` [Qemu-devel] [libvirt] " Laine Stump
@ 2016-09-05  7:52                       ` Paolo Bonzini
  -1 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-05  7:52 UTC (permalink / raw)
  To: Laine Stump, Michal Privoznik, Alex Williamson, libvir-list
  Cc: John Ferlan, Kirti Wankhede, Song, Jike, cjia, kvm, Tian, Kevin,
	qemu-devel, kraxel, bjsdjshi



On 03/09/2016 01:57, Laine Stump wrote:
>>
>> mdevs do not exist on the host (they do not have a driver on the host
>> because they are not PCI devices) so they do need any management.  At
>> least I hope that's good news. :)
> 
> What's your definition of "management"? They don't need the same type of
> management as a traditional hostdev, but they certainly don't just
> appear by magic! :-)
>
> For standard PCI devices, the managed attribute says whether or not the
> device needs to be detached from the host driver and attached to
> vfio-pci. For other kinds of hostdev devices, we could decide that it
> meant something different. In this case, perhaps managed='yes' could
> mean that the vGPU will be created as needed, and destroyed when the
> guest is finished with it, and managed='no' could mean that we expect a
> vGPU to already exist, and just need starting.

Yes, you're 100% right.  vGPUs have to be created through sysfs, and
that is indeed a kind of management.  My point is that for now, given
there is no support in libvirt for persistent nodedevs, it is safe to
let the user do that and reject managed='yes' for mdev-based <hostdev>.

If later you want to add nodedev-define, then managed='yes' might mean
"create and destroy the nodedev automatically" based on a persistent
definition.  But for now, you can enforce managed='no' (it's the default
anyway) and have the user create a transient nodedev manually before the
domain.  More features can be added incrementally on top.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-05  7:52                       ` Paolo Bonzini
  0 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-05  7:52 UTC (permalink / raw)
  To: Laine Stump, Michal Privoznik, Alex Williamson, libvir-list
  Cc: John Ferlan, Kirti Wankhede, Song, Jike, cjia, kvm, Tian, Kevin,
	qemu-devel, kraxel, bjsdjshi



On 03/09/2016 01:57, Laine Stump wrote:
>>
>> mdevs do not exist on the host (they do not have a driver on the host
>> because they are not PCI devices) so they do need any management.  At
>> least I hope that's good news. :)
> 
> What's your definition of "management"? They don't need the same type of
> management as a traditional hostdev, but they certainly don't just
> appear by magic! :-)
>
> For standard PCI devices, the managed attribute says whether or not the
> device needs to be detached from the host driver and attached to
> vfio-pci. For other kinds of hostdev devices, we could decide that it
> meant something different. In this case, perhaps managed='yes' could
> mean that the vGPU will be created as needed, and destroyed when the
> guest is finished with it, and managed='no' could mean that we expect a
> vGPU to already exist, and just need starting.

Yes, you're 100% right.  vGPUs have to be created through sysfs, and
that is indeed a kind of management.  My point is that for now, given
there is no support in libvirt for persistent nodedevs, it is safe to
let the user do that and reject managed='yes' for mdev-based <hostdev>.

If later you want to add nodedev-define, then managed='yes' might mean
"create and destroy the nodedev automatically" based on a persistent
definition.  But for now, you can enforce managed='no' (it's the default
anyway) and have the user create a transient nodedev manually before the
domain.  More features can be added incrementally on top.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-03 11:57                     ` [Qemu-devel] [libvirt] " John Ferlan
@ 2016-09-05  7:54                       ` Paolo Bonzini
  -1 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-05  7:54 UTC (permalink / raw)
  To: John Ferlan, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 03/09/2016 13:57, John Ferlan wrote:
>>>> After creating the vGPU, if required by the host driver, all the other
>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>
>>> Not wanting to make assumptions, but this reads as if I create one type
>>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>>> wrong - it's been a long week.
>>
>> Correct, at least for NVIDIA.
>>
> 
> OK, but so what am I missing vis-a-vis the groups conversation?  Sounds
> like multiple vGPU's are being combined, but if only one can be created.
> I think this is where I got confused while reading...

Oh, I read that as "then I can create no other _types_ on the host".
For NVIDIA you can create other vGPUs but they all have to be of the
same type (type 11 in your example).

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-05  7:54                       ` Paolo Bonzini
  0 siblings, 0 replies; 162+ messages in thread
From: Paolo Bonzini @ 2016-09-05  7:54 UTC (permalink / raw)
  To: John Ferlan, Kirti Wankhede, Michal Privoznik, Alex Williamson
  Cc: Song, Jike, cjia, kvm, libvir-list, Tian, Kevin, qemu-devel,
	kraxel, Laine Stump, bjsdjshi



On 03/09/2016 13:57, John Ferlan wrote:
>>>> After creating the vGPU, if required by the host driver, all the other
>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>
>>> Not wanting to make assumptions, but this reads as if I create one type
>>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>>> wrong - it's been a long week.
>>
>> Correct, at least for NVIDIA.
>>
> 
> OK, but so what am I missing vis-a-vis the groups conversation?  Sounds
> like multiple vGPU's are being combined, but if only one can be created.
> I think this is where I got confused while reading...

Oh, I read that as "then I can create no other _types_ on the host".
For NVIDIA you can create other vGPUs but they all have to be of the
same type (type 11 in your example).

Paolo

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-03 16:34                       ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-06 17:40                         ` Alex Williamson
  2016-09-06 19:35                           ` Kirti Wankhede
  2016-09-07  6:48                             ` Tian, Kevin
  0 siblings, 2 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-06 17:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Sat, 3 Sep 2016 22:04:56 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > 
> > 
> > On 02/09/2016 20:33, Kirti Wankhede wrote:  
> >> <Alex> We could even do:  
> >>>>
> >>>> echo $UUID1:$GROUPA > create
> >>>>
> >>>> where $GROUPA is the group ID of a previously created mdev device into
> >>>> which $UUID1 is to be created and added to the same group.  
> >> </Alex>  
> > 
> > From the point of view of libvirt, I think I prefer Alex's idea.
> > <group> could be an additional element in the nodedev-create XML:
> > 
> >     <device>
> >       <name>my-vgpu</name>
> >       <parent>pci_0000_86_00_0</parent>
> >       <capability type='mdev'>
> >         <type id='11'/>
> >         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >         <group>group1</group>
> >       </capability>
> >     </device>
> > 
> > (should group also be a UUID?)
> >   
> 
> No, this should be a unique number in a system, similar to iommu_group.

Sorry, just trying to catch up on this thread after a long weekend.

We're talking about iommu groups here, we're not creating any sort of
parallel grouping specific to mdev devices.  This is why my example
created a device and then required the user to go find the group number
given to that device in order to create another device within the same
group.  iommu group numbering is not within the user's control and is
not a uuid.  libvirt can refer to the group as anything it wants in the
xml, but the host group number is allocated by the host, not under user
control, is not persistent.  libvirt would just be giving it a name to
know which devices are part of the same group.  Perhaps the runtime xml
would fill in the group number once created.

There were also a lot of unanswered questions in my proposal, it's not
clear that there's a standard algorithm for when mdev devices need to
be grouped together.  Should we even allow groups to span multiple host
devices?  Should they be allowed to span devices from different
vendors?

If we imagine a scenario of a group composed of a mix of Intel and
NVIDIA vGPUs, what happens when an Intel device is opened first?  The
NVIDIA driver wouldn't know about this, but it would know when the
first NVIDIA device is opened and be able to establish p2p for the
NVIDIA devices at that point.  Can we do what we need with that model?
What if libvirt is asked to hot-add an NVIDIA vGPU?  It would need to
do a create on the NVIDIA parent device with the existing group id, at
which point the NVIDIA vendor driver could fail the device create if
the p2p setup has already been done.  The Intel vendor driver might
allow it.  Similar to open, the last close of the mdev device for a
given vendor (which might not be the last close of mdev devices within
the group) would need to trigger the offline process for that vendor.

That all sounds well and good... here's the kicker: iommu groups
necessarily need to be part of the same iommu context, ie.
vfio container.  How do we deal with vIOMMUs within the guest when we
are intentionally forcing a set of devices within the same context?
This is why it's _very_ beneficial on the host to create iommu groups
with the smallest number of devices we can reasonably trust to be
isolated.  We're backing ourselves into a corner if we tell libvirt
that the standard process is to put all mdev devices into a single
group.  The grouping/startup issue is still unresolved in my head.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-03 16:31                         ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-06 17:54                           ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-06 17:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: John Ferlan, Paolo Bonzini, Michal Privoznik, Song, Jike, cjia,
	kvm, libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Sat, 3 Sep 2016 22:01:13 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/3/2016 1:59 AM, John Ferlan wrote:
> > 
> > 
> > On 09/02/2016 02:33 PM, Kirti Wankhede wrote:  
> >>
> >> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:  
> >>>
> >>>
> >>> On 02/09/2016 19:15, Kirti Wankhede wrote:  
> >>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:  
> >>>>>    <device>
> >>>>>      <name>my-vgpu</name>
> >>>>>      <parent>pci_0000_86_00_0</parent>
> >>>>>      <capability type='mdev'>
> >>>>>        <type id='11'/>
> >>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>      </capability>
> >>>>>    </device>
> >>>>>
> >>>>> After creating the vGPU, if required by the host driver, all the other
> >>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.  
> >>>>
> >>>> Thanks Paolo for details.
> >>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
> >>>> file in sysfs to create mdev device. Right?
> >>>> At this moment, does libvirt know which VM this device would be
> >>>> associated with?  
> >>>
> >>> No, the VM will associate to the nodedev through the UUID.  The nodedev
> >>> is created separately from the VM.
> >>>  
> >>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
> >>>>> info, again taken from sysfs:
> >>>>>
> >>>>>    <device>
> >>>>>      <name>my-vgpu</name>
> >>>>>      <parent>pci_0000_86_00_0</parent>
> >>>>>      <capability type='mdev'>
> >>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>        <!-- only the chosen type -->
> >>>>>        <type id='11'>
> >>>>>          <!-- ... snip ... -->
> >>>>>        </type>
> >>>>>        <capability type='pci'>
> >>>>>          <!-- no domain/bus/slot/function of course -->
> >>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
> >>>>>          <product id='...'>...</product>
> >>>>>          <vendor id='0x10de'>NVIDIA</vendor>
> >>>>>        </capability>
> >>>>>      </capability>
> >>>>>    </device>
> >>>>>
> >>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
> >>>>> pci at all, would have it inside mdev.  This represents the difference
> >>>>> between the mdev provider and the mdev device.  
> >>>>
> >>>> Parent of mdev device might not always be a PCI device. I think we
> >>>> shouldn't consider it as PCI capability.  
> >>>
> >>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
> >>> as a PCI device by VFIO.
> >>>
> >>> The <capability type='pci'> in the physical GPU means that the GPU is a
> >>> PCI device.
> >>>  
> >>
> >> Ok. Got that.
> >>  
> >>>>> Random proposal for the domain XML too:
> >>>>>
> >>>>>   <hostdev mode='subsystem' type='pci'>
> >>>>>     <source type='mdev'>
> >>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
> >>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>     </source>
> >>>>>     <address type='pci' bus='0' slot='2' function='0'/>
> >>>>>   </hostdev>
> >>>>>  
> >>>>
> >>>> When user wants to assign two mdev devices to one VM, user have to add
> >>>> such two entries or group the two devices in one entry?  
> >>>
> >>> Two entries, one per UUID, each with its own PCI address in the guest.
> >>>  
> >>>> On other mail thread with same subject we are thinking of creating group
> >>>> of mdev devices to assign multiple mdev devices to one VM.  
> >>>
> >>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
> >>> other thread).
> >>>  
> >>
> >> When mdev device is created, resources from physical device is assigned
> >> to this device. But resources are committed only when device goes
> >> 'online' ('start' in v6 patch)
> >> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
> >> for all vGPU devices in a VM are committed at one place. So we need to
> >> know the vGPUs assigned to a VM before QEMU starts.
> >>
> >> Grouping would help here as Alex suggested in that mail. Pulling only
> >> that part of discussion here:
> >>
> >> <Alex> It seems then that the grouping needs to affect the iommu group
> >> so that  
> >>> you know that there's only a single owner for all the mdev devices
> >>> within the group.  IIRC, the bus drivers don't have any visibility
> >>> to opening and releasing of the group itself to trigger the
> >>> online/offline, but they can track opening of the device file
> >>> descriptors within the group.  Within the VFIO API the user cannot
> >>> access the device without the device file descriptor, so a "first
> >>> device opened" and "last device closed" trigger would provide the
> >>> trigger points you need.  Some sort of new sysfs interface would need
> >>> to be invented to allow this sort of manipulation.
> >>> Also we should probably keep sight of whether we feel this is
> >>> sufficiently necessary for the complexity.  If we can get by with only
> >>> doing this grouping at creation time then we could define the "create"
> >>> interface in various ways.  For example:
> >>>
> >>> echo $UUID0 > create
> >>>
> >>> would create a single mdev named $UUID0 in it's own group.
> >>>
> >>> echo {$UUID0,$UUID1} > create
> >>>
> >>> could create mdev devices $UUID0 and $UUID1 grouped together.
> >>>  
> >> </Alex>
> >>
> >> <Kirti>
> >> I think this would create mdev device of same type on same parent
> >> device. We need to consider the case of multiple mdev devices of
> >> different types and with different parents to be grouped together.
> >> </Kirti>
> >>
> >> <Alex> We could even do:  
> >>>
> >>> echo $UUID1:$GROUPA > create
> >>>
> >>> where $GROUPA is the group ID of a previously created mdev device into
> >>> which $UUID1 is to be created and added to the same group.  
> >> </Alex>
> >>
> >> <Kirti>
> >> I was thinking about:
> >>
> >>   echo $UUID0 > create
> >>
> >> would create mdev device
> >>
> >>   echo $UUID0 > /sys/class/mdev/create_group
> >>
> >> would add created device to group.
> >>
> >> For multiple devices case:
> >>   echo $UUID0 > create
> >>   echo $UUID1 > create
> >>
> >> would create mdev devices which could be of different types and
> >> different parents.
> >>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
> >>
> >> would add devices in a group.
> >> Mdev core module would create a new group with unique number.  On mdev
> >> device 'destroy' that mdev device would be removed from the group. When
> >> there are no devices left in the group, group would be deleted. With
> >> this "first device opened" and "last device closed" trigger can be used
> >> to commit resources.
> >> Then libvirt use mdev device path to pass as argument to QEMU, same as
> >> it does for VFIO. Libvirt don't have to care about group number.
> >> </Kirti>
> >>  
> > 
> > The more complicated one makes this, the more difficult it is for the
> > customer to configure and the more difficult it is and the longer it
> > takes to get something out. I didn't follow the details of groups...
> > 
> > What gets created from a pass through some *mdev/create_group?    
> 
> My proposal here is, on
>   echo $UUID1, $UUID2 > /sys/class/mdev/create_group
> would create a group in mdev core driver, which should be internal to
> mdev core module. In mdev core module, a unique group number would be
> saved in mdev_device structure for each device belonging to a that group.

See my reply to the other thread, the group is an iommu group because
that's the unit of ownership vfio uses.  We're not going to impose an
mdev specific layer of grouping on vfio.  iommu group IDs are allocated
by the iommu-core, we don't get to specify them.  Also note the
complication I've discovered with all devices within a group requiring
the same iommu context, which maps poorly to the multiple device iommu
contexts required to support a guest iommu.  That's certainly not
something we'd want to impose on mdev devices in the general case.
 
> > Does
> > some new udev device get create that then is fed to the guest?  
> 
> No, group is not a device. It will be like a identifier for the use of
> vendor driver to identify devices in a group.
> 
> > Seems
> > painful to make two distinct/async passes through systemd/udev. I
> > foresee testing nightmares with creating 3 vGPU's, processing a group
> > request, while some other process/thread is deleting a vGPU... How do
> > the vGPU's get marked so that the delete cannot happen.
> >   
> 
> How is the same case handled for direct assigned device? I mean a device
> is unbound from its vendors driver, bound to vfio_pci device. How is it
> guaranteed to be assigned to vfio_pci module? some other process/thread
> might unbound it from vfio_pci module?

Yeah, I don't really see the problem here.  Once an mdev device is
bound to the mdev driver and opened by the user, the mdev driver
release callback would be required in order to do the unbind.  If we're
concerned about multiple entities playing in sysfs at the same time
creating and deleting devices and stepping on each other, well that's
why we're using uuids for the device names and why we'd get group
numbers from the iommu-core so that we have unique devices/groups and
why we establish the parent-child relationship between mdev device and
parent so we can't have orphan devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-06 17:54                           ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-06 17:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: John Ferlan, Paolo Bonzini, Michal Privoznik, Song, Jike, cjia,
	kvm, libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Sat, 3 Sep 2016 22:01:13 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/3/2016 1:59 AM, John Ferlan wrote:
> > 
> > 
> > On 09/02/2016 02:33 PM, Kirti Wankhede wrote:  
> >>
> >> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:  
> >>>
> >>>
> >>> On 02/09/2016 19:15, Kirti Wankhede wrote:  
> >>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:  
> >>>>>    <device>
> >>>>>      <name>my-vgpu</name>
> >>>>>      <parent>pci_0000_86_00_0</parent>
> >>>>>      <capability type='mdev'>
> >>>>>        <type id='11'/>
> >>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>      </capability>
> >>>>>    </device>
> >>>>>
> >>>>> After creating the vGPU, if required by the host driver, all the other
> >>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.  
> >>>>
> >>>> Thanks Paolo for details.
> >>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
> >>>> file in sysfs to create mdev device. Right?
> >>>> At this moment, does libvirt know which VM this device would be
> >>>> associated with?  
> >>>
> >>> No, the VM will associate to the nodedev through the UUID.  The nodedev
> >>> is created separately from the VM.
> >>>  
> >>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
> >>>>> info, again taken from sysfs:
> >>>>>
> >>>>>    <device>
> >>>>>      <name>my-vgpu</name>
> >>>>>      <parent>pci_0000_86_00_0</parent>
> >>>>>      <capability type='mdev'>
> >>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>        <!-- only the chosen type -->
> >>>>>        <type id='11'>
> >>>>>          <!-- ... snip ... -->
> >>>>>        </type>
> >>>>>        <capability type='pci'>
> >>>>>          <!-- no domain/bus/slot/function of course -->
> >>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
> >>>>>          <product id='...'>...</product>
> >>>>>          <vendor id='0x10de'>NVIDIA</vendor>
> >>>>>        </capability>
> >>>>>      </capability>
> >>>>>    </device>
> >>>>>
> >>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
> >>>>> pci at all, would have it inside mdev.  This represents the difference
> >>>>> between the mdev provider and the mdev device.  
> >>>>
> >>>> Parent of mdev device might not always be a PCI device. I think we
> >>>> shouldn't consider it as PCI capability.  
> >>>
> >>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
> >>> as a PCI device by VFIO.
> >>>
> >>> The <capability type='pci'> in the physical GPU means that the GPU is a
> >>> PCI device.
> >>>  
> >>
> >> Ok. Got that.
> >>  
> >>>>> Random proposal for the domain XML too:
> >>>>>
> >>>>>   <hostdev mode='subsystem' type='pci'>
> >>>>>     <source type='mdev'>
> >>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
> >>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>     </source>
> >>>>>     <address type='pci' bus='0' slot='2' function='0'/>
> >>>>>   </hostdev>
> >>>>>  
> >>>>
> >>>> When user wants to assign two mdev devices to one VM, user have to add
> >>>> such two entries or group the two devices in one entry?  
> >>>
> >>> Two entries, one per UUID, each with its own PCI address in the guest.
> >>>  
> >>>> On other mail thread with same subject we are thinking of creating group
> >>>> of mdev devices to assign multiple mdev devices to one VM.  
> >>>
> >>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
> >>> other thread).
> >>>  
> >>
> >> When mdev device is created, resources from physical device is assigned
> >> to this device. But resources are committed only when device goes
> >> 'online' ('start' in v6 patch)
> >> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
> >> for all vGPU devices in a VM are committed at one place. So we need to
> >> know the vGPUs assigned to a VM before QEMU starts.
> >>
> >> Grouping would help here as Alex suggested in that mail. Pulling only
> >> that part of discussion here:
> >>
> >> <Alex> It seems then that the grouping needs to affect the iommu group
> >> so that  
> >>> you know that there's only a single owner for all the mdev devices
> >>> within the group.  IIRC, the bus drivers don't have any visibility
> >>> to opening and releasing of the group itself to trigger the
> >>> online/offline, but they can track opening of the device file
> >>> descriptors within the group.  Within the VFIO API the user cannot
> >>> access the device without the device file descriptor, so a "first
> >>> device opened" and "last device closed" trigger would provide the
> >>> trigger points you need.  Some sort of new sysfs interface would need
> >>> to be invented to allow this sort of manipulation.
> >>> Also we should probably keep sight of whether we feel this is
> >>> sufficiently necessary for the complexity.  If we can get by with only
> >>> doing this grouping at creation time then we could define the "create"
> >>> interface in various ways.  For example:
> >>>
> >>> echo $UUID0 > create
> >>>
> >>> would create a single mdev named $UUID0 in it's own group.
> >>>
> >>> echo {$UUID0,$UUID1} > create
> >>>
> >>> could create mdev devices $UUID0 and $UUID1 grouped together.
> >>>  
> >> </Alex>
> >>
> >> <Kirti>
> >> I think this would create mdev device of same type on same parent
> >> device. We need to consider the case of multiple mdev devices of
> >> different types and with different parents to be grouped together.
> >> </Kirti>
> >>
> >> <Alex> We could even do:  
> >>>
> >>> echo $UUID1:$GROUPA > create
> >>>
> >>> where $GROUPA is the group ID of a previously created mdev device into
> >>> which $UUID1 is to be created and added to the same group.  
> >> </Alex>
> >>
> >> <Kirti>
> >> I was thinking about:
> >>
> >>   echo $UUID0 > create
> >>
> >> would create mdev device
> >>
> >>   echo $UUID0 > /sys/class/mdev/create_group
> >>
> >> would add created device to group.
> >>
> >> For multiple devices case:
> >>   echo $UUID0 > create
> >>   echo $UUID1 > create
> >>
> >> would create mdev devices which could be of different types and
> >> different parents.
> >>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
> >>
> >> would add devices in a group.
> >> Mdev core module would create a new group with unique number.  On mdev
> >> device 'destroy' that mdev device would be removed from the group. When
> >> there are no devices left in the group, group would be deleted. With
> >> this "first device opened" and "last device closed" trigger can be used
> >> to commit resources.
> >> Then libvirt use mdev device path to pass as argument to QEMU, same as
> >> it does for VFIO. Libvirt don't have to care about group number.
> >> </Kirti>
> >>  
> > 
> > The more complicated one makes this, the more difficult it is for the
> > customer to configure and the more difficult it is and the longer it
> > takes to get something out. I didn't follow the details of groups...
> > 
> > What gets created from a pass through some *mdev/create_group?    
> 
> My proposal here is, on
>   echo $UUID1, $UUID2 > /sys/class/mdev/create_group
> would create a group in mdev core driver, which should be internal to
> mdev core module. In mdev core module, a unique group number would be
> saved in mdev_device structure for each device belonging to a that group.

See my reply to the other thread, the group is an iommu group because
that's the unit of ownership vfio uses.  We're not going to impose an
mdev specific layer of grouping on vfio.  iommu group IDs are allocated
by the iommu-core, we don't get to specify them.  Also note the
complication I've discovered with all devices within a group requiring
the same iommu context, which maps poorly to the multiple device iommu
contexts required to support a guest iommu.  That's certainly not
something we'd want to impose on mdev devices in the general case.
 
> > Does
> > some new udev device get create that then is fed to the guest?  
> 
> No, group is not a device. It will be like a identifier for the use of
> vendor driver to identify devices in a group.
> 
> > Seems
> > painful to make two distinct/async passes through systemd/udev. I
> > foresee testing nightmares with creating 3 vGPU's, processing a group
> > request, while some other process/thread is deleting a vGPU... How do
> > the vGPU's get marked so that the delete cannot happen.
> >   
> 
> How is the same case handled for direct assigned device? I mean a device
> is unbound from its vendors driver, bound to vfio_pci device. How is it
> guaranteed to be assigned to vfio_pci module? some other process/thread
> might unbound it from vfio_pci module?

Yeah, I don't really see the problem here.  Once an mdev device is
bound to the mdev driver and opened by the user, the mdev driver
release callback would be required in order to do the unbind.  If we're
concerned about multiple entities playing in sysfs at the same time
creating and deleting devices and stepping on each other, well that's
why we're using uuids for the device names and why we'd get group
numbers from the iommu-core so that we have unique devices/groups and
why we establish the parent-child relationship between mdev device and
parent so we can't have orphan devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-06 17:40                         ` Alex Williamson
@ 2016-09-06 19:35                           ` Kirti Wankhede
  2016-09-06 21:28                             ` Alex Williamson
  2016-09-07  6:48                             ` Tian, Kevin
  1 sibling, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-06 19:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi



On 9/6/2016 11:10 PM, Alex Williamson wrote:
> On Sat, 3 Sep 2016 22:04:56 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 20:33, Kirti Wankhede wrote:  
>>>> <Alex> We could even do:  
>>>>>>
>>>>>> echo $UUID1:$GROUPA > create
>>>>>>
>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>> which $UUID1 is to be created and added to the same group.  
>>>> </Alex>  
>>>
>>> From the point of view of libvirt, I think I prefer Alex's idea.
>>> <group> could be an additional element in the nodedev-create XML:
>>>
>>>     <device>
>>>       <name>my-vgpu</name>
>>>       <parent>pci_0000_86_00_0</parent>
>>>       <capability type='mdev'>
>>>         <type id='11'/>
>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>         <group>group1</group>
>>>       </capability>
>>>     </device>
>>>
>>> (should group also be a UUID?)
>>>   
>>
>> No, this should be a unique number in a system, similar to iommu_group.
> 
> Sorry, just trying to catch up on this thread after a long weekend.
> 
> We're talking about iommu groups here, we're not creating any sort of
> parallel grouping specific to mdev devices.

I thought we were talking about group of mdev devices and not iommu
group. IIRC, there were concerns about it (this would be similar to
UUID+instance) and that would (ab)use iommu groups.

I'm thinking about your suggestion, but would also like to know your
thought how sysfs interface would look like? Its still no clear to me.
Or will it be better to have grouping at mdev layer?

Kirti.

>  This is why my example
> created a device and then required the user to go find the group number
> given to that device in order to create another device within the same
> group.  iommu group numbering is not within the user's control and is
> not a uuid.  libvirt can refer to the group as anything it wants in the
> xml, but the host group number is allocated by the host, not under user
> control, is not persistent.  libvirt would just be giving it a name to
> know which devices are part of the same group.  Perhaps the runtime xml
> would fill in the group number once created.
> 
> There were also a lot of unanswered questions in my proposal, it's not
> clear that there's a standard algorithm for when mdev devices need to
> be grouped together.  Should we even allow groups to span multiple host
> devices?  Should they be allowed to span devices from different
> vendors?
>
> If we imagine a scenario of a group composed of a mix of Intel and
> NVIDIA vGPUs, what happens when an Intel device is opened first?  The
> NVIDIA driver wouldn't know about this, but it would know when the
> first NVIDIA device is opened and be able to establish p2p for the
> NVIDIA devices at that point.  Can we do what we need with that model?
> What if libvirt is asked to hot-add an NVIDIA vGPU?  It would need to
> do a create on the NVIDIA parent device with the existing group id, at
> which point the NVIDIA vendor driver could fail the device create if
> the p2p setup has already been done.  The Intel vendor driver might
> allow it.  Similar to open, the last close of the mdev device for a
> given vendor (which might not be the last close of mdev devices within
> the group) would need to trigger the offline process for that vendor.
> 
> That all sounds well and good... here's the kicker: iommu groups
> necessarily need to be part of the same iommu context, ie.
> vfio container.  How do we deal with vIOMMUs within the guest when we
> are intentionally forcing a set of devices within the same context?
> This is why it's _very_ beneficial on the host to create iommu groups
> with the smallest number of devices we can reasonably trust to be
> isolated.  We're backing ourselves into a corner if we tell libvirt
> that the standard process is to put all mdev devices into a single
> group.  The grouping/startup issue is still unresolved in my head.
> Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-06 19:35                           ` Kirti Wankhede
@ 2016-09-06 21:28                             ` Alex Williamson
  2016-09-07  8:22                                 ` Tian, Kevin
  2016-09-07 16:15                               ` Kirti Wankhede
  0 siblings, 2 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-06 21:28 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Wed, 7 Sep 2016 01:05:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/6/2016 11:10 PM, Alex Williamson wrote:
> > On Sat, 3 Sep 2016 22:04:56 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:  
> >>>
> >>>
> >>> On 02/09/2016 20:33, Kirti Wankhede wrote:    
> >>>> <Alex> We could even do:    
> >>>>>>
> >>>>>> echo $UUID1:$GROUPA > create
> >>>>>>
> >>>>>> where $GROUPA is the group ID of a previously created mdev device into
> >>>>>> which $UUID1 is to be created and added to the same group.    
> >>>> </Alex>    
> >>>
> >>> From the point of view of libvirt, I think I prefer Alex's idea.
> >>> <group> could be an additional element in the nodedev-create XML:
> >>>
> >>>     <device>
> >>>       <name>my-vgpu</name>
> >>>       <parent>pci_0000_86_00_0</parent>
> >>>       <capability type='mdev'>
> >>>         <type id='11'/>
> >>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>         <group>group1</group>
> >>>       </capability>
> >>>     </device>
> >>>
> >>> (should group also be a UUID?)
> >>>     
> >>
> >> No, this should be a unique number in a system, similar to iommu_group.  
> > 
> > Sorry, just trying to catch up on this thread after a long weekend.
> > 
> > We're talking about iommu groups here, we're not creating any sort of
> > parallel grouping specific to mdev devices.  
> 
> I thought we were talking about group of mdev devices and not iommu
> group. IIRC, there were concerns about it (this would be similar to
> UUID+instance) and that would (ab)use iommu groups.

What constraints does a group, which is not an iommu group, place on the
usage of the mdev devices?  What happens if we put two mdev devices in
the same "mdev group" and then assign them to separate VMs/users?  I
believe that the answer is that this theoretical "mdev group" doesn't
actually impose any constraints on the devices within the group or how
they're used.

vfio knows about iommu groups and we consider an iommu group to be the
unit of ownership for userspace.  Therefore by placing multiple mdev
devices within the same iommu group we can be assured that there's only
one user for that group.  Furthermore, the specific case for this
association on NVIDIA is to couple the hardware peer-to-peer resources
for the individual mdev devices.  Therefore this particular grouping
does imply a lack of isolation between those mdev devices involved in
the group.

For mdev devices which are actually isolated from one another, where
they don't poke these p2p holes, placing them in the same iommu group
is definitely an abuse of the interface and is going to lead to
problems with a single iommu context.  But how does libvirt know that
one type of mdev device needs to be grouped while another type doesn't?

There's really not much that I like about using iommu groups in this
way, it's just that they seem to solve this particular problem of
enforcing how such a group can be used and imposing a second form of
grouping onto the vfio infrastructure seems much too complex.
 
> I'm thinking about your suggestion, but would also like to know your
> thought how sysfs interface would look like? Its still no clear to me.
> Or will it be better to have grouping at mdev layer?

In previous replies I had proposed that a group could be an additional
argument when we write the mdev UUID to the create entry in sysfs.
This is specifically why I listed only the UUID when creating the first
mdev device and UUID:group when creating the second.  The user would
need to go determine the group ID allocated for the first entry to
specify creating the second within that same group.

I have no love for this proposal, it's functional but not elegant and
again leaves libvirt lost in trying to determine which devices need to
be grouped together and which have no business being grouped together.

Let's think through this further and let me make a couple assumptions
to get started:

1) iommu groups are the way that we want to group NVIDIA vGPUs because:
  a) The peer-to-peer resources represent an isolation gap between
     mdev devices, iommu groups represent sets of isolated devices.
  b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
     device model.
  c) iommu_group_for_each_dev() gives the vendor driver the
     functionality it needs to perform a first-open/last-close
     device walk for configuring these p2p resources.

2) iommu groups as used by mdev devices should contain the minimum
number of devices in order to provide the maximum iommu context
flexibility.

Do we agree on these?  The corollary is that NVIDIA is going to suffer
reduced iommu granularity exactly because of the requirement to setup
p2p resources between mdev devices within the same VM.  This has
implications when guest iommus are in play (viommu).

So by default we want an iommu group per mdev.  This works for all mdev
devices as far as we know, including NVIDIA with the constraint that we
only have a single NVIDIA device per VM.

What if we want multiple NVIDIA devices?  We either need to create the
additional devices with a property which will place them into the same
iommu group or allow the iommu groups to be manipulated dynamically.

The trouble I see with the former (creating a device into a group) is
that it becomes part of the "create" syntax, which is global for all
mdev devices.  It's the same functional, but non-elegant solution I
proposed previously.

What if we allow groups to be manipulated dynamically?  In this case I
envision an attribute under the mdev device with read/write access.
The existence of the attribute indicates to libvirt that this device
requires such handling and allows reading and setting the association.
To be clear, the attribute would only exist on mdev devices requiring
this handling.  I'm always a fan of naming things after what they do, so
rather than making this attribute reference an iommu group, I might
actually call it "peer_to_peer_resource_uuid".  So the process might
look something like this:

# create 2 mdev devices
echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create
echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create

# move $UUID1 to the same group as $UUID0
P2P_UUID=$(cat /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid)
echo $P2P_UUID > \
    /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid

Alternatively we could have used uuidgen to create a UUID then moved
both to the new UUID.

Within the mdev vendor driver this would walk through the mdev devices,
find the matching peer_to_peer_resource_uuid (generated randomly at
create time by default) and add the device to the iommu group for
devices sharing that p2p uuid.  When removed from the VM, libvirt could
simply echo the output of uuidgen to each to split them again.

So from a libvirt perspective, special handling would need to invoked
that when this p2p attribute is found, all devices for a given VM would
need to share the same p2p uuid.  libvirt would be free to use an
existing p2p uuid or generate a new one.  The vendor driver should
enforce a write failure if the device cannot be added to the p2p uuid
(for example devices within the p2p uuid are already opened).

Maybe this is similar to your proposal and even goes back to vm_uuid,
but under the covers the vendor driver needs to be manipulating iommu
grouping based on this parameter and there's no concept of an "mdev
group" in the base API (nor vm_uuid), this is an extension keyed by the
additional sysfs attribute.

Are we getting closer?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* RE: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-06 17:40                         ` Alex Williamson
@ 2016-09-07  6:48                             ` Tian, Kevin
  2016-09-07  6:48                             ` Tian, Kevin
  1 sibling, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-07  6:48 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, qemu-devel, kraxel, Laine Stump, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, September 07, 2016 1:41 AM
> 
> On Sat, 3 Sep 2016 22:04:56 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >
> > >
> > > On 02/09/2016 20:33, Kirti Wankhede wrote:
> > >> <Alex> We could even do:
> > >>>>
> > >>>> echo $UUID1:$GROUPA > create
> > >>>>
> > >>>> where $GROUPA is the group ID of a previously created mdev device into
> > >>>> which $UUID1 is to be created and added to the same group.
> > >> </Alex>
> > >
> > > From the point of view of libvirt, I think I prefer Alex's idea.
> > > <group> could be an additional element in the nodedev-create XML:
> > >
> > >     <device>
> > >       <name>my-vgpu</name>
> > >       <parent>pci_0000_86_00_0</parent>
> > >       <capability type='mdev'>
> > >         <type id='11'/>
> > >         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> > >         <group>group1</group>
> > >       </capability>
> > >     </device>
> > >
> > > (should group also be a UUID?)
> > >
> >
> > No, this should be a unique number in a system, similar to iommu_group.
> 
> Sorry, just trying to catch up on this thread after a long weekend.
> 
> We're talking about iommu groups here, we're not creating any sort of
> parallel grouping specific to mdev devices.  This is why my example
> created a device and then required the user to go find the group number
> given to that device in order to create another device within the same
> group.  iommu group numbering is not within the user's control and is
> not a uuid.  libvirt can refer to the group as anything it wants in the
> xml, but the host group number is allocated by the host, not under user
> control, is not persistent.  libvirt would just be giving it a name to
> know which devices are part of the same group.  Perhaps the runtime xml
> would fill in the group number once created.
> 
> There were also a lot of unanswered questions in my proposal, it's not
> clear that there's a standard algorithm for when mdev devices need to
> be grouped together.  Should we even allow groups to span multiple host
> devices?  Should they be allowed to span devices from different
> vendors?

I think we should limit the scope of iommu group for mdev here, which 
better only contains mdev belonging to same parent device. Spanning
multiple host devices (regardless of whether from different vendors)
are grouped based on physical isolation granularity. Better not to mix
two levels together. I'm not sure whether NVIDIA has requirement to
start all vGPUs together even when they come from different parent
devices. Hope not...

> 
> If we imagine a scenario of a group composed of a mix of Intel and
> NVIDIA vGPUs, what happens when an Intel device is opened first?  The
> NVIDIA driver wouldn't know about this, but it would know when the
> first NVIDIA device is opened and be able to establish p2p for the
> NVIDIA devices at that point.  Can we do what we need with that model?
> What if libvirt is asked to hot-add an NVIDIA vGPU?  It would need to
> do a create on the NVIDIA parent device with the existing group id, at
> which point the NVIDIA vendor driver could fail the device create if
> the p2p setup has already been done.  The Intel vendor driver might
> allow it.  Similar to open, the last close of the mdev device for a
> given vendor (which might not be the last close of mdev devices within
> the group) would need to trigger the offline process for that vendor.

I assume iommu group is for minimal isolation granularity. In higher
level we have VFIO container which could deliver both Intel vGPUs
and NVIDIA vGPUs to the same VM. Intel vGPUs each have its own
iommu group, while NVIDIA vGPUs of the same parent device may
be in one group.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-07  6:48                             ` Tian, Kevin
  0 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-07  6:48 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, qemu-devel, kraxel, Laine Stump, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, September 07, 2016 1:41 AM
> 
> On Sat, 3 Sep 2016 22:04:56 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >
> > >
> > > On 02/09/2016 20:33, Kirti Wankhede wrote:
> > >> <Alex> We could even do:
> > >>>>
> > >>>> echo $UUID1:$GROUPA > create
> > >>>>
> > >>>> where $GROUPA is the group ID of a previously created mdev device into
> > >>>> which $UUID1 is to be created and added to the same group.
> > >> </Alex>
> > >
> > > From the point of view of libvirt, I think I prefer Alex's idea.
> > > <group> could be an additional element in the nodedev-create XML:
> > >
> > >     <device>
> > >       <name>my-vgpu</name>
> > >       <parent>pci_0000_86_00_0</parent>
> > >       <capability type='mdev'>
> > >         <type id='11'/>
> > >         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> > >         <group>group1</group>
> > >       </capability>
> > >     </device>
> > >
> > > (should group also be a UUID?)
> > >
> >
> > No, this should be a unique number in a system, similar to iommu_group.
> 
> Sorry, just trying to catch up on this thread after a long weekend.
> 
> We're talking about iommu groups here, we're not creating any sort of
> parallel grouping specific to mdev devices.  This is why my example
> created a device and then required the user to go find the group number
> given to that device in order to create another device within the same
> group.  iommu group numbering is not within the user's control and is
> not a uuid.  libvirt can refer to the group as anything it wants in the
> xml, but the host group number is allocated by the host, not under user
> control, is not persistent.  libvirt would just be giving it a name to
> know which devices are part of the same group.  Perhaps the runtime xml
> would fill in the group number once created.
> 
> There were also a lot of unanswered questions in my proposal, it's not
> clear that there's a standard algorithm for when mdev devices need to
> be grouped together.  Should we even allow groups to span multiple host
> devices?  Should they be allowed to span devices from different
> vendors?

I think we should limit the scope of iommu group for mdev here, which 
better only contains mdev belonging to same parent device. Spanning
multiple host devices (regardless of whether from different vendors)
are grouped based on physical isolation granularity. Better not to mix
two levels together. I'm not sure whether NVIDIA has requirement to
start all vGPUs together even when they come from different parent
devices. Hope not...

> 
> If we imagine a scenario of a group composed of a mix of Intel and
> NVIDIA vGPUs, what happens when an Intel device is opened first?  The
> NVIDIA driver wouldn't know about this, but it would know when the
> first NVIDIA device is opened and be able to establish p2p for the
> NVIDIA devices at that point.  Can we do what we need with that model?
> What if libvirt is asked to hot-add an NVIDIA vGPU?  It would need to
> do a create on the NVIDIA parent device with the existing group id, at
> which point the NVIDIA vendor driver could fail the device create if
> the p2p setup has already been done.  The Intel vendor driver might
> allow it.  Similar to open, the last close of the mdev device for a
> given vendor (which might not be the last close of mdev devices within
> the group) would need to trigger the offline process for that vendor.

I assume iommu group is for minimal isolation granularity. In higher
level we have VFIO container which could deliver both Intel vGPUs
and NVIDIA vGPUs to the same VM. Intel vGPUs each have its own
iommu group, while NVIDIA vGPUs of the same parent device may
be in one group.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* RE: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-06 21:28                             ` Alex Williamson
@ 2016-09-07  8:22                                 ` Tian, Kevin
  2016-09-07 16:15                               ` Kirti Wankhede
  1 sibling, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-07  8:22 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, qemu-devel, kraxel, Laine Stump, bjsdjshi

> From: Alex Williamson
> Sent: Wednesday, September 07, 2016 5:29 AM
> 
> On Wed, 7 Sep 2016 01:05:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 9/6/2016 11:10 PM, Alex Williamson wrote:
> > > On Sat, 3 Sep 2016 22:04:56 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >>>
> > >>>
> > >>> On 02/09/2016 20:33, Kirti Wankhede wrote:
> > >>>> <Alex> We could even do:
> > >>>>>>
> > >>>>>> echo $UUID1:$GROUPA > create
> > >>>>>>
> > >>>>>> where $GROUPA is the group ID of a previously created mdev device into
> > >>>>>> which $UUID1 is to be created and added to the same group.
> > >>>> </Alex>
> > >>>
> > >>> From the point of view of libvirt, I think I prefer Alex's idea.
> > >>> <group> could be an additional element in the nodedev-create XML:
> > >>>
> > >>>     <device>
> > >>>       <name>my-vgpu</name>
> > >>>       <parent>pci_0000_86_00_0</parent>
> > >>>       <capability type='mdev'>
> > >>>         <type id='11'/>
> > >>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> > >>>         <group>group1</group>
> > >>>       </capability>
> > >>>     </device>
> > >>>
> > >>> (should group also be a UUID?)
> > >>>
> > >>
> > >> No, this should be a unique number in a system, similar to iommu_group.
> > >
> > > Sorry, just trying to catch up on this thread after a long weekend.
> > >
> > > We're talking about iommu groups here, we're not creating any sort of
> > > parallel grouping specific to mdev devices.
> >
> > I thought we were talking about group of mdev devices and not iommu
> > group. IIRC, there were concerns about it (this would be similar to
> > UUID+instance) and that would (ab)use iommu groups.
> 
> What constraints does a group, which is not an iommu group, place on the
> usage of the mdev devices?  What happens if we put two mdev devices in
> the same "mdev group" and then assign them to separate VMs/users?  I
> believe that the answer is that this theoretical "mdev group" doesn't
> actually impose any constraints on the devices within the group or how
> they're used.
> 
> vfio knows about iommu groups and we consider an iommu group to be the
> unit of ownership for userspace.  Therefore by placing multiple mdev
> devices within the same iommu group we can be assured that there's only
> one user for that group.  Furthermore, the specific case for this
> association on NVIDIA is to couple the hardware peer-to-peer resources
> for the individual mdev devices.  Therefore this particular grouping
> does imply a lack of isolation between those mdev devices involved in
> the group.
> 
> For mdev devices which are actually isolated from one another, where
> they don't poke these p2p holes, placing them in the same iommu group
> is definitely an abuse of the interface and is going to lead to
> problems with a single iommu context.  But how does libvirt know that
> one type of mdev device needs to be grouped while another type doesn't?

can we introduce an attribute under specific type to indicate such p2p
requirement so libvirt knows the need of additional group action?

> 
> There's really not much that I like about using iommu groups in this
> way, it's just that they seem to solve this particular problem of
> enforcing how such a group can be used and imposing a second form of
> grouping onto the vfio infrastructure seems much too complex.
> 
> > I'm thinking about your suggestion, but would also like to know your
> > thought how sysfs interface would look like? Its still no clear to me.
> > Or will it be better to have grouping at mdev layer?
> 
> In previous replies I had proposed that a group could be an additional
> argument when we write the mdev UUID to the create entry in sysfs.
> This is specifically why I listed only the UUID when creating the first
> mdev device and UUID:group when creating the second.  The user would
> need to go determine the group ID allocated for the first entry to
> specify creating the second within that same group.
> 
> I have no love for this proposal, it's functional but not elegant and
> again leaves libvirt lost in trying to determine which devices need to
> be grouped together and which have no business being grouped together.
> 
> Let's think through this further and let me make a couple assumptions
> to get started:
> 
> 1) iommu groups are the way that we want to group NVIDIA vGPUs because:
>   a) The peer-to-peer resources represent an isolation gap between
>      mdev devices, iommu groups represent sets of isolated devices.
>   b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
>      device model.
>   c) iommu_group_for_each_dev() gives the vendor driver the
>      functionality it needs to perform a first-open/last-close
>      device walk for configuring these p2p resources.
> 
> 2) iommu groups as used by mdev devices should contain the minimum
> number of devices in order to provide the maximum iommu context
> flexibility.
> 
> Do we agree on these?  The corollary is that NVIDIA is going to suffer
> reduced iommu granularity exactly because of the requirement to setup
> p2p resources between mdev devices within the same VM.  This has
> implications when guest iommus are in play (viommu).
> 
> So by default we want an iommu group per mdev.  This works for all mdev
> devices as far as we know, including NVIDIA with the constraint that we
> only have a single NVIDIA device per VM.
> 
> What if we want multiple NVIDIA devices?  We either need to create the
> additional devices with a property which will place them into the same
> iommu group or allow the iommu groups to be manipulated dynamically.
> 
> The trouble I see with the former (creating a device into a group) is
> that it becomes part of the "create" syntax, which is global for all
> mdev devices.  It's the same functional, but non-elegant solution I
> proposed previously.
> 
> What if we allow groups to be manipulated dynamically?  In this case I
> envision an attribute under the mdev device with read/write access.
> The existence of the attribute indicates to libvirt that this device
> requires such handling and allows reading and setting the association.
> To be clear, the attribute would only exist on mdev devices requiring
> this handling.  I'm always a fan of naming things after what they do, so
> rather than making this attribute reference an iommu group, I might
> actually call it "peer_to_peer_resource_uuid".  So the process might
> look something like this:
> 
> # create 2 mdev devices
> echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> 
> # move $UUID1 to the same group as $UUID0
> P2P_UUID=$(cat
> /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid)
> echo $P2P_UUID > \
> 
> /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid
> 
> Alternatively we could have used uuidgen to create a UUID then moved
> both to the new UUID.
> 
> Within the mdev vendor driver this would walk through the mdev devices,
> find the matching peer_to_peer_resource_uuid (generated randomly at
> create time by default) and add the device to the iommu group for
> devices sharing that p2p uuid.  When removed from the VM, libvirt could
> simply echo the output of uuidgen to each to split them again.

I think it could work. Then the binding of p2p uuid with devices is
asynchronous from mdev_create, which is more flexible to manage.

> 
> So from a libvirt perspective, special handling would need to invoked
> that when this p2p attribute is found, all devices for a given VM would
> need to share the same p2p uuid.  libvirt would be free to use an

if those devices come from two parent devices, do we expect
libvirt to use two p2p uuids here?

> existing p2p uuid or generate a new one.  The vendor driver should
> enforce a write failure if the device cannot be added to the p2p uuid
> (for example devices within the p2p uuid are already opened).
> 
> Maybe this is similar to your proposal and even goes back to vm_uuid,
> but under the covers the vendor driver needs to be manipulating iommu
> grouping based on this parameter and there's no concept of an "mdev
> group" in the base API (nor vm_uuid), this is an extension keyed by the
> additional sysfs attribute.
> 
> Are we getting closer?  Thanks,
> 

Looks so. :-)

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
@ 2016-09-07  8:22                                 ` Tian, Kevin
  0 siblings, 0 replies; 162+ messages in thread
From: Tian, Kevin @ 2016-09-07  8:22 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, qemu-devel, kraxel, Laine Stump, bjsdjshi

> From: Alex Williamson
> Sent: Wednesday, September 07, 2016 5:29 AM
> 
> On Wed, 7 Sep 2016 01:05:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 9/6/2016 11:10 PM, Alex Williamson wrote:
> > > On Sat, 3 Sep 2016 22:04:56 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >>>
> > >>>
> > >>> On 02/09/2016 20:33, Kirti Wankhede wrote:
> > >>>> <Alex> We could even do:
> > >>>>>>
> > >>>>>> echo $UUID1:$GROUPA > create
> > >>>>>>
> > >>>>>> where $GROUPA is the group ID of a previously created mdev device into
> > >>>>>> which $UUID1 is to be created and added to the same group.
> > >>>> </Alex>
> > >>>
> > >>> From the point of view of libvirt, I think I prefer Alex's idea.
> > >>> <group> could be an additional element in the nodedev-create XML:
> > >>>
> > >>>     <device>
> > >>>       <name>my-vgpu</name>
> > >>>       <parent>pci_0000_86_00_0</parent>
> > >>>       <capability type='mdev'>
> > >>>         <type id='11'/>
> > >>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> > >>>         <group>group1</group>
> > >>>       </capability>
> > >>>     </device>
> > >>>
> > >>> (should group also be a UUID?)
> > >>>
> > >>
> > >> No, this should be a unique number in a system, similar to iommu_group.
> > >
> > > Sorry, just trying to catch up on this thread after a long weekend.
> > >
> > > We're talking about iommu groups here, we're not creating any sort of
> > > parallel grouping specific to mdev devices.
> >
> > I thought we were talking about group of mdev devices and not iommu
> > group. IIRC, there were concerns about it (this would be similar to
> > UUID+instance) and that would (ab)use iommu groups.
> 
> What constraints does a group, which is not an iommu group, place on the
> usage of the mdev devices?  What happens if we put two mdev devices in
> the same "mdev group" and then assign them to separate VMs/users?  I
> believe that the answer is that this theoretical "mdev group" doesn't
> actually impose any constraints on the devices within the group or how
> they're used.
> 
> vfio knows about iommu groups and we consider an iommu group to be the
> unit of ownership for userspace.  Therefore by placing multiple mdev
> devices within the same iommu group we can be assured that there's only
> one user for that group.  Furthermore, the specific case for this
> association on NVIDIA is to couple the hardware peer-to-peer resources
> for the individual mdev devices.  Therefore this particular grouping
> does imply a lack of isolation between those mdev devices involved in
> the group.
> 
> For mdev devices which are actually isolated from one another, where
> they don't poke these p2p holes, placing them in the same iommu group
> is definitely an abuse of the interface and is going to lead to
> problems with a single iommu context.  But how does libvirt know that
> one type of mdev device needs to be grouped while another type doesn't?

can we introduce an attribute under specific type to indicate such p2p
requirement so libvirt knows the need of additional group action?

> 
> There's really not much that I like about using iommu groups in this
> way, it's just that they seem to solve this particular problem of
> enforcing how such a group can be used and imposing a second form of
> grouping onto the vfio infrastructure seems much too complex.
> 
> > I'm thinking about your suggestion, but would also like to know your
> > thought how sysfs interface would look like? Its still no clear to me.
> > Or will it be better to have grouping at mdev layer?
> 
> In previous replies I had proposed that a group could be an additional
> argument when we write the mdev UUID to the create entry in sysfs.
> This is specifically why I listed only the UUID when creating the first
> mdev device and UUID:group when creating the second.  The user would
> need to go determine the group ID allocated for the first entry to
> specify creating the second within that same group.
> 
> I have no love for this proposal, it's functional but not elegant and
> again leaves libvirt lost in trying to determine which devices need to
> be grouped together and which have no business being grouped together.
> 
> Let's think through this further and let me make a couple assumptions
> to get started:
> 
> 1) iommu groups are the way that we want to group NVIDIA vGPUs because:
>   a) The peer-to-peer resources represent an isolation gap between
>      mdev devices, iommu groups represent sets of isolated devices.
>   b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
>      device model.
>   c) iommu_group_for_each_dev() gives the vendor driver the
>      functionality it needs to perform a first-open/last-close
>      device walk for configuring these p2p resources.
> 
> 2) iommu groups as used by mdev devices should contain the minimum
> number of devices in order to provide the maximum iommu context
> flexibility.
> 
> Do we agree on these?  The corollary is that NVIDIA is going to suffer
> reduced iommu granularity exactly because of the requirement to setup
> p2p resources between mdev devices within the same VM.  This has
> implications when guest iommus are in play (viommu).
> 
> So by default we want an iommu group per mdev.  This works for all mdev
> devices as far as we know, including NVIDIA with the constraint that we
> only have a single NVIDIA device per VM.
> 
> What if we want multiple NVIDIA devices?  We either need to create the
> additional devices with a property which will place them into the same
> iommu group or allow the iommu groups to be manipulated dynamically.
> 
> The trouble I see with the former (creating a device into a group) is
> that it becomes part of the "create" syntax, which is global for all
> mdev devices.  It's the same functional, but non-elegant solution I
> proposed previously.
> 
> What if we allow groups to be manipulated dynamically?  In this case I
> envision an attribute under the mdev device with read/write access.
> The existence of the attribute indicates to libvirt that this device
> requires such handling and allows reading and setting the association.
> To be clear, the attribute would only exist on mdev devices requiring
> this handling.  I'm always a fan of naming things after what they do, so
> rather than making this attribute reference an iommu group, I might
> actually call it "peer_to_peer_resource_uuid".  So the process might
> look something like this:
> 
> # create 2 mdev devices
> echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> 
> # move $UUID1 to the same group as $UUID0
> P2P_UUID=$(cat
> /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid)
> echo $P2P_UUID > \
> 
> /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid
> 
> Alternatively we could have used uuidgen to create a UUID then moved
> both to the new UUID.
> 
> Within the mdev vendor driver this would walk through the mdev devices,
> find the matching peer_to_peer_resource_uuid (generated randomly at
> create time by default) and add the device to the iommu group for
> devices sharing that p2p uuid.  When removed from the VM, libvirt could
> simply echo the output of uuidgen to each to split them again.

I think it could work. Then the binding of p2p uuid with devices is
asynchronous from mdev_create, which is more flexible to manage.

> 
> So from a libvirt perspective, special handling would need to invoked
> that when this p2p attribute is found, all devices for a given VM would
> need to share the same p2p uuid.  libvirt would be free to use an

if those devices come from two parent devices, do we expect
libvirt to use two p2p uuids here?

> existing p2p uuid or generate a new one.  The vendor driver should
> enforce a write failure if the device cannot be added to the p2p uuid
> (for example devices within the p2p uuid are already opened).
> 
> Maybe this is similar to your proposal and even goes back to vm_uuid,
> but under the covers the vendor driver needs to be manipulating iommu
> grouping based on this parameter and there's no concept of an "mdev
> group" in the base API (nor vm_uuid), this is an extension keyed by the
> additional sysfs attribute.
> 
> Are we getting closer?  Thanks,
> 

Looks so. :-)

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07  8:22                                 ` Tian, Kevin
  (?)
@ 2016-09-07 16:00                                 ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-07 16:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Paolo Bonzini, Michal Privoznik, Song, Jike,
	cjia, kvm, libvir-list, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Wed, 7 Sep 2016 08:22:05 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Wednesday, September 07, 2016 5:29 AM
> > 
> > On Wed, 7 Sep 2016 01:05:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > On 9/6/2016 11:10 PM, Alex Williamson wrote:  
> > > > On Sat, 3 Sep 2016 22:04:56 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >  
> > > >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:  
> > > >>>
> > > >>>
> > > >>> On 02/09/2016 20:33, Kirti Wankhede wrote:  
> > > >>>> <Alex> We could even do:  
> > > >>>>>>
> > > >>>>>> echo $UUID1:$GROUPA > create
> > > >>>>>>
> > > >>>>>> where $GROUPA is the group ID of a previously created mdev device into
> > > >>>>>> which $UUID1 is to be created and added to the same group.  
> > > >>>> </Alex>  
> > > >>>
> > > >>> From the point of view of libvirt, I think I prefer Alex's idea.
> > > >>> <group> could be an additional element in the nodedev-create XML:
> > > >>>
> > > >>>     <device>
> > > >>>       <name>my-vgpu</name>
> > > >>>       <parent>pci_0000_86_00_0</parent>
> > > >>>       <capability type='mdev'>
> > > >>>         <type id='11'/>
> > > >>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> > > >>>         <group>group1</group>
> > > >>>       </capability>
> > > >>>     </device>
> > > >>>
> > > >>> (should group also be a UUID?)
> > > >>>  
> > > >>
> > > >> No, this should be a unique number in a system, similar to iommu_group.  
> > > >
> > > > Sorry, just trying to catch up on this thread after a long weekend.
> > > >
> > > > We're talking about iommu groups here, we're not creating any sort of
> > > > parallel grouping specific to mdev devices.  
> > >
> > > I thought we were talking about group of mdev devices and not iommu
> > > group. IIRC, there were concerns about it (this would be similar to
> > > UUID+instance) and that would (ab)use iommu groups.  
> > 
> > What constraints does a group, which is not an iommu group, place on the
> > usage of the mdev devices?  What happens if we put two mdev devices in
> > the same "mdev group" and then assign them to separate VMs/users?  I
> > believe that the answer is that this theoretical "mdev group" doesn't
> > actually impose any constraints on the devices within the group or how
> > they're used.
> > 
> > vfio knows about iommu groups and we consider an iommu group to be the
> > unit of ownership for userspace.  Therefore by placing multiple mdev
> > devices within the same iommu group we can be assured that there's only
> > one user for that group.  Furthermore, the specific case for this
> > association on NVIDIA is to couple the hardware peer-to-peer resources
> > for the individual mdev devices.  Therefore this particular grouping
> > does imply a lack of isolation between those mdev devices involved in
> > the group.
> > 
> > For mdev devices which are actually isolated from one another, where
> > they don't poke these p2p holes, placing them in the same iommu group
> > is definitely an abuse of the interface and is going to lead to
> > problems with a single iommu context.  But how does libvirt know that
> > one type of mdev device needs to be grouped while another type doesn't?  
> 
> can we introduce an attribute under specific type to indicate such p2p
> requirement so libvirt knows the need of additional group action?

I don't have any objection to that.

> > 
> > There's really not much that I like about using iommu groups in this
> > way, it's just that they seem to solve this particular problem of
> > enforcing how such a group can be used and imposing a second form of
> > grouping onto the vfio infrastructure seems much too complex.
> >   
> > > I'm thinking about your suggestion, but would also like to know your
> > > thought how sysfs interface would look like? Its still no clear to me.
> > > Or will it be better to have grouping at mdev layer?  
> > 
> > In previous replies I had proposed that a group could be an additional
> > argument when we write the mdev UUID to the create entry in sysfs.
> > This is specifically why I listed only the UUID when creating the first
> > mdev device and UUID:group when creating the second.  The user would
> > need to go determine the group ID allocated for the first entry to
> > specify creating the second within that same group.
> > 
> > I have no love for this proposal, it's functional but not elegant and
> > again leaves libvirt lost in trying to determine which devices need to
> > be grouped together and which have no business being grouped together.
> > 
> > Let's think through this further and let me make a couple assumptions
> > to get started:
> > 
> > 1) iommu groups are the way that we want to group NVIDIA vGPUs because:
> >   a) The peer-to-peer resources represent an isolation gap between
> >      mdev devices, iommu groups represent sets of isolated devices.
> >   b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
> >      device model.
> >   c) iommu_group_for_each_dev() gives the vendor driver the
> >      functionality it needs to perform a first-open/last-close
> >      device walk for configuring these p2p resources.
> > 
> > 2) iommu groups as used by mdev devices should contain the minimum
> > number of devices in order to provide the maximum iommu context
> > flexibility.
> > 
> > Do we agree on these?  The corollary is that NVIDIA is going to suffer
> > reduced iommu granularity exactly because of the requirement to setup
> > p2p resources between mdev devices within the same VM.  This has
> > implications when guest iommus are in play (viommu).
> > 
> > So by default we want an iommu group per mdev.  This works for all mdev
> > devices as far as we know, including NVIDIA with the constraint that we
> > only have a single NVIDIA device per VM.
> > 
> > What if we want multiple NVIDIA devices?  We either need to create the
> > additional devices with a property which will place them into the same
> > iommu group or allow the iommu groups to be manipulated dynamically.
> > 
> > The trouble I see with the former (creating a device into a group) is
> > that it becomes part of the "create" syntax, which is global for all
> > mdev devices.  It's the same functional, but non-elegant solution I
> > proposed previously.
> > 
> > What if we allow groups to be manipulated dynamically?  In this case I
> > envision an attribute under the mdev device with read/write access.
> > The existence of the attribute indicates to libvirt that this device
> > requires such handling and allows reading and setting the association.
> > To be clear, the attribute would only exist on mdev devices requiring
> > this handling.  I'm always a fan of naming things after what they do, so
> > rather than making this attribute reference an iommu group, I might
> > actually call it "peer_to_peer_resource_uuid".  So the process might
> > look something like this:
> > 
> > # create 2 mdev devices
> > echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> > echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> > 
> > # move $UUID1 to the same group as $UUID0
> > P2P_UUID=$(cat
> > /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid)
> > echo $P2P_UUID > \
> > 
> > /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid
> > 
> > Alternatively we could have used uuidgen to create a UUID then moved
> > both to the new UUID.
> > 
> > Within the mdev vendor driver this would walk through the mdev devices,
> > find the matching peer_to_peer_resource_uuid (generated randomly at
> > create time by default) and add the device to the iommu group for
> > devices sharing that p2p uuid.  When removed from the VM, libvirt could
> > simply echo the output of uuidgen to each to split them again.  
> 
> I think it could work. Then the binding of p2p uuid with devices is
> asynchronous from mdev_create, which is more flexible to manage.
> 
> > 
> > So from a libvirt perspective, special handling would need to invoked
> > that when this p2p attribute is found, all devices for a given VM would
> > need to share the same p2p uuid.  libvirt would be free to use an  
> 
> if those devices come from two parent devices, do we expect
> libvirt to use two p2p uuids here?

I expect so.  AIUI, NVIDIA wants to start all the devices together,
which implies that a p2p uuid group would span parent devices.  If
there are not actually any p2p resources shared between parent devices
it would be more optimal to create a p2p uuid group for each parent,
thus limiting the size of the iommu group, but that might interfere with
internals of the NVIDIA userspace manager.  It's a bit more 'abuse'
rather than 'use' of iommu groups if there aren't actually any p2p
resources.  Whether or not there's some optimization in having mdev
devices on the same parent is going to be something that libvirt, or at
least an advanced user if we can't do it programatically, is going to
want to know. Thanks,

Alex

> > existing p2p uuid or generate a new one.  The vendor driver should
> > enforce a write failure if the device cannot be added to the p2p uuid
> > (for example devices within the p2p uuid are already opened).
> > 
> > Maybe this is similar to your proposal and even goes back to vm_uuid,
> > but under the covers the vendor driver needs to be manipulating iommu
> > grouping based on this parameter and there's no concept of an "mdev
> > group" in the base API (nor vm_uuid), this is an extension keyed by the
> > additional sysfs attribute.
> > 
> > Are we getting closer?  Thanks,
> >   
> 
> Looks so. :-)
> 
> Thanks,
> Kevin


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-06 21:28                             ` Alex Williamson
  2016-09-07  8:22                                 ` Tian, Kevin
@ 2016-09-07 16:15                               ` Kirti Wankhede
  2016-09-07 16:44                                 ` Alex Williamson
  1 sibling, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-07 16:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi



On 9/7/2016 2:58 AM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 01:05:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/6/2016 11:10 PM, Alex Williamson wrote:
>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:  
>>>>>
>>>>>
>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:    
>>>>>> <Alex> We could even do:    
>>>>>>>>
>>>>>>>> echo $UUID1:$GROUPA > create
>>>>>>>>
>>>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>>>> which $UUID1 is to be created and added to the same group.    
>>>>>> </Alex>    
>>>>>
>>>>> From the point of view of libvirt, I think I prefer Alex's idea.
>>>>> <group> could be an additional element in the nodedev-create XML:
>>>>>
>>>>>     <device>
>>>>>       <name>my-vgpu</name>
>>>>>       <parent>pci_0000_86_00_0</parent>
>>>>>       <capability type='mdev'>
>>>>>         <type id='11'/>
>>>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>         <group>group1</group>
>>>>>       </capability>
>>>>>     </device>
>>>>>
>>>>> (should group also be a UUID?)
>>>>>     
>>>>
>>>> No, this should be a unique number in a system, similar to iommu_group.  
>>>
>>> Sorry, just trying to catch up on this thread after a long weekend.
>>>
>>> We're talking about iommu groups here, we're not creating any sort of
>>> parallel grouping specific to mdev devices.  
>>
>> I thought we were talking about group of mdev devices and not iommu
>> group. IIRC, there were concerns about it (this would be similar to
>> UUID+instance) and that would (ab)use iommu groups.
> 
> What constraints does a group, which is not an iommu group, place on the
> usage of the mdev devices?  What happens if we put two mdev devices in
> the same "mdev group" and then assign them to separate VMs/users?  I
> believe that the answer is that this theoretical "mdev group" doesn't
> actually impose any constraints on the devices within the group or how
> they're used.
> 

We feel its not a good idea to try to associate device's iommu groups
with mdev device groups. That adds more complications.

As in above nodedev-create xml, 'group1' could be a unique number that
can be generated by libvirt. Then to create mdev device:

  echo $UUID1:group1 > create

If user want to add more mdev devices to same group, he/she should use
same group number in next nodedev-create devices. So create commands
would be:
  echo $UUID2:group1 > create
  echo $UUID3:group1 > create

Each mdev device would store this group number in its mdev_device
structure.

With this, we would add open() and close() callbacks from vfio_mdev
module for vendor driver to commit resources. Then we don't need
'start'/'stop' or online/offline interface.

To commit resources for all devices associated to that domain/user space
application, vendor driver can use 'first open()' and 'last close()' to
free those. Or if vendor driver want to commit resources for each device
separately, they can do in each device's open() call. It will depend on
vendor driver how they want to implement.

Libvirt don't have to do anything about assigned group numbers while
managing mdev devices.

QEMU commandline parameter would be same as earlier (don't have to
mention group number here):

  -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
  -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2

In case if two mdev devices from same groups are assigned to different
domains, we can fail open() call of second device. How would driver know
that those are being used by different domain? By checking <group1, pid>
of first device of 'group1'. The two devices in same group should have
same pid in their open() call.

To hot-plug mdev device to a domain in which there is already a mdev
device assigned, mdev device should be created with same group number as
the existing devices are and then hot-plug it. If there is no mdev
device in that domain, then group number should be a unique number.

This simplifies the mdev grouping and also provide flexibility for
vendor driver implementation.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 16:15                               ` Kirti Wankhede
@ 2016-09-07 16:44                                 ` Alex Williamson
  2016-09-07 18:06                                   ` Kirti Wankhede
  2016-09-07 18:17                                   ` Neo Jia
  0 siblings, 2 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-07 16:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Wed, 7 Sep 2016 21:45:31 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/7/2016 2:58 AM, Alex Williamson wrote:
> > On Wed, 7 Sep 2016 01:05:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/6/2016 11:10 PM, Alex Williamson wrote:  
> >>> On Sat, 3 Sep 2016 22:04:56 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:    
> >>>>>
> >>>>>
> >>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:      
> >>>>>> <Alex> We could even do:      
> >>>>>>>>
> >>>>>>>> echo $UUID1:$GROUPA > create
> >>>>>>>>
> >>>>>>>> where $GROUPA is the group ID of a previously created mdev device into
> >>>>>>>> which $UUID1 is to be created and added to the same group.      
> >>>>>> </Alex>      
> >>>>>
> >>>>> From the point of view of libvirt, I think I prefer Alex's idea.
> >>>>> <group> could be an additional element in the nodedev-create XML:
> >>>>>
> >>>>>     <device>
> >>>>>       <name>my-vgpu</name>
> >>>>>       <parent>pci_0000_86_00_0</parent>
> >>>>>       <capability type='mdev'>
> >>>>>         <type id='11'/>
> >>>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>         <group>group1</group>
> >>>>>       </capability>
> >>>>>     </device>
> >>>>>
> >>>>> (should group also be a UUID?)
> >>>>>       
> >>>>
> >>>> No, this should be a unique number in a system, similar to iommu_group.    
> >>>
> >>> Sorry, just trying to catch up on this thread after a long weekend.
> >>>
> >>> We're talking about iommu groups here, we're not creating any sort of
> >>> parallel grouping specific to mdev devices.    
> >>
> >> I thought we were talking about group of mdev devices and not iommu
> >> group. IIRC, there were concerns about it (this would be similar to
> >> UUID+instance) and that would (ab)use iommu groups.  
> > 
> > What constraints does a group, which is not an iommu group, place on the
> > usage of the mdev devices?  What happens if we put two mdev devices in
> > the same "mdev group" and then assign them to separate VMs/users?  I
> > believe that the answer is that this theoretical "mdev group" doesn't
> > actually impose any constraints on the devices within the group or how
> > they're used.
> >   
> 
> We feel its not a good idea to try to associate device's iommu groups
> with mdev device groups. That adds more complications.
> 
> As in above nodedev-create xml, 'group1' could be a unique number that
> can be generated by libvirt. Then to create mdev device:
> 
>   echo $UUID1:group1 > create
> 
> If user want to add more mdev devices to same group, he/she should use
> same group number in next nodedev-create devices. So create commands
> would be:
>   echo $UUID2:group1 > create
>   echo $UUID3:group1 > create

So groups return to being static, libvirt would need to destroy and
create mdev devices specifically for use within the predefined group?
This imposes limitations on how mdev devices can be used (ie. the mdev
pool option is once again removed).  We're also back to imposing
grouping semantics on mdev devices that may not need them.  Do all mdev
devices for a given user need to be put into the same group?  Do groups
span parent devices?  Do they span different vendor drivers?

> Each mdev device would store this group number in its mdev_device
> structure.
> 
> With this, we would add open() and close() callbacks from vfio_mdev
> module for vendor driver to commit resources. Then we don't need
> 'start'/'stop' or online/offline interface.
> 
> To commit resources for all devices associated to that domain/user space
> application, vendor driver can use 'first open()' and 'last close()' to
> free those. Or if vendor driver want to commit resources for each device
> separately, they can do in each device's open() call. It will depend on
> vendor driver how they want to implement.
> 
> Libvirt don't have to do anything about assigned group numbers while
> managing mdev devices.
> 
> QEMU commandline parameter would be same as earlier (don't have to
> mention group number here):
> 
>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
> 
> In case if two mdev devices from same groups are assigned to different
> domains, we can fail open() call of second device. How would driver know
> that those are being used by different domain? By checking <group1, pid>
> of first device of 'group1'. The two devices in same group should have
> same pid in their open() call.

Are you assuming that the two devices are owned by the same vendor
driver?  What if I put NVIDIA and Intel vGPUs both into the same group
and give each of them to a separate VM?  How would the NVIDIA host
driver know which <group, pid> the Intel device got?  This is what the
iommu groups do that a different layer of grouping cannot do.  Maybe
you're suggesting a group per vendor driver, but how does libvirt know
the vendor driver?  Do they need to go research the parent device in
sysfs and compare driver links?
 
> To hot-plug mdev device to a domain in which there is already a mdev
> device assigned, mdev device should be created with same group number as
> the existing devices are and then hot-plug it. If there is no mdev
> device in that domain, then group number should be a unique number.
> 
> This simplifies the mdev grouping and also provide flexibility for
> vendor driver implementation.

The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
resources between mdev devices.  Does this not represent some degree of
an isolation hole between those devices?  Will peer-to-peer DMA between
devices honor the guest IOVA when mdev devices are placed into separate
address spaces, such as possible with vIOMMU?

I don't particularly like the iommu group solution either, which is why
in my latest proposal I've given the vendor driver a way to indicate
this grouping is required so more flexible mdev devices aren't
restricted by this.  But the limited knowledge I have of the hardware
configuration which imposes this restriction on NVIDIA devices seems to
suggest that iommu grouping of these sets is appropriate.  The vfio-core
infrastructure is almost entirely built for managing vfio group, which
are just a direct mapping of iommu groups.  So the complexity of iommu
groups is already handled.  Adding a new layer of grouping into mdev
seems like it's increasing the complexity further, not decreasing it.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 16:44                                 ` Alex Williamson
@ 2016-09-07 18:06                                   ` Kirti Wankhede
  2016-09-07 22:13                                     ` Alex Williamson
  2016-09-07 18:17                                   ` Neo Jia
  1 sibling, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-07 18:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi



On 9/7/2016 10:14 PM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 21:45:31 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/7/2016 2:58 AM, Alex Williamson wrote:
>>> On Wed, 7 Sep 2016 01:05:11 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:  
>>>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:    
>>>>>>>
>>>>>>>
>>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:      
>>>>>>>> <Alex> We could even do:      
>>>>>>>>>>
>>>>>>>>>> echo $UUID1:$GROUPA > create
>>>>>>>>>>
>>>>>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>>>>>> which $UUID1 is to be created and added to the same group.      
>>>>>>>> </Alex>      
>>>>>>>
>>>>>>> From the point of view of libvirt, I think I prefer Alex's idea.
>>>>>>> <group> could be an additional element in the nodedev-create XML:
>>>>>>>
>>>>>>>     <device>
>>>>>>>       <name>my-vgpu</name>
>>>>>>>       <parent>pci_0000_86_00_0</parent>
>>>>>>>       <capability type='mdev'>
>>>>>>>         <type id='11'/>
>>>>>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>>>         <group>group1</group>
>>>>>>>       </capability>
>>>>>>>     </device>
>>>>>>>
>>>>>>> (should group also be a UUID?)
>>>>>>>       
>>>>>>
>>>>>> No, this should be a unique number in a system, similar to iommu_group.    
>>>>>
>>>>> Sorry, just trying to catch up on this thread after a long weekend.
>>>>>
>>>>> We're talking about iommu groups here, we're not creating any sort of
>>>>> parallel grouping specific to mdev devices.    
>>>>
>>>> I thought we were talking about group of mdev devices and not iommu
>>>> group. IIRC, there were concerns about it (this would be similar to
>>>> UUID+instance) and that would (ab)use iommu groups.  
>>>
>>> What constraints does a group, which is not an iommu group, place on the
>>> usage of the mdev devices?  What happens if we put two mdev devices in
>>> the same "mdev group" and then assign them to separate VMs/users?  I
>>> believe that the answer is that this theoretical "mdev group" doesn't
>>> actually impose any constraints on the devices within the group or how
>>> they're used.
>>>   
>>
>> We feel its not a good idea to try to associate device's iommu groups
>> with mdev device groups. That adds more complications.
>>
>> As in above nodedev-create xml, 'group1' could be a unique number that
>> can be generated by libvirt. Then to create mdev device:
>>
>>   echo $UUID1:group1 > create
>>
>> If user want to add more mdev devices to same group, he/she should use
>> same group number in next nodedev-create devices. So create commands
>> would be:
>>   echo $UUID2:group1 > create
>>   echo $UUID3:group1 > create
> 
> So groups return to being static, libvirt would need to destroy and
> create mdev devices specifically for use within the predefined group?

Yes.

> This imposes limitations on how mdev devices can be used (ie. the mdev
> pool option is once again removed).  We're also back to imposing
> grouping semantics on mdev devices that may not need them.  Do all mdev
> devices for a given user need to be put into the same group?  
	
Yes.

> Do groups
> span parent devices?  Do they span different vendor drivers?
> 

Yes and yes. Group number would be associated with mdev device
irrespective of its parent.


>> Each mdev device would store this group number in its mdev_device
>> structure.
>>
>> With this, we would add open() and close() callbacks from vfio_mdev
>> module for vendor driver to commit resources. Then we don't need
>> 'start'/'stop' or online/offline interface.
>>
>> To commit resources for all devices associated to that domain/user space
>> application, vendor driver can use 'first open()' and 'last close()' to
>> free those. Or if vendor driver want to commit resources for each device
>> separately, they can do in each device's open() call. It will depend on
>> vendor driver how they want to implement.
>>
>> Libvirt don't have to do anything about assigned group numbers while
>> managing mdev devices.
>>
>> QEMU commandline parameter would be same as earlier (don't have to
>> mention group number here):
>>
>>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
>>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
>>
>> In case if two mdev devices from same groups are assigned to different
>> domains, we can fail open() call of second device. How would driver know
>> that those are being used by different domain? By checking <group1, pid>
>> of first device of 'group1'. The two devices in same group should have
>> same pid in their open() call.
> 
> Are you assuming that the two devices are owned by the same vendor
> driver?

No. See my reply to next questions below.

>  What if I put NVIDIA and Intel vGPUs both into the same group
> and give each of them to a separate VM?

It depends on where we put the logic to verify pid in open() call of
each devices in group.
If we place the logic of checking <group, pid> for devices in a group in
vendor driver, then in above case both VMs would boot.
But If we impose this logic in mdev core or vfio_mdev module, then
open() on second device should fail.

>  How would the NVIDIA host
> driver know which <group, pid> the Intel device got?

How to make use of group number to commit resources for devices owned by
a vendor would be vendor driver's responsibility. NVIDIA driver doesn't
need to know about Intel's vGPU nor Intel driver need to know about
NVIDIA's vGPU.

>  This is what the
> iommu groups do that a different layer of grouping cannot do.  Maybe
> you're suggesting a group per vendor driver, but how does libvirt know
> the vendor driver?  Do they need to go research the parent device in
> sysfs and compare driver links?
>  

No, group is not associated with vendor driver. Group number is
associated iwth mdev device.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 16:44                                 ` Alex Williamson
  2016-09-07 18:06                                   ` Kirti Wankhede
@ 2016-09-07 18:17                                   ` Neo Jia
  2016-09-07 18:27                                     ` Daniel P. Berrange
  1 sibling, 1 reply; 162+ messages in thread
From: Neo Jia @ 2016-09-07 18:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, Paolo Bonzini, Michal Privoznik, Song, Jike, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
> On Wed, 7 Sep 2016 21:45:31 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > To hot-plug mdev device to a domain in which there is already a mdev
> > device assigned, mdev device should be created with same group number as
> > the existing devices are and then hot-plug it. If there is no mdev
> > device in that domain, then group number should be a unique number.
> > 
> > This simplifies the mdev grouping and also provide flexibility for
> > vendor driver implementation.
> 
> The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
> resources between mdev devices.  Does this not represent some degree of
> an isolation hole between those devices?  Will peer-to-peer DMA between
> devices honor the guest IOVA when mdev devices are placed into separate
> address spaces, such as possible with vIOMMU?

Hi Alex,

In reality, the p2p operation will only work under same translation domain.

As we are discussing the multiple mdev per VM use cases, I think we probably
should not just limit it for p2p operation. 

So, in general, the NVIDIA vGPU device model's requirement is to know/register 
all mdevs per VM before opening any those mdev devices.

> 
> I don't particularly like the iommu group solution either, which is why
> in my latest proposal I've given the vendor driver a way to indicate
> this grouping is required so more flexible mdev devices aren't
> restricted by this.  But the limited knowledge I have of the hardware
> configuration which imposes this restriction on NVIDIA devices seems to
> suggest that iommu grouping of these sets is appropriate.  The vfio-core
> infrastructure is almost entirely built for managing vfio group, which
> are just a direct mapping of iommu groups.  So the complexity of iommu
> groups is already handled.  Adding a new layer of grouping into mdev
> seems like it's increasing the complexity further, not decreasing it.

I really appreciate your thoughts on this issue, and consideration of how NVIDIA
vGPU device model works, but so far I still feel we are borrowing a very
meaningful concept "iommu group" to solve an device model issues, which I actually 
hope can be workarounded by a more independent piece of logic, and that is why Kirti is
proposing the "mdev group".

Let's see if we can address your concerns / questions in Kirti's reply.

Thanks,
Neo

> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 18:17                                   ` Neo Jia
@ 2016-09-07 18:27                                     ` Daniel P. Berrange
  2016-09-07 18:32                                       ` Neo Jia
  0 siblings, 1 reply; 162+ messages in thread
From: Daniel P. Berrange @ 2016-09-07 18:27 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Song, Jike, kvm, libvir-list, Michal Privoznik,
	Tian, Kevin, qemu-devel, Kirti Wankhede, kraxel, Laine Stump,
	Paolo Bonzini, bjsdjshi

On Wed, Sep 07, 2016 at 11:17:39AM -0700, Neo Jia wrote:
> On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
> > On Wed, 7 Sep 2016 21:45:31 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > > To hot-plug mdev device to a domain in which there is already a mdev
> > > device assigned, mdev device should be created with same group number as
> > > the existing devices are and then hot-plug it. If there is no mdev
> > > device in that domain, then group number should be a unique number.
> > > 
> > > This simplifies the mdev grouping and also provide flexibility for
> > > vendor driver implementation.
> > 
> > The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
> > resources between mdev devices.  Does this not represent some degree of
> > an isolation hole between those devices?  Will peer-to-peer DMA between
> > devices honor the guest IOVA when mdev devices are placed into separate
> > address spaces, such as possible with vIOMMU?
> 
> Hi Alex,
> 
> In reality, the p2p operation will only work under same translation domain.
> 
> As we are discussing the multiple mdev per VM use cases, I think we probably
> should not just limit it for p2p operation.
> 
> So, in general, the NVIDIA vGPU device model's requirement is to know/register 
> all mdevs per VM before opening any those mdev devices.

It concerns me that if we bake this rule into the sysfs interface,
then it feels like we're making life very hard for future support
for hotplug / unplug of mdevs to running VMs.

Conversely, if we can solve the hotplug/unplug problem, then we
potentially would not need this grouping concept.

I'd hate us to do all this complex work to group multiple mdevs per
VM only to throw it away later when we hotplug support is made to
work.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 18:27                                     ` Daniel P. Berrange
@ 2016-09-07 18:32                                       ` Neo Jia
  0 siblings, 0 replies; 162+ messages in thread
From: Neo Jia @ 2016-09-07 18:32 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Alex Williamson, Song, Jike, kvm, libvir-list, Michal Privoznik,
	Tian, Kevin, qemu-devel, Kirti Wankhede, kraxel, Laine Stump,
	Paolo Bonzini, bjsdjshi

On Wed, Sep 07, 2016 at 07:27:19PM +0100, Daniel P. Berrange wrote:
> On Wed, Sep 07, 2016 at 11:17:39AM -0700, Neo Jia wrote:
> > On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
> > > On Wed, 7 Sep 2016 21:45:31 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > 
> > > > To hot-plug mdev device to a domain in which there is already a mdev
> > > > device assigned, mdev device should be created with same group number as
> > > > the existing devices are and then hot-plug it. If there is no mdev
> > > > device in that domain, then group number should be a unique number.
> > > > 
> > > > This simplifies the mdev grouping and also provide flexibility for
> > > > vendor driver implementation.
> > > 
> > > The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
> > > resources between mdev devices.  Does this not represent some degree of
> > > an isolation hole between those devices?  Will peer-to-peer DMA between
> > > devices honor the guest IOVA when mdev devices are placed into separate
> > > address spaces, such as possible with vIOMMU?
> > 
> > Hi Alex,
> > 
> > In reality, the p2p operation will only work under same translation domain.
> > 
> > As we are discussing the multiple mdev per VM use cases, I think we probably
> > should not just limit it for p2p operation.
> > 
> > So, in general, the NVIDIA vGPU device model's requirement is to know/register 
> > all mdevs per VM before opening any those mdev devices.
> 
> It concerns me that if we bake this rule into the sysfs interface,
> then it feels like we're making life very hard for future support
> for hotplug / unplug of mdevs to running VMs.

Hi Daniel,

I don't think the grouping will stop anybody from supporting hotplug / unplug at
least from syntax point of view.

> 
> Conversely, if we can solve the hotplug/unplug problem, then we
> potentially would not need this grouping concept.

I think Kirti has also mentioned about hotplug support in her proposal, do you
mind to comment on that thread so I can think if I have missed anything?

Thanks,
Neo

> 
> I'd hate us to do all this complex work to group multiple mdevs per
> VM only to throw it away later when we hotplug support is made to
> work.
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 18:06                                   ` Kirti Wankhede
@ 2016-09-07 22:13                                     ` Alex Williamson
  2016-09-08 18:48                                       ` Kirti Wankhede
  0 siblings, 1 reply; 162+ messages in thread
From: Alex Williamson @ 2016-09-07 22:13 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Wed, 7 Sep 2016 23:36:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/7/2016 10:14 PM, Alex Williamson wrote:
> > On Wed, 7 Sep 2016 21:45:31 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/7/2016 2:58 AM, Alex Williamson wrote:  
> >>> On Wed, 7 Sep 2016 01:05:11 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:    
> >>>>> On Sat, 3 Sep 2016 22:04:56 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:      
> >>>>>>>
> >>>>>>>
> >>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:        
> >>>>>>>> <Alex> We could even do:        
> >>>>>>>>>>
> >>>>>>>>>> echo $UUID1:$GROUPA > create
> >>>>>>>>>>
> >>>>>>>>>> where $GROUPA is the group ID of a previously created mdev device into
> >>>>>>>>>> which $UUID1 is to be created and added to the same group.        
> >>>>>>>> </Alex>        
> >>>>>>>
> >>>>>>> From the point of view of libvirt, I think I prefer Alex's idea.
> >>>>>>> <group> could be an additional element in the nodedev-create XML:
> >>>>>>>
> >>>>>>>     <device>
> >>>>>>>       <name>my-vgpu</name>
> >>>>>>>       <parent>pci_0000_86_00_0</parent>
> >>>>>>>       <capability type='mdev'>
> >>>>>>>         <type id='11'/>
> >>>>>>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>>>>>         <group>group1</group>
> >>>>>>>       </capability>
> >>>>>>>     </device>
> >>>>>>>
> >>>>>>> (should group also be a UUID?)
> >>>>>>>         
> >>>>>>
> >>>>>> No, this should be a unique number in a system, similar to iommu_group.      
> >>>>>
> >>>>> Sorry, just trying to catch up on this thread after a long weekend.
> >>>>>
> >>>>> We're talking about iommu groups here, we're not creating any sort of
> >>>>> parallel grouping specific to mdev devices.      
> >>>>
> >>>> I thought we were talking about group of mdev devices and not iommu
> >>>> group. IIRC, there were concerns about it (this would be similar to
> >>>> UUID+instance) and that would (ab)use iommu groups.    
> >>>
> >>> What constraints does a group, which is not an iommu group, place on the
> >>> usage of the mdev devices?  What happens if we put two mdev devices in
> >>> the same "mdev group" and then assign them to separate VMs/users?  I
> >>> believe that the answer is that this theoretical "mdev group" doesn't
> >>> actually impose any constraints on the devices within the group or how
> >>> they're used.
> >>>     
> >>
> >> We feel its not a good idea to try to associate device's iommu groups
> >> with mdev device groups. That adds more complications.
> >>
> >> As in above nodedev-create xml, 'group1' could be a unique number that
> >> can be generated by libvirt. Then to create mdev device:
> >>
> >>   echo $UUID1:group1 > create
> >>
> >> If user want to add more mdev devices to same group, he/she should use
> >> same group number in next nodedev-create devices. So create commands
> >> would be:
> >>   echo $UUID2:group1 > create
> >>   echo $UUID3:group1 > create  
> > 
> > So groups return to being static, libvirt would need to destroy and
> > create mdev devices specifically for use within the predefined group?  
> 
> Yes.
> 
> > This imposes limitations on how mdev devices can be used (ie. the mdev
> > pool option is once again removed).  We're also back to imposing
> > grouping semantics on mdev devices that may not need them.  Do all mdev
> > devices for a given user need to be put into the same group?    
> 	
> Yes.
> 
> > Do groups
> > span parent devices?  Do they span different vendor drivers?
> >   
> 
> Yes and yes. Group number would be associated with mdev device
> irrespective of its parent.
> 
> 
> >> Each mdev device would store this group number in its mdev_device
> >> structure.
> >>
> >> With this, we would add open() and close() callbacks from vfio_mdev
> >> module for vendor driver to commit resources. Then we don't need
> >> 'start'/'stop' or online/offline interface.
> >>
> >> To commit resources for all devices associated to that domain/user space
> >> application, vendor driver can use 'first open()' and 'last close()' to
> >> free those. Or if vendor driver want to commit resources for each device
> >> separately, they can do in each device's open() call. It will depend on
> >> vendor driver how they want to implement.
> >>
> >> Libvirt don't have to do anything about assigned group numbers while
> >> managing mdev devices.
> >>
> >> QEMU commandline parameter would be same as earlier (don't have to
> >> mention group number here):
> >>
> >>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
> >>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
> >>
> >> In case if two mdev devices from same groups are assigned to different
> >> domains, we can fail open() call of second device. How would driver know
> >> that those are being used by different domain? By checking <group1, pid>
> >> of first device of 'group1'. The two devices in same group should have
> >> same pid in their open() call.  
> > 
> > Are you assuming that the two devices are owned by the same vendor
> > driver?  
> 
> No. See my reply to next questions below.
> 
> >  What if I put NVIDIA and Intel vGPUs both into the same group
> > and give each of them to a separate VM?  
> 
> It depends on where we put the logic to verify pid in open() call of
> each devices in group.
> If we place the logic of checking <group, pid> for devices in a group in
> vendor driver, then in above case both VMs would boot.
> But If we impose this logic in mdev core or vfio_mdev module, then
> open() on second device should fail.

So you're proposing that the mdev layer keeps a list of mdev-groups and
wraps the vfio_device_ops.{open,release} entry points to record or
verify the user on each open, keep tallies of the open devices, and
clears that association on the last close?  Is pid really the thing we
want to key on, what about multiple threads running in the same address
space?  vfio-core does this my only allowing a single open on the vfio
group thus the vfio device file descriptors can be farmed out to other
threads.  Using pid seems incompatible with that usage model and we'll
have a vfio group per mdev device, so we can't restrict access there.
The model seems plausible, but also significantly restricts the user's
freedom unless we can come up with a better context to use to identify
the user.

Forcing groups to be static also seems arbitrary since nothing here
demands that the mdev group cannot be changed while not in use.  This
grouping is really only required for NVIDIA mdev devices, so it needs
to be as non-intrusive as possible for other vendors or it needs to
only be invoked for vendors that require it.
 
> >  How would the NVIDIA host
> > driver know which <group, pid> the Intel device got?  
> 
> How to make use of group number to commit resources for devices owned by
> a vendor would be vendor driver's responsibility. NVIDIA driver doesn't
> need to know about Intel's vGPU nor Intel driver need to know about
> NVIDIA's vGPU.

So the mdev layer would be responsible for making sure that a device
within a mdev group can only be opened by the <somehow> identified user
and the vendor driver would have it's own list of mdev groups and
devices and do yet more first-open/last-closed processing.
 
> >  This is what the
> > iommu groups do that a different layer of grouping cannot do.  Maybe
> > you're suggesting a group per vendor driver, but how does libvirt know
> > the vendor driver?  Do they need to go research the parent device in
> > sysfs and compare driver links?
> >    
> 
> No, group is not associated with vendor driver. Group number is
> associated iwth mdev device.

Philosophically, mdev devices should be entirely independent of one
another.  A user can set the same iommu context for multiple mdevs
by placing them in the same container.  A user should be able to
stop using an mdev in one place and start using it somewhere else.
It should be a fungible $TYPE device.  It's an NVIDIA-only requirement
that imposes this association of mdev devices into groups and I don't
particularly see it as beneficial to the mdev architecture.  So why
make it a standard part of the interface?

We could do keying at the layer you suggest, assuming we can find
something that doesn't restrict the user, but we could make that
optional.  For instance, say we did key on pid, there could be an
attribute in the supported types hierarchy to indicate this type
supports(requires) pid-sets.  Each mdev device with this attribute
would create a pid-group file in sysfs where libvirt could associate
the device.  Only for those mdev devices requiring it.

The alternative is that we need to find some mechanism for this
association that doesn't impose arbitrary requirements, and potentially
usage restrictions on vendors that don't have this need.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-26 14:13       ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-08  2:38         ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-08  2:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dong Jia, alex.williamson, pbonzini, kraxel, cjia, qemu-devel,
	kvm, kevin.tian

On 08/26/2016 10:13 PM, Kirti Wankhede wrote:
> 
> 
> On 8/25/2016 2:52 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>> [...]
>>
>> Dear Kirti,
>>
>> I just rebased my vfio-ccw patches to this series.
>> With a little fix, which was pointed it out in my reply to the #3
>> patch, it works fine.
>>
> 
> Thanks for update. Glad to know this works for you.
> 
> 
>>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>>> +				     unsigned int cmd, unsigned long arg)
>>> +{
>>> +	int ret = 0;
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct parent_device *parent = vmdev->mdev->parent;
>>> +	unsigned long minsz;
>>> +
>>> +	switch (cmd) {
>>> +	case VFIO_DEVICE_GET_INFO:
>>> +	{
>>> +		struct vfio_device_info info;
>>> +
>>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>>> +
>>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>>> +			return -EFAULT;
>>> +
>>> +		if (info.argsz < minsz)
>>> +			return -EINVAL;
>>> +
>>> +		if (parent->ops->get_device_info)
>>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>>> +		else
>>> +			return -EINVAL;
>>> +
>>> +		if (ret)
>>> +			return ret;
>>> +
>>> +		if (parent->ops->reset)
>>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>> Shouldn't this be done inside the get_device_info callback?
>>
> 
> I would like Vendor driver to set device type only. Reset flag should be
> set on basis of reset() callback provided.
> 
>>> +
>>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>>> +
>>> +		return copy_to_user((void __user *)arg, &info, minsz);
>>> +	}
>> [...]
>>
>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct mdev_device *mdev = vmdev->mdev;
>>> +	struct parent_device *parent = mdev->parent;
>>> +	unsigned int done = 0;
>>> +	int ret;
>>> +
>>> +	if (!parent->ops->read)
>>> +		return -EINVAL;
>>> +
>>> +	while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
> 
> I would like to know Alex's thought on this. He raised concern with this
> approach in v6 reviews:
> "But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer."

It is impossible to check count here, because one simply doesn't have the
knowledge of this region.

VFIO_DEVICE_GET_REGION_INFO was implemented in vfio-mdev.ko, while decoding
the vfio_mdev_read to a particular MMIO region was expected to be implemented
in vendor driver, that results in unbalanced interfaces.


To have balanced interfaces, you either:

	- call ioctl instead of GET_REGION_INFO
	- call read instead of decoding REGION

or:

	- call GET_REGION_INFO instead of ioctl
	- decode REGION in read, and check its validity, call region-specific
	  read function


V6 was the latter, v7 is kind of a mixture of these two, while I believe
the former will completely address such problem :)


--
Thanks,
Jike


>>> +		size_t filled;
>>> +
>>> +		if (count >= 4 && !(*ppos % 4)) {
>>> +			u32 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 4;
>>> +		} else if (count >= 2 && !(*ppos % 2)) {
>>> +			u16 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 2;
>>> +		} else {
>>> +			u8 val;
>>> +
>>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 1;
>>> +		}
>>> +
>>> +		count -= filled;
>>> +		done += filled;
>>> +		*ppos += filled;
>>> +		buf += filled;
>>> +	}
>>> +
>>> +	return done;
>>> +
>>> +read_err:
>>> +	return -EFAULT;
>>> +}
>> [...]
>>
>> --------
>> Dong Jia
>>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-08  2:38         ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-08  2:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dong Jia, alex.williamson, pbonzini, kraxel, cjia, qemu-devel,
	kvm, kevin.tian

On 08/26/2016 10:13 PM, Kirti Wankhede wrote:
> 
> 
> On 8/25/2016 2:52 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>> [...]
>>
>> Dear Kirti,
>>
>> I just rebased my vfio-ccw patches to this series.
>> With a little fix, which was pointed it out in my reply to the #3
>> patch, it works fine.
>>
> 
> Thanks for update. Glad to know this works for you.
> 
> 
>>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>>> +				     unsigned int cmd, unsigned long arg)
>>> +{
>>> +	int ret = 0;
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct parent_device *parent = vmdev->mdev->parent;
>>> +	unsigned long minsz;
>>> +
>>> +	switch (cmd) {
>>> +	case VFIO_DEVICE_GET_INFO:
>>> +	{
>>> +		struct vfio_device_info info;
>>> +
>>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>>> +
>>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>>> +			return -EFAULT;
>>> +
>>> +		if (info.argsz < minsz)
>>> +			return -EINVAL;
>>> +
>>> +		if (parent->ops->get_device_info)
>>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>>> +		else
>>> +			return -EINVAL;
>>> +
>>> +		if (ret)
>>> +			return ret;
>>> +
>>> +		if (parent->ops->reset)
>>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>> Shouldn't this be done inside the get_device_info callback?
>>
> 
> I would like Vendor driver to set device type only. Reset flag should be
> set on basis of reset() callback provided.
> 
>>> +
>>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>>> +
>>> +		return copy_to_user((void __user *)arg, &info, minsz);
>>> +	}
>> [...]
>>
>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct mdev_device *mdev = vmdev->mdev;
>>> +	struct parent_device *parent = mdev->parent;
>>> +	unsigned int done = 0;
>>> +	int ret;
>>> +
>>> +	if (!parent->ops->read)
>>> +		return -EINVAL;
>>> +
>>> +	while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
> 
> I would like to know Alex's thought on this. He raised concern with this
> approach in v6 reviews:
> "But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer."

It is impossible to check count here, because one simply doesn't have the
knowledge of this region.

VFIO_DEVICE_GET_REGION_INFO was implemented in vfio-mdev.ko, while decoding
the vfio_mdev_read to a particular MMIO region was expected to be implemented
in vendor driver, that results in unbalanced interfaces.


To have balanced interfaces, you either:

	- call ioctl instead of GET_REGION_INFO
	- call read instead of decoding REGION

or:

	- call GET_REGION_INFO instead of ioctl
	- decode REGION in read, and check its validity, call region-specific
	  read function


V6 was the latter, v7 is kind of a mixture of these two, while I believe
the former will completely address such problem :)


--
Thanks,
Jike


>>> +		size_t filled;
>>> +
>>> +		if (count >= 4 && !(*ppos % 4)) {
>>> +			u32 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 4;
>>> +		} else if (count >= 2 && !(*ppos % 2)) {
>>> +			u16 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 2;
>>> +		} else {
>>> +			u8 val;
>>> +
>>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 1;
>>> +		}
>>> +
>>> +		count -= filled;
>>> +		done += filled;
>>> +		*ppos += filled;
>>> +		buf += filled;
>>> +	}
>>> +
>>> +	return done;
>>> +
>>> +read_err:
>>> +	return -EFAULT;
>>> +}
>> [...]
>>
>> --------
>> Dong Jia
>>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-25  9:22     ` [Qemu-devel] " Dong Jia
@ 2016-09-08  2:45       ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-08  2:45 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, kevin.tian

On 08/25/2016 05:22 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> [...]
> 
> Dear Kirti,
> 
> I just rebased my vfio-ccw patches to this series.
> With a little fix, which was pointed it out in my reply to the #3
> patch, it works fine.
>

Hi Jia,

Sorry I didn't follow a lot in previous discussion, but since
vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
with us why you still need a vfio-ccw?


--
Thanks,
Jike
 
>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>> +				     unsigned int cmd, unsigned long arg)
>> +{
>> +	int ret = 0;
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct parent_device *parent = vmdev->mdev->parent;
>> +	unsigned long minsz;
>> +
>> +	switch (cmd) {
>> +	case VFIO_DEVICE_GET_INFO:
>> +	{
>> +		struct vfio_device_info info;
>> +
>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>> +
>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (info.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		if (parent->ops->get_device_info)
>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>> +		else
>> +			return -EINVAL;
>> +
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (parent->ops->reset)
>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> Shouldn't this be done inside the get_device_info callback?
> 
>> +
>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>> +
>> +		return copy_to_user((void __user *)arg, &info, minsz);
>> +	}
> [...]
> 
>> +
>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	unsigned int done = 0;
>> +	int ret;
>> +
>> +	if (!parent->ops->read)
>> +		return -EINVAL;
>> +
>> +	while (count) {
> Here, I have to say sorry to you guys for that I didn't notice the
> bad impact of this change to my patches during the v6 discussion.
> 
> For vfio-ccw, I introduced an I/O region to input/output I/O
> instruction parameters and results for Qemu. The @count of these data
> currently is 140. So supporting arbitrary lengths in one shot here, and
> also in vfio_mdev_write, seems the better option for this case.
> 
> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> can do that in the parent read/write callbacks instead.
> 
> What do you think?
> 
>> +		size_t filled;
>> +
>> +		if (count >= 4 && !(*ppos % 4)) {
>> +			u32 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 4;
>> +		} else if (count >= 2 && !(*ppos % 2)) {
>> +			u16 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 2;
>> +		} else {
>> +			u8 val;
>> +
>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 1;
>> +		}
>> +
>> +		count -= filled;
>> +		done += filled;
>> +		*ppos += filled;
>> +		buf += filled;
>> +	}
>> +
>> +	return done;
>> +
>> +read_err:
>> +	return -EFAULT;
>> +}
> [...]
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-08  2:45       ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-08  2:45 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, kevin.tian

On 08/25/2016 05:22 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> [...]
> 
> Dear Kirti,
> 
> I just rebased my vfio-ccw patches to this series.
> With a little fix, which was pointed it out in my reply to the #3
> patch, it works fine.
>

Hi Jia,

Sorry I didn't follow a lot in previous discussion, but since
vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
with us why you still need a vfio-ccw?


--
Thanks,
Jike
 
>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>> +				     unsigned int cmd, unsigned long arg)
>> +{
>> +	int ret = 0;
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct parent_device *parent = vmdev->mdev->parent;
>> +	unsigned long minsz;
>> +
>> +	switch (cmd) {
>> +	case VFIO_DEVICE_GET_INFO:
>> +	{
>> +		struct vfio_device_info info;
>> +
>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>> +
>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (info.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		if (parent->ops->get_device_info)
>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>> +		else
>> +			return -EINVAL;
>> +
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (parent->ops->reset)
>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> Shouldn't this be done inside the get_device_info callback?
> 
>> +
>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>> +
>> +		return copy_to_user((void __user *)arg, &info, minsz);
>> +	}
> [...]
> 
>> +
>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	unsigned int done = 0;
>> +	int ret;
>> +
>> +	if (!parent->ops->read)
>> +		return -EINVAL;
>> +
>> +	while (count) {
> Here, I have to say sorry to you guys for that I didn't notice the
> bad impact of this change to my patches during the v6 discussion.
> 
> For vfio-ccw, I introduced an I/O region to input/output I/O
> instruction parameters and results for Qemu. The @count of these data
> currently is 140. So supporting arbitrary lengths in one shot here, and
> also in vfio_mdev_write, seems the better option for this case.
> 
> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> can do that in the parent read/write callbacks instead.
> 
> What do you think?
> 
>> +		size_t filled;
>> +
>> +		if (count >= 4 && !(*ppos % 4)) {
>> +			u32 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 4;
>> +		} else if (count >= 2 && !(*ppos % 2)) {
>> +			u16 val;
>> +
>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>> +						*ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 2;
>> +		} else {
>> +			u8 val;
>> +
>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>> +			if (ret <= 0)
>> +				goto read_err;
>> +
>> +			if (copy_to_user(buf, &val, sizeof(val)))
>> +				goto read_err;
>> +
>> +			filled = 1;
>> +		}
>> +
>> +		count -= filled;
>> +		done += filled;
>> +		*ppos += filled;
>> +		buf += filled;
>> +	}
>> +
>> +	return done;
>> +
>> +read_err:
>> +	return -EFAULT;
>> +}
> [...]
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-08  8:09     ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-08  8:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |  mdev     | |                         |              |
>  | |  bus      | +------------------------>+              |<-> VFIO user
>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>  | |           | |                         |              |
>  | +-----------+ |                         +--------------+
>  |               |

This aimed to have only one single vfio bus driver for all mediated devices,
right?

>  |  MDEV CORE    |
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev, vfio_mdev, uses this interface to
> register with Core driver. vfio_mdev module adds mediated device to VFIO
> group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - reset: to free and reallocate resources in vendor driver during device
> 	 reset.
> - set_online_status: to change online status of mediated device.
> - get_online_status: to get current (online/offline) status of mediated
> 		     device.
> - read : read emulation callback.
> - write: write emulation callback.
> - mmap: mmap emulation callback.
> - get_irq_info: to retrieve information about mediated device's IRQ.
> - set_irqs: send interrupt configuration information that VMM sets.
> - get_device_info: to retrieve VFIO device related flags, number of regions
> 		   and number of IRQs supported.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> Locks to serialize above callbacks are removed. If required, vendor driver
> can have locks to serialize above APIs in their driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> Reviewed-on: http://git-master/r/1175705
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  12 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 509 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++
>  drivers/vfio/mdev/mdev_private.h |  36 +++
>  drivers/vfio/mdev/mdev_sysfs.c   | 240 ++++++++++++++++++
>  include/linux/mdev.h             | 212 ++++++++++++++++
>  9 files changed, 1147 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..4a23c13b6be4 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_VFIO_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..a34fbc66f92f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,12 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize device.
> +	See Documentation/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..56a75e689582
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..9f278c7507f7
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,509 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}

These functions are not necessary. You can always specify the attribute groups
to dev->groups before registering a new device.

> +
> +/* Should be called holding parent->mdev_list_lock */
> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
> +					      uuid_le uuid)
> +{
> +	struct mdev_device *mdev;
> +
> +	list_for_each_entry(mdev, &parent->mdev_list, next) {
> +		if (uuid_le_cmp(mdev->uuid, uuid) == 0)
> +			return mdev;
> +	}
> +	return NULL;
> +}
> +
> +/* Should be called holding parent_list_lock */
> +static struct parent_device *__find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	list_for_each_entry(parent, &parent_list, next) {
> +		if (parent->dev == dev)
> +			return parent;
> +	}
> +	return NULL;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = mdev_get_parent(__find_parent_device(dev));
> +	mutex_unlock(&parent_list_lock);
> +
> +	return parent;
> +}

As we have demonstrated, all these refs and locks and release workqueue are not necessary,
as long as you have an independent device associated with the mdev host device
("parent" device here).

PS, "parent" is somehow a name too generic?

> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	ret = parent->ops->create(mdev, mdev_params);
> +	if (ret)
> +		return ret;
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);

Ditto: dev->groups.

> +	if (ret)
> +		parent->ops->destroy(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	ret = parent->ops->destroy(mdev);
> +	if (ret && !force)
> +		return -EBUSY;
> +
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	list_del(&mdev->next);
> +
> +	/*
> +	 * This unlock pairs with mutex held by mdev_put_device() through
> +	 * kref_put_mutex()
> +	 */
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	struct parent_device *parent;
> +
> +	if (!mdev)
> +		return;
> +
> +	parent = mdev->parent;
> +	kref_put_mutex(&mdev->ref, mdev_release_device,
> +		       &parent->mdev_list_lock);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* check for mandatory ops */
> +	if (!ops->create || !ops->destroy)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = __find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_list);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +	mutex_unlock(&parent_list_lock);
> +
> +	ret = parent_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);

parent_create_sysfs_files and mdev_add_attribute_group are kind of doing
the same thing, do you mind to merge them into one?

> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev = NULL;
> +	int ret;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = __find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove "mdev_create" and
> +	 * "mdev_destroy" sysfs files so that no new mediated device could be
> +	 * created for this parent
> +	 */
> +	list_del(&parent->next);
> +	parent_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +
> +	while (!list_empty(&parent->mdev_list)) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		if (!list_empty(&parent->mdev_list)) {
> +			mdev = list_first_entry(&parent->mdev_list,
> +						struct mdev_device, next);
> +			mdev_device_destroy_ops(mdev, true);
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			mdev_put_device(mdev);
> +	}
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev_sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_from_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	/* Check for duplicate */
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	ret = mdev_create_sysfs_files(&mdev->dev);
> +	if (ret)
> +		goto create_sysfs_error;
> +
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_sysfs_error:
> +	mdev_device_destroy_ops(mdev, true);
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_from_dev(dev);
> +	if (!parent)
> +		return -ENODEV;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev_remove_sysfs_files(&mdev->dev);
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_device(mdev);
> +
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +destroy_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_from_dev(dev);
> +
> +	if (parent) {
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_set_online_status(struct device *dev, bool online)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_device(to_mdev_device(dev));
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->set_online_status)
> +		ret = parent->ops->set_online_status(mdev, online);
> +
> +	if (ret)
> +		pr_err("mdev online failed  %d\n", ret);
> +	else {
> +		if (online)
> +			kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +		else
> +			kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +	}
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_get_online_status(struct device *dev, bool *online)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_device(to_mdev_device(dev));
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->get_online_status)
> +		ret = parent->ops->get_online_status(mdev, online);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}

The driver core has a perfect 'online' file for a device, with both
'show' and 'store' support, you don't need to write another one.

Please have a look at online_show and online_store in drivers/base/core.c.

> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..8afc2d8e5c04
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,131 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	mdev->group = NULL;
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	if (ret)
> +		mdev_detach_iommu(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..07ad1b381370
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,36 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];

This is useless?

> +
> +int  parent_create_sysfs_files(struct device *dev);
> +void parent_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +
> +int mdev_device_set_online_status(struct device *dev, bool online);
> +int mdev_device_get_online_status(struct device *dev, bool *online);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..ed55cd5d6595
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,240 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +static ssize_t online_store(struct device *dev, struct device_attribute *attr,
> +			    const char *buf, size_t count);
> +static ssize_t online_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf);
> +static DEVICE_ATTR_RW(online);
> +
> +/* Static functions */
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);

pstr is not used.

> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);

Ditto.

> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t online_store(struct device *dev, struct device_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *str;
> +	int ret;
> +	uint32_t online_status;
> +	bool online;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ret = kstrtouint(str, 0, &online_status);
> +	kfree(str);
> +
> +	if (ret) {
> +		pr_err("online: parsing error %s\n", buf);
> +		return ret;
> +	}
> +
> +	online = online_status > 0 ? true : false;
> +
> +	ret = mdev_device_set_online_status(dev, online);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +
> +static ssize_t online_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	int ret;
> +	bool online = false;
> +
> +	ret = mdev_device_get_online_status(dev, &online);
> +	if (ret)
> +		return ret;
> +
> +	ret = sprintf(buf, "%d\n", online);
> +	return ret;
> +}

online_show and online_store are unnecessary, see comment on mdev_device_get_online_status.

> +
> +int parent_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void parent_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}

The 2 functions above are also unnecessary: you can always group it with a single
function call of sysfs_create_files.

> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_online.attr);
> +	if (ret)
> +		pr_err("Failed to create 'online' entry\n");
> +
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_online.attr);
> +}

As said above, "online" attr is unnecessary.

> +
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..babcb7293199
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,212 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	uuid_le			uuid;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device. It is mandatory to provide destroy
> + *			ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + *			If VMM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VMM is
> + *			running and driver doesn't support mediated device
> + *			hotplug.
> + * @reset:		Called to reset mediated device.
> + *			@mdev: mdev_device device structure.
> + *			Returns integer: success (0) or error (< 0)
> + * @set_online_status:	Called to change to status of mediated device.
> + *			@mdev: mediated device.
> + *			@online: set true or false to make mdev device online or
> + *			offline.
> + *			Returns integer: success (0) or error (< 0)
> + * @get_online_status:	Called to get online/offline status of  mediated device
> + *			@mdev: mediated device.
> + *			@online: Returns status of mediated device.
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@pos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@pos: address.
> + *			Retuns number on bytes written on success or error.
> + * @get_irq_info:	Called to retrieve information about mediated device IRQ
> + *			@mdev: mediated device structure
> + *			@irq_info: VFIO IRQ flags and count.
> + *			Returns integer: success (0) or error (< 0)
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_device_info:	Called to get VFIO device information for a mediated
> + *			device.
> + *			@vfio_device_info: VFIO device info.
> + *			Returns integer: success (0) or error (< 0)
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			@cap_type_id: returns id of capability.
> + *			@cap_type: returns pointer to capability structure
> + *			corresponding to capability id.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
> +	int     (*destroy)(struct mdev_device *mdev);
> +	int     (*reset)(struct mdev_device *mdev);
> +	int     (*set_online_status)(struct mdev_device *mdev, bool online);
> +	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> +			loff_t pos);
> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> +			 loff_t pos);
> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +	int	(*get_irq_info)(struct mdev_device *mdev,
> +				struct vfio_irq_info *irq_info);
> +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_device_info)(struct mdev_device *mdev,
> +				   struct vfio_device_info *dev_info);
> +	int	(*get_region_info)(struct mdev_device *mdev,
> +				   struct vfio_region_info *region_info,
> +				   u16 *cap_type_id, void **cap_type);
> +};

I have a strong objection here to such low-level interfaces, the interfaces
between vfio-mdev and vendor drivers should be as thin as possible, not imposing
any limitation to vendor drivers.

I saw that validate_map_request was removed from the ops and mmap was added. 
That is pretty nice. Furthermore, if you add an ioctl here, you can also remove
get_device_info, get_irq_info, set_irqs, and get_region_info (and even "reset").
There are several benefits by doing this:

	-	Balanced interfaces.
		Like I replied in another mail, you won't have unbalanced interfaces.
		You already have read, write and mmap in the ops, why not ioctl?

	-	Scalability.
		You are intercepting vfio optional capabilities in the framework, but
		how if vfio.ko, or vfio-pci.ko add a few new capabilities in the future?

	-	Abstraction.
		Even placing common codes here can avoid code duplication, you still
		have code duplicate with vfio-pci.  Better to move common logic out of
		vfio-pci and call them from mdev vendor drivers.

	-	Maintainability.
		This is pretty obvious :)

> +
> +/*
> + * Parent Device
> + */
> +
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}

These can be macros, like pci ones.

> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-08  8:09     ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-08  8:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |  mdev     | |                         |              |
>  | |  bus      | +------------------------>+              |<-> VFIO user
>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>  | |           | |                         |              |
>  | +-----------+ |                         +--------------+
>  |               |

This aimed to have only one single vfio bus driver for all mediated devices,
right?

>  |  MDEV CORE    |
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev, vfio_mdev, uses this interface to
> register with Core driver. vfio_mdev module adds mediated device to VFIO
> group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - reset: to free and reallocate resources in vendor driver during device
> 	 reset.
> - set_online_status: to change online status of mediated device.
> - get_online_status: to get current (online/offline) status of mediated
> 		     device.
> - read : read emulation callback.
> - write: write emulation callback.
> - mmap: mmap emulation callback.
> - get_irq_info: to retrieve information about mediated device's IRQ.
> - set_irqs: send interrupt configuration information that VMM sets.
> - get_device_info: to retrieve VFIO device related flags, number of regions
> 		   and number of IRQs supported.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> Locks to serialize above callbacks are removed. If required, vendor driver
> can have locks to serialize above APIs in their driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> Reviewed-on: http://git-master/r/1175705
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  12 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 509 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++
>  drivers/vfio/mdev/mdev_private.h |  36 +++
>  drivers/vfio/mdev/mdev_sysfs.c   | 240 ++++++++++++++++++
>  include/linux/mdev.h             | 212 ++++++++++++++++
>  9 files changed, 1147 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..4a23c13b6be4 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_VFIO_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..a34fbc66f92f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,12 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize device.
> +	See Documentation/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..56a75e689582
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..9f278c7507f7
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,509 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}

These functions are not necessary. You can always specify the attribute groups
to dev->groups before registering a new device.

> +
> +/* Should be called holding parent->mdev_list_lock */
> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
> +					      uuid_le uuid)
> +{
> +	struct mdev_device *mdev;
> +
> +	list_for_each_entry(mdev, &parent->mdev_list, next) {
> +		if (uuid_le_cmp(mdev->uuid, uuid) == 0)
> +			return mdev;
> +	}
> +	return NULL;
> +}
> +
> +/* Should be called holding parent_list_lock */
> +static struct parent_device *__find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	list_for_each_entry(parent, &parent_list, next) {
> +		if (parent->dev == dev)
> +			return parent;
> +	}
> +	return NULL;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = mdev_get_parent(__find_parent_device(dev));
> +	mutex_unlock(&parent_list_lock);
> +
> +	return parent;
> +}

As we have demonstrated, all these refs and locks and release workqueue are not necessary,
as long as you have an independent device associated with the mdev host device
("parent" device here).

PS, "parent" is somehow a name too generic?

> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	ret = parent->ops->create(mdev, mdev_params);
> +	if (ret)
> +		return ret;
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);

Ditto: dev->groups.

> +	if (ret)
> +		parent->ops->destroy(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	ret = parent->ops->destroy(mdev);
> +	if (ret && !force)
> +		return -EBUSY;
> +
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	list_del(&mdev->next);
> +
> +	/*
> +	 * This unlock pairs with mutex held by mdev_put_device() through
> +	 * kref_put_mutex()
> +	 */
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	struct parent_device *parent;
> +
> +	if (!mdev)
> +		return;
> +
> +	parent = mdev->parent;
> +	kref_put_mutex(&mdev->ref, mdev_release_device,
> +		       &parent->mdev_list_lock);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* check for mandatory ops */
> +	if (!ops->create || !ops->destroy)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = __find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_list);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +	mutex_unlock(&parent_list_lock);
> +
> +	ret = parent_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);

parent_create_sysfs_files and mdev_add_attribute_group are kind of doing
the same thing, do you mind to merge them into one?

> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev = NULL;
> +	int ret;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = __find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove "mdev_create" and
> +	 * "mdev_destroy" sysfs files so that no new mediated device could be
> +	 * created for this parent
> +	 */
> +	list_del(&parent->next);
> +	parent_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +
> +	while (!list_empty(&parent->mdev_list)) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		if (!list_empty(&parent->mdev_list)) {
> +			mdev = list_first_entry(&parent->mdev_list,
> +						struct mdev_device, next);
> +			mdev_device_destroy_ops(mdev, true);
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			mdev_put_device(mdev);
> +	}
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev_sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_from_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	/* Check for duplicate */
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	ret = mdev_create_sysfs_files(&mdev->dev);
> +	if (ret)
> +		goto create_sysfs_error;
> +
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_sysfs_error:
> +	mdev_device_destroy_ops(mdev, true);
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_from_dev(dev);
> +	if (!parent)
> +		return -ENODEV;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev_remove_sysfs_files(&mdev->dev);
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_device(mdev);
> +
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +destroy_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_from_dev(dev);
> +
> +	if (parent) {
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_set_online_status(struct device *dev, bool online)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_device(to_mdev_device(dev));
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->set_online_status)
> +		ret = parent->ops->set_online_status(mdev, online);
> +
> +	if (ret)
> +		pr_err("mdev online failed  %d\n", ret);
> +	else {
> +		if (online)
> +			kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +		else
> +			kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +	}
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_get_online_status(struct device *dev, bool *online)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_device(to_mdev_device(dev));
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->get_online_status)
> +		ret = parent->ops->get_online_status(mdev, online);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}

The driver core has a perfect 'online' file for a device, with both
'show' and 'store' support, you don't need to write another one.

Please have a look at online_show and online_store in drivers/base/core.c.

> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..8afc2d8e5c04
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,131 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	mdev->group = NULL;
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	if (ret)
> +		mdev_detach_iommu(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..07ad1b381370
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,36 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];

This is useless?

> +
> +int  parent_create_sysfs_files(struct device *dev);
> +void parent_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +
> +int mdev_device_set_online_status(struct device *dev, bool online);
> +int mdev_device_get_online_status(struct device *dev, bool *online);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..ed55cd5d6595
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,240 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +static ssize_t online_store(struct device *dev, struct device_attribute *attr,
> +			    const char *buf, size_t count);
> +static ssize_t online_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf);
> +static DEVICE_ATTR_RW(online);
> +
> +/* Static functions */
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);

pstr is not used.

> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);

Ditto.

> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t online_store(struct device *dev, struct device_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *str;
> +	int ret;
> +	uint32_t online_status;
> +	bool online;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ret = kstrtouint(str, 0, &online_status);
> +	kfree(str);
> +
> +	if (ret) {
> +		pr_err("online: parsing error %s\n", buf);
> +		return ret;
> +	}
> +
> +	online = online_status > 0 ? true : false;
> +
> +	ret = mdev_device_set_online_status(dev, online);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +
> +static ssize_t online_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	int ret;
> +	bool online = false;
> +
> +	ret = mdev_device_get_online_status(dev, &online);
> +	if (ret)
> +		return ret;
> +
> +	ret = sprintf(buf, "%d\n", online);
> +	return ret;
> +}

online_show and online_store are unnecessary, see comment on mdev_device_get_online_status.

> +
> +int parent_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void parent_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}

The 2 functions above are also unnecessary: you can always group it with a single
function call of sysfs_create_files.

> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_online.attr);
> +	if (ret)
> +		pr_err("Failed to create 'online' entry\n");
> +
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_online.attr);
> +}

As said above, "online" attr is unnecessary.

> +
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..babcb7293199
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,212 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	uuid_le			uuid;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device. It is mandatory to provide destroy
> + *			ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + *			If VMM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VMM is
> + *			running and driver doesn't support mediated device
> + *			hotplug.
> + * @reset:		Called to reset mediated device.
> + *			@mdev: mdev_device device structure.
> + *			Returns integer: success (0) or error (< 0)
> + * @set_online_status:	Called to change to status of mediated device.
> + *			@mdev: mediated device.
> + *			@online: set true or false to make mdev device online or
> + *			offline.
> + *			Returns integer: success (0) or error (< 0)
> + * @get_online_status:	Called to get online/offline status of  mediated device
> + *			@mdev: mediated device.
> + *			@online: Returns status of mediated device.
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@pos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@pos: address.
> + *			Retuns number on bytes written on success or error.
> + * @get_irq_info:	Called to retrieve information about mediated device IRQ
> + *			@mdev: mediated device structure
> + *			@irq_info: VFIO IRQ flags and count.
> + *			Returns integer: success (0) or error (< 0)
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_device_info:	Called to get VFIO device information for a mediated
> + *			device.
> + *			@vfio_device_info: VFIO device info.
> + *			Returns integer: success (0) or error (< 0)
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			@cap_type_id: returns id of capability.
> + *			@cap_type: returns pointer to capability structure
> + *			corresponding to capability id.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
> +	int     (*destroy)(struct mdev_device *mdev);
> +	int     (*reset)(struct mdev_device *mdev);
> +	int     (*set_online_status)(struct mdev_device *mdev, bool online);
> +	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> +			loff_t pos);
> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> +			 loff_t pos);
> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +	int	(*get_irq_info)(struct mdev_device *mdev,
> +				struct vfio_irq_info *irq_info);
> +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_device_info)(struct mdev_device *mdev,
> +				   struct vfio_device_info *dev_info);
> +	int	(*get_region_info)(struct mdev_device *mdev,
> +				   struct vfio_region_info *region_info,
> +				   u16 *cap_type_id, void **cap_type);
> +};

I have a strong objection here to such low-level interfaces, the interfaces
between vfio-mdev and vendor drivers should be as thin as possible, not imposing
any limitation to vendor drivers.

I saw that validate_map_request was removed from the ops and mmap was added. 
That is pretty nice. Furthermore, if you add an ioctl here, you can also remove
get_device_info, get_irq_info, set_irqs, and get_region_info (and even "reset").
There are several benefits by doing this:

	-	Balanced interfaces.
		Like I replied in another mail, you won't have unbalanced interfaces.
		You already have read, write and mmap in the ops, why not ioctl?

	-	Scalability.
		You are intercepting vfio optional capabilities in the framework, but
		how if vfio.ko, or vfio-pci.ko add a few new capabilities in the future?

	-	Abstraction.
		Even placing common codes here can avoid code duplication, you still
		have code duplicate with vfio-pci.  Better to move common logic out of
		vfio-pci and call them from mdev vendor drivers.

	-	Maintainability.
		This is pretty obvious :)

> +
> +/*
> + * Parent Device
> + */
> +
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}

These can be macros, like pci ones.

> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-08  8:09     ` [Qemu-devel] " Jike Song
@ 2016-09-08  9:38       ` Neo Jia
  -1 siblings, 0 replies; 162+ messages in thread
From: Neo Jia @ 2016-09-08  9:38 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, bjsdjshi

On Thu, Sep 08, 2016 at 04:09:39PM +0800, Jike Song wrote:
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> > +
> > +/**
> > + * struct parent_ops - Structure to be registered for each parent device to
> > + * register the device to mdev module.
> > + *
> > + * @owner:		The module owner.
> > + * @dev_attr_groups:	Default attributes of the parent device.
> > + * @mdev_attr_groups:	Default attributes of the mediated device.
> > + * @supported_config:	Called to get information about supported types.
> > + *			@dev : device structure of parent device.
> > + *			@config: should return string listing supported config
> > + *			Returns integer: success (0) or error (< 0)
> > + * @create:		Called to allocate basic resources in parent device's
> > + *			driver for a particular mediated device. It is
> > + *			mandatory to provide create ops.
> > + *			@mdev: mdev_device structure on of mediated device
> > + *			      that is being created
> > + *			@mdev_params: extra parameters required by parent
> > + *			device's driver.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @destroy:		Called to free resources in parent device's driver for a
> > + *			a mediated device. It is mandatory to provide destroy
> > + *			ops.
> > + *			@mdev: mdev_device device structure which is being
> > + *			       destroyed
> > + *			Returns integer: success (0) or error (< 0)
> > + *			If VMM is running and destroy() is called that means the
> > + *			mdev is being hotunpluged. Return error if VMM is
> > + *			running and driver doesn't support mediated device
> > + *			hotplug.
> > + * @reset:		Called to reset mediated device.
> > + *			@mdev: mdev_device device structure.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @set_online_status:	Called to change to status of mediated device.
> > + *			@mdev: mediated device.
> > + *			@online: set true or false to make mdev device online or
> > + *			offline.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @get_online_status:	Called to get online/offline status of  mediated device
> > + *			@mdev: mediated device.
> > + *			@online: Returns status of mediated device.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @read:		Read emulation callback
> > + *			@mdev: mediated device structure
> > + *			@buf: read buffer
> > + *			@count: number of bytes to read
> > + *			@pos: address.
> > + *			Retuns number on bytes read on success or error.
> > + * @write:		Write emulation callback
> > + *			@mdev: mediated device structure
> > + *			@buf: write buffer
> > + *			@count: number of bytes to be written
> > + *			@pos: address.
> > + *			Retuns number on bytes written on success or error.
> > + * @get_irq_info:	Called to retrieve information about mediated device IRQ
> > + *			@mdev: mediated device structure
> > + *			@irq_info: VFIO IRQ flags and count.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @set_irqs:		Called to send about interrupts configuration
> > + *			information that VMM sets.
> > + *			@mdev: mediated device structure
> > + *			@flags, index, start, count and *data : same as that of
> > + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> > + * @get_device_info:	Called to get VFIO device information for a mediated
> > + *			device.
> > + *			@vfio_device_info: VFIO device info.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @get_region_info:	Called to get VFIO region size and flags of mediated
> > + *			device.
> > + *			@mdev: mediated device structure
> > + *			@region_info: output, returns size and flags of
> > + *				      requested region.
> > + *			@cap_type_id: returns id of capability.
> > + *			@cap_type: returns pointer to capability structure
> > + *			corresponding to capability id.
> > + *			Returns integer: success (0) or error (< 0)
> > + *
> > + * Parent device that support mediated device should be registered with mdev
> > + * module with parent_ops structure.
> > + */
> > +
> > +struct parent_ops {
> > +	struct module   *owner;
> > +	const struct attribute_group **dev_attr_groups;
> > +	const struct attribute_group **mdev_attr_groups;
> > +
> > +	int	(*supported_config)(struct device *dev, char *config);
> > +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
> > +	int     (*destroy)(struct mdev_device *mdev);
> > +	int     (*reset)(struct mdev_device *mdev);
> > +	int     (*set_online_status)(struct mdev_device *mdev, bool online);
> > +	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
> > +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> > +			loff_t pos);
> > +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> > +			 loff_t pos);
> > +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> > +	int	(*get_irq_info)(struct mdev_device *mdev,
> > +				struct vfio_irq_info *irq_info);
> > +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
> > +			    unsigned int index, unsigned int start,
> > +			    unsigned int count, void *data);
> > +	int	(*get_device_info)(struct mdev_device *mdev,
> > +				   struct vfio_device_info *dev_info);
> > +	int	(*get_region_info)(struct mdev_device *mdev,
> > +				   struct vfio_region_info *region_info,
> > +				   u16 *cap_type_id, void **cap_type);
> > +};
> 
> I have a strong objection here to such low-level interfaces, the interfaces
> between vfio-mdev and vendor drivers should be as thin as possible, not imposing
> any limitation to vendor drivers.

Hi Jike,

Welcome! :-)

Unfortunately, this is something I definitely can't agree with you.

We would like to capture the common code as much as possible without losing
flexibilities, so each vendor driver writers won't have to duplicate them and we
have something can be maintained publicly.

If you are running into specific limitation with above callback interfaces,
please show us the scenarios and we are very happy to look into that.

> 
> I saw that validate_map_request was removed from the ops and mmap was added. 
> That is pretty nice. Furthermore, if you add an ioctl here, you can also remove
> get_device_info, get_irq_info, set_irqs, and get_region_info (and even "reset").
> There are several benefits by doing this:

The decision of moving validate_map_request is mainly because we are adding a lot of 
advanced logic which most vendor drivers don't require, since we are the only consumer 
of such logic, no need to put it in the public/shared module.

> 
> 	-	Balanced interfaces.
> 		Like I replied in another mail, you won't have unbalanced interfaces.
> 		You already have read, write and mmap in the ops, why not ioctl?

Sorry, don't think "balanced" interface is a design criteria especially when
simply pursuing the sake of "balanced or full-set" interface ends up lots
duplicated code for vendor driver writers.

> 
> 	-	Scalability.
> 		You are intercepting vfio optional capabilities in the framework, but
> 		how if vfio.ko, or vfio-pci.ko add a few new capabilities in the future?

Exactly my point about the code sharing.

If new cap is added inside vfio.ko or vfio-pci.ko, we can just add it into
vfio_mdev.ko. 

Adding the code in one place is better to duplicate in multiple vendor drivers.

> 
> 	-	Abstraction.
> 		Even placing common codes here can avoid code duplication, you still
> 		have code duplicate with vfio-pci.  Better to move common logic out of
> 		vfio-pci and call them from mdev vendor drivers.

Are you saying to avoid the code duplications between vfio-pci and vfio-mdev?

> 
> 	-	Maintainability.
> 		This is pretty obvious :)

Definitely not, the burden is moving to the vendor driver side.

Again, Jike, I really want to enable you with the mediated framework we have been
doing here. So it is probably easier for us to accommodate your need if you could
follow the interfaces we have introduced and let us know if you have any specific
issues.

Thanks,
Neo

> 
> > +
> > +/*
> > + * Parent Device
> > + */
> > +
> > +struct parent_device {
> > +	struct device		*dev;
> > +	const struct parent_ops	*ops;
> > +
> > +	/* internal */
> > +	struct kref		ref;
> > +	struct list_head	next;
> > +	struct list_head	mdev_list;
> > +	struct mutex		mdev_list_lock;
> > +	wait_queue_head_t	release_done;
> > +};
> > +
> > +/**
> > + * struct mdev_driver - Mediated device driver
> > + * @name: driver name
> > + * @probe: called when new device created
> > + * @remove: called when device removed
> > + * @driver: device driver structure
> > + *
> > + **/
> > +struct mdev_driver {
> > +	const char *name;
> > +	int  (*probe)(struct device *dev);
> > +	void (*remove)(struct device *dev);
> > +	struct device_driver driver;
> > +};
> > +
> > +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> > +{
> > +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> > +}
> > +
> > +static inline struct mdev_device *to_mdev_device(struct device *dev)
> > +{
> > +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> > +}
> 
> These can be macros, like pci ones.
> 
> > +
> > +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> > +{
> > +	return mdev->driver_data;
> > +}
> > +
> > +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> > +{
> > +	mdev->driver_data = data;
> > +}
> > +
> > +extern struct bus_type mdev_bus_type;
> > +
> > +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> > +
> > +extern int  mdev_register_device(struct device *dev,
> > +				 const struct parent_ops *ops);
> > +extern void mdev_unregister_device(struct device *dev);
> > +
> > +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> > +extern void mdev_unregister_driver(struct mdev_driver *drv);
> > +
> > +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> > +extern void mdev_put_device(struct mdev_device *mdev);
> > +
> > +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> > +
> > +#endif /* MDEV_H */
> > 
> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-08  9:38       ` Neo Jia
  0 siblings, 0 replies; 162+ messages in thread
From: Neo Jia @ 2016-09-08  9:38 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, bjsdjshi

On Thu, Sep 08, 2016 at 04:09:39PM +0800, Jike Song wrote:
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> > +
> > +/**
> > + * struct parent_ops - Structure to be registered for each parent device to
> > + * register the device to mdev module.
> > + *
> > + * @owner:		The module owner.
> > + * @dev_attr_groups:	Default attributes of the parent device.
> > + * @mdev_attr_groups:	Default attributes of the mediated device.
> > + * @supported_config:	Called to get information about supported types.
> > + *			@dev : device structure of parent device.
> > + *			@config: should return string listing supported config
> > + *			Returns integer: success (0) or error (< 0)
> > + * @create:		Called to allocate basic resources in parent device's
> > + *			driver for a particular mediated device. It is
> > + *			mandatory to provide create ops.
> > + *			@mdev: mdev_device structure on of mediated device
> > + *			      that is being created
> > + *			@mdev_params: extra parameters required by parent
> > + *			device's driver.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @destroy:		Called to free resources in parent device's driver for a
> > + *			a mediated device. It is mandatory to provide destroy
> > + *			ops.
> > + *			@mdev: mdev_device device structure which is being
> > + *			       destroyed
> > + *			Returns integer: success (0) or error (< 0)
> > + *			If VMM is running and destroy() is called that means the
> > + *			mdev is being hotunpluged. Return error if VMM is
> > + *			running and driver doesn't support mediated device
> > + *			hotplug.
> > + * @reset:		Called to reset mediated device.
> > + *			@mdev: mdev_device device structure.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @set_online_status:	Called to change to status of mediated device.
> > + *			@mdev: mediated device.
> > + *			@online: set true or false to make mdev device online or
> > + *			offline.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @get_online_status:	Called to get online/offline status of  mediated device
> > + *			@mdev: mediated device.
> > + *			@online: Returns status of mediated device.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @read:		Read emulation callback
> > + *			@mdev: mediated device structure
> > + *			@buf: read buffer
> > + *			@count: number of bytes to read
> > + *			@pos: address.
> > + *			Retuns number on bytes read on success or error.
> > + * @write:		Write emulation callback
> > + *			@mdev: mediated device structure
> > + *			@buf: write buffer
> > + *			@count: number of bytes to be written
> > + *			@pos: address.
> > + *			Retuns number on bytes written on success or error.
> > + * @get_irq_info:	Called to retrieve information about mediated device IRQ
> > + *			@mdev: mediated device structure
> > + *			@irq_info: VFIO IRQ flags and count.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @set_irqs:		Called to send about interrupts configuration
> > + *			information that VMM sets.
> > + *			@mdev: mediated device structure
> > + *			@flags, index, start, count and *data : same as that of
> > + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> > + * @get_device_info:	Called to get VFIO device information for a mediated
> > + *			device.
> > + *			@vfio_device_info: VFIO device info.
> > + *			Returns integer: success (0) or error (< 0)
> > + * @get_region_info:	Called to get VFIO region size and flags of mediated
> > + *			device.
> > + *			@mdev: mediated device structure
> > + *			@region_info: output, returns size and flags of
> > + *				      requested region.
> > + *			@cap_type_id: returns id of capability.
> > + *			@cap_type: returns pointer to capability structure
> > + *			corresponding to capability id.
> > + *			Returns integer: success (0) or error (< 0)
> > + *
> > + * Parent device that support mediated device should be registered with mdev
> > + * module with parent_ops structure.
> > + */
> > +
> > +struct parent_ops {
> > +	struct module   *owner;
> > +	const struct attribute_group **dev_attr_groups;
> > +	const struct attribute_group **mdev_attr_groups;
> > +
> > +	int	(*supported_config)(struct device *dev, char *config);
> > +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
> > +	int     (*destroy)(struct mdev_device *mdev);
> > +	int     (*reset)(struct mdev_device *mdev);
> > +	int     (*set_online_status)(struct mdev_device *mdev, bool online);
> > +	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
> > +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> > +			loff_t pos);
> > +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> > +			 loff_t pos);
> > +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> > +	int	(*get_irq_info)(struct mdev_device *mdev,
> > +				struct vfio_irq_info *irq_info);
> > +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
> > +			    unsigned int index, unsigned int start,
> > +			    unsigned int count, void *data);
> > +	int	(*get_device_info)(struct mdev_device *mdev,
> > +				   struct vfio_device_info *dev_info);
> > +	int	(*get_region_info)(struct mdev_device *mdev,
> > +				   struct vfio_region_info *region_info,
> > +				   u16 *cap_type_id, void **cap_type);
> > +};
> 
> I have a strong objection here to such low-level interfaces, the interfaces
> between vfio-mdev and vendor drivers should be as thin as possible, not imposing
> any limitation to vendor drivers.

Hi Jike,

Welcome! :-)

Unfortunately, this is something I definitely can't agree with you.

We would like to capture the common code as much as possible without losing
flexibilities, so each vendor driver writers won't have to duplicate them and we
have something can be maintained publicly.

If you are running into specific limitation with above callback interfaces,
please show us the scenarios and we are very happy to look into that.

> 
> I saw that validate_map_request was removed from the ops and mmap was added. 
> That is pretty nice. Furthermore, if you add an ioctl here, you can also remove
> get_device_info, get_irq_info, set_irqs, and get_region_info (and even "reset").
> There are several benefits by doing this:

The decision of moving validate_map_request is mainly because we are adding a lot of 
advanced logic which most vendor drivers don't require, since we are the only consumer 
of such logic, no need to put it in the public/shared module.

> 
> 	-	Balanced interfaces.
> 		Like I replied in another mail, you won't have unbalanced interfaces.
> 		You already have read, write and mmap in the ops, why not ioctl?

Sorry, don't think "balanced" interface is a design criteria especially when
simply pursuing the sake of "balanced or full-set" interface ends up lots
duplicated code for vendor driver writers.

> 
> 	-	Scalability.
> 		You are intercepting vfio optional capabilities in the framework, but
> 		how if vfio.ko, or vfio-pci.ko add a few new capabilities in the future?

Exactly my point about the code sharing.

If new cap is added inside vfio.ko or vfio-pci.ko, we can just add it into
vfio_mdev.ko. 

Adding the code in one place is better to duplicate in multiple vendor drivers.

> 
> 	-	Abstraction.
> 		Even placing common codes here can avoid code duplication, you still
> 		have code duplicate with vfio-pci.  Better to move common logic out of
> 		vfio-pci and call them from mdev vendor drivers.

Are you saying to avoid the code duplications between vfio-pci and vfio-mdev?

> 
> 	-	Maintainability.
> 		This is pretty obvious :)

Definitely not, the burden is moving to the vendor driver side.

Again, Jike, I really want to enable you with the mediated framework we have been
doing here. So it is probably easier for us to accommodate your need if you could
follow the interfaces we have introduced and let us know if you have any specific
issues.

Thanks,
Neo

> 
> > +
> > +/*
> > + * Parent Device
> > + */
> > +
> > +struct parent_device {
> > +	struct device		*dev;
> > +	const struct parent_ops	*ops;
> > +
> > +	/* internal */
> > +	struct kref		ref;
> > +	struct list_head	next;
> > +	struct list_head	mdev_list;
> > +	struct mutex		mdev_list_lock;
> > +	wait_queue_head_t	release_done;
> > +};
> > +
> > +/**
> > + * struct mdev_driver - Mediated device driver
> > + * @name: driver name
> > + * @probe: called when new device created
> > + * @remove: called when device removed
> > + * @driver: device driver structure
> > + *
> > + **/
> > +struct mdev_driver {
> > +	const char *name;
> > +	int  (*probe)(struct device *dev);
> > +	void (*remove)(struct device *dev);
> > +	struct device_driver driver;
> > +};
> > +
> > +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> > +{
> > +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> > +}
> > +
> > +static inline struct mdev_device *to_mdev_device(struct device *dev)
> > +{
> > +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> > +}
> 
> These can be macros, like pci ones.
> 
> > +
> > +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> > +{
> > +	return mdev->driver_data;
> > +}
> > +
> > +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> > +{
> > +	mdev->driver_data = data;
> > +}
> > +
> > +extern struct bus_type mdev_bus_type;
> > +
> > +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> > +
> > +extern int  mdev_register_device(struct device *dev,
> > +				 const struct parent_ops *ops);
> > +extern void mdev_unregister_device(struct device *dev);
> > +
> > +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> > +extern void mdev_unregister_driver(struct mdev_driver *drv);
> > +
> > +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> > +extern void mdev_put_device(struct mdev_device *mdev);
> > +
> > +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> > +
> > +#endif /* MDEV_H */
> > 
> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-07 22:13                                     ` Alex Williamson
@ 2016-09-08 18:48                                       ` Kirti Wankhede
  2016-09-08 20:51                                         ` Alex Williamson
  0 siblings, 1 reply; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-08 18:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi



On 9/8/2016 3:43 AM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 23:36:28 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/7/2016 10:14 PM, Alex Williamson wrote:
>>> On Wed, 7 Sep 2016 21:45:31 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 9/7/2016 2:58 AM, Alex Williamson wrote:  
>>>>> On Wed, 7 Sep 2016 01:05:11 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:    
>>>>>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:      
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:        

...

> 
> Philosophically, mdev devices should be entirely independent of one
> another.  A user can set the same iommu context for multiple mdevs
> by placing them in the same container.  A user should be able to
> stop using an mdev in one place and start using it somewhere else.
> It should be a fungible $TYPE device.  It's an NVIDIA-only requirement
> that imposes this association of mdev devices into groups and I don't
> particularly see it as beneficial to the mdev architecture.  So why
> make it a standard part of the interface?
> 

Yes, I agree. This might not be each vendor's requirement.


> We could do keying at the layer you suggest, assuming we can find
> something that doesn't restrict the user, but we could make that
> optional.  

We can key on 'container'. Devices should be in same VFIO 'container'.
open() call should fail if they are found to be in different containers.

> For instance, say we did key on pid, there could be an
> attribute in the supported types hierarchy to indicate this type
> supports(requires) pid-sets.  Each mdev device with this attribute
> would create a pid-group file in sysfs where libvirt could associate
> the device.  Only for those mdev devices requiring it.
> 

We are OK with this suggestion if this works of libvirt integration.
We can have file in types directory in supported types as 'requires_group'.

Thanks,
Kirti

> The alternative is that we need to find some mechanism for this
> association that doesn't impose arbitrary requirements, and potentially
> usage restrictions on vendors that don't have this need.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support
  2016-09-08 18:48                                       ` Kirti Wankhede
@ 2016-09-08 20:51                                         ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-08 20:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Paolo Bonzini, Michal Privoznik, Song, Jike, cjia, kvm,
	libvir-list, Tian, Kevin, qemu-devel, kraxel, Laine Stump,
	bjsdjshi

On Fri, 9 Sep 2016 00:18:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/8/2016 3:43 AM, Alex Williamson wrote:
> > On Wed, 7 Sep 2016 23:36:28 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/7/2016 10:14 PM, Alex Williamson wrote:  
> >>> On Wed, 7 Sep 2016 21:45:31 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 9/7/2016 2:58 AM, Alex Williamson wrote:    
> >>>>> On Wed, 7 Sep 2016 01:05:11 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:      
> >>>>>>> On Sat, 3 Sep 2016 22:04:56 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>         
> >>>>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:        
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:          
> 
> ...
> 
> > 
> > Philosophically, mdev devices should be entirely independent of one
> > another.  A user can set the same iommu context for multiple mdevs
> > by placing them in the same container.  A user should be able to
> > stop using an mdev in one place and start using it somewhere else.
> > It should be a fungible $TYPE device.  It's an NVIDIA-only requirement
> > that imposes this association of mdev devices into groups and I don't
> > particularly see it as beneficial to the mdev architecture.  So why
> > make it a standard part of the interface?
> >   
> 
> Yes, I agree. This might not be each vendor's requirement.
> 
> 
> > We could do keying at the layer you suggest, assuming we can find
> > something that doesn't restrict the user, but we could make that
> > optional.    
> 
> We can key on 'container'. Devices should be in same VFIO 'container'.
> open() call should fail if they are found to be in different containers.

If we're operating with a vIOMMU then each vfio-group needs to be in
its own address space and will therefore be in separate containers.
Even without that, it would be entirely valid for a user to put groups
in separate containers, QEMU just chooses to use the same container for
efficiency and to avoid accounting issues with multiple containers.
There's also no interface for the vfio bus driver to get at the
container currently.

> > For instance, say we did key on pid, there could be an
> > attribute in the supported types hierarchy to indicate this type
> > supports(requires) pid-sets.  Each mdev device with this attribute
> > would create a pid-group file in sysfs where libvirt could associate
> > the device.  Only for those mdev devices requiring it.
> >   
> 
> We are OK with this suggestion if this works of libvirt integration.
> We can have file in types directory in supported types as 'requires_group'.

Ok, I wish there was a better way, we'll see what libvirt folks think.
If we can't make it transparent for mdev vendors that don't require it,
at least we can define an API extension within mdev that libvirt can
use to discover the requirement and support it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-08  9:38       ` [Qemu-devel] " Neo Jia
@ 2016-09-09  6:26         ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-09  6:26 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, bjsdjshi

On 09/08/2016 05:38 PM, Neo Jia wrote:
> On Thu, Sep 08, 2016 at 04:09:39PM +0800, Jike Song wrote:
>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>> +
>>> +/**
>>> + * struct parent_ops - Structure to be registered for each parent device to
>>> + * register the device to mdev module.
>>> + *
>>> + * @owner:		The module owner.
>>> + * @dev_attr_groups:	Default attributes of the parent device.
>>> + * @mdev_attr_groups:	Default attributes of the mediated device.
>>> + * @supported_config:	Called to get information about supported types.
>>> + *			@dev : device structure of parent device.
>>> + *			@config: should return string listing supported config
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @create:		Called to allocate basic resources in parent device's
>>> + *			driver for a particular mediated device. It is
>>> + *			mandatory to provide create ops.
>>> + *			@mdev: mdev_device structure on of mediated device
>>> + *			      that is being created
>>> + *			@mdev_params: extra parameters required by parent
>>> + *			device's driver.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @destroy:		Called to free resources in parent device's driver for a
>>> + *			a mediated device. It is mandatory to provide destroy
>>> + *			ops.
>>> + *			@mdev: mdev_device device structure which is being
>>> + *			       destroyed
>>> + *			Returns integer: success (0) or error (< 0)
>>> + *			If VMM is running and destroy() is called that means the
>>> + *			mdev is being hotunpluged. Return error if VMM is
>>> + *			running and driver doesn't support mediated device
>>> + *			hotplug.
>>> + * @reset:		Called to reset mediated device.
>>> + *			@mdev: mdev_device device structure.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @set_online_status:	Called to change to status of mediated device.
>>> + *			@mdev: mediated device.
>>> + *			@online: set true or false to make mdev device online or
>>> + *			offline.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @get_online_status:	Called to get online/offline status of  mediated device
>>> + *			@mdev: mediated device.
>>> + *			@online: Returns status of mediated device.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @read:		Read emulation callback
>>> + *			@mdev: mediated device structure
>>> + *			@buf: read buffer
>>> + *			@count: number of bytes to read
>>> + *			@pos: address.
>>> + *			Retuns number on bytes read on success or error.
>>> + * @write:		Write emulation callback
>>> + *			@mdev: mediated device structure
>>> + *			@buf: write buffer
>>> + *			@count: number of bytes to be written
>>> + *			@pos: address.
>>> + *			Retuns number on bytes written on success or error.
>>> + * @get_irq_info:	Called to retrieve information about mediated device IRQ
>>> + *			@mdev: mediated device structure
>>> + *			@irq_info: VFIO IRQ flags and count.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @set_irqs:		Called to send about interrupts configuration
>>> + *			information that VMM sets.
>>> + *			@mdev: mediated device structure
>>> + *			@flags, index, start, count and *data : same as that of
>>> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
>>> + * @get_device_info:	Called to get VFIO device information for a mediated
>>> + *			device.
>>> + *			@vfio_device_info: VFIO device info.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @get_region_info:	Called to get VFIO region size and flags of mediated
>>> + *			device.
>>> + *			@mdev: mediated device structure
>>> + *			@region_info: output, returns size and flags of
>>> + *				      requested region.
>>> + *			@cap_type_id: returns id of capability.
>>> + *			@cap_type: returns pointer to capability structure
>>> + *			corresponding to capability id.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + *
>>> + * Parent device that support mediated device should be registered with mdev
>>> + * module with parent_ops structure.
>>> + */
>>> +
>>> +struct parent_ops {
>>> +	struct module   *owner;
>>> +	const struct attribute_group **dev_attr_groups;
>>> +	const struct attribute_group **mdev_attr_groups;
>>> +
>>> +	int	(*supported_config)(struct device *dev, char *config);
>>> +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
>>> +	int     (*destroy)(struct mdev_device *mdev);
>>> +	int     (*reset)(struct mdev_device *mdev);
>>> +	int     (*set_online_status)(struct mdev_device *mdev, bool online);
>>> +	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
>>> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
>>> +			loff_t pos);
>>> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
>>> +			 loff_t pos);
>>> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
>>> +	int	(*get_irq_info)(struct mdev_device *mdev,
>>> +				struct vfio_irq_info *irq_info);
>>> +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
>>> +			    unsigned int index, unsigned int start,
>>> +			    unsigned int count, void *data);
>>> +	int	(*get_device_info)(struct mdev_device *mdev,
>>> +				   struct vfio_device_info *dev_info);
>>> +	int	(*get_region_info)(struct mdev_device *mdev,
>>> +				   struct vfio_region_info *region_info,
>>> +				   u16 *cap_type_id, void **cap_type);
>>> +};
>>
>> I have a strong objection here to such low-level interfaces, the interfaces
>> between vfio-mdev and vendor drivers should be as thin as possible, not imposing
>> any limitation to vendor drivers.
> 
> Hi Jike,
> 
> Welcome! :-)

Aha, thanks! :)

>
> Unfortunately, this is something I definitely can't agree with you.
>

Glad to see your opinion!


> We would like to capture the common code as much as possible without losing
> flexibilities, so each vendor driver writers won't have to duplicate them and we
> have something can be maintained publicly.
> 

Yeah it is good to reduce the duplications among different vendor drivers,
but what do you think about the duplication between here and other bus drivers
like vfio-pci?

> If you are running into specific limitation with above callback interfaces,
> please show us the scenarios and we are very happy to look into that.
>

Though we don't actually test it upon this series (using high-level implementations
instead), I personally don't think there is a problem. However, this doesn't
necessarily mean it's sufficient.

>>
>> I saw that validate_map_request was removed from the ops and mmap was added. 
>> That is pretty nice. Furthermore, if you add an ioctl here, you can also remove
>> get_device_info, get_irq_info, set_irqs, and get_region_info (and even "reset").
>> There are several benefits by doing this:
> 
> The decision of moving validate_map_request is mainly because we are adding a lot of 
> advanced logic which most vendor drivers don't require, since we are the only consumer 
> of such logic, no need to put it in the public/shared module.
>
>>
>> 	-	Balanced interfaces.
>> 		Like I replied in another mail, you won't have unbalanced interfaces.
>> 		You already have read, write and mmap in the ops, why not ioctl?
> 
> Sorry, don't think "balanced" interface is a design criteria especially when
> simply pursuing the sake of "balanced or full-set" interface ends up lots
> duplicated code for vendor driver writers.
> 

Please kindly have a look at my comment on patch 2/4, about how to check the
validity of "count".

>>
>> 	-	Scalability.
>> 		You are intercepting vfio optional capabilities in the framework, but
>> 		how if vfio.ko, or vfio-pci.ko add a few new capabilities in the future?
> 
> Exactly my point about the code sharing.
> 
> If new cap is added inside vfio.ko or vfio-pci.ko, we can just add it into
> vfio_mdev.ko.
> 
> Adding the code in one place is better to duplicate in multiple vendor drivers.

So after adding that, how many places will you have?

>>
>> 	-	Abstraction.
>> 		Even placing common codes here can avoid code duplication, you still
>> 		have code duplicate with vfio-pci.  Better to move common logic out of
>> 		vfio-pci and call them from mdev vendor drivers.
> 
> Are you saying to avoid the code duplications between vfio-pci and vfio-mdev?
> 

Exactly. I haven't check other bus-driver like vfio-platform, but even if only
having vfio-pci considered, there will be duplications.

>>
>> 	-	Maintainability.
>> 		This is pretty obvious :)
> 
> Definitely not, the burden is moving to the vendor driver side.
>

Moving to vendor side is not the target, as said above, this will probably cause
more abstraction and refactoring of existing vfio code.

> Again, Jike, I really want to enable you with the mediated framework we have been
> doing here. So it is probably easier for us to accommodate your need if you could
> follow the interfaces we have introduced and let us know if you have any specific
> issues.

I won't read this as that one is not welcome to comment as long as he met no
actual issue :)

--
Thanks,
Jike

>>> +
>>> +/*
>>> + * Parent Device
>>> + */
>>> +
>>> +struct parent_device {
>>> +	struct device		*dev;
>>> +	const struct parent_ops	*ops;
>>> +
>>> +	/* internal */
>>> +	struct kref		ref;
>>> +	struct list_head	next;
>>> +	struct list_head	mdev_list;
>>> +	struct mutex		mdev_list_lock;
>>> +	wait_queue_head_t	release_done;
>>> +};
>>> +
>>> +/**
>>> + * struct mdev_driver - Mediated device driver
>>> + * @name: driver name
>>> + * @probe: called when new device created
>>> + * @remove: called when device removed
>>> + * @driver: device driver structure
>>> + *
>>> + **/
>>> +struct mdev_driver {
>>> +	const char *name;
>>> +	int  (*probe)(struct device *dev);
>>> +	void (*remove)(struct device *dev);
>>> +	struct device_driver driver;
>>> +};
>>> +
>>> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
>>> +{
>>> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
>>> +}
>>> +
>>> +static inline struct mdev_device *to_mdev_device(struct device *dev)
>>> +{
>>> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
>>> +}
>>
>> These can be macros, like pci ones.
>>
>>> +
>>> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
>>> +{
>>> +	return mdev->driver_data;
>>> +}
>>> +
>>> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
>>> +{
>>> +	mdev->driver_data = data;
>>> +}
>>> +
>>> +extern struct bus_type mdev_bus_type;
>>> +
>>> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
>>> +
>>> +extern int  mdev_register_device(struct device *dev,
>>> +				 const struct parent_ops *ops);
>>> +extern void mdev_unregister_device(struct device *dev);
>>> +
>>> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
>>> +extern void mdev_unregister_driver(struct mdev_driver *drv);
>>> +
>>> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
>>> +extern void mdev_put_device(struct mdev_device *mdev);
>>> +
>>> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
>>> +
>>> +#endif /* MDEV_H */
>>>
>>
>> --
>> Thanks,
>> Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-09  6:26         ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-09  6:26 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, bjsdjshi

On 09/08/2016 05:38 PM, Neo Jia wrote:
> On Thu, Sep 08, 2016 at 04:09:39PM +0800, Jike Song wrote:
>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>> +
>>> +/**
>>> + * struct parent_ops - Structure to be registered for each parent device to
>>> + * register the device to mdev module.
>>> + *
>>> + * @owner:		The module owner.
>>> + * @dev_attr_groups:	Default attributes of the parent device.
>>> + * @mdev_attr_groups:	Default attributes of the mediated device.
>>> + * @supported_config:	Called to get information about supported types.
>>> + *			@dev : device structure of parent device.
>>> + *			@config: should return string listing supported config
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @create:		Called to allocate basic resources in parent device's
>>> + *			driver for a particular mediated device. It is
>>> + *			mandatory to provide create ops.
>>> + *			@mdev: mdev_device structure on of mediated device
>>> + *			      that is being created
>>> + *			@mdev_params: extra parameters required by parent
>>> + *			device's driver.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @destroy:		Called to free resources in parent device's driver for a
>>> + *			a mediated device. It is mandatory to provide destroy
>>> + *			ops.
>>> + *			@mdev: mdev_device device structure which is being
>>> + *			       destroyed
>>> + *			Returns integer: success (0) or error (< 0)
>>> + *			If VMM is running and destroy() is called that means the
>>> + *			mdev is being hotunpluged. Return error if VMM is
>>> + *			running and driver doesn't support mediated device
>>> + *			hotplug.
>>> + * @reset:		Called to reset mediated device.
>>> + *			@mdev: mdev_device device structure.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @set_online_status:	Called to change to status of mediated device.
>>> + *			@mdev: mediated device.
>>> + *			@online: set true or false to make mdev device online or
>>> + *			offline.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @get_online_status:	Called to get online/offline status of  mediated device
>>> + *			@mdev: mediated device.
>>> + *			@online: Returns status of mediated device.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @read:		Read emulation callback
>>> + *			@mdev: mediated device structure
>>> + *			@buf: read buffer
>>> + *			@count: number of bytes to read
>>> + *			@pos: address.
>>> + *			Retuns number on bytes read on success or error.
>>> + * @write:		Write emulation callback
>>> + *			@mdev: mediated device structure
>>> + *			@buf: write buffer
>>> + *			@count: number of bytes to be written
>>> + *			@pos: address.
>>> + *			Retuns number on bytes written on success or error.
>>> + * @get_irq_info:	Called to retrieve information about mediated device IRQ
>>> + *			@mdev: mediated device structure
>>> + *			@irq_info: VFIO IRQ flags and count.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @set_irqs:		Called to send about interrupts configuration
>>> + *			information that VMM sets.
>>> + *			@mdev: mediated device structure
>>> + *			@flags, index, start, count and *data : same as that of
>>> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
>>> + * @get_device_info:	Called to get VFIO device information for a mediated
>>> + *			device.
>>> + *			@vfio_device_info: VFIO device info.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + * @get_region_info:	Called to get VFIO region size and flags of mediated
>>> + *			device.
>>> + *			@mdev: mediated device structure
>>> + *			@region_info: output, returns size and flags of
>>> + *				      requested region.
>>> + *			@cap_type_id: returns id of capability.
>>> + *			@cap_type: returns pointer to capability structure
>>> + *			corresponding to capability id.
>>> + *			Returns integer: success (0) or error (< 0)
>>> + *
>>> + * Parent device that support mediated device should be registered with mdev
>>> + * module with parent_ops structure.
>>> + */
>>> +
>>> +struct parent_ops {
>>> +	struct module   *owner;
>>> +	const struct attribute_group **dev_attr_groups;
>>> +	const struct attribute_group **mdev_attr_groups;
>>> +
>>> +	int	(*supported_config)(struct device *dev, char *config);
>>> +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
>>> +	int     (*destroy)(struct mdev_device *mdev);
>>> +	int     (*reset)(struct mdev_device *mdev);
>>> +	int     (*set_online_status)(struct mdev_device *mdev, bool online);
>>> +	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
>>> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
>>> +			loff_t pos);
>>> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
>>> +			 loff_t pos);
>>> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
>>> +	int	(*get_irq_info)(struct mdev_device *mdev,
>>> +				struct vfio_irq_info *irq_info);
>>> +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
>>> +			    unsigned int index, unsigned int start,
>>> +			    unsigned int count, void *data);
>>> +	int	(*get_device_info)(struct mdev_device *mdev,
>>> +				   struct vfio_device_info *dev_info);
>>> +	int	(*get_region_info)(struct mdev_device *mdev,
>>> +				   struct vfio_region_info *region_info,
>>> +				   u16 *cap_type_id, void **cap_type);
>>> +};
>>
>> I have a strong objection here to such low-level interfaces, the interfaces
>> between vfio-mdev and vendor drivers should be as thin as possible, not imposing
>> any limitation to vendor drivers.
> 
> Hi Jike,
> 
> Welcome! :-)

Aha, thanks! :)

>
> Unfortunately, this is something I definitely can't agree with you.
>

Glad to see your opinion!


> We would like to capture the common code as much as possible without losing
> flexibilities, so each vendor driver writers won't have to duplicate them and we
> have something can be maintained publicly.
> 

Yeah it is good to reduce the duplications among different vendor drivers,
but what do you think about the duplication between here and other bus drivers
like vfio-pci?

> If you are running into specific limitation with above callback interfaces,
> please show us the scenarios and we are very happy to look into that.
>

Though we don't actually test it upon this series (using high-level implementations
instead), I personally don't think there is a problem. However, this doesn't
necessarily mean it's sufficient.

>>
>> I saw that validate_map_request was removed from the ops and mmap was added. 
>> That is pretty nice. Furthermore, if you add an ioctl here, you can also remove
>> get_device_info, get_irq_info, set_irqs, and get_region_info (and even "reset").
>> There are several benefits by doing this:
> 
> The decision of moving validate_map_request is mainly because we are adding a lot of 
> advanced logic which most vendor drivers don't require, since we are the only consumer 
> of such logic, no need to put it in the public/shared module.
>
>>
>> 	-	Balanced interfaces.
>> 		Like I replied in another mail, you won't have unbalanced interfaces.
>> 		You already have read, write and mmap in the ops, why not ioctl?
> 
> Sorry, don't think "balanced" interface is a design criteria especially when
> simply pursuing the sake of "balanced or full-set" interface ends up lots
> duplicated code for vendor driver writers.
> 

Please kindly have a look at my comment on patch 2/4, about how to check the
validity of "count".

>>
>> 	-	Scalability.
>> 		You are intercepting vfio optional capabilities in the framework, but
>> 		how if vfio.ko, or vfio-pci.ko add a few new capabilities in the future?
> 
> Exactly my point about the code sharing.
> 
> If new cap is added inside vfio.ko or vfio-pci.ko, we can just add it into
> vfio_mdev.ko.
> 
> Adding the code in one place is better to duplicate in multiple vendor drivers.

So after adding that, how many places will you have?

>>
>> 	-	Abstraction.
>> 		Even placing common codes here can avoid code duplication, you still
>> 		have code duplicate with vfio-pci.  Better to move common logic out of
>> 		vfio-pci and call them from mdev vendor drivers.
> 
> Are you saying to avoid the code duplications between vfio-pci and vfio-mdev?
> 

Exactly. I haven't check other bus-driver like vfio-platform, but even if only
having vfio-pci considered, there will be duplications.

>>
>> 	-	Maintainability.
>> 		This is pretty obvious :)
> 
> Definitely not, the burden is moving to the vendor driver side.
>

Moving to vendor side is not the target, as said above, this will probably cause
more abstraction and refactoring of existing vfio code.

> Again, Jike, I really want to enable you with the mediated framework we have been
> doing here. So it is probably easier for us to accommodate your need if you could
> follow the interfaces we have introduced and let us know if you have any specific
> issues.

I won't read this as that one is not welcome to comment as long as he met no
actual issue :)

--
Thanks,
Jike

>>> +
>>> +/*
>>> + * Parent Device
>>> + */
>>> +
>>> +struct parent_device {
>>> +	struct device		*dev;
>>> +	const struct parent_ops	*ops;
>>> +
>>> +	/* internal */
>>> +	struct kref		ref;
>>> +	struct list_head	next;
>>> +	struct list_head	mdev_list;
>>> +	struct mutex		mdev_list_lock;
>>> +	wait_queue_head_t	release_done;
>>> +};
>>> +
>>> +/**
>>> + * struct mdev_driver - Mediated device driver
>>> + * @name: driver name
>>> + * @probe: called when new device created
>>> + * @remove: called when device removed
>>> + * @driver: device driver structure
>>> + *
>>> + **/
>>> +struct mdev_driver {
>>> +	const char *name;
>>> +	int  (*probe)(struct device *dev);
>>> +	void (*remove)(struct device *dev);
>>> +	struct device_driver driver;
>>> +};
>>> +
>>> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
>>> +{
>>> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
>>> +}
>>> +
>>> +static inline struct mdev_device *to_mdev_device(struct device *dev)
>>> +{
>>> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
>>> +}
>>
>> These can be macros, like pci ones.
>>
>>> +
>>> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
>>> +{
>>> +	return mdev->driver_data;
>>> +}
>>> +
>>> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
>>> +{
>>> +	mdev->driver_data = data;
>>> +}
>>> +
>>> +extern struct bus_type mdev_bus_type;
>>> +
>>> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
>>> +
>>> +extern int  mdev_register_device(struct device *dev,
>>> +				 const struct parent_ops *ops);
>>> +extern void mdev_unregister_device(struct device *dev);
>>> +
>>> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
>>> +extern void mdev_unregister_driver(struct mdev_driver *drv);
>>> +
>>> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
>>> +extern void mdev_put_device(struct mdev_device *mdev);
>>> +
>>> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
>>> +
>>> +#endif /* MDEV_H */
>>>
>>
>> --
>> Thanks,
>> Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-08  8:09     ` [Qemu-devel] " Jike Song
@ 2016-09-09 17:48       ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-09 17:48 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi



On 9/8/2016 1:39 PM, Jike Song wrote:
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:

>>  +---------------+
>>  |               |
>>  | +-----------+ |  mdev_register_driver() +--------------+
>>  | |           | +<------------------------+ __init()     |
>>  | |  mdev     | |                         |              |
>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>  | |           | |                         |              |
>>  | +-----------+ |                         +--------------+
>>  |               |
> 
> This aimed to have only one single vfio bus driver for all mediated devices,
> right?
>

Yes. That's correct.


>> +
>> +static int mdev_add_attribute_group(struct device *dev,
>> +				    const struct attribute_group **groups)
>> +{
>> +	return sysfs_create_groups(&dev->kobj, groups);
>> +}
>> +
>> +static void mdev_remove_attribute_group(struct device *dev,
>> +					const struct attribute_group **groups)
>> +{
>> +	sysfs_remove_groups(&dev->kobj, groups);
>> +}
> 
> These functions are not necessary. You can always specify the attribute groups
> to dev->groups before registering a new device.
> 

At the time of mdev device create, I specifically didn't used
dev->groups because we callback in vendor driver before that, see below
code snippet, and those attributes should only be added if create()
callback returns success.

        ret = parent->ops->create(mdev, mdev_params);
        if (ret)
                return ret;

        ret = mdev_add_attribute_group(&mdev->dev,
                                        parent->ops->mdev_attr_groups);
        if (ret)
                parent->ops->destroy(mdev);



>> +
>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>> +{
>> +	struct parent_device *parent;
>> +
>> +	mutex_lock(&parent_list_lock);
>> +	parent = mdev_get_parent(__find_parent_device(dev));
>> +	mutex_unlock(&parent_list_lock);
>> +
>> +	return parent;
>> +}
> 
> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> as long as you have an independent device associated with the mdev host device
> ("parent" device here).
>

I don't think every lock will go away with that. This also changes how
mdev devices entries are created in sysfs. It adds an extra directory.


> PS, "parent" is somehow a name too generic?
>

This is the term used in Linux Kernel for such cases. See 'struct
device' in include/linux/device.h. I would prefer 'parent'.

>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	ret = parent->ops->create(mdev, mdev_params);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = mdev_add_attribute_group(&mdev->dev,
>> +					parent->ops->mdev_attr_groups);
> 
> Ditto: dev->groups.
> 

See my above response for why this is indented to be so.


>> +	ret = parent_create_sysfs_files(dev);
>> +	if (ret)
>> +		goto add_sysfs_error;
>> +
>> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> 
> parent_create_sysfs_files and mdev_add_attribute_group are kind of doing
> the same thing, do you mind to merge them into one?
> 

Ok. I'll see I can do that.


>> +int mdev_device_get_online_status(struct device *dev, bool *online)
>> +{
>> +	int ret = 0;
>> +	struct mdev_device *mdev;
>> +	struct parent_device *parent;
>> +
>> +	mdev = mdev_get_device(to_mdev_device(dev));
>> +	if (!mdev)
>> +		return -EINVAL;
>> +
>> +	parent = mdev->parent;
>> +
>> +	if (parent->ops->get_online_status)
>> +		ret = parent->ops->get_online_status(mdev, online);
>> +
>> +	mdev_put_device(mdev);
>> +
>> +	return ret;
>> +}
> 
> The driver core has a perfect 'online' file for a device, with both
> 'show' and 'store' support, you don't need to write another one.
> 
> Please have a look at online_show and online_store in drivers/base/core.c.
> 

This is going to be removed as per the latest discussion.


> +
>> +extern struct class_attribute mdev_class_attrs[];
> 
> This is useless?
>

Oh, I missed to remove it. Thanks for pointing that out.


>> +static ssize_t mdev_create_store(struct device *dev,
>> +				 struct device_attribute *attr,
>> +				 const char *buf, size_t count)
>> +{
>> +	char *str, *pstr;
>> +	char *uuid_str, *mdev_params = NULL, *params = NULL;
>> +	uuid_le uuid;
>> +	int ret;
>> +
>> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> 
> pstr is not used.
>

It is used below to free duped memory. If you see below code, 'str'
pointer gets incremented on strsep, so that can't be used to free
memory. pstr points to start of memory which we want to free while
returning from this function.

>> +
>> +	if (!str)
>> +		return -ENOMEM;
>> +
>> +	uuid_str = strsep(&str, ":");
>> +	if (!uuid_str) {
>> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
>> +		ret = -EINVAL;
>> +		goto create_error;
>> +	}
>> +
>> +	if (str)
>> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
>> +
>> +	ret = uuid_le_to_bin(uuid_str, &uuid);
>> +	if (ret) {
>> +		pr_err("mdev_create: UUID parse error %s\n", buf);
>> +		goto create_error;
>> +	}
>> +
>> +	ret = mdev_device_create(dev, uuid, mdev_params);
>> +	if (ret)
>> +		pr_err("mdev_create: Failed to create mdev device\n");
>> +	else
>> +		ret = count;
>> +
>> +create_error:
>> +	kfree(params);
>> +	kfree(pstr);
>> +	return ret;
>> +}
>> +
>> +static ssize_t mdev_destroy_store(struct device *dev,
>> +				  struct device_attribute *attr,
>> +				  const char *buf, size_t count)
>> +{
>> +	char *uuid_str, *str, *pstr;
>> +	uuid_le uuid;
>> +	int ret;
>> +
>> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> 
> Ditto.
> 

Same as above.

>> +
>> +	if (!str)
>> +		return -ENOMEM;
>> +
>> +	uuid_str = strsep(&str, ":");
>> +	if (!uuid_str) {
>> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
>> +		ret = -EINVAL;
>> +		goto destroy_error;
>> +	}
>> +
>> +	ret = uuid_le_to_bin(uuid_str, &uuid);
>> +	if (ret) {
>> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
>> +		goto destroy_error;
>> +	}
>> +
>> +	ret = mdev_device_destroy(dev, uuid);
>> +	if (ret == 0)
>> +		ret = count;
>> +
>> +destroy_error:
>> +	kfree(pstr);
>> +	return ret;
>> +}
>> +
>> +
>> +int parent_create_sysfs_files(struct device *dev)
>> +{
>> +	int ret;
>> +
>> +	ret = sysfs_create_file(&dev->kobj,
>> +				&dev_attr_mdev_supported_types.attr);
>> +	if (ret) {
>> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
>> +		return ret;
>> +	}
>> +
>> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
>> +	if (ret) {
>> +		pr_err("Failed to create mdev_create sysfs entry\n");
>> +		goto create_sysfs_failed;
>> +	}
>> +
>> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
>> +	if (ret) {
>> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
>> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
>> +	} else
>> +		return ret;
>> +
>> +create_sysfs_failed:
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
>> +	return ret;
>> +}
>> +
>> +void parent_remove_sysfs_files(struct device *dev)
>> +{
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
>> +}
> 
> The 2 functions above are also unnecessary: you can always group it with a single
> function call of sysfs_create_files.
>

Ok. 'supported types' and 'create' are going to change as per discussion
going on for libvirt integration. These functions would get removed with
that change.


>> +
>> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
>> +{
>> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
>> +}
>> +
>> +static inline struct mdev_device *to_mdev_device(struct device *dev)
>> +{
>> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
>> +}
> 
> These can be macros, like pci ones.
>

These also checks for NULL of argument which macro doesn't.

Kirti.

>> +
>> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
>> +{
>> +	return mdev->driver_data;
>> +}
>> +
>> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
>> +{
>> +	mdev->driver_data = data;
>> +}
>> +
>> +extern struct bus_type mdev_bus_type;
>> +
>> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
>> +
>> +extern int  mdev_register_device(struct device *dev,
>> +				 const struct parent_ops *ops);
>> +extern void mdev_unregister_device(struct device *dev);
>> +
>> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
>> +extern void mdev_unregister_driver(struct mdev_driver *drv);
>> +
>> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
>> +extern void mdev_put_device(struct mdev_device *mdev);
>> +
>> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
>> +
>> +#endif /* MDEV_H */
>>
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-09 17:48       ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-09 17:48 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi



On 9/8/2016 1:39 PM, Jike Song wrote:
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:

>>  +---------------+
>>  |               |
>>  | +-----------+ |  mdev_register_driver() +--------------+
>>  | |           | +<------------------------+ __init()     |
>>  | |  mdev     | |                         |              |
>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>  | |           | |                         |              |
>>  | +-----------+ |                         +--------------+
>>  |               |
> 
> This aimed to have only one single vfio bus driver for all mediated devices,
> right?
>

Yes. That's correct.


>> +
>> +static int mdev_add_attribute_group(struct device *dev,
>> +				    const struct attribute_group **groups)
>> +{
>> +	return sysfs_create_groups(&dev->kobj, groups);
>> +}
>> +
>> +static void mdev_remove_attribute_group(struct device *dev,
>> +					const struct attribute_group **groups)
>> +{
>> +	sysfs_remove_groups(&dev->kobj, groups);
>> +}
> 
> These functions are not necessary. You can always specify the attribute groups
> to dev->groups before registering a new device.
> 

At the time of mdev device create, I specifically didn't used
dev->groups because we callback in vendor driver before that, see below
code snippet, and those attributes should only be added if create()
callback returns success.

        ret = parent->ops->create(mdev, mdev_params);
        if (ret)
                return ret;

        ret = mdev_add_attribute_group(&mdev->dev,
                                        parent->ops->mdev_attr_groups);
        if (ret)
                parent->ops->destroy(mdev);



>> +
>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>> +{
>> +	struct parent_device *parent;
>> +
>> +	mutex_lock(&parent_list_lock);
>> +	parent = mdev_get_parent(__find_parent_device(dev));
>> +	mutex_unlock(&parent_list_lock);
>> +
>> +	return parent;
>> +}
> 
> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> as long as you have an independent device associated with the mdev host device
> ("parent" device here).
>

I don't think every lock will go away with that. This also changes how
mdev devices entries are created in sysfs. It adds an extra directory.


> PS, "parent" is somehow a name too generic?
>

This is the term used in Linux Kernel for such cases. See 'struct
device' in include/linux/device.h. I would prefer 'parent'.

>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	ret = parent->ops->create(mdev, mdev_params);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = mdev_add_attribute_group(&mdev->dev,
>> +					parent->ops->mdev_attr_groups);
> 
> Ditto: dev->groups.
> 

See my above response for why this is indented to be so.


>> +	ret = parent_create_sysfs_files(dev);
>> +	if (ret)
>> +		goto add_sysfs_error;
>> +
>> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> 
> parent_create_sysfs_files and mdev_add_attribute_group are kind of doing
> the same thing, do you mind to merge them into one?
> 

Ok. I'll see I can do that.


>> +int mdev_device_get_online_status(struct device *dev, bool *online)
>> +{
>> +	int ret = 0;
>> +	struct mdev_device *mdev;
>> +	struct parent_device *parent;
>> +
>> +	mdev = mdev_get_device(to_mdev_device(dev));
>> +	if (!mdev)
>> +		return -EINVAL;
>> +
>> +	parent = mdev->parent;
>> +
>> +	if (parent->ops->get_online_status)
>> +		ret = parent->ops->get_online_status(mdev, online);
>> +
>> +	mdev_put_device(mdev);
>> +
>> +	return ret;
>> +}
> 
> The driver core has a perfect 'online' file for a device, with both
> 'show' and 'store' support, you don't need to write another one.
> 
> Please have a look at online_show and online_store in drivers/base/core.c.
> 

This is going to be removed as per the latest discussion.


> +
>> +extern struct class_attribute mdev_class_attrs[];
> 
> This is useless?
>

Oh, I missed to remove it. Thanks for pointing that out.


>> +static ssize_t mdev_create_store(struct device *dev,
>> +				 struct device_attribute *attr,
>> +				 const char *buf, size_t count)
>> +{
>> +	char *str, *pstr;
>> +	char *uuid_str, *mdev_params = NULL, *params = NULL;
>> +	uuid_le uuid;
>> +	int ret;
>> +
>> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> 
> pstr is not used.
>

It is used below to free duped memory. If you see below code, 'str'
pointer gets incremented on strsep, so that can't be used to free
memory. pstr points to start of memory which we want to free while
returning from this function.

>> +
>> +	if (!str)
>> +		return -ENOMEM;
>> +
>> +	uuid_str = strsep(&str, ":");
>> +	if (!uuid_str) {
>> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
>> +		ret = -EINVAL;
>> +		goto create_error;
>> +	}
>> +
>> +	if (str)
>> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
>> +
>> +	ret = uuid_le_to_bin(uuid_str, &uuid);
>> +	if (ret) {
>> +		pr_err("mdev_create: UUID parse error %s\n", buf);
>> +		goto create_error;
>> +	}
>> +
>> +	ret = mdev_device_create(dev, uuid, mdev_params);
>> +	if (ret)
>> +		pr_err("mdev_create: Failed to create mdev device\n");
>> +	else
>> +		ret = count;
>> +
>> +create_error:
>> +	kfree(params);
>> +	kfree(pstr);
>> +	return ret;
>> +}
>> +
>> +static ssize_t mdev_destroy_store(struct device *dev,
>> +				  struct device_attribute *attr,
>> +				  const char *buf, size_t count)
>> +{
>> +	char *uuid_str, *str, *pstr;
>> +	uuid_le uuid;
>> +	int ret;
>> +
>> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> 
> Ditto.
> 

Same as above.

>> +
>> +	if (!str)
>> +		return -ENOMEM;
>> +
>> +	uuid_str = strsep(&str, ":");
>> +	if (!uuid_str) {
>> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
>> +		ret = -EINVAL;
>> +		goto destroy_error;
>> +	}
>> +
>> +	ret = uuid_le_to_bin(uuid_str, &uuid);
>> +	if (ret) {
>> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
>> +		goto destroy_error;
>> +	}
>> +
>> +	ret = mdev_device_destroy(dev, uuid);
>> +	if (ret == 0)
>> +		ret = count;
>> +
>> +destroy_error:
>> +	kfree(pstr);
>> +	return ret;
>> +}
>> +
>> +
>> +int parent_create_sysfs_files(struct device *dev)
>> +{
>> +	int ret;
>> +
>> +	ret = sysfs_create_file(&dev->kobj,
>> +				&dev_attr_mdev_supported_types.attr);
>> +	if (ret) {
>> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
>> +		return ret;
>> +	}
>> +
>> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
>> +	if (ret) {
>> +		pr_err("Failed to create mdev_create sysfs entry\n");
>> +		goto create_sysfs_failed;
>> +	}
>> +
>> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
>> +	if (ret) {
>> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
>> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
>> +	} else
>> +		return ret;
>> +
>> +create_sysfs_failed:
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
>> +	return ret;
>> +}
>> +
>> +void parent_remove_sysfs_files(struct device *dev)
>> +{
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
>> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
>> +}
> 
> The 2 functions above are also unnecessary: you can always group it with a single
> function call of sysfs_create_files.
>

Ok. 'supported types' and 'create' are going to change as per discussion
going on for libvirt integration. These functions would get removed with
that change.


>> +
>> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
>> +{
>> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
>> +}
>> +
>> +static inline struct mdev_device *to_mdev_device(struct device *dev)
>> +{
>> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
>> +}
> 
> These can be macros, like pci ones.
>

These also checks for NULL of argument which macro doesn't.

Kirti.

>> +
>> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
>> +{
>> +	return mdev->driver_data;
>> +}
>> +
>> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
>> +{
>> +	mdev->driver_data = data;
>> +}
>> +
>> +extern struct bus_type mdev_bus_type;
>> +
>> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
>> +
>> +extern int  mdev_register_device(struct device *dev,
>> +				 const struct parent_ops *ops);
>> +extern void mdev_unregister_device(struct device *dev);
>> +
>> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
>> +extern void mdev_unregister_driver(struct mdev_driver *drv);
>> +
>> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
>> +extern void mdev_put_device(struct mdev_device *mdev);
>> +
>> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
>> +
>> +#endif /* MDEV_H */
>>
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-09 17:48       ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-09 18:42         ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-09 18:42 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, cjia, kvm, qemu-devel, kevin.tian, kraxel, pbonzini, bjsdjshi

On Fri, 9 Sep 2016 23:18:45 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/8/2016 1:39 PM, Jike Song wrote:
> > On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
> 
> >>  +---------------+
> >>  |               |
> >>  | +-----------+ |  mdev_register_driver() +--------------+
> >>  | |           | +<------------------------+ __init()     |
> >>  | |  mdev     | |                         |              |
> >>  | |  bus      | +------------------------>+              |<-> VFIO user
> >>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
> >>  | |           | |                         |              |
> >>  | +-----------+ |                         +--------------+
> >>  |               |  
> > 
> > This aimed to have only one single vfio bus driver for all mediated devices,
> > right?
> >  
> 
> Yes. That's correct.
> 
> 
> >> +
> >> +static int mdev_add_attribute_group(struct device *dev,
> >> +				    const struct attribute_group **groups)
> >> +{
> >> +	return sysfs_create_groups(&dev->kobj, groups);
> >> +}
> >> +
> >> +static void mdev_remove_attribute_group(struct device *dev,
> >> +					const struct attribute_group **groups)
> >> +{
> >> +	sysfs_remove_groups(&dev->kobj, groups);
> >> +}  
> > 
> > These functions are not necessary. You can always specify the attribute groups
> > to dev->groups before registering a new device.
> >   
> 
> At the time of mdev device create, I specifically didn't used
> dev->groups because we callback in vendor driver before that, see below
> code snippet, and those attributes should only be added if create()
> callback returns success.
> 
>         ret = parent->ops->create(mdev, mdev_params);
>         if (ret)
>                 return ret;
> 
>         ret = mdev_add_attribute_group(&mdev->dev,
>                                         parent->ops->mdev_attr_groups);
>         if (ret)
>                 parent->ops->destroy(mdev);
> 
> 
> 
> >> +
> >> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >> +{
> >> +	struct parent_device *parent;
> >> +
> >> +	mutex_lock(&parent_list_lock);
> >> +	parent = mdev_get_parent(__find_parent_device(dev));
> >> +	mutex_unlock(&parent_list_lock);
> >> +
> >> +	return parent;
> >> +}  
> > 
> > As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> > as long as you have an independent device associated with the mdev host device
> > ("parent" device here).
> >  
> 
> I don't think every lock will go away with that. This also changes how
> mdev devices entries are created in sysfs. It adds an extra directory.

Exposing the parent-child relationship through sysfs is a desirable
feature, so I'm not sure how this is a negative.  This part of Jike's
conversion was a big improvement, I thought.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-09 18:42         ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-09 18:42 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi

On Fri, 9 Sep 2016 23:18:45 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/8/2016 1:39 PM, Jike Song wrote:
> > On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
> 
> >>  +---------------+
> >>  |               |
> >>  | +-----------+ |  mdev_register_driver() +--------------+
> >>  | |           | +<------------------------+ __init()     |
> >>  | |  mdev     | |                         |              |
> >>  | |  bus      | +------------------------>+              |<-> VFIO user
> >>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
> >>  | |           | |                         |              |
> >>  | +-----------+ |                         +--------------+
> >>  |               |  
> > 
> > This aimed to have only one single vfio bus driver for all mediated devices,
> > right?
> >  
> 
> Yes. That's correct.
> 
> 
> >> +
> >> +static int mdev_add_attribute_group(struct device *dev,
> >> +				    const struct attribute_group **groups)
> >> +{
> >> +	return sysfs_create_groups(&dev->kobj, groups);
> >> +}
> >> +
> >> +static void mdev_remove_attribute_group(struct device *dev,
> >> +					const struct attribute_group **groups)
> >> +{
> >> +	sysfs_remove_groups(&dev->kobj, groups);
> >> +}  
> > 
> > These functions are not necessary. You can always specify the attribute groups
> > to dev->groups before registering a new device.
> >   
> 
> At the time of mdev device create, I specifically didn't used
> dev->groups because we callback in vendor driver before that, see below
> code snippet, and those attributes should only be added if create()
> callback returns success.
> 
>         ret = parent->ops->create(mdev, mdev_params);
>         if (ret)
>                 return ret;
> 
>         ret = mdev_add_attribute_group(&mdev->dev,
>                                         parent->ops->mdev_attr_groups);
>         if (ret)
>                 parent->ops->destroy(mdev);
> 
> 
> 
> >> +
> >> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >> +{
> >> +	struct parent_device *parent;
> >> +
> >> +	mutex_lock(&parent_list_lock);
> >> +	parent = mdev_get_parent(__find_parent_device(dev));
> >> +	mutex_unlock(&parent_list_lock);
> >> +
> >> +	return parent;
> >> +}  
> > 
> > As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> > as long as you have an independent device associated with the mdev host device
> > ("parent" device here).
> >  
> 
> I don't think every lock will go away with that. This also changes how
> mdev devices entries are created in sysfs. It adds an extra directory.

Exposing the parent-child relationship through sysfs is a desirable
feature, so I'm not sure how this is a negative.  This part of Jike's
conversion was a big improvement, I thought.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-09 18:42         ` [Qemu-devel] " Alex Williamson
@ 2016-09-09 19:55           ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-09 19:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi



On 9/10/2016 12:12 AM, Alex Williamson wrote:
> On Fri, 9 Sep 2016 23:18:45 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>
>>>>  +---------------+
>>>>  |               |
>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>  | |           | +<------------------------+ __init()     |
>>>>  | |  mdev     | |                         |              |
>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>  | |           | |                         |              |
>>>>  | +-----------+ |                         +--------------+
>>>>  |               |  
>>>
>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>> right?
>>>  
>>
>> Yes. That's correct.
>>
>>
>>>> +
>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>> +				    const struct attribute_group **groups)
>>>> +{
>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>> +}
>>>> +
>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>> +					const struct attribute_group **groups)
>>>> +{
>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>> +}  
>>>
>>> These functions are not necessary. You can always specify the attribute groups
>>> to dev->groups before registering a new device.
>>>   
>>
>> At the time of mdev device create, I specifically didn't used
>> dev->groups because we callback in vendor driver before that, see below
>> code snippet, and those attributes should only be added if create()
>> callback returns success.
>>
>>         ret = parent->ops->create(mdev, mdev_params);
>>         if (ret)
>>                 return ret;
>>
>>         ret = mdev_add_attribute_group(&mdev->dev,
>>                                         parent->ops->mdev_attr_groups);
>>         if (ret)
>>                 parent->ops->destroy(mdev);
>>
>>
>>
>>>> +
>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>> +{
>>>> +	struct parent_device *parent;
>>>> +
>>>> +	mutex_lock(&parent_list_lock);
>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>> +	mutex_unlock(&parent_list_lock);
>>>> +
>>>> +	return parent;
>>>> +}  
>>>
>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>> as long as you have an independent device associated with the mdev host device
>>> ("parent" device here).
>>>  
>>
>> I don't think every lock will go away with that. This also changes how
>> mdev devices entries are created in sysfs. It adds an extra directory.
> 
> Exposing the parent-child relationship through sysfs is a desirable
> feature, so I'm not sure how this is a negative.  This part of Jike's
> conversion was a big improvement, I thought.  Thanks,
> 

Jike's suggestion is to introduced a fake device over parent device i.e.
mdev-host, and then all mdev devices are children of 'mdev-host' not
children of real parent.

For example, directory structure we have now is:
/sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>

mdev devices are in real parents directory.

By introducing fake device it would be:
/sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>

mdev devices are in fake device's directory.

Lock would be still required, to handle the race conditions like
'mdev_create' is still in process and parent device is unregistered by
vendor driver/ parent device is unbind from vendor driver.

With the new changes/discussion, we believe the locking will be
simplified without having fake parent device.

With fake device suggestion, removed pointer to parent device from
mdev_device structure. When a create(struct mdev_device *mdev) callback
comes to vendor driver, how would vendor driver know for which physical
device this mdev device create call is intended to? because then
'parent' would be newly introduced fake device, not the real parent.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-09 19:55           ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-09 19:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi



On 9/10/2016 12:12 AM, Alex Williamson wrote:
> On Fri, 9 Sep 2016 23:18:45 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>
>>>>  +---------------+
>>>>  |               |
>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>  | |           | +<------------------------+ __init()     |
>>>>  | |  mdev     | |                         |              |
>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>  | |           | |                         |              |
>>>>  | +-----------+ |                         +--------------+
>>>>  |               |  
>>>
>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>> right?
>>>  
>>
>> Yes. That's correct.
>>
>>
>>>> +
>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>> +				    const struct attribute_group **groups)
>>>> +{
>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>> +}
>>>> +
>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>> +					const struct attribute_group **groups)
>>>> +{
>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>> +}  
>>>
>>> These functions are not necessary. You can always specify the attribute groups
>>> to dev->groups before registering a new device.
>>>   
>>
>> At the time of mdev device create, I specifically didn't used
>> dev->groups because we callback in vendor driver before that, see below
>> code snippet, and those attributes should only be added if create()
>> callback returns success.
>>
>>         ret = parent->ops->create(mdev, mdev_params);
>>         if (ret)
>>                 return ret;
>>
>>         ret = mdev_add_attribute_group(&mdev->dev,
>>                                         parent->ops->mdev_attr_groups);
>>         if (ret)
>>                 parent->ops->destroy(mdev);
>>
>>
>>
>>>> +
>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>> +{
>>>> +	struct parent_device *parent;
>>>> +
>>>> +	mutex_lock(&parent_list_lock);
>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>> +	mutex_unlock(&parent_list_lock);
>>>> +
>>>> +	return parent;
>>>> +}  
>>>
>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>> as long as you have an independent device associated with the mdev host device
>>> ("parent" device here).
>>>  
>>
>> I don't think every lock will go away with that. This also changes how
>> mdev devices entries are created in sysfs. It adds an extra directory.
> 
> Exposing the parent-child relationship through sysfs is a desirable
> feature, so I'm not sure how this is a negative.  This part of Jike's
> conversion was a big improvement, I thought.  Thanks,
> 

Jike's suggestion is to introduced a fake device over parent device i.e.
mdev-host, and then all mdev devices are children of 'mdev-host' not
children of real parent.

For example, directory structure we have now is:
/sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>

mdev devices are in real parents directory.

By introducing fake device it would be:
/sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>

mdev devices are in fake device's directory.

Lock would be still required, to handle the race conditions like
'mdev_create' is still in process and parent device is unregistered by
vendor driver/ parent device is unbind from vendor driver.

With the new changes/discussion, we believe the locking will be
simplified without having fake parent device.

With fake device suggestion, removed pointer to parent device from
mdev_device structure. When a create(struct mdev_device *mdev) callback
comes to vendor driver, how would vendor driver know for which physical
device this mdev device create call is intended to? because then
'parent' would be newly introduced fake device, not the real parent.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-09 19:55           ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-12  5:10             ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-12  5:10 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alex Williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 09/10/2016 03:55 AM, Kirti Wankhede wrote:
> On 9/10/2016 12:12 AM, Alex Williamson wrote:
>> On Fri, 9 Sep 2016 23:18:45 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>>
>>>>>  +---------------+
>>>>>  |               |
>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>  | |           | +<------------------------+ __init()     |
>>>>>  | |  mdev     | |                         |              |
>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>  | |           | |                         |              |
>>>>>  | +-----------+ |                         +--------------+
>>>>>  |               |  
>>>>
>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>> right?
>>>>  
>>>
>>> Yes. That's correct.
>>>
>>>
>>>>> +
>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>> +				    const struct attribute_group **groups)
>>>>> +{
>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>> +}
>>>>> +
>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>> +					const struct attribute_group **groups)
>>>>> +{
>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>> +}  
>>>>
>>>> These functions are not necessary. You can always specify the attribute groups
>>>> to dev->groups before registering a new device.
>>>>   
>>>
>>> At the time of mdev device create, I specifically didn't used
>>> dev->groups because we callback in vendor driver before that, see below
>>> code snippet, and those attributes should only be added if create()
>>> callback returns success.
>>>
>>>         ret = parent->ops->create(mdev, mdev_params);
>>>         if (ret)
>>>                 return ret;
>>>
>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>                                         parent->ops->mdev_attr_groups);
>>>         if (ret)
>>>                 parent->ops->destroy(mdev);
>>>
>>>
>>>
>>>>> +
>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>> +{
>>>>> +	struct parent_device *parent;
>>>>> +
>>>>> +	mutex_lock(&parent_list_lock);
>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>> +	mutex_unlock(&parent_list_lock);
>>>>> +
>>>>> +	return parent;
>>>>> +}  
>>>>
>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>> as long as you have an independent device associated with the mdev host device
>>>> ("parent" device here).
>>>>  
>>>
>>> I don't think every lock will go away with that. This also changes how
>>> mdev devices entries are created in sysfs. It adds an extra directory.
>>
>> Exposing the parent-child relationship through sysfs is a desirable
>> feature, so I'm not sure how this is a negative.  This part of Jike's
>> conversion was a big improvement, I thought.  Thanks,
>>
> 
> Jike's suggestion is to introduced a fake device over parent device i.e.
> mdev-host, and then all mdev devices are children of 'mdev-host' not
> children of real parent.
>

It really depends on how you define 'real parent' :)

With a physical-host-mdev hierarchy, the parent of mdev devices is the host
device, the parent of host device is the physical device. e.g.

        pdev            mdev_host       mdev_device
        dev<------------dev<------------dev
              parent          parent

        Figure 1: device hierarchy

> For example, directory structure we have now is:
> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> 
> mdev devices are in real parents directory.
> 
> By introducing fake device it would be:
> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> 
> mdev devices are in fake device's directory.
>

Yes, this is the wanted directory.

> Lock would be still required, to handle the race conditions like
> 'mdev_create' is still in process and parent device is unregistered by
> vendor driver/ parent device is unbind from vendor driver.
>

locks are provided to protect resources, would you elaborate more on
what is the exact resource you want to protect by a lock in mdev_create?

> With the new changes/discussion, we believe the locking will be
> simplified without having fake parent device.
>
> With fake device suggestion, removed pointer to parent device from
> mdev_device structure. When a create(struct mdev_device *mdev) callback
> comes to vendor driver, how would vendor driver know for which physical
> device this mdev device create call is intended to? because then
> 'parent' would be newly introduced fake device, not the real parent.

Please have a look at "Figure 1".

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-12  5:10             ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-12  5:10 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alex Williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 09/10/2016 03:55 AM, Kirti Wankhede wrote:
> On 9/10/2016 12:12 AM, Alex Williamson wrote:
>> On Fri, 9 Sep 2016 23:18:45 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>>
>>>>>  +---------------+
>>>>>  |               |
>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>  | |           | +<------------------------+ __init()     |
>>>>>  | |  mdev     | |                         |              |
>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>  | |           | |                         |              |
>>>>>  | +-----------+ |                         +--------------+
>>>>>  |               |  
>>>>
>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>> right?
>>>>  
>>>
>>> Yes. That's correct.
>>>
>>>
>>>>> +
>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>> +				    const struct attribute_group **groups)
>>>>> +{
>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>> +}
>>>>> +
>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>> +					const struct attribute_group **groups)
>>>>> +{
>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>> +}  
>>>>
>>>> These functions are not necessary. You can always specify the attribute groups
>>>> to dev->groups before registering a new device.
>>>>   
>>>
>>> At the time of mdev device create, I specifically didn't used
>>> dev->groups because we callback in vendor driver before that, see below
>>> code snippet, and those attributes should only be added if create()
>>> callback returns success.
>>>
>>>         ret = parent->ops->create(mdev, mdev_params);
>>>         if (ret)
>>>                 return ret;
>>>
>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>                                         parent->ops->mdev_attr_groups);
>>>         if (ret)
>>>                 parent->ops->destroy(mdev);
>>>
>>>
>>>
>>>>> +
>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>> +{
>>>>> +	struct parent_device *parent;
>>>>> +
>>>>> +	mutex_lock(&parent_list_lock);
>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>> +	mutex_unlock(&parent_list_lock);
>>>>> +
>>>>> +	return parent;
>>>>> +}  
>>>>
>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>> as long as you have an independent device associated with the mdev host device
>>>> ("parent" device here).
>>>>  
>>>
>>> I don't think every lock will go away with that. This also changes how
>>> mdev devices entries are created in sysfs. It adds an extra directory.
>>
>> Exposing the parent-child relationship through sysfs is a desirable
>> feature, so I'm not sure how this is a negative.  This part of Jike's
>> conversion was a big improvement, I thought.  Thanks,
>>
> 
> Jike's suggestion is to introduced a fake device over parent device i.e.
> mdev-host, and then all mdev devices are children of 'mdev-host' not
> children of real parent.
>

It really depends on how you define 'real parent' :)

With a physical-host-mdev hierarchy, the parent of mdev devices is the host
device, the parent of host device is the physical device. e.g.

        pdev            mdev_host       mdev_device
        dev<------------dev<------------dev
              parent          parent

        Figure 1: device hierarchy

> For example, directory structure we have now is:
> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> 
> mdev devices are in real parents directory.
> 
> By introducing fake device it would be:
> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> 
> mdev devices are in fake device's directory.
>

Yes, this is the wanted directory.

> Lock would be still required, to handle the race conditions like
> 'mdev_create' is still in process and parent device is unregistered by
> vendor driver/ parent device is unbind from vendor driver.
>

locks are provided to protect resources, would you elaborate more on
what is the exact resource you want to protect by a lock in mdev_create?

> With the new changes/discussion, we believe the locking will be
> simplified without having fake parent device.
>
> With fake device suggestion, removed pointer to parent device from
> mdev_device structure. When a create(struct mdev_device *mdev) callback
> comes to vendor driver, how would vendor driver know for which physical
> device this mdev device create call is intended to? because then
> 'parent' would be newly introduced fake device, not the real parent.

Please have a look at "Figure 1".

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-12  5:10             ` [Qemu-devel] " Jike Song
@ 2016-09-12  7:49               ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-12  7:49 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi



On 9/12/2016 10:40 AM, Jike Song wrote:
> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:
>> On 9/10/2016 12:12 AM, Alex Williamson wrote:
>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>
>>>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>>>
>>>>>>  +---------------+
>>>>>>  |               |
>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>>  | |           | +<------------------------+ __init()     |
>>>>>>  | |  mdev     | |                         |              |
>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>>  | |           | |                         |              |
>>>>>>  | +-----------+ |                         +--------------+
>>>>>>  |               |  
>>>>>
>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>>> right?
>>>>>  
>>>>
>>>> Yes. That's correct.
>>>>
>>>>
>>>>>> +
>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>> +				    const struct attribute_group **groups)
>>>>>> +{
>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>>> +}
>>>>>> +
>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>> +					const struct attribute_group **groups)
>>>>>> +{
>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>>> +}  
>>>>>
>>>>> These functions are not necessary. You can always specify the attribute groups
>>>>> to dev->groups before registering a new device.
>>>>>   
>>>>
>>>> At the time of mdev device create, I specifically didn't used
>>>> dev->groups because we callback in vendor driver before that, see below
>>>> code snippet, and those attributes should only be added if create()
>>>> callback returns success.
>>>>
>>>>         ret = parent->ops->create(mdev, mdev_params);
>>>>         if (ret)
>>>>                 return ret;
>>>>
>>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>>                                         parent->ops->mdev_attr_groups);
>>>>         if (ret)
>>>>                 parent->ops->destroy(mdev);
>>>>
>>>>
>>>>
>>>>>> +
>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>> +{
>>>>>> +	struct parent_device *parent;
>>>>>> +
>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>> +
>>>>>> +	return parent;
>>>>>> +}  
>>>>>
>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>> as long as you have an independent device associated with the mdev host device
>>>>> ("parent" device here).
>>>>>  
>>>>
>>>> I don't think every lock will go away with that. This also changes how
>>>> mdev devices entries are created in sysfs. It adds an extra directory.
>>>
>>> Exposing the parent-child relationship through sysfs is a desirable
>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>> conversion was a big improvement, I thought.  Thanks,
>>>
>>
>> Jike's suggestion is to introduced a fake device over parent device i.e.
>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>> children of real parent.
>>
> 
> It really depends on how you define 'real parent' :)
> 
> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> device, the parent of host device is the physical device. e.g.
> 
>         pdev            mdev_host       mdev_device
>         dev<------------dev<------------dev
>               parent          parent
> 
>         Figure 1: device hierarchy
> 

Right, mdev-host device doesn't represent physical device nor any mdev
device. Then what is the need of such device?

>> For example, directory structure we have now is:
>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>
>> mdev devices are in real parents directory.
>>
>> By introducing fake device it would be:
>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>
>> mdev devices are in fake device's directory.
>>
> 
> Yes, this is the wanted directory.
> 

I don't think so.


>> Lock would be still required, to handle the race conditions like
>> 'mdev_create' is still in process and parent device is unregistered by
>> vendor driver/ parent device is unbind from vendor driver.
>>
> 
> locks are provided to protect resources, would you elaborate more on
> what is the exact resource you want to protect by a lock in mdev_create?
> 

Simple, in your suggestion mdev-host device. Fake device will go away if
vendor driver unregisters the device from mdev module, right.

Thanks,
Kirti.

>> With the new changes/discussion, we believe the locking will be
>> simplified without having fake parent device.
>>
>> With fake device suggestion, removed pointer to parent device from
>> mdev_device structure. When a create(struct mdev_device *mdev) callback
>> comes to vendor driver, how would vendor driver know for which physical
>> device this mdev device create call is intended to? because then
>> 'parent' would be newly introduced fake device, not the real parent.
> 
> Please have a look at "Figure 1".
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-12  7:49               ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-12  7:49 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi



On 9/12/2016 10:40 AM, Jike Song wrote:
> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:
>> On 9/10/2016 12:12 AM, Alex Williamson wrote:
>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>
>>>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>>>
>>>>>>  +---------------+
>>>>>>  |               |
>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>>  | |           | +<------------------------+ __init()     |
>>>>>>  | |  mdev     | |                         |              |
>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>>  | |           | |                         |              |
>>>>>>  | +-----------+ |                         +--------------+
>>>>>>  |               |  
>>>>>
>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>>> right?
>>>>>  
>>>>
>>>> Yes. That's correct.
>>>>
>>>>
>>>>>> +
>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>> +				    const struct attribute_group **groups)
>>>>>> +{
>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>>> +}
>>>>>> +
>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>> +					const struct attribute_group **groups)
>>>>>> +{
>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>>> +}  
>>>>>
>>>>> These functions are not necessary. You can always specify the attribute groups
>>>>> to dev->groups before registering a new device.
>>>>>   
>>>>
>>>> At the time of mdev device create, I specifically didn't used
>>>> dev->groups because we callback in vendor driver before that, see below
>>>> code snippet, and those attributes should only be added if create()
>>>> callback returns success.
>>>>
>>>>         ret = parent->ops->create(mdev, mdev_params);
>>>>         if (ret)
>>>>                 return ret;
>>>>
>>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>>                                         parent->ops->mdev_attr_groups);
>>>>         if (ret)
>>>>                 parent->ops->destroy(mdev);
>>>>
>>>>
>>>>
>>>>>> +
>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>> +{
>>>>>> +	struct parent_device *parent;
>>>>>> +
>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>> +
>>>>>> +	return parent;
>>>>>> +}  
>>>>>
>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>> as long as you have an independent device associated with the mdev host device
>>>>> ("parent" device here).
>>>>>  
>>>>
>>>> I don't think every lock will go away with that. This also changes how
>>>> mdev devices entries are created in sysfs. It adds an extra directory.
>>>
>>> Exposing the parent-child relationship through sysfs is a desirable
>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>> conversion was a big improvement, I thought.  Thanks,
>>>
>>
>> Jike's suggestion is to introduced a fake device over parent device i.e.
>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>> children of real parent.
>>
> 
> It really depends on how you define 'real parent' :)
> 
> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> device, the parent of host device is the physical device. e.g.
> 
>         pdev            mdev_host       mdev_device
>         dev<------------dev<------------dev
>               parent          parent
> 
>         Figure 1: device hierarchy
> 

Right, mdev-host device doesn't represent physical device nor any mdev
device. Then what is the need of such device?

>> For example, directory structure we have now is:
>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>
>> mdev devices are in real parents directory.
>>
>> By introducing fake device it would be:
>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>
>> mdev devices are in fake device's directory.
>>
> 
> Yes, this is the wanted directory.
> 

I don't think so.


>> Lock would be still required, to handle the race conditions like
>> 'mdev_create' is still in process and parent device is unregistered by
>> vendor driver/ parent device is unbind from vendor driver.
>>
> 
> locks are provided to protect resources, would you elaborate more on
> what is the exact resource you want to protect by a lock in mdev_create?
> 

Simple, in your suggestion mdev-host device. Fake device will go away if
vendor driver unregisters the device from mdev module, right.

Thanks,
Kirti.

>> With the new changes/discussion, we believe the locking will be
>> simplified without having fake parent device.
>>
>> With fake device suggestion, removed pointer to parent device from
>> mdev_device structure. When a create(struct mdev_device *mdev) callback
>> comes to vendor driver, how would vendor driver know for which physical
>> device this mdev device create call is intended to? because then
>> 'parent' would be newly introduced fake device, not the real parent.
> 
> Please have a look at "Figure 1".
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-12  7:49               ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-12 15:53                 ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-12 15:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi

On Mon, 12 Sep 2016 13:19:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/12/2016 10:40 AM, Jike Song wrote:
> > On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
> >> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
> >>> On Fri, 9 Sep 2016 23:18:45 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>  
> >>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
> >>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:    
> >>>>  
> >>>>>>  +---------------+
> >>>>>>  |               |
> >>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
> >>>>>>  | |           | +<------------------------+ __init()     |
> >>>>>>  | |  mdev     | |                         |              |
> >>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
> >>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
> >>>>>>  | |           | |                         |              |
> >>>>>>  | +-----------+ |                         +--------------+
> >>>>>>  |               |    
> >>>>>
> >>>>> This aimed to have only one single vfio bus driver for all mediated devices,
> >>>>> right?
> >>>>>    
> >>>>
> >>>> Yes. That's correct.
> >>>>
> >>>>  
> >>>>>> +
> >>>>>> +static int mdev_add_attribute_group(struct device *dev,
> >>>>>> +				    const struct attribute_group **groups)
> >>>>>> +{
> >>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void mdev_remove_attribute_group(struct device *dev,
> >>>>>> +					const struct attribute_group **groups)
> >>>>>> +{
> >>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
> >>>>>> +}    
> >>>>>
> >>>>> These functions are not necessary. You can always specify the attribute groups
> >>>>> to dev->groups before registering a new device.
> >>>>>     
> >>>>
> >>>> At the time of mdev device create, I specifically didn't used
> >>>> dev->groups because we callback in vendor driver before that, see below
> >>>> code snippet, and those attributes should only be added if create()
> >>>> callback returns success.
> >>>>
> >>>>         ret = parent->ops->create(mdev, mdev_params);
> >>>>         if (ret)
> >>>>                 return ret;
> >>>>
> >>>>         ret = mdev_add_attribute_group(&mdev->dev,
> >>>>                                         parent->ops->mdev_attr_groups);
> >>>>         if (ret)
> >>>>                 parent->ops->destroy(mdev);
> >>>>
> >>>>
> >>>>  
> >>>>>> +
> >>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >>>>>> +{
> >>>>>> +	struct parent_device *parent;
> >>>>>> +
> >>>>>> +	mutex_lock(&parent_list_lock);
> >>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
> >>>>>> +	mutex_unlock(&parent_list_lock);
> >>>>>> +
> >>>>>> +	return parent;
> >>>>>> +}    
> >>>>>
> >>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> >>>>> as long as you have an independent device associated with the mdev host device
> >>>>> ("parent" device here).
> >>>>>    
> >>>>
> >>>> I don't think every lock will go away with that. This also changes how
> >>>> mdev devices entries are created in sysfs. It adds an extra directory.  
> >>>
> >>> Exposing the parent-child relationship through sysfs is a desirable
> >>> feature, so I'm not sure how this is a negative.  This part of Jike's
> >>> conversion was a big improvement, I thought.  Thanks,
> >>>  
> >>
> >> Jike's suggestion is to introduced a fake device over parent device i.e.
> >> mdev-host, and then all mdev devices are children of 'mdev-host' not
> >> children of real parent.
> >>  
> > 
> > It really depends on how you define 'real parent' :)
> > 
> > With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> > device, the parent of host device is the physical device. e.g.
> > 
> >         pdev            mdev_host       mdev_device
> >         dev<------------dev<------------dev
> >               parent          parent
> > 
> >         Figure 1: device hierarchy
> >   
> 
> Right, mdev-host device doesn't represent physical device nor any mdev
> device. Then what is the need of such device?

Is there anything implicitly wrong with using a device node to host the
mdev child devices?  Is the argument against it only that it's
unnecessary?  Can we make use of the device-core parent/child
dependencies as Jike has done w/o that extra node?
 
> >> For example, directory structure we have now is:
> >> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> >>
> >> mdev devices are in real parents directory.
> >>
> >> By introducing fake device it would be:
> >> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> >>
> >> mdev devices are in fake device's directory.
> >>  
> > 
> > Yes, this is the wanted directory.
> >   
> 
> I don't think so.

Why?

> >> Lock would be still required, to handle the race conditions like
> >> 'mdev_create' is still in process and parent device is unregistered by
> >> vendor driver/ parent device is unbind from vendor driver.
> >>  
> > 
> > locks are provided to protect resources, would you elaborate more on
> > what is the exact resource you want to protect by a lock in mdev_create?
> >   
> 
> Simple, in your suggestion mdev-host device. Fake device will go away if
> vendor driver unregisters the device from mdev module, right.

I don't follow the reply here, but aiui there's ordering implicit in
the device core that Jike is trying to take advantage of that
simplifies the mdev layer significantly.  In the case of an
mdev_create, the device core needs to take a reference to the parent
object, the mdev-host I'd guess in Jike's version, the created mdev
device would also have a reference to that object, so the physical host
device could not be removed so long as there are outstanding
references.  It's just a matter of managing references and acquiring
and releasing objects.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-12 15:53                 ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-12 15:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi

On Mon, 12 Sep 2016 13:19:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/12/2016 10:40 AM, Jike Song wrote:
> > On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
> >> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
> >>> On Fri, 9 Sep 2016 23:18:45 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>  
> >>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
> >>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:    
> >>>>  
> >>>>>>  +---------------+
> >>>>>>  |               |
> >>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
> >>>>>>  | |           | +<------------------------+ __init()     |
> >>>>>>  | |  mdev     | |                         |              |
> >>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
> >>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
> >>>>>>  | |           | |                         |              |
> >>>>>>  | +-----------+ |                         +--------------+
> >>>>>>  |               |    
> >>>>>
> >>>>> This aimed to have only one single vfio bus driver for all mediated devices,
> >>>>> right?
> >>>>>    
> >>>>
> >>>> Yes. That's correct.
> >>>>
> >>>>  
> >>>>>> +
> >>>>>> +static int mdev_add_attribute_group(struct device *dev,
> >>>>>> +				    const struct attribute_group **groups)
> >>>>>> +{
> >>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void mdev_remove_attribute_group(struct device *dev,
> >>>>>> +					const struct attribute_group **groups)
> >>>>>> +{
> >>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
> >>>>>> +}    
> >>>>>
> >>>>> These functions are not necessary. You can always specify the attribute groups
> >>>>> to dev->groups before registering a new device.
> >>>>>     
> >>>>
> >>>> At the time of mdev device create, I specifically didn't used
> >>>> dev->groups because we callback in vendor driver before that, see below
> >>>> code snippet, and those attributes should only be added if create()
> >>>> callback returns success.
> >>>>
> >>>>         ret = parent->ops->create(mdev, mdev_params);
> >>>>         if (ret)
> >>>>                 return ret;
> >>>>
> >>>>         ret = mdev_add_attribute_group(&mdev->dev,
> >>>>                                         parent->ops->mdev_attr_groups);
> >>>>         if (ret)
> >>>>                 parent->ops->destroy(mdev);
> >>>>
> >>>>
> >>>>  
> >>>>>> +
> >>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >>>>>> +{
> >>>>>> +	struct parent_device *parent;
> >>>>>> +
> >>>>>> +	mutex_lock(&parent_list_lock);
> >>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
> >>>>>> +	mutex_unlock(&parent_list_lock);
> >>>>>> +
> >>>>>> +	return parent;
> >>>>>> +}    
> >>>>>
> >>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> >>>>> as long as you have an independent device associated with the mdev host device
> >>>>> ("parent" device here).
> >>>>>    
> >>>>
> >>>> I don't think every lock will go away with that. This also changes how
> >>>> mdev devices entries are created in sysfs. It adds an extra directory.  
> >>>
> >>> Exposing the parent-child relationship through sysfs is a desirable
> >>> feature, so I'm not sure how this is a negative.  This part of Jike's
> >>> conversion was a big improvement, I thought.  Thanks,
> >>>  
> >>
> >> Jike's suggestion is to introduced a fake device over parent device i.e.
> >> mdev-host, and then all mdev devices are children of 'mdev-host' not
> >> children of real parent.
> >>  
> > 
> > It really depends on how you define 'real parent' :)
> > 
> > With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> > device, the parent of host device is the physical device. e.g.
> > 
> >         pdev            mdev_host       mdev_device
> >         dev<------------dev<------------dev
> >               parent          parent
> > 
> >         Figure 1: device hierarchy
> >   
> 
> Right, mdev-host device doesn't represent physical device nor any mdev
> device. Then what is the need of such device?

Is there anything implicitly wrong with using a device node to host the
mdev child devices?  Is the argument against it only that it's
unnecessary?  Can we make use of the device-core parent/child
dependencies as Jike has done w/o that extra node?
 
> >> For example, directory structure we have now is:
> >> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> >>
> >> mdev devices are in real parents directory.
> >>
> >> By introducing fake device it would be:
> >> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> >>
> >> mdev devices are in fake device's directory.
> >>  
> > 
> > Yes, this is the wanted directory.
> >   
> 
> I don't think so.

Why?

> >> Lock would be still required, to handle the race conditions like
> >> 'mdev_create' is still in process and parent device is unregistered by
> >> vendor driver/ parent device is unbind from vendor driver.
> >>  
> > 
> > locks are provided to protect resources, would you elaborate more on
> > what is the exact resource you want to protect by a lock in mdev_create?
> >   
> 
> Simple, in your suggestion mdev-host device. Fake device will go away if
> vendor driver unregisters the device from mdev module, right.

I don't follow the reply here, but aiui there's ordering implicit in
the device core that Jike is trying to take advantage of that
simplifies the mdev layer significantly.  In the case of an
mdev_create, the device core needs to take a reference to the parent
object, the mdev-host I'd guess in Jike's version, the created mdev
device would also have a reference to that object, so the physical host
device could not be removed so long as there are outstanding
references.  It's just a matter of managing references and acquiring
and releasing objects.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-08  2:45       ` [Qemu-devel] " Jike Song
@ 2016-09-13  2:35         ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-13  2:35 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, kevin.tian

On 09/08/2016 10:45 AM, Jike Song wrote:
> On 08/25/2016 05:22 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>> [...]
>>
>> Dear Kirti,
>>
>> I just rebased my vfio-ccw patches to this series.
>> With a little fix, which was pointed it out in my reply to the #3
>> patch, it works fine.
>>
> 
> Hi Jia,
> 
> Sorry I didn't follow a lot in previous discussion, but since
> vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
> with us why you still need a vfio-ccw?

Kind ping :)


Hi Dong Jia,

Since Kirti has confirmed that in v7 it is designed to have only one
vfio-mdev driver for all mdev devices, would you please tell us the
reason of your vfio-ccw? It could possibly be an architectural gap and
the earlier we discuss it the better :)

--
Thanks,
Jike

>>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>>> +				     unsigned int cmd, unsigned long arg)
>>> +{
>>> +	int ret = 0;
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct parent_device *parent = vmdev->mdev->parent;
>>> +	unsigned long minsz;
>>> +
>>> +	switch (cmd) {
>>> +	case VFIO_DEVICE_GET_INFO:
>>> +	{
>>> +		struct vfio_device_info info;
>>> +
>>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>>> +
>>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>>> +			return -EFAULT;
>>> +
>>> +		if (info.argsz < minsz)
>>> +			return -EINVAL;
>>> +
>>> +		if (parent->ops->get_device_info)
>>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>>> +		else
>>> +			return -EINVAL;
>>> +
>>> +		if (ret)
>>> +			return ret;
>>> +
>>> +		if (parent->ops->reset)
>>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>> Shouldn't this be done inside the get_device_info callback?
>>
>>> +
>>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>>> +
>>> +		return copy_to_user((void __user *)arg, &info, minsz);
>>> +	}
>> [...]
>>
>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct mdev_device *mdev = vmdev->mdev;
>>> +	struct parent_device *parent = mdev->parent;
>>> +	unsigned int done = 0;
>>> +	int ret;
>>> +
>>> +	if (!parent->ops->read)
>>> +		return -EINVAL;
>>> +
>>> +	while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
>>> +		size_t filled;
>>> +
>>> +		if (count >= 4 && !(*ppos % 4)) {
>>> +			u32 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 4;
>>> +		} else if (count >= 2 && !(*ppos % 2)) {
>>> +			u16 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 2;
>>> +		} else {
>>> +			u8 val;
>>> +
>>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 1;
>>> +		}
>>> +
>>> +		count -= filled;
>>> +		done += filled;
>>> +		*ppos += filled;
>>> +		buf += filled;
>>> +	}
>>> +
>>> +	return done;
>>> +
>>> +read_err:
>>> +	return -EFAULT;
>>> +}
>> [...]
>>
>> --------
>> Dong Jia
>>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-13  2:35         ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-13  2:35 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, kevin.tian

On 09/08/2016 10:45 AM, Jike Song wrote:
> On 08/25/2016 05:22 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>> [...]
>>
>> Dear Kirti,
>>
>> I just rebased my vfio-ccw patches to this series.
>> With a little fix, which was pointed it out in my reply to the #3
>> patch, it works fine.
>>
> 
> Hi Jia,
> 
> Sorry I didn't follow a lot in previous discussion, but since
> vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
> with us why you still need a vfio-ccw?

Kind ping :)


Hi Dong Jia,

Since Kirti has confirmed that in v7 it is designed to have only one
vfio-mdev driver for all mdev devices, would you please tell us the
reason of your vfio-ccw? It could possibly be an architectural gap and
the earlier we discuss it the better :)

--
Thanks,
Jike

>>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>>> +				     unsigned int cmd, unsigned long arg)
>>> +{
>>> +	int ret = 0;
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct parent_device *parent = vmdev->mdev->parent;
>>> +	unsigned long minsz;
>>> +
>>> +	switch (cmd) {
>>> +	case VFIO_DEVICE_GET_INFO:
>>> +	{
>>> +		struct vfio_device_info info;
>>> +
>>> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
>>> +
>>> +		if (copy_from_user(&info, (void __user *)arg, minsz))
>>> +			return -EFAULT;
>>> +
>>> +		if (info.argsz < minsz)
>>> +			return -EINVAL;
>>> +
>>> +		if (parent->ops->get_device_info)
>>> +			ret = parent->ops->get_device_info(vmdev->mdev, &info);
>>> +		else
>>> +			return -EINVAL;
>>> +
>>> +		if (ret)
>>> +			return ret;
>>> +
>>> +		if (parent->ops->reset)
>>> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>> Shouldn't this be done inside the get_device_info callback?
>>
>>> +
>>> +		memcpy(&vmdev->dev_info, &info, sizeof(info));
>>> +
>>> +		return copy_to_user((void __user *)arg, &info, minsz);
>>> +	}
>> [...]
>>
>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct mdev_device *mdev = vmdev->mdev;
>>> +	struct parent_device *parent = mdev->parent;
>>> +	unsigned int done = 0;
>>> +	int ret;
>>> +
>>> +	if (!parent->ops->read)
>>> +		return -EINVAL;
>>> +
>>> +	while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
>>> +		size_t filled;
>>> +
>>> +		if (count >= 4 && !(*ppos % 4)) {
>>> +			u32 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 4;
>>> +		} else if (count >= 2 && !(*ppos % 2)) {
>>> +			u16 val;
>>> +
>>> +			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
>>> +						*ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 2;
>>> +		} else {
>>> +			u8 val;
>>> +
>>> +			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
>>> +			if (ret <= 0)
>>> +				goto read_err;
>>> +
>>> +			if (copy_to_user(buf, &val, sizeof(val)))
>>> +				goto read_err;
>>> +
>>> +			filled = 1;
>>> +		}
>>> +
>>> +		count -= filled;
>>> +		done += filled;
>>> +		*ppos += filled;
>>> +		buf += filled;
>>> +	}
>>> +
>>> +	return done;
>>> +
>>> +read_err:
>>> +	return -EFAULT;
>>> +}
>> [...]
>>
>> --------
>> Dong Jia
>>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-12 15:53                 ` [Qemu-devel] " Alex Williamson
@ 2016-09-19  7:08                   ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-19  7:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 09/12/2016 11:53 PM, Alex Williamson wrote:
> On Mon, 12 Sep 2016 13:19:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/12/2016 10:40 AM, Jike Song wrote:
>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>  
>>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
>>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:    
>>>>>>  
>>>>>>>>  +---------------+
>>>>>>>>  |               |
>>>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>>>>  | |           | +<------------------------+ __init()     |
>>>>>>>>  | |  mdev     | |                         |              |
>>>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>>>>  | |           | |                         |              |
>>>>>>>>  | +-----------+ |                         +--------------+
>>>>>>>>  |               |    
>>>>>>>
>>>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>>>>> right?
>>>>>>>    
>>>>>>
>>>>>> Yes. That's correct.
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>>>> +				    const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>>>> +					const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>>>>> +}    
>>>>>>>
>>>>>>> These functions are not necessary. You can always specify the attribute groups
>>>>>>> to dev->groups before registering a new device.
>>>>>>>     
>>>>>>
>>>>>> At the time of mdev device create, I specifically didn't used
>>>>>> dev->groups because we callback in vendor driver before that, see below
>>>>>> code snippet, and those attributes should only be added if create()
>>>>>> callback returns success.
>>>>>>
>>>>>>         ret = parent->ops->create(mdev, mdev_params);
>>>>>>         if (ret)
>>>>>>                 return ret;
>>>>>>
>>>>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>>>>                                         parent->ops->mdev_attr_groups);
>>>>>>         if (ret)
>>>>>>                 parent->ops->destroy(mdev);
>>>>>>
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>>>> +{
>>>>>>>> +	struct parent_device *parent;
>>>>>>>> +
>>>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>>>> +
>>>>>>>> +	return parent;
>>>>>>>> +}    
>>>>>>>
>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>>>> as long as you have an independent device associated with the mdev host device
>>>>>>> ("parent" device here).
>>>>>>>    
>>>>>>
>>>>>> I don't think every lock will go away with that. This also changes how
>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.  
>>>>>
>>>>> Exposing the parent-child relationship through sysfs is a desirable
>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>>>> conversion was a big improvement, I thought.  Thanks,
>>>>>  
>>>>
>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>>>> children of real parent.
>>>>  
>>>
>>> It really depends on how you define 'real parent' :)
>>>
>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
>>> device, the parent of host device is the physical device. e.g.
>>>
>>>         pdev            mdev_host       mdev_device
>>>         dev<------------dev<------------dev
>>>               parent          parent
>>>
>>>         Figure 1: device hierarchy
>>>   
>>
>> Right, mdev-host device doesn't represent physical device nor any mdev
>> device. Then what is the need of such device?
> 
> Is there anything implicitly wrong with using a device node to host the
> mdev child devices?  Is the argument against it only that it's
> unnecessary?  Can we make use of the device-core parent/child
> dependencies as Jike has done w/o that extra node?
>  
>>>> For example, directory structure we have now is:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>>>
>>>> mdev devices are in real parents directory.
>>>>
>>>> By introducing fake device it would be:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>>>
>>>> mdev devices are in fake device's directory.
>>>>  
>>>
>>> Yes, this is the wanted directory.
>>>   
>>
>> I don't think so.
> 
> Why?
> 
>>>> Lock would be still required, to handle the race conditions like
>>>> 'mdev_create' is still in process and parent device is unregistered by
>>>> vendor driver/ parent device is unbind from vendor driver.
>>>>  
>>>
>>> locks are provided to protect resources, would you elaborate more on
>>> what is the exact resource you want to protect by a lock in mdev_create?
>>>   
>>
>> Simple, in your suggestion mdev-host device. Fake device will go away if
>> vendor driver unregisters the device from mdev module, right.
> 
> I don't follow the reply here, but aiui there's ordering implicit in
> the device core that Jike is trying to take advantage of that
> simplifies the mdev layer significantly.  In the case of an
> mdev_create, the device core needs to take a reference to the parent
> object, the mdev-host I'd guess in Jike's version, the created mdev
> device would also have a reference to that object, so the physical host
> device could not be removed so long as there are outstanding
> references.  It's just a matter of managing references and acquiring
> and releasing objects.  Thanks,

Hi Kirti/Neo,

Any thought on this part? Could we have more discussion in case that
further concern raised?


--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-19  7:08                   ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-19  7:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 09/12/2016 11:53 PM, Alex Williamson wrote:
> On Mon, 12 Sep 2016 13:19:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/12/2016 10:40 AM, Jike Song wrote:
>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>  
>>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
>>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:    
>>>>>>  
>>>>>>>>  +---------------+
>>>>>>>>  |               |
>>>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>>>>  | |           | +<------------------------+ __init()     |
>>>>>>>>  | |  mdev     | |                         |              |
>>>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>>>>  | |           | |                         |              |
>>>>>>>>  | +-----------+ |                         +--------------+
>>>>>>>>  |               |    
>>>>>>>
>>>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>>>>> right?
>>>>>>>    
>>>>>>
>>>>>> Yes. That's correct.
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>>>> +				    const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>>>> +					const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>>>>> +}    
>>>>>>>
>>>>>>> These functions are not necessary. You can always specify the attribute groups
>>>>>>> to dev->groups before registering a new device.
>>>>>>>     
>>>>>>
>>>>>> At the time of mdev device create, I specifically didn't used
>>>>>> dev->groups because we callback in vendor driver before that, see below
>>>>>> code snippet, and those attributes should only be added if create()
>>>>>> callback returns success.
>>>>>>
>>>>>>         ret = parent->ops->create(mdev, mdev_params);
>>>>>>         if (ret)
>>>>>>                 return ret;
>>>>>>
>>>>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>>>>                                         parent->ops->mdev_attr_groups);
>>>>>>         if (ret)
>>>>>>                 parent->ops->destroy(mdev);
>>>>>>
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>>>> +{
>>>>>>>> +	struct parent_device *parent;
>>>>>>>> +
>>>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>>>> +
>>>>>>>> +	return parent;
>>>>>>>> +}    
>>>>>>>
>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>>>> as long as you have an independent device associated with the mdev host device
>>>>>>> ("parent" device here).
>>>>>>>    
>>>>>>
>>>>>> I don't think every lock will go away with that. This also changes how
>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.  
>>>>>
>>>>> Exposing the parent-child relationship through sysfs is a desirable
>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>>>> conversion was a big improvement, I thought.  Thanks,
>>>>>  
>>>>
>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>>>> children of real parent.
>>>>  
>>>
>>> It really depends on how you define 'real parent' :)
>>>
>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
>>> device, the parent of host device is the physical device. e.g.
>>>
>>>         pdev            mdev_host       mdev_device
>>>         dev<------------dev<------------dev
>>>               parent          parent
>>>
>>>         Figure 1: device hierarchy
>>>   
>>
>> Right, mdev-host device doesn't represent physical device nor any mdev
>> device. Then what is the need of such device?
> 
> Is there anything implicitly wrong with using a device node to host the
> mdev child devices?  Is the argument against it only that it's
> unnecessary?  Can we make use of the device-core parent/child
> dependencies as Jike has done w/o that extra node?
>  
>>>> For example, directory structure we have now is:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>>>
>>>> mdev devices are in real parents directory.
>>>>
>>>> By introducing fake device it would be:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>>>
>>>> mdev devices are in fake device's directory.
>>>>  
>>>
>>> Yes, this is the wanted directory.
>>>   
>>
>> I don't think so.
> 
> Why?
> 
>>>> Lock would be still required, to handle the race conditions like
>>>> 'mdev_create' is still in process and parent device is unregistered by
>>>> vendor driver/ parent device is unbind from vendor driver.
>>>>  
>>>
>>> locks are provided to protect resources, would you elaborate more on
>>> what is the exact resource you want to protect by a lock in mdev_create?
>>>   
>>
>> Simple, in your suggestion mdev-host device. Fake device will go away if
>> vendor driver unregisters the device from mdev module, right.
> 
> I don't follow the reply here, but aiui there's ordering implicit in
> the device core that Jike is trying to take advantage of that
> simplifies the mdev layer significantly.  In the case of an
> mdev_create, the device core needs to take a reference to the parent
> object, the mdev-host I'd guess in Jike's version, the created mdev
> device would also have a reference to that object, so the physical host
> device could not be removed so long as there are outstanding
> references.  It's just a matter of managing references and acquiring
> and releasing objects.  Thanks,

Hi Kirti/Neo,

Any thought on this part? Could we have more discussion in case that
further concern raised?


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-12 15:53                 ` [Qemu-devel] " Alex Williamson
@ 2016-09-19 17:29                   ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 17:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi



On 9/12/2016 9:23 PM, Alex Williamson wrote:
> On Mon, 12 Sep 2016 13:19:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/12/2016 10:40 AM, Jike Song wrote:
>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>  
>>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
>>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:    
>>>>>>  
>>>>>>>>  +---------------+
>>>>>>>>  |               |
>>>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>>>>  | |           | +<------------------------+ __init()     |
>>>>>>>>  | |  mdev     | |                         |              |
>>>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>>>>  | |           | |                         |              |
>>>>>>>>  | +-----------+ |                         +--------------+
>>>>>>>>  |               |    
>>>>>>>
>>>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>>>>> right?
>>>>>>>    
>>>>>>
>>>>>> Yes. That's correct.
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>>>> +				    const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>>>> +					const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>>>>> +}    
>>>>>>>
>>>>>>> These functions are not necessary. You can always specify the attribute groups
>>>>>>> to dev->groups before registering a new device.
>>>>>>>     
>>>>>>
>>>>>> At the time of mdev device create, I specifically didn't used
>>>>>> dev->groups because we callback in vendor driver before that, see below
>>>>>> code snippet, and those attributes should only be added if create()
>>>>>> callback returns success.
>>>>>>
>>>>>>         ret = parent->ops->create(mdev, mdev_params);
>>>>>>         if (ret)
>>>>>>                 return ret;
>>>>>>
>>>>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>>>>                                         parent->ops->mdev_attr_groups);
>>>>>>         if (ret)
>>>>>>                 parent->ops->destroy(mdev);
>>>>>>
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>>>> +{
>>>>>>>> +	struct parent_device *parent;
>>>>>>>> +
>>>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>>>> +
>>>>>>>> +	return parent;
>>>>>>>> +}    
>>>>>>>
>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>>>> as long as you have an independent device associated with the mdev host device
>>>>>>> ("parent" device here).
>>>>>>>    
>>>>>>
>>>>>> I don't think every lock will go away with that. This also changes how
>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.  
>>>>>
>>>>> Exposing the parent-child relationship through sysfs is a desirable
>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>>>> conversion was a big improvement, I thought.  Thanks,
>>>>>  
>>>>
>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>>>> children of real parent.
>>>>  
>>>
>>> It really depends on how you define 'real parent' :)
>>>
>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
>>> device, the parent of host device is the physical device. e.g.
>>>
>>>         pdev            mdev_host       mdev_device
>>>         dev<------------dev<------------dev
>>>               parent          parent
>>>
>>>         Figure 1: device hierarchy
>>>   
>>
>> Right, mdev-host device doesn't represent physical device nor any mdev
>> device. Then what is the need of such device?
> 
> Is there anything implicitly wrong with using a device node to host the
> mdev child devices?  Is the argument against it only that it's
> unnecessary?  Can we make use of the device-core parent/child
> dependencies as Jike has done w/o that extra node?
>

I do feel that mdev core module would get simplified with the new sysfs
interface without having extra node.


>>>> For example, directory structure we have now is:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>>>
>>>> mdev devices are in real parents directory.
>>>>
>>>> By introducing fake device it would be:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>>>
>>>> mdev devices are in fake device's directory.
>>>>  
>>>
>>> Yes, this is the wanted directory.
>>>   
>>
>> I don't think so.
> 
> Why?
> 

This directory is not mandatory. right?

>>>> Lock would be still required, to handle the race conditions like
>>>> 'mdev_create' is still in process and parent device is unregistered by
>>>> vendor driver/ parent device is unbind from vendor driver.
>>>>  
>>>
>>> locks are provided to protect resources, would you elaborate more on
>>> what is the exact resource you want to protect by a lock in mdev_create?
>>>   
>>
>> Simple, in your suggestion mdev-host device. Fake device will go away if
>> vendor driver unregisters the device from mdev module, right.
> 
> I don't follow the reply here, but aiui there's ordering implicit in
> the device core that Jike is trying to take advantage of that
> simplifies the mdev layer significantly.  In the case of an
> mdev_create, the device core needs to take a reference to the parent
> object, the mdev-host I'd guess in Jike's version, the created mdev
> device would also have a reference to that object, so the physical host
> device could not be removed so long as there are outstanding
> references.  It's just a matter of managing references and acquiring
> and releasing objects.  Thanks,
>

I do think this could be simplified without having extra node.

> the created mdev
> device would also have a reference to that object, so the physical host
> device could not be removed so long as there are outstanding
> references.

Yes, this is also true when physical device is direct parent of mdev
device. mdev device keeps reference of parent, so physical host device
could not be removed as long as mdev devices are present. That is why
from mdev_unregister_device() a chance is given to free all child mdev
devices.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-19 17:29                   ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 17:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi



On 9/12/2016 9:23 PM, Alex Williamson wrote:
> On Mon, 12 Sep 2016 13:19:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/12/2016 10:40 AM, Jike Song wrote:
>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>  
>>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
>>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:    
>>>>>>  
>>>>>>>>  +---------------+
>>>>>>>>  |               |
>>>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
>>>>>>>>  | |           | +<------------------------+ __init()     |
>>>>>>>>  | |  mdev     | |                         |              |
>>>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
>>>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>>>>>>>>  | |           | |                         |              |
>>>>>>>>  | +-----------+ |                         +--------------+
>>>>>>>>  |               |    
>>>>>>>
>>>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>>>>>> right?
>>>>>>>    
>>>>>>
>>>>>> Yes. That's correct.
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>>>> +				    const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>>>> +					const struct attribute_group **groups)
>>>>>>>> +{
>>>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
>>>>>>>> +}    
>>>>>>>
>>>>>>> These functions are not necessary. You can always specify the attribute groups
>>>>>>> to dev->groups before registering a new device.
>>>>>>>     
>>>>>>
>>>>>> At the time of mdev device create, I specifically didn't used
>>>>>> dev->groups because we callback in vendor driver before that, see below
>>>>>> code snippet, and those attributes should only be added if create()
>>>>>> callback returns success.
>>>>>>
>>>>>>         ret = parent->ops->create(mdev, mdev_params);
>>>>>>         if (ret)
>>>>>>                 return ret;
>>>>>>
>>>>>>         ret = mdev_add_attribute_group(&mdev->dev,
>>>>>>                                         parent->ops->mdev_attr_groups);
>>>>>>         if (ret)
>>>>>>                 parent->ops->destroy(mdev);
>>>>>>
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>>>> +{
>>>>>>>> +	struct parent_device *parent;
>>>>>>>> +
>>>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>>>> +
>>>>>>>> +	return parent;
>>>>>>>> +}    
>>>>>>>
>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>>>> as long as you have an independent device associated with the mdev host device
>>>>>>> ("parent" device here).
>>>>>>>    
>>>>>>
>>>>>> I don't think every lock will go away with that. This also changes how
>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.  
>>>>>
>>>>> Exposing the parent-child relationship through sysfs is a desirable
>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>>>> conversion was a big improvement, I thought.  Thanks,
>>>>>  
>>>>
>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>>>> children of real parent.
>>>>  
>>>
>>> It really depends on how you define 'real parent' :)
>>>
>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
>>> device, the parent of host device is the physical device. e.g.
>>>
>>>         pdev            mdev_host       mdev_device
>>>         dev<------------dev<------------dev
>>>               parent          parent
>>>
>>>         Figure 1: device hierarchy
>>>   
>>
>> Right, mdev-host device doesn't represent physical device nor any mdev
>> device. Then what is the need of such device?
> 
> Is there anything implicitly wrong with using a device node to host the
> mdev child devices?  Is the argument against it only that it's
> unnecessary?  Can we make use of the device-core parent/child
> dependencies as Jike has done w/o that extra node?
>

I do feel that mdev core module would get simplified with the new sysfs
interface without having extra node.


>>>> For example, directory structure we have now is:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>>>
>>>> mdev devices are in real parents directory.
>>>>
>>>> By introducing fake device it would be:
>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>>>
>>>> mdev devices are in fake device's directory.
>>>>  
>>>
>>> Yes, this is the wanted directory.
>>>   
>>
>> I don't think so.
> 
> Why?
> 

This directory is not mandatory. right?

>>>> Lock would be still required, to handle the race conditions like
>>>> 'mdev_create' is still in process and parent device is unregistered by
>>>> vendor driver/ parent device is unbind from vendor driver.
>>>>  
>>>
>>> locks are provided to protect resources, would you elaborate more on
>>> what is the exact resource you want to protect by a lock in mdev_create?
>>>   
>>
>> Simple, in your suggestion mdev-host device. Fake device will go away if
>> vendor driver unregisters the device from mdev module, right.
> 
> I don't follow the reply here, but aiui there's ordering implicit in
> the device core that Jike is trying to take advantage of that
> simplifies the mdev layer significantly.  In the case of an
> mdev_create, the device core needs to take a reference to the parent
> object, the mdev-host I'd guess in Jike's version, the created mdev
> device would also have a reference to that object, so the physical host
> device could not be removed so long as there are outstanding
> references.  It's just a matter of managing references and acquiring
> and releasing objects.  Thanks,
>

I do think this could be simplified without having extra node.

> the created mdev
> device would also have a reference to that object, so the physical host
> device could not be removed so long as there are outstanding
> references.

Yes, this is also true when physical device is direct parent of mdev
device. mdev device keeps reference of parent, so physical host device
could not be removed as long as mdev devices are present. That is why
from mdev_unregister_device() a chance is given to free all child mdev
devices.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-19 17:29                   ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-19 18:11                     ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 18:11 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi

On Mon, 19 Sep 2016 22:59:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/12/2016 9:23 PM, Alex Williamson wrote:
> > On Mon, 12 Sep 2016 13:19:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/12/2016 10:40 AM, Jike Song wrote:  
> >>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:    
> >>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:    
> >>>>> On Fri, 9 Sep 2016 23:18:45 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>    
> >>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:    
> >>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:      
> >>>>>>    
> >>>>>>>>  +---------------+
> >>>>>>>>  |               |
> >>>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
> >>>>>>>>  | |           | +<------------------------+ __init()     |
> >>>>>>>>  | |  mdev     | |                         |              |
> >>>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
> >>>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
> >>>>>>>>  | |           | |                         |              |
> >>>>>>>>  | +-----------+ |                         +--------------+
> >>>>>>>>  |               |      
> >>>>>>>
> >>>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
> >>>>>>> right?
> >>>>>>>      
> >>>>>>
> >>>>>> Yes. That's correct.
> >>>>>>
> >>>>>>    
> >>>>>>>> +
> >>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
> >>>>>>>> +				    const struct attribute_group **groups)
> >>>>>>>> +{
> >>>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
> >>>>>>>> +					const struct attribute_group **groups)
> >>>>>>>> +{
> >>>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
> >>>>>>>> +}      
> >>>>>>>
> >>>>>>> These functions are not necessary. You can always specify the attribute groups
> >>>>>>> to dev->groups before registering a new device.
> >>>>>>>       
> >>>>>>
> >>>>>> At the time of mdev device create, I specifically didn't used
> >>>>>> dev->groups because we callback in vendor driver before that, see below
> >>>>>> code snippet, and those attributes should only be added if create()
> >>>>>> callback returns success.
> >>>>>>
> >>>>>>         ret = parent->ops->create(mdev, mdev_params);
> >>>>>>         if (ret)
> >>>>>>                 return ret;
> >>>>>>
> >>>>>>         ret = mdev_add_attribute_group(&mdev->dev,
> >>>>>>                                         parent->ops->mdev_attr_groups);
> >>>>>>         if (ret)
> >>>>>>                 parent->ops->destroy(mdev);
> >>>>>>
> >>>>>>
> >>>>>>    
> >>>>>>>> +
> >>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >>>>>>>> +{
> >>>>>>>> +	struct parent_device *parent;
> >>>>>>>> +
> >>>>>>>> +	mutex_lock(&parent_list_lock);
> >>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
> >>>>>>>> +	mutex_unlock(&parent_list_lock);
> >>>>>>>> +
> >>>>>>>> +	return parent;
> >>>>>>>> +}      
> >>>>>>>
> >>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> >>>>>>> as long as you have an independent device associated with the mdev host device
> >>>>>>> ("parent" device here).
> >>>>>>>      
> >>>>>>
> >>>>>> I don't think every lock will go away with that. This also changes how
> >>>>>> mdev devices entries are created in sysfs. It adds an extra directory.    
> >>>>>
> >>>>> Exposing the parent-child relationship through sysfs is a desirable
> >>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
> >>>>> conversion was a big improvement, I thought.  Thanks,
> >>>>>    
> >>>>
> >>>> Jike's suggestion is to introduced a fake device over parent device i.e.
> >>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
> >>>> children of real parent.
> >>>>    
> >>>
> >>> It really depends on how you define 'real parent' :)
> >>>
> >>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> >>> device, the parent of host device is the physical device. e.g.
> >>>
> >>>         pdev            mdev_host       mdev_device
> >>>         dev<------------dev<------------dev
> >>>               parent          parent
> >>>
> >>>         Figure 1: device hierarchy
> >>>     
> >>
> >> Right, mdev-host device doesn't represent physical device nor any mdev
> >> device. Then what is the need of such device?  
> > 
> > Is there anything implicitly wrong with using a device node to host the
> > mdev child devices?  Is the argument against it only that it's
> > unnecessary?  Can we make use of the device-core parent/child
> > dependencies as Jike has done w/o that extra node?
> >  
> 
> I do feel that mdev core module would get simplified with the new sysfs
> interface without having extra node.

Can you provide an example of why that is?
 
> >>>> For example, directory structure we have now is:
> >>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> >>>>
> >>>> mdev devices are in real parents directory.
> >>>>
> >>>> By introducing fake device it would be:
> >>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> >>>>
> >>>> mdev devices are in fake device's directory.
> >>>>    
> >>>
> >>> Yes, this is the wanted directory.
> >>>     
> >>
> >> I don't think so.  
> > 
> > Why?
> >   
> 
> This directory is not mandatory. right?

Clearly you've done an implementation without it, so it's not
functionally mandatory, but Jike has made significant code reduction
and simplification with it.  Those are desirable things.

> >>>> Lock would be still required, to handle the race conditions like
> >>>> 'mdev_create' is still in process and parent device is unregistered by
> >>>> vendor driver/ parent device is unbind from vendor driver.
> >>>>    
> >>>
> >>> locks are provided to protect resources, would you elaborate more on
> >>> what is the exact resource you want to protect by a lock in mdev_create?
> >>>     
> >>
> >> Simple, in your suggestion mdev-host device. Fake device will go away if
> >> vendor driver unregisters the device from mdev module, right.  
> > 
> > I don't follow the reply here, but aiui there's ordering implicit in
> > the device core that Jike is trying to take advantage of that
> > simplifies the mdev layer significantly.  In the case of an
> > mdev_create, the device core needs to take a reference to the parent
> > object, the mdev-host I'd guess in Jike's version, the created mdev
> > device would also have a reference to that object, so the physical host
> > device could not be removed so long as there are outstanding
> > references.  It's just a matter of managing references and acquiring
> > and releasing objects.  Thanks,
> >  
> 
> I do think this could be simplified without having extra node.

The simplification is really what I'm after, whether or not it includes
an extra device node is not something I'm sure I have an opinion on
yet.  Aren't we really just talking about an extra sysfs directory
node?  Doesn't it make it easier for userspace to efficiently identify
all the mdev children when they're segregated from the other attributes
and sub-nodes of a parent device?
 
> > the created mdev
> > device would also have a reference to that object, so the physical host
> > device could not be removed so long as there are outstanding
> > references.  
> 
> Yes, this is also true when physical device is direct parent of mdev
> device. mdev device keeps reference of parent, so physical host device
> could not be removed as long as mdev devices are present. That is why
> from mdev_unregister_device() a chance is given to free all child mdev
> devices.

But why aren't we using the device core do do that?  It seems like
we're getting hung up on this device node, which is more of a stylistic
and sysfs layout issue when the primary comment is to simplify the mdev
infrastructure by taking more advantage of the parent/child
dependencies of the device core.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-19 18:11                     ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 18:11 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi

On Mon, 19 Sep 2016 22:59:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/12/2016 9:23 PM, Alex Williamson wrote:
> > On Mon, 12 Sep 2016 13:19:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/12/2016 10:40 AM, Jike Song wrote:  
> >>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:    
> >>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:    
> >>>>> On Fri, 9 Sep 2016 23:18:45 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>    
> >>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:    
> >>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:      
> >>>>>>    
> >>>>>>>>  +---------------+
> >>>>>>>>  |               |
> >>>>>>>>  | +-----------+ |  mdev_register_driver() +--------------+
> >>>>>>>>  | |           | +<------------------------+ __init()     |
> >>>>>>>>  | |  mdev     | |                         |              |
> >>>>>>>>  | |  bus      | +------------------------>+              |<-> VFIO user
> >>>>>>>>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
> >>>>>>>>  | |           | |                         |              |
> >>>>>>>>  | +-----------+ |                         +--------------+
> >>>>>>>>  |               |      
> >>>>>>>
> >>>>>>> This aimed to have only one single vfio bus driver for all mediated devices,
> >>>>>>> right?
> >>>>>>>      
> >>>>>>
> >>>>>> Yes. That's correct.
> >>>>>>
> >>>>>>    
> >>>>>>>> +
> >>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
> >>>>>>>> +				    const struct attribute_group **groups)
> >>>>>>>> +{
> >>>>>>>> +	return sysfs_create_groups(&dev->kobj, groups);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
> >>>>>>>> +					const struct attribute_group **groups)
> >>>>>>>> +{
> >>>>>>>> +	sysfs_remove_groups(&dev->kobj, groups);
> >>>>>>>> +}      
> >>>>>>>
> >>>>>>> These functions are not necessary. You can always specify the attribute groups
> >>>>>>> to dev->groups before registering a new device.
> >>>>>>>       
> >>>>>>
> >>>>>> At the time of mdev device create, I specifically didn't used
> >>>>>> dev->groups because we callback in vendor driver before that, see below
> >>>>>> code snippet, and those attributes should only be added if create()
> >>>>>> callback returns success.
> >>>>>>
> >>>>>>         ret = parent->ops->create(mdev, mdev_params);
> >>>>>>         if (ret)
> >>>>>>                 return ret;
> >>>>>>
> >>>>>>         ret = mdev_add_attribute_group(&mdev->dev,
> >>>>>>                                         parent->ops->mdev_attr_groups);
> >>>>>>         if (ret)
> >>>>>>                 parent->ops->destroy(mdev);
> >>>>>>
> >>>>>>
> >>>>>>    
> >>>>>>>> +
> >>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >>>>>>>> +{
> >>>>>>>> +	struct parent_device *parent;
> >>>>>>>> +
> >>>>>>>> +	mutex_lock(&parent_list_lock);
> >>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
> >>>>>>>> +	mutex_unlock(&parent_list_lock);
> >>>>>>>> +
> >>>>>>>> +	return parent;
> >>>>>>>> +}      
> >>>>>>>
> >>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> >>>>>>> as long as you have an independent device associated with the mdev host device
> >>>>>>> ("parent" device here).
> >>>>>>>      
> >>>>>>
> >>>>>> I don't think every lock will go away with that. This also changes how
> >>>>>> mdev devices entries are created in sysfs. It adds an extra directory.    
> >>>>>
> >>>>> Exposing the parent-child relationship through sysfs is a desirable
> >>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
> >>>>> conversion was a big improvement, I thought.  Thanks,
> >>>>>    
> >>>>
> >>>> Jike's suggestion is to introduced a fake device over parent device i.e.
> >>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
> >>>> children of real parent.
> >>>>    
> >>>
> >>> It really depends on how you define 'real parent' :)
> >>>
> >>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> >>> device, the parent of host device is the physical device. e.g.
> >>>
> >>>         pdev            mdev_host       mdev_device
> >>>         dev<------------dev<------------dev
> >>>               parent          parent
> >>>
> >>>         Figure 1: device hierarchy
> >>>     
> >>
> >> Right, mdev-host device doesn't represent physical device nor any mdev
> >> device. Then what is the need of such device?  
> > 
> > Is there anything implicitly wrong with using a device node to host the
> > mdev child devices?  Is the argument against it only that it's
> > unnecessary?  Can we make use of the device-core parent/child
> > dependencies as Jike has done w/o that extra node?
> >  
> 
> I do feel that mdev core module would get simplified with the new sysfs
> interface without having extra node.

Can you provide an example of why that is?
 
> >>>> For example, directory structure we have now is:
> >>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> >>>>
> >>>> mdev devices are in real parents directory.
> >>>>
> >>>> By introducing fake device it would be:
> >>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> >>>>
> >>>> mdev devices are in fake device's directory.
> >>>>    
> >>>
> >>> Yes, this is the wanted directory.
> >>>     
> >>
> >> I don't think so.  
> > 
> > Why?
> >   
> 
> This directory is not mandatory. right?

Clearly you've done an implementation without it, so it's not
functionally mandatory, but Jike has made significant code reduction
and simplification with it.  Those are desirable things.

> >>>> Lock would be still required, to handle the race conditions like
> >>>> 'mdev_create' is still in process and parent device is unregistered by
> >>>> vendor driver/ parent device is unbind from vendor driver.
> >>>>    
> >>>
> >>> locks are provided to protect resources, would you elaborate more on
> >>> what is the exact resource you want to protect by a lock in mdev_create?
> >>>     
> >>
> >> Simple, in your suggestion mdev-host device. Fake device will go away if
> >> vendor driver unregisters the device from mdev module, right.  
> > 
> > I don't follow the reply here, but aiui there's ordering implicit in
> > the device core that Jike is trying to take advantage of that
> > simplifies the mdev layer significantly.  In the case of an
> > mdev_create, the device core needs to take a reference to the parent
> > object, the mdev-host I'd guess in Jike's version, the created mdev
> > device would also have a reference to that object, so the physical host
> > device could not be removed so long as there are outstanding
> > references.  It's just a matter of managing references and acquiring
> > and releasing objects.  Thanks,
> >  
> 
> I do think this could be simplified without having extra node.

The simplification is really what I'm after, whether or not it includes
an extra device node is not something I'm sure I have an opinion on
yet.  Aren't we really just talking about an extra sysfs directory
node?  Doesn't it make it easier for userspace to efficiently identify
all the mdev children when they're segregated from the other attributes
and sub-nodes of a parent device?
 
> > the created mdev
> > device would also have a reference to that object, so the physical host
> > device could not be removed so long as there are outstanding
> > references.  
> 
> Yes, this is also true when physical device is direct parent of mdev
> device. mdev device keeps reference of parent, so physical host device
> could not be removed as long as mdev devices are present. That is why
> from mdev_unregister_device() a chance is given to free all child mdev
> devices.

But why aren't we using the device core do do that?  It seems like
we're getting hung up on this device node, which is more of a stylistic
and sysfs layout issue when the primary comment is to simplify the mdev
infrastructure by taking more advantage of the parent/child
dependencies of the device core.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-26 14:13       ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-19 18:22         ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 18:22 UTC (permalink / raw)
  To: Dong Jia, alex.williamson
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini



On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> On 8/25/2016 2:52 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530

>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct mdev_device *mdev = vmdev->mdev;
>>> +	struct parent_device *parent = mdev->parent;
>>> +	unsigned int done = 0;
>>> +	int ret;
>>> +
>>> +	if (!parent->ops->read)
>>> +		return -EINVAL;
>>> +
>>> +	while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
> 
> I would like to know Alex's thought on this. He raised concern with this
> approach in v6 reviews:
> "But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer."
> 

Read/write callbacks are for slow path, emulation of mmio region which
are mainly device registers. I do feel it shouldn't support arbitrary
lengths.
Alex, I would like to know your thoughts.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-19 18:22         ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 18:22 UTC (permalink / raw)
  To: Dong Jia, alex.williamson
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini



On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> On 8/25/2016 2:52 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530

>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> +			      size_t count, loff_t *ppos)
>>> +{
>>> +	struct vfio_mdev *vmdev = device_data;
>>> +	struct mdev_device *mdev = vmdev->mdev;
>>> +	struct parent_device *parent = mdev->parent;
>>> +	unsigned int done = 0;
>>> +	int ret;
>>> +
>>> +	if (!parent->ops->read)
>>> +		return -EINVAL;
>>> +
>>> +	while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
> 
> I would like to know Alex's thought on this. He raised concern with this
> approach in v6 reviews:
> "But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer."
> 

Read/write callbacks are for slow path, emulation of mmio region which
are mainly device registers. I do feel it shouldn't support arbitrary
lengths.
Alex, I would like to know your thoughts.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-19 18:22         ` Kirti Wankhede
@ 2016-09-19 18:36           ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 18:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dong Jia, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini

On Mon, 19 Sep 2016 23:52:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
> > * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> > On 8/25/2016 2:52 PM, Dong Jia wrote:  
> >> On Thu, 25 Aug 2016 09:23:53 +0530  
> 
> >>> +
> >>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> >>> +			      size_t count, loff_t *ppos)
> >>> +{
> >>> +	struct vfio_mdev *vmdev = device_data;
> >>> +	struct mdev_device *mdev = vmdev->mdev;
> >>> +	struct parent_device *parent = mdev->parent;
> >>> +	unsigned int done = 0;
> >>> +	int ret;
> >>> +
> >>> +	if (!parent->ops->read)
> >>> +		return -EINVAL;
> >>> +
> >>> +	while (count) {  
> >> Here, I have to say sorry to you guys for that I didn't notice the
> >> bad impact of this change to my patches during the v6 discussion.
> >>
> >> For vfio-ccw, I introduced an I/O region to input/output I/O
> >> instruction parameters and results for Qemu. The @count of these data
> >> currently is 140. So supporting arbitrary lengths in one shot here, and
> >> also in vfio_mdev_write, seems the better option for this case.
> >>
> >> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> >> can do that in the parent read/write callbacks instead.
> >>
> >> What do you think?
> >>  
> > 
> > I would like to know Alex's thought on this. He raised concern with this
> > approach in v6 reviews:
> > "But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer."
> >   
> 
> Read/write callbacks are for slow path, emulation of mmio region which
> are mainly device registers. I do feel it shouldn't support arbitrary
> lengths.
> Alex, I would like to know your thoughts.

The exploit was that the mdev layer allocated a buffer and copied the
entire user buffer into kernel space before passing it to the vendor
driver.  The solution is to pass the __user *buf to the vendor driver
and let them sanitize and split it however makes sense for their
device.  We shouldn't be assuming naturally aligned, up to dword
accesses in the generic mdev layers.  Those sorts of semantics are
defined by the device type.  This is another case where making
the mdev layer as thin as possible is actually the best thing to
do to make it really device type agnostic.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-19 18:36           ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 18:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Dong Jia, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini

On Mon, 19 Sep 2016 23:52:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
> > * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> > On 8/25/2016 2:52 PM, Dong Jia wrote:  
> >> On Thu, 25 Aug 2016 09:23:53 +0530  
> 
> >>> +
> >>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> >>> +			      size_t count, loff_t *ppos)
> >>> +{
> >>> +	struct vfio_mdev *vmdev = device_data;
> >>> +	struct mdev_device *mdev = vmdev->mdev;
> >>> +	struct parent_device *parent = mdev->parent;
> >>> +	unsigned int done = 0;
> >>> +	int ret;
> >>> +
> >>> +	if (!parent->ops->read)
> >>> +		return -EINVAL;
> >>> +
> >>> +	while (count) {  
> >> Here, I have to say sorry to you guys for that I didn't notice the
> >> bad impact of this change to my patches during the v6 discussion.
> >>
> >> For vfio-ccw, I introduced an I/O region to input/output I/O
> >> instruction parameters and results for Qemu. The @count of these data
> >> currently is 140. So supporting arbitrary lengths in one shot here, and
> >> also in vfio_mdev_write, seems the better option for this case.
> >>
> >> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> >> can do that in the parent read/write callbacks instead.
> >>
> >> What do you think?
> >>  
> > 
> > I would like to know Alex's thought on this. He raised concern with this
> > approach in v6 reviews:
> > "But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer."
> >   
> 
> Read/write callbacks are for slow path, emulation of mmio region which
> are mainly device registers. I do feel it shouldn't support arbitrary
> lengths.
> Alex, I would like to know your thoughts.

The exploit was that the mdev layer allocated a buffer and copied the
entire user buffer into kernel space before passing it to the vendor
driver.  The solution is to pass the __user *buf to the vendor driver
and let them sanitize and split it however makes sense for their
device.  We shouldn't be assuming naturally aligned, up to dword
accesses in the generic mdev layers.  Those sorts of semantics are
defined by the device type.  This is another case where making
the mdev layer as thin as possible is actually the best thing to
do to make it really device type agnostic.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-19 18:36           ` Alex Williamson
@ 2016-09-19 19:13             ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 19:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dong Jia, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini



On 9/20/2016 12:06 AM, Alex Williamson wrote:
> On Mon, 19 Sep 2016 23:52:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
>>> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
>>> On 8/25/2016 2:52 PM, Dong Jia wrote:  
>>>> On Thu, 25 Aug 2016 09:23:53 +0530  
>>
>>>>> +
>>>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>>>> +			      size_t count, loff_t *ppos)
>>>>> +{
>>>>> +	struct vfio_mdev *vmdev = device_data;
>>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>>> +	struct parent_device *parent = mdev->parent;
>>>>> +	unsigned int done = 0;
>>>>> +	int ret;
>>>>> +
>>>>> +	if (!parent->ops->read)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	while (count) {  
>>>> Here, I have to say sorry to you guys for that I didn't notice the
>>>> bad impact of this change to my patches during the v6 discussion.
>>>>
>>>> For vfio-ccw, I introduced an I/O region to input/output I/O
>>>> instruction parameters and results for Qemu. The @count of these data
>>>> currently is 140. So supporting arbitrary lengths in one shot here, and
>>>> also in vfio_mdev_write, seems the better option for this case.
>>>>
>>>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>>>> can do that in the parent read/write callbacks instead.
>>>>
>>>> What do you think?
>>>>  
>>>
>>> I would like to know Alex's thought on this. He raised concern with this
>>> approach in v6 reviews:
>>> "But I think this is exploitable, it lets the user make the kernel
>>> allocate an arbitrarily sized buffer."
>>>   
>>
>> Read/write callbacks are for slow path, emulation of mmio region which
>> are mainly device registers. I do feel it shouldn't support arbitrary
>> lengths.
>> Alex, I would like to know your thoughts.
> 
> The exploit was that the mdev layer allocated a buffer and copied the
> entire user buffer into kernel space before passing it to the vendor
> driver.  The solution is to pass the __user *buf to the vendor driver
> and let them sanitize and split it however makes sense for their
> device.  We shouldn't be assuming naturally aligned, up to dword
> accesses in the generic mdev layers.  Those sorts of semantics are
> defined by the device type.  This is another case where making
> the mdev layer as thin as possible is actually the best thing to
> do to make it really device type agnostic.  Thanks,
> 


Alex,

These were the comments on v6 patch:

>>> Do we really need to support arbitrary lengths in one shot?  Seems
>>> like
>>> we could just use a 4 or 8 byte variable on the stack and iterate
>>> until
>>> done.
>>>
>>
>> We just want to pass the arguments to vendor driver as is here.Vendor
>> driver could take care of that.

> But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer.

As per above discussion in v7 version, this module don't allocated
memory from heap.

If vendor driver allocates arbitrary memory in kernel space through mdev
module interface, isn't that would be exploit?

Thanks,
Kirti

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-19 19:13             ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 19:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dong Jia, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini



On 9/20/2016 12:06 AM, Alex Williamson wrote:
> On Mon, 19 Sep 2016 23:52:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
>>> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
>>> On 8/25/2016 2:52 PM, Dong Jia wrote:  
>>>> On Thu, 25 Aug 2016 09:23:53 +0530  
>>
>>>>> +
>>>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>>>> +			      size_t count, loff_t *ppos)
>>>>> +{
>>>>> +	struct vfio_mdev *vmdev = device_data;
>>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>>> +	struct parent_device *parent = mdev->parent;
>>>>> +	unsigned int done = 0;
>>>>> +	int ret;
>>>>> +
>>>>> +	if (!parent->ops->read)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	while (count) {  
>>>> Here, I have to say sorry to you guys for that I didn't notice the
>>>> bad impact of this change to my patches during the v6 discussion.
>>>>
>>>> For vfio-ccw, I introduced an I/O region to input/output I/O
>>>> instruction parameters and results for Qemu. The @count of these data
>>>> currently is 140. So supporting arbitrary lengths in one shot here, and
>>>> also in vfio_mdev_write, seems the better option for this case.
>>>>
>>>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>>>> can do that in the parent read/write callbacks instead.
>>>>
>>>> What do you think?
>>>>  
>>>
>>> I would like to know Alex's thought on this. He raised concern with this
>>> approach in v6 reviews:
>>> "But I think this is exploitable, it lets the user make the kernel
>>> allocate an arbitrarily sized buffer."
>>>   
>>
>> Read/write callbacks are for slow path, emulation of mmio region which
>> are mainly device registers. I do feel it shouldn't support arbitrary
>> lengths.
>> Alex, I would like to know your thoughts.
> 
> The exploit was that the mdev layer allocated a buffer and copied the
> entire user buffer into kernel space before passing it to the vendor
> driver.  The solution is to pass the __user *buf to the vendor driver
> and let them sanitize and split it however makes sense for their
> device.  We shouldn't be assuming naturally aligned, up to dword
> accesses in the generic mdev layers.  Those sorts of semantics are
> defined by the device type.  This is another case where making
> the mdev layer as thin as possible is actually the best thing to
> do to make it really device type agnostic.  Thanks,
> 


Alex,

These were the comments on v6 patch:

>>> Do we really need to support arbitrary lengths in one shot?  Seems
>>> like
>>> we could just use a 4 or 8 byte variable on the stack and iterate
>>> until
>>> done.
>>>
>>
>> We just want to pass the arguments to vendor driver as is here.Vendor
>> driver could take care of that.

> But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer.

As per above discussion in v7 version, this module don't allocated
memory from heap.

If vendor driver allocates arbitrary memory in kernel space through mdev
module interface, isn't that would be exploit?

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-19 19:13             ` Kirti Wankhede
@ 2016-09-19 20:03               ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 20:03 UTC (permalink / raw)
  To: Kirti Wankhede, kraxel
  Cc: Dong Jia, kevin.tian, cjia, kvm, qemu-devel, jike.song, pbonzini

On Tue, 20 Sep 2016 00:43:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/20/2016 12:06 AM, Alex Williamson wrote:
> > On Mon, 19 Sep 2016 23:52:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:  
> >>> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> >>> On 8/25/2016 2:52 PM, Dong Jia wrote:    
> >>>> On Thu, 25 Aug 2016 09:23:53 +0530    
> >>  
> >>>>> +
> >>>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> >>>>> +			      size_t count, loff_t *ppos)
> >>>>> +{
> >>>>> +	struct vfio_mdev *vmdev = device_data;
> >>>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>>> +	struct parent_device *parent = mdev->parent;
> >>>>> +	unsigned int done = 0;
> >>>>> +	int ret;
> >>>>> +
> >>>>> +	if (!parent->ops->read)
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	while (count) {    
> >>>> Here, I have to say sorry to you guys for that I didn't notice the
> >>>> bad impact of this change to my patches during the v6 discussion.
> >>>>
> >>>> For vfio-ccw, I introduced an I/O region to input/output I/O
> >>>> instruction parameters and results for Qemu. The @count of these data
> >>>> currently is 140. So supporting arbitrary lengths in one shot here, and
> >>>> also in vfio_mdev_write, seems the better option for this case.
> >>>>
> >>>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> >>>> can do that in the parent read/write callbacks instead.
> >>>>
> >>>> What do you think?
> >>>>    
> >>>
> >>> I would like to know Alex's thought on this. He raised concern with this
> >>> approach in v6 reviews:
> >>> "But I think this is exploitable, it lets the user make the kernel
> >>> allocate an arbitrarily sized buffer."
> >>>     
> >>
> >> Read/write callbacks are for slow path, emulation of mmio region which
> >> are mainly device registers. I do feel it shouldn't support arbitrary
> >> lengths.
> >> Alex, I would like to know your thoughts.  
> > 
> > The exploit was that the mdev layer allocated a buffer and copied the
> > entire user buffer into kernel space before passing it to the vendor
> > driver.  The solution is to pass the __user *buf to the vendor driver
> > and let them sanitize and split it however makes sense for their
> > device.  We shouldn't be assuming naturally aligned, up to dword
> > accesses in the generic mdev layers.  Those sorts of semantics are
> > defined by the device type.  This is another case where making
> > the mdev layer as thin as possible is actually the best thing to
> > do to make it really device type agnostic.  Thanks,
> >   
> 
> 
> Alex,
> 
> These were the comments on v6 patch:
> 
> >>> Do we really need to support arbitrary lengths in one shot?  Seems
> >>> like
> >>> we could just use a 4 or 8 byte variable on the stack and iterate
> >>> until
> >>> done.
> >>>  
> >>
> >> We just want to pass the arguments to vendor driver as is here.Vendor
> >> driver could take care of that.  
> 
> > But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer.  
> 
> As per above discussion in v7 version, this module don't allocated
> memory from heap.
> 
> If vendor driver allocates arbitrary memory in kernel space through mdev
> module interface, isn't that would be exploit?

Yep, my 4-8/byte chunking idea was too PCI specific.  If a vendor
driver introduces an exploit, that's a bug in the vendor driver.  I'm
not sure if we can or should attempt to guard against that.  Ultimately
the vendor driver is either open source and we can inspect it for such
exploits or it's closed source, taints the kernel, and we hope for the
best.  It might make a good unit test to perform substantially sized
reads/writes to the mdev device.  Perhaps the only sanity test we can
make in the core is to verify the access doesn't exceed the size of
the region as advertised by the vendor driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-19 20:03               ` Alex Williamson
  0 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 20:03 UTC (permalink / raw)
  To: Kirti Wankhede, kraxel
  Cc: Dong Jia, kevin.tian, cjia, kvm, qemu-devel, jike.song, pbonzini

On Tue, 20 Sep 2016 00:43:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/20/2016 12:06 AM, Alex Williamson wrote:
> > On Mon, 19 Sep 2016 23:52:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:  
> >>> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> >>> On 8/25/2016 2:52 PM, Dong Jia wrote:    
> >>>> On Thu, 25 Aug 2016 09:23:53 +0530    
> >>  
> >>>>> +
> >>>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> >>>>> +			      size_t count, loff_t *ppos)
> >>>>> +{
> >>>>> +	struct vfio_mdev *vmdev = device_data;
> >>>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>>> +	struct parent_device *parent = mdev->parent;
> >>>>> +	unsigned int done = 0;
> >>>>> +	int ret;
> >>>>> +
> >>>>> +	if (!parent->ops->read)
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	while (count) {    
> >>>> Here, I have to say sorry to you guys for that I didn't notice the
> >>>> bad impact of this change to my patches during the v6 discussion.
> >>>>
> >>>> For vfio-ccw, I introduced an I/O region to input/output I/O
> >>>> instruction parameters and results for Qemu. The @count of these data
> >>>> currently is 140. So supporting arbitrary lengths in one shot here, and
> >>>> also in vfio_mdev_write, seems the better option for this case.
> >>>>
> >>>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> >>>> can do that in the parent read/write callbacks instead.
> >>>>
> >>>> What do you think?
> >>>>    
> >>>
> >>> I would like to know Alex's thought on this. He raised concern with this
> >>> approach in v6 reviews:
> >>> "But I think this is exploitable, it lets the user make the kernel
> >>> allocate an arbitrarily sized buffer."
> >>>     
> >>
> >> Read/write callbacks are for slow path, emulation of mmio region which
> >> are mainly device registers. I do feel it shouldn't support arbitrary
> >> lengths.
> >> Alex, I would like to know your thoughts.  
> > 
> > The exploit was that the mdev layer allocated a buffer and copied the
> > entire user buffer into kernel space before passing it to the vendor
> > driver.  The solution is to pass the __user *buf to the vendor driver
> > and let them sanitize and split it however makes sense for their
> > device.  We shouldn't be assuming naturally aligned, up to dword
> > accesses in the generic mdev layers.  Those sorts of semantics are
> > defined by the device type.  This is another case where making
> > the mdev layer as thin as possible is actually the best thing to
> > do to make it really device type agnostic.  Thanks,
> >   
> 
> 
> Alex,
> 
> These were the comments on v6 patch:
> 
> >>> Do we really need to support arbitrary lengths in one shot?  Seems
> >>> like
> >>> we could just use a 4 or 8 byte variable on the stack and iterate
> >>> until
> >>> done.
> >>>  
> >>
> >> We just want to pass the arguments to vendor driver as is here.Vendor
> >> driver could take care of that.  
> 
> > But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer.  
> 
> As per above discussion in v7 version, this module don't allocated
> memory from heap.
> 
> If vendor driver allocates arbitrary memory in kernel space through mdev
> module interface, isn't that would be exploit?

Yep, my 4-8/byte chunking idea was too PCI specific.  If a vendor
driver introduces an exploit, that's a bug in the vendor driver.  I'm
not sure if we can or should attempt to guard against that.  Ultimately
the vendor driver is either open source and we can inspect it for such
exploits or it's closed source, taints the kernel, and we hope for the
best.  It might make a good unit test to perform substantially sized
reads/writes to the mdev device.  Perhaps the only sanity test we can
make in the core is to verify the access doesn't exceed the size of
the region as advertised by the vendor driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-19 18:11                     ` [Qemu-devel] " Alex Williamson
@ 2016-09-19 20:09                       ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 20:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi



On 9/19/2016 11:41 PM, Alex Williamson wrote:
> On Mon, 19 Sep 2016 22:59:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/12/2016 9:23 PM, Alex Williamson wrote:
>>> On Mon, 12 Sep 2016 13:19:11 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 9/12/2016 10:40 AM, Jike Song wrote:  
>>>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:    
>>>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:    
>>>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>    
>>>>>>>>>> +
>>>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>>>>>> +{
>>>>>>>>>> +	struct parent_device *parent;
>>>>>>>>>> +
>>>>>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>>>>>> +
>>>>>>>>>> +	return parent;
>>>>>>>>>> +}      
>>>>>>>>>
>>>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>>>>>> as long as you have an independent device associated with the mdev host device
>>>>>>>>> ("parent" device here).
>>>>>>>>>      
>>>>>>>>
>>>>>>>> I don't think every lock will go away with that. This also changes how
>>>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.    
>>>>>>>
>>>>>>> Exposing the parent-child relationship through sysfs is a desirable
>>>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>>>>>> conversion was a big improvement, I thought.  Thanks,
>>>>>>>    
>>>>>>
>>>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
>>>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>>>>>> children of real parent.
>>>>>>    
>>>>>
>>>>> It really depends on how you define 'real parent' :)
>>>>>
>>>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
>>>>> device, the parent of host device is the physical device. e.g.
>>>>>
>>>>>         pdev            mdev_host       mdev_device
>>>>>         dev<------------dev<------------dev
>>>>>               parent          parent
>>>>>
>>>>>         Figure 1: device hierarchy
>>>>>     
>>>>
>>>> Right, mdev-host device doesn't represent physical device nor any mdev
>>>> device. Then what is the need of such device?  
>>>
>>> Is there anything implicitly wrong with using a device node to host the
>>> mdev child devices?  Is the argument against it only that it's
>>> unnecessary?  Can we make use of the device-core parent/child
>>> dependencies as Jike has done w/o that extra node?
>>>  
>>
>> I do feel that mdev core module would get simplified with the new sysfs
>> interface without having extra node.
> 
> Can you provide an example of why that is?
>  

'online' or earlier 'start'/'stop' interface is removed and would add
open() and close() callbacks. ref count from struct mdev_device and
mdev_get_device()/mdev_put_device() were added for this interface, these
would go away.
Using device-core parent/child dependencies between physical device and
mdev device, other functions would get simplified.

>>>>>> For example, directory structure we have now is:
>>>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>>>>>
>>>>>> mdev devices are in real parents directory.
>>>>>>
>>>>>> By introducing fake device it would be:
>>>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>>>>>
>>>>>> mdev devices are in fake device's directory.
>>>>>>    
>>>>>
>>>>> Yes, this is the wanted directory.
>>>>>     
>>>>
>>>> I don't think so.  
>>>
>>> Why?
>>>   
>>
>> This directory is not mandatory. right?
> 
> Clearly you've done an implementation without it, so it's not
> functionally mandatory, but Jike has made significant code reduction
> and simplification with it.  Those are desirable things.
> 
>>>>>> Lock would be still required, to handle the race conditions like
>>>>>> 'mdev_create' is still in process and parent device is unregistered by
>>>>>> vendor driver/ parent device is unbind from vendor driver.
>>>>>>    
>>>>>
>>>>> locks are provided to protect resources, would you elaborate more on
>>>>> what is the exact resource you want to protect by a lock in mdev_create?
>>>>>     
>>>>
>>>> Simple, in your suggestion mdev-host device. Fake device will go away if
>>>> vendor driver unregisters the device from mdev module, right.  
>>>
>>> I don't follow the reply here, but aiui there's ordering implicit in
>>> the device core that Jike is trying to take advantage of that
>>> simplifies the mdev layer significantly.  In the case of an
>>> mdev_create, the device core needs to take a reference to the parent
>>> object, the mdev-host I'd guess in Jike's version, the created mdev
>>> device would also have a reference to that object, so the physical host
>>> device could not be removed so long as there are outstanding
>>> references.  It's just a matter of managing references and acquiring
>>> and releasing objects.  Thanks,
>>>  
>>
>> I do think this could be simplified without having extra node.
> 
> The simplification is really what I'm after, whether or not it includes
> an extra device node is not something I'm sure I have an opinion on
> yet.  Aren't we really just talking about an extra sysfs directory
> node?

No, not just extra sysfs directory. I'm trying to keep parent/child
relationship between physical device and mdev device direct without
having extra device node.

> Doesn't it make it easier for userspace to efficiently identify
> all the mdev children when they're segregated from the other attributes
> and sub-nodes of a parent device?
>  

Physical devices are generally leaf nodes. I think its easier to find
all mdev children in its own directory.

>>> the created mdev
>>> device would also have a reference to that object, so the physical host
>>> device could not be removed so long as there are outstanding
>>> references.  
>>
>> Yes, this is also true when physical device is direct parent of mdev
>> device. mdev device keeps reference of parent, so physical host device
>> could not be removed as long as mdev devices are present. That is why
>> from mdev_unregister_device() a chance is given to free all child mdev
>> devices.
> 
> But why aren't we using the device core do do that?  It seems like
> we're getting hung up on this device node, which is more of a stylistic
> and sysfs layout issue when the primary comment is to simplify the mdev
> infrastructure by taking more advantage of the parent/child
> dependencies of the device core.  Thanks,
> 

Definitely we would use device core parent/child dependency and simplify
mdev framework without having extra device node.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-19 20:09                       ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-19 20:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi



On 9/19/2016 11:41 PM, Alex Williamson wrote:
> On Mon, 19 Sep 2016 22:59:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/12/2016 9:23 PM, Alex Williamson wrote:
>>> On Mon, 12 Sep 2016 13:19:11 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 9/12/2016 10:40 AM, Jike Song wrote:  
>>>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:    
>>>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:    
>>>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>    
>>>>>>>>>> +
>>>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>>>>>>>> +{
>>>>>>>>>> +	struct parent_device *parent;
>>>>>>>>>> +
>>>>>>>>>> +	mutex_lock(&parent_list_lock);
>>>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>>>> +	mutex_unlock(&parent_list_lock);
>>>>>>>>>> +
>>>>>>>>>> +	return parent;
>>>>>>>>>> +}      
>>>>>>>>>
>>>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
>>>>>>>>> as long as you have an independent device associated with the mdev host device
>>>>>>>>> ("parent" device here).
>>>>>>>>>      
>>>>>>>>
>>>>>>>> I don't think every lock will go away with that. This also changes how
>>>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.    
>>>>>>>
>>>>>>> Exposing the parent-child relationship through sysfs is a desirable
>>>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>>>>>> conversion was a big improvement, I thought.  Thanks,
>>>>>>>    
>>>>>>
>>>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
>>>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>>>>>> children of real parent.
>>>>>>    
>>>>>
>>>>> It really depends on how you define 'real parent' :)
>>>>>
>>>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
>>>>> device, the parent of host device is the physical device. e.g.
>>>>>
>>>>>         pdev            mdev_host       mdev_device
>>>>>         dev<------------dev<------------dev
>>>>>               parent          parent
>>>>>
>>>>>         Figure 1: device hierarchy
>>>>>     
>>>>
>>>> Right, mdev-host device doesn't represent physical device nor any mdev
>>>> device. Then what is the need of such device?  
>>>
>>> Is there anything implicitly wrong with using a device node to host the
>>> mdev child devices?  Is the argument against it only that it's
>>> unnecessary?  Can we make use of the device-core parent/child
>>> dependencies as Jike has done w/o that extra node?
>>>  
>>
>> I do feel that mdev core module would get simplified with the new sysfs
>> interface without having extra node.
> 
> Can you provide an example of why that is?
>  

'online' or earlier 'start'/'stop' interface is removed and would add
open() and close() callbacks. ref count from struct mdev_device and
mdev_get_device()/mdev_put_device() were added for this interface, these
would go away.
Using device-core parent/child dependencies between physical device and
mdev device, other functions would get simplified.

>>>>>> For example, directory structure we have now is:
>>>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
>>>>>>
>>>>>> mdev devices are in real parents directory.
>>>>>>
>>>>>> By introducing fake device it would be:
>>>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
>>>>>>
>>>>>> mdev devices are in fake device's directory.
>>>>>>    
>>>>>
>>>>> Yes, this is the wanted directory.
>>>>>     
>>>>
>>>> I don't think so.  
>>>
>>> Why?
>>>   
>>
>> This directory is not mandatory. right?
> 
> Clearly you've done an implementation without it, so it's not
> functionally mandatory, but Jike has made significant code reduction
> and simplification with it.  Those are desirable things.
> 
>>>>>> Lock would be still required, to handle the race conditions like
>>>>>> 'mdev_create' is still in process and parent device is unregistered by
>>>>>> vendor driver/ parent device is unbind from vendor driver.
>>>>>>    
>>>>>
>>>>> locks are provided to protect resources, would you elaborate more on
>>>>> what is the exact resource you want to protect by a lock in mdev_create?
>>>>>     
>>>>
>>>> Simple, in your suggestion mdev-host device. Fake device will go away if
>>>> vendor driver unregisters the device from mdev module, right.  
>>>
>>> I don't follow the reply here, but aiui there's ordering implicit in
>>> the device core that Jike is trying to take advantage of that
>>> simplifies the mdev layer significantly.  In the case of an
>>> mdev_create, the device core needs to take a reference to the parent
>>> object, the mdev-host I'd guess in Jike's version, the created mdev
>>> device would also have a reference to that object, so the physical host
>>> device could not be removed so long as there are outstanding
>>> references.  It's just a matter of managing references and acquiring
>>> and releasing objects.  Thanks,
>>>  
>>
>> I do think this could be simplified without having extra node.
> 
> The simplification is really what I'm after, whether or not it includes
> an extra device node is not something I'm sure I have an opinion on
> yet.  Aren't we really just talking about an extra sysfs directory
> node?

No, not just extra sysfs directory. I'm trying to keep parent/child
relationship between physical device and mdev device direct without
having extra device node.

> Doesn't it make it easier for userspace to efficiently identify
> all the mdev children when they're segregated from the other attributes
> and sub-nodes of a parent device?
>  

Physical devices are generally leaf nodes. I think its easier to find
all mdev children in its own directory.

>>> the created mdev
>>> device would also have a reference to that object, so the physical host
>>> device could not be removed so long as there are outstanding
>>> references.  
>>
>> Yes, this is also true when physical device is direct parent of mdev
>> device. mdev device keeps reference of parent, so physical host device
>> could not be removed as long as mdev devices are present. That is why
>> from mdev_unregister_device() a chance is given to free all child mdev
>> devices.
> 
> But why aren't we using the device core do do that?  It seems like
> we're getting hung up on this device node, which is more of a stylistic
> and sysfs layout issue when the primary comment is to simplify the mdev
> infrastructure by taking more advantage of the parent/child
> dependencies of the device core.  Thanks,
> 

Definitely we would use device core parent/child dependency and simplify
mdev framework without having extra device node.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-09-19 20:09                       ` [Qemu-devel] " Kirti Wankhede
  (?)
@ 2016-09-19 20:59                       ` Alex Williamson
  -1 siblings, 0 replies; 162+ messages in thread
From: Alex Williamson @ 2016-09-19 20:59 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Jike Song, cjia, kvm, qemu-devel, kevin.tian, kraxel, pbonzini, bjsdjshi

On Tue, 20 Sep 2016 01:39:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/19/2016 11:41 PM, Alex Williamson wrote:
> > On Mon, 19 Sep 2016 22:59:34 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/12/2016 9:23 PM, Alex Williamson wrote:  
> >>> On Mon, 12 Sep 2016 13:19:11 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 9/12/2016 10:40 AM, Jike Song wrote:    
> >>>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:      
> >>>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:      
> >>>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>      
> >>>>>>>>>> +
> >>>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
> >>>>>>>>>> +{
> >>>>>>>>>> +	struct parent_device *parent;
> >>>>>>>>>> +
> >>>>>>>>>> +	mutex_lock(&parent_list_lock);
> >>>>>>>>>> +	parent = mdev_get_parent(__find_parent_device(dev));
> >>>>>>>>>> +	mutex_unlock(&parent_list_lock);
> >>>>>>>>>> +
> >>>>>>>>>> +	return parent;
> >>>>>>>>>> +}        
> >>>>>>>>>
> >>>>>>>>> As we have demonstrated, all these refs and locks and release workqueue are not necessary,
> >>>>>>>>> as long as you have an independent device associated with the mdev host device
> >>>>>>>>> ("parent" device here).
> >>>>>>>>>        
> >>>>>>>>
> >>>>>>>> I don't think every lock will go away with that. This also changes how
> >>>>>>>> mdev devices entries are created in sysfs. It adds an extra directory.      
> >>>>>>>
> >>>>>>> Exposing the parent-child relationship through sysfs is a desirable
> >>>>>>> feature, so I'm not sure how this is a negative.  This part of Jike's
> >>>>>>> conversion was a big improvement, I thought.  Thanks,
> >>>>>>>      
> >>>>>>
> >>>>>> Jike's suggestion is to introduced a fake device over parent device i.e.
> >>>>>> mdev-host, and then all mdev devices are children of 'mdev-host' not
> >>>>>> children of real parent.
> >>>>>>      
> >>>>>
> >>>>> It really depends on how you define 'real parent' :)
> >>>>>
> >>>>> With a physical-host-mdev hierarchy, the parent of mdev devices is the host
> >>>>> device, the parent of host device is the physical device. e.g.
> >>>>>
> >>>>>         pdev            mdev_host       mdev_device
> >>>>>         dev<------------dev<------------dev
> >>>>>               parent          parent
> >>>>>
> >>>>>         Figure 1: device hierarchy
> >>>>>       
> >>>>
> >>>> Right, mdev-host device doesn't represent physical device nor any mdev
> >>>> device. Then what is the need of such device?    
> >>>
> >>> Is there anything implicitly wrong with using a device node to host the
> >>> mdev child devices?  Is the argument against it only that it's
> >>> unnecessary?  Can we make use of the device-core parent/child
> >>> dependencies as Jike has done w/o that extra node?
> >>>    
> >>
> >> I do feel that mdev core module would get simplified with the new sysfs
> >> interface without having extra node.  
> > 
> > Can you provide an example of why that is?
> >    
> 
> 'online' or earlier 'start'/'stop' interface is removed and would add
> open() and close() callbacks. ref count from struct mdev_device and
> mdev_get_device()/mdev_put_device() were added for this interface, these
> would go away.
> Using device-core parent/child dependencies between physical device and
> mdev device, other functions would get simplified.

Yes, we've refined the interface over time and I'm glad that you're
incorporating the device-core simplifications, but this doesn't really
argue for or against the extra device node.
 
> >>>>>> For example, directory structure we have now is:
> >>>>>> /sys/bus/pci/devices/0000\:85\:00.0/<mdev_device>
> >>>>>>
> >>>>>> mdev devices are in real parents directory.
> >>>>>>
> >>>>>> By introducing fake device it would be:
> >>>>>> /sys/bus/pci/devices/0000\:85\:00.0/mdev-host/<mdev_device>
> >>>>>>
> >>>>>> mdev devices are in fake device's directory.
> >>>>>>      
> >>>>>
> >>>>> Yes, this is the wanted directory.
> >>>>>       
> >>>>
> >>>> I don't think so.    
> >>>
> >>> Why?
> >>>     
> >>
> >> This directory is not mandatory. right?  
> > 
> > Clearly you've done an implementation without it, so it's not
> > functionally mandatory, but Jike has made significant code reduction
> > and simplification with it.  Those are desirable things.
> >   
> >>>>>> Lock would be still required, to handle the race conditions like
> >>>>>> 'mdev_create' is still in process and parent device is unregistered by
> >>>>>> vendor driver/ parent device is unbind from vendor driver.
> >>>>>>      
> >>>>>
> >>>>> locks are provided to protect resources, would you elaborate more on
> >>>>> what is the exact resource you want to protect by a lock in mdev_create?
> >>>>>       
> >>>>
> >>>> Simple, in your suggestion mdev-host device. Fake device will go away if
> >>>> vendor driver unregisters the device from mdev module, right.    
> >>>
> >>> I don't follow the reply here, but aiui there's ordering implicit in
> >>> the device core that Jike is trying to take advantage of that
> >>> simplifies the mdev layer significantly.  In the case of an
> >>> mdev_create, the device core needs to take a reference to the parent
> >>> object, the mdev-host I'd guess in Jike's version, the created mdev
> >>> device would also have a reference to that object, so the physical host
> >>> device could not be removed so long as there are outstanding
> >>> references.  It's just a matter of managing references and acquiring
> >>> and releasing objects.  Thanks,
> >>>    
> >>
> >> I do think this could be simplified without having extra node.  
> > 
> > The simplification is really what I'm after, whether or not it includes
> > an extra device node is not something I'm sure I have an opinion on
> > yet.  Aren't we really just talking about an extra sysfs directory
> > node?  
> 
> No, not just extra sysfs directory. I'm trying to keep parent/child
> relationship between physical device and mdev device direct without
> having extra device node.

But it's just a difference of whether the mdev devices are rooted at
the parent itself or a child of the parent.  It's just a slight
organizational change, isn't it?

> > Doesn't it make it easier for userspace to efficiently identify
> > all the mdev children when they're segregated from the other attributes
> > and sub-nodes of a parent device?
> >    
> 
> Physical devices are generally leaf nodes. I think its easier to find
> all mdev children in its own directory.

So we're just going to drop a bunch of UUID links in the parent device
sysfs directory, userspace needs to follow the link to figure out what
they are?  Doesn't that seem messy?  SR-IOV VFs are prefixed with
"virtfn" in the parent directory, putting them under an mdev-host
device is just another way to implement that organization.

> >>> the created mdev
> >>> device would also have a reference to that object, so the physical host
> >>> device could not be removed so long as there are outstanding
> >>> references.    
> >>
> >> Yes, this is also true when physical device is direct parent of mdev
> >> device. mdev device keeps reference of parent, so physical host device
> >> could not be removed as long as mdev devices are present. That is why
> >> from mdev_unregister_device() a chance is given to free all child mdev
> >> devices.  
> > 
> > But why aren't we using the device core do do that?  It seems like
> > we're getting hung up on this device node, which is more of a stylistic
> > and sysfs layout issue when the primary comment is to simplify the mdev
> > infrastructure by taking more advantage of the parent/child
> > dependencies of the device core.  Thanks,
> >   
> 
> Definitely we would use device core parent/child dependency and simplify
> mdev framework without having extra device node.

That's great, I wasn't sure if you were rejecting the whole
parent/child aspect or just the extra node.  Still, I don't see the
mdev-host node as an outright bad idea, it seems like just a way to
classify a set of child devices.  In the SR-IOV case the VFs are actual
PCI devices spawned from the parent device, so making them direct
children of the PF might make sense.  Here we're going through this
mdev core interface to create mdev devices and an mdev-host device also
seems like a realistic representation of that.  In any case, it seems
like a fairly minor tweak to add or remove it, we can argue one way or
the other on the next draft.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-19 20:03               ` Alex Williamson
  (?)
@ 2016-09-20  2:50               ` Jike Song
  2016-09-20 16:24                 ` Alex Williamson
  -1 siblings, 1 reply; 162+ messages in thread
From: Jike Song @ 2016-09-20  2:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, kraxel, Dong Jia, kevin.tian, cjia, kvm,
	qemu-devel, pbonzini

On 09/20/2016 04:03 AM, Alex Williamson wrote:
> On Tue, 20 Sep 2016 00:43:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/20/2016 12:06 AM, Alex Williamson wrote:
>>> On Mon, 19 Sep 2016 23:52:36 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:  
>>>>> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
>>>>> On 8/25/2016 2:52 PM, Dong Jia wrote:    
>>>>>> On Thu, 25 Aug 2016 09:23:53 +0530    
>>>>  
>>>>>>> +
>>>>>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>>>>>> +			      size_t count, loff_t *ppos)
>>>>>>> +{
>>>>>>> +	struct vfio_mdev *vmdev = device_data;
>>>>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>>>>> +	struct parent_device *parent = mdev->parent;
>>>>>>> +	unsigned int done = 0;
>>>>>>> +	int ret;
>>>>>>> +
>>>>>>> +	if (!parent->ops->read)
>>>>>>> +		return -EINVAL;
>>>>>>> +
>>>>>>> +	while (count) {    
>>>>>> Here, I have to say sorry to you guys for that I didn't notice the
>>>>>> bad impact of this change to my patches during the v6 discussion.
>>>>>>
>>>>>> For vfio-ccw, I introduced an I/O region to input/output I/O
>>>>>> instruction parameters and results for Qemu. The @count of these data
>>>>>> currently is 140. So supporting arbitrary lengths in one shot here, and
>>>>>> also in vfio_mdev_write, seems the better option for this case.
>>>>>>
>>>>>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>>>>>> can do that in the parent read/write callbacks instead.
>>>>>>
>>>>>> What do you think?
>>>>>>    
>>>>>
>>>>> I would like to know Alex's thought on this. He raised concern with this
>>>>> approach in v6 reviews:
>>>>> "But I think this is exploitable, it lets the user make the kernel
>>>>> allocate an arbitrarily sized buffer."
>>>>>     
>>>>
>>>> Read/write callbacks are for slow path, emulation of mmio region which
>>>> are mainly device registers. I do feel it shouldn't support arbitrary
>>>> lengths.
>>>> Alex, I would like to know your thoughts.  
>>>
>>> The exploit was that the mdev layer allocated a buffer and copied the
>>> entire user buffer into kernel space before passing it to the vendor
>>> driver.  The solution is to pass the __user *buf to the vendor driver
>>> and let them sanitize and split it however makes sense for their
>>> device.  We shouldn't be assuming naturally aligned, up to dword
>>> accesses in the generic mdev layers.  Those sorts of semantics are
>>> defined by the device type.  This is another case where making
>>> the mdev layer as thin as possible is actually the best thing to
>>> do to make it really device type agnostic.  Thanks,
>>>   
>>
>>
>> Alex,
>>
>> These were the comments on v6 patch:
>>
>>>>> Do we really need to support arbitrary lengths in one shot?  Seems
>>>>> like
>>>>> we could just use a 4 or 8 byte variable on the stack and iterate
>>>>> until
>>>>> done.
>>>>>  
>>>>
>>>> We just want to pass the arguments to vendor driver as is here.Vendor
>>>> driver could take care of that.  
>>
>>> But I think this is exploitable, it lets the user make the kernel
>>> allocate an arbitrarily sized buffer.  
>>
>> As per above discussion in v7 version, this module don't allocated
>> memory from heap.
>>
>> If vendor driver allocates arbitrary memory in kernel space through mdev
>> module interface, isn't that would be exploit?
> 
> Yep, my 4-8/byte chunking idea was too PCI specific.  If a vendor
> driver introduces an exploit, that's a bug in the vendor driver.  I'm
> not sure if we can or should attempt to guard against that.  Ultimately
> the vendor driver is either open source and we can inspect it for such
> exploits or it's closed source, taints the kernel, and we hope for the
> best.  It might make a good unit test to perform substantially sized
> reads/writes to the mdev device.

Can't agree more! :-)

> Perhaps the only sanity test we can
> make in the core is to verify the access doesn't exceed the size of
> the region as advertised by the vendor driver.  Thanks,

Even performing a lightweight sanity check, would require vfio-mdev
to be able to decode the ppos into a particular region, that means
information of all regions should be stored in the framework. I guess
it is not your preferred way :)

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-13  2:35         ` [Qemu-devel] " Jike Song
  (?)
@ 2016-09-20  5:48         ` Dong Jia Shi
  -1 siblings, 0 replies; 162+ messages in thread
From: Dong Jia Shi @ 2016-09-20  5:48 UTC (permalink / raw)
  To: Jike Song
  Cc: kevin.tian, cjia, kvm, Kirti Wankhede, qemu-devel,
	alex.williamson, kraxel, pbonzini, Dong Jia

* Jike Song <jike.song@intel.com> [2016-09-13 10:35:11 +0800]:

> On 09/08/2016 10:45 AM, Jike Song wrote:
> > On 08/25/2016 05:22 PM, Dong Jia wrote:
> >> On Thu, 25 Aug 2016 09:23:53 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>
> >> [...]
> >>
> >> Dear Kirti,
> >>
> >> I just rebased my vfio-ccw patches to this series.
> >> With a little fix, which was pointed it out in my reply to the #3
> >> patch, it works fine.
> >>
> > 
> > Hi Jia,
> > 
> > Sorry I didn't follow a lot in previous discussion, but since
> > vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
> > with us why you still need a vfio-ccw?
> 
> Kind ping :)
> 
> 
> Hi Dong Jia,
> 
> Since Kirti has confirmed that in v7 it is designed to have only one
> vfio-mdev driver for all mdev devices, would you please tell us the
> reason of your vfio-ccw? It could possibly be an architectural gap and
> the earlier we discuss it the better :)

Hi Jike,

Sorry for the late response.

I think you may mix up vfio-ccw with vfio-mccw, which is in the same
level with vfio-mpci in v6 (or vfio-mdev in v7). :>

The term of vfio-ccw is in the same semantic level with vfio-pci. You
can understand vfio-ccw as the parent device driver in the ccw case.

As you mentioned above, v7 provides an universal mdev driver (vfio-mdev)
for the mediated devices. So I don't need to provide a vfio-mccw (a
vendor specific mediated device driver) anymore, but I'd still need my
vfio-ccw (the parent device driver).

> 
> --
> Thanks,
> Jike
> 

...snip...

-- 
Dong Jia

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-13  2:35         ` [Qemu-devel] " Jike Song
  (?)
  (?)
@ 2016-09-20  5:48         ` Dong Jia Shi
  2016-09-20  6:37             ` [Qemu-devel] " Jike Song
  -1 siblings, 1 reply; 162+ messages in thread
From: Dong Jia Shi @ 2016-09-20  5:48 UTC (permalink / raw)
  To: Jike Song
  Cc: Dong Jia, Kirti Wankhede, alex.williamson, pbonzini, kraxel,
	cjia, qemu-devel, kvm, kevin.tian

* Jike Song <jike.song@intel.com> [2016-09-13 10:35:11 +0800]:

> On 09/08/2016 10:45 AM, Jike Song wrote:
> > On 08/25/2016 05:22 PM, Dong Jia wrote:
> >> On Thu, 25 Aug 2016 09:23:53 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>
> >> [...]
> >>
> >> Dear Kirti,
> >>
> >> I just rebased my vfio-ccw patches to this series.
> >> With a little fix, which was pointed it out in my reply to the #3
> >> patch, it works fine.
> >>
> > 
> > Hi Jia,
> > 
> > Sorry I didn't follow a lot in previous discussion, but since
> > vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
> > with us why you still need a vfio-ccw?
> 
> Kind ping :)
> 
> 
> Hi Dong Jia,
> 
> Since Kirti has confirmed that in v7 it is designed to have only one
> vfio-mdev driver for all mdev devices, would you please tell us the
> reason of your vfio-ccw? It could possibly be an architectural gap and
> the earlier we discuss it the better :)

Hi Jike,

Sorry for the late response.

I think you may mix up vfio-ccw with vfio-mccw, which is in the same
level with vfio-mpci in v6 (or vfio-mdev in v7). :>

The term of vfio-ccw is in the same semantic level with vfio-pci. You
can understand vfio-ccw as the parent device driver in the ccw case.

As you mentioned above, v7 provides an universal mdev driver (vfio-mdev)
for the mediated devices. So I don't need to provide a vfio-mccw (a
vendor specific mediated device driver) anymore, but I'd still need my
vfio-ccw (the parent device driver).

> 
> --
> Thanks,
> Jike
> 

...snip...

-- 
Dong Jia

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-20  5:48         ` [Qemu-devel] " Dong Jia Shi
@ 2016-09-20  6:37             ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-20  6:37 UTC (permalink / raw)
  To: Dong Jia, Kirti Wankhede, alex.williamson, pbonzini, kraxel,
	cjia, qemu-devel, kvm, kevin.tian


On 09/20/2016 01:48 PM, Dong Jia Shi wrote:
> * Jike Song <jike.song@intel.com> [2016-09-13 10:35:11 +0800]:
> 
>> On 09/08/2016 10:45 AM, Jike Song wrote:
>>> On 08/25/2016 05:22 PM, Dong Jia wrote:
>>>> On Thu, 25 Aug 2016 09:23:53 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>
>>>> [...]
>>>>
>>>> Dear Kirti,
>>>>
>>>> I just rebased my vfio-ccw patches to this series.
>>>> With a little fix, which was pointed it out in my reply to the #3
>>>> patch, it works fine.
>>>>
>>>
>>> Hi Jia,
>>>
>>> Sorry I didn't follow a lot in previous discussion, but since
>>> vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
>>> with us why you still need a vfio-ccw?
>>
>> Kind ping :)
>>
>>
>> Hi Dong Jia,
>>
>> Since Kirti has confirmed that in v7 it is designed to have only one
>> vfio-mdev driver for all mdev devices, would you please tell us the
>> reason of your vfio-ccw? It could possibly be an architectural gap and
>> the earlier we discuss it the better :)
> 
> Hi Jike,
> 
> Sorry for the late response.
> 

NP, thanks for the info :)

> I think you may mix up vfio-ccw with vfio-mccw, which is in the same
> level with vfio-mpci in v6 (or vfio-mdev in v7). :>
> 
> The term of vfio-ccw is in the same semantic level with vfio-pci. You
> can understand vfio-ccw as the parent device driver in the ccw case.
> 

So 'vfio-ccw' is actually the driver of physical device.

> As you mentioned above, v7 provides an universal mdev driver (vfio-mdev)
> for the mediated devices. So I don't need to provide a vfio-mccw (a
> vendor specific mediated device driver) anymore, but I'd still need my
> vfio-ccw (the parent device driver).

Glad to know that vfio-mdev works for all mdev devices :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-20  6:37             ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-20  6:37 UTC (permalink / raw)
  To: Dong Jia, Kirti Wankhede, alex.williamson, pbonzini, kraxel,
	cjia, qemu-devel, kvm, kevin.tian


On 09/20/2016 01:48 PM, Dong Jia Shi wrote:
> * Jike Song <jike.song@intel.com> [2016-09-13 10:35:11 +0800]:
> 
>> On 09/08/2016 10:45 AM, Jike Song wrote:
>>> On 08/25/2016 05:22 PM, Dong Jia wrote:
>>>> On Thu, 25 Aug 2016 09:23:53 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>
>>>> [...]
>>>>
>>>> Dear Kirti,
>>>>
>>>> I just rebased my vfio-ccw patches to this series.
>>>> With a little fix, which was pointed it out in my reply to the #3
>>>> patch, it works fine.
>>>>
>>>
>>> Hi Jia,
>>>
>>> Sorry I didn't follow a lot in previous discussion, but since
>>> vfio-mdev in v7 patchset is at least PCI-agnostic, would you share
>>> with us why you still need a vfio-ccw?
>>
>> Kind ping :)
>>
>>
>> Hi Dong Jia,
>>
>> Since Kirti has confirmed that in v7 it is designed to have only one
>> vfio-mdev driver for all mdev devices, would you please tell us the
>> reason of your vfio-ccw? It could possibly be an architectural gap and
>> the earlier we discuss it the better :)
> 
> Hi Jike,
> 
> Sorry for the late response.
> 

NP, thanks for the info :)

> I think you may mix up vfio-ccw with vfio-mccw, which is in the same
> level with vfio-mpci in v6 (or vfio-mdev in v7). :>
> 
> The term of vfio-ccw is in the same semantic level with vfio-pci. You
> can understand vfio-ccw as the parent device driver in the ccw case.
> 

So 'vfio-ccw' is actually the driver of physical device.

> As you mentioned above, v7 provides an universal mdev driver (vfio-mdev)
> for the mediated devices. So I don't need to provide a vfio-mccw (a
> vendor specific mediated device driver) anymore, but I'd still need my
> vfio-ccw (the parent device driver).

Glad to know that vfio-mdev works for all mdev devices :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 1/4] vfio: Mediated device Core driver
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-20 12:48     ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-20 12:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
/* {snip} */

To show a straight-forward way of introducing an independent
struct device for the middle layer (in Kirti's patch the parent device,
we changed it to mdev_host since 'parent' is something too
generic or misleading) between physical device and mdev devices,
and how it will make the whole thing simpler, here is the incremental patch
against Kirti's version 7, which is exactly the same as the
the standalone version sent out Sep02.

This is only for demonstration. The sysfs interfaces changes are
kept although there are lots of discussions since.

--
Thanks,
Jike




diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 4a23c13..7c70753 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
-obj-$(CONFIG_VFIO_MDEV) += mdev/
+obj-$(CONFIG_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 703abd0..d25439f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -1,5 +1,5 @@
 
-config VFIO_MDEV
+config MDEV
     tristate "Mediated device driver framework"
     depends on VFIO
     default n
@@ -8,11 +8,3 @@ config VFIO_MDEV
 	See Documentation/vfio-mediated-device.txt for more details.
 
         If you don't know what do here, say N.
-
-config VFIO_MDEV_DEVICE
-    tristate "VFIO support for Mediated devices"
-    depends on VFIO && VFIO_MDEV
-    default n
-    help
-        VFIO based driver for mediated devices.
-
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index e5087ed..8bd78b5 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -1,6 +1,4 @@
 
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
-obj-$(CONFIG_VFIO_MDEV) += mdev.o
-obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
-
+obj-$(CONFIG_MDEV) += mdev.o
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 9f278c7..cb27ccf 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -5,6 +5,11 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ * Author:
+ *	Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *	Jike Song <jike.song@intel.com>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -23,316 +28,144 @@
 
 #include "mdev_private.h"
 
-#define DRIVER_VERSION		"0.1"
+#define DRIVER_VERSION		"0.2"
 #define DRIVER_AUTHOR		"NVIDIA Corporation"
-#define DRIVER_DESC		"Mediated device Core Driver"
-
-static LIST_HEAD(parent_list);
-static DEFINE_MUTEX(parent_list_lock);
-
-static int mdev_add_attribute_group(struct device *dev,
-				    const struct attribute_group **groups)
-{
-	return sysfs_create_groups(&dev->kobj, groups);
-}
-
-static void mdev_remove_attribute_group(struct device *dev,
-					const struct attribute_group **groups)
-{
-	sysfs_remove_groups(&dev->kobj, groups);
-}
-
-/* Should be called holding parent->mdev_list_lock */
-static struct mdev_device *__find_mdev_device(struct parent_device *parent,
-					      uuid_le uuid)
-{
-	struct mdev_device *mdev;
-
-	list_for_each_entry(mdev, &parent->mdev_list, next) {
-		if (uuid_le_cmp(mdev->uuid, uuid) == 0)
-			return mdev;
-	}
-	return NULL;
-}
-
-/* Should be called holding parent_list_lock */
-static struct parent_device *__find_parent_device(struct device *dev)
-{
-	struct parent_device *parent;
-
-	list_for_each_entry(parent, &parent_list, next) {
-		if (parent->dev == dev)
-			return parent;
-	}
-	return NULL;
-}
+#define DRIVER_DESC		"Mediated Device Core Driver"
 
-static void mdev_release_parent(struct kref *kref)
-{
-	struct parent_device *parent = container_of(kref, struct parent_device,
-						    ref);
-	kfree(parent);
-}
 
-static
-inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+static int __find_mdev_device(struct device *dev, void *data)
 {
-	if (parent)
-		kref_get(&parent->ref);
-
-	return parent;
-}
+	struct mdev_device *mdev = dev_to_mdev(dev);
 
-static inline void mdev_put_parent(struct parent_device *parent)
-{
-	if (parent)
-		kref_put(&parent->ref, mdev_release_parent);
+	return (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0);
 }
 
-static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
+static struct mdev_device *find_mdev_device(struct mdev_host *host,
+					    uuid_le uuid)
 {
-	struct parent_device *parent;
+	struct device *dev;
 
-	mutex_lock(&parent_list_lock);
-	parent = mdev_get_parent(__find_parent_device(dev));
-	mutex_unlock(&parent_list_lock);
+	dev = device_find_child(&host->dev, &uuid, __find_mdev_device);
+	if (!dev)
+		return NULL;
 
-	return parent;
+	return dev_to_mdev(dev);
 }
 
 static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
 {
-	struct parent_device *parent = mdev->parent;
-	int ret;
-
-	ret = parent->ops->create(mdev, mdev_params);
-	if (ret)
-		return ret;
-
-	ret = mdev_add_attribute_group(&mdev->dev,
-					parent->ops->mdev_attr_groups);
-	if (ret)
-		parent->ops->destroy(mdev);
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-	return ret;
+	return host->ops->create(mdev, mdev_params);
 }
 
-static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+static void mdev_device_destroy_ops(struct mdev_device *mdev)
 {
-	struct parent_device *parent = mdev->parent;
-	int ret = 0;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-	/*
-	 * If vendor driver doesn't return success that means vendor
-	 * driver doesn't support hot-unplug
-	 */
-	ret = parent->ops->destroy(mdev);
-	if (ret && !force)
-		return -EBUSY;
-
-	mdev_remove_attribute_group(&mdev->dev,
-				    parent->ops->mdev_attr_groups);
-
-	return ret;
-}
-
-static void mdev_release_device(struct kref *kref)
-{
-	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
-	struct parent_device *parent = mdev->parent;
-
-	list_del(&mdev->next);
-
-	/*
-	 * This unlock pairs with mutex held by mdev_put_device() through
-	 * kref_put_mutex()
-	 */
-	mutex_unlock(&parent->mdev_list_lock);
-
-	device_unregister(&mdev->dev);
-	wake_up(&parent->release_done);
-	mdev_put_parent(parent);
-}
-
-struct mdev_device *mdev_get_device(struct mdev_device *mdev)
-{
-	if (mdev)
-		kref_get(&mdev->ref);
-	return mdev;
+	host->ops->destroy(mdev);
 }
-EXPORT_SYMBOL(mdev_get_device);
-
-void mdev_put_device(struct mdev_device *mdev)
-{
-	struct parent_device *parent;
-
-	if (!mdev)
-		return;
-
-	parent = mdev->parent;
-	kref_put_mutex(&mdev->ref, mdev_release_device,
-		       &parent->mdev_list_lock);
-}
-EXPORT_SYMBOL(mdev_put_device);
 
 /*
- * mdev_register_device : Register a device
- * @dev: device structure representing parent device.
+ * mdev_register_host_device : register a mdev host device
+ * @dev: device structure of the physical device under which the created
+ *       host device will be.
  * @ops: Parent device operation structure to be registered.
  *
- * Add device to list of registered parent devices.
- * Returns a negative value on error, otherwise 0.
+ * Register a mdev host device as the mediator of mdev devices.
+ * Returns the pointer of mdev host device structure for success, NULL
+ * for errors.
  */
-int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+struct mdev_host *mdev_register_host_device(struct device *pdev,
+				const struct mdev_host_ops *ops)
 {
-	int ret = 0;
-	struct parent_device *parent;
+	int rc = 0;
+	struct mdev_host *host;
 
-	if (!dev || !ops)
-		return -EINVAL;
+	if (!pdev || !ops) {
+		dev_warn(pdev, "dev or ops is NULL\n");
+		return NULL;
+	}
 
 	/* check for mandatory ops */
-	if (!ops->create || !ops->destroy)
-		return -EINVAL;
-
-	mutex_lock(&parent_list_lock);
-
-	/* Check for duplicate */
-	parent = __find_parent_device(dev);
-	if (parent) {
-		ret = -EEXIST;
-		goto add_dev_err;
+	if (!ops->create || !ops->destroy) {
+		dev_warn(pdev, "create and destroy methods are necessary\n");
+		return NULL;
 	}
 
-	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
-	if (!parent) {
-		ret = -ENOMEM;
-		goto add_dev_err;
-	}
+	host = kzalloc(sizeof(*host), GFP_KERNEL);
+	if (!host)
+		return NULL;
 
-	kref_init(&parent->ref);
-	list_add(&parent->next, &parent_list);
+	host->dev.parent = pdev;
+	host->ops = ops;
+	dev_set_name(&host->dev, "mdev-host");
 
-	parent->dev = dev;
-	parent->ops = ops;
-	mutex_init(&parent->mdev_list_lock);
-	INIT_LIST_HEAD(&parent->mdev_list);
-	init_waitqueue_head(&parent->release_done);
-	mutex_unlock(&parent_list_lock);
+	rc = device_register(&host->dev);
+	if (rc)
+		goto register_error;
 
-	ret = parent_create_sysfs_files(dev);
-	if (ret)
+	rc = mdev_create_sysfs_files(&host->dev);
+	if (rc)
 		goto add_sysfs_error;
 
-	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
-	if (ret)
+	rc = sysfs_create_groups(&host->dev.kobj, ops->hdev_attr_groups);
+	if (rc)
 		goto add_group_error;
 
-	dev_info(dev, "MDEV: Registered\n");
-	return 0;
+	dev_info(&host->dev, "mdev host device registered\n");
+	return host;
 
 add_group_error:
-	mdev_remove_sysfs_files(dev);
+	mdev_remove_sysfs_files(&host->dev);
+
 add_sysfs_error:
-	mutex_lock(&parent_list_lock);
-	list_del(&parent->next);
-	mutex_unlock(&parent_list_lock);
-	mdev_put_parent(parent);
-	return ret;
+	device_unregister(&host->dev);
 
-add_dev_err:
-	mutex_unlock(&parent_list_lock);
-	return ret;
+register_error:
+	kfree(host);
+	return NULL;
 }
-EXPORT_SYMBOL(mdev_register_device);
-
-/*
- * mdev_unregister_device : Unregister a parent device
- * @dev: device structure representing parent device.
- *
- * Remove device from list of registered parent devices. Give a chance to free
- * existing mediated devices for given device.
- */
+EXPORT_SYMBOL(mdev_register_host_device);
 
-void mdev_unregister_device(struct device *dev)
+static int __mdev_device_destroy(struct device *dev, void *data)
 {
-	struct parent_device *parent;
-	struct mdev_device *mdev = NULL;
-	int ret;
+	struct mdev_device *mdev = dev_to_mdev(dev);
 
-	mutex_lock(&parent_list_lock);
-	parent = __find_parent_device(dev);
-
-	if (!parent) {
-		mutex_unlock(&parent_list_lock);
-		return;
-	}
-	dev_info(dev, "MDEV: Unregistering\n");
-
-	/*
-	 * Remove parent from the list and remove "mdev_create" and
-	 * "mdev_destroy" sysfs files so that no new mediated device could be
-	 * created for this parent
-	 */
-	list_del(&parent->next);
-	parent_remove_sysfs_files(dev);
-	mutex_unlock(&parent_list_lock);
-
-	mdev_remove_attribute_group(dev,
-				    parent->ops->dev_attr_groups);
-
-	while (!list_empty(&parent->mdev_list)) {
-		mutex_lock(&parent->mdev_list_lock);
-		if (!list_empty(&parent->mdev_list)) {
-			mdev = list_first_entry(&parent->mdev_list,
-						struct mdev_device, next);
-			mdev_device_destroy_ops(mdev, true);
-		}
-		mutex_unlock(&parent->mdev_list_lock);
-
-		if (mdev)
-			mdev_put_device(mdev);
-	}
+	mdev_device_destroy_ops(mdev);
+	device_unregister(&mdev->dev);
 
-	do {
-		ret = wait_event_interruptible_timeout(parent->release_done,
-				list_empty(&parent->mdev_list), HZ * 10);
-		if (ret == -ERESTARTSYS) {
-			dev_warn(dev, "Mediated devices are in use, task"
-				      " \"%s\" (%d) "
-				      "blocked until all are released",
-				      current->comm, task_pid_nr(current));
-		}
-	} while (ret <= 0);
-
-	mdev_put_parent(parent);
+	return 0;
 }
-EXPORT_SYMBOL(mdev_unregister_device);
 
 /*
- * Functions required for mdev_sysfs
+ * mdev_unregister_host_device : unregister a mdev host device
+ * @host: the mdev host device structure
+ *
+ * Unregister a mdev host device as the mediator
  */
-static void mdev_device_release(struct device *dev)
+void mdev_unregister_host_device(struct mdev_host *host)
 {
-	struct mdev_device *mdev = to_mdev_device(dev);
+	if (!host)
+		return;
 
-	dev_dbg(&mdev->dev, "MDEV: destroying\n");
-	kfree(mdev);
+	dev_info(&host->dev, "mdev host device unregistered\n");
+
+	mdev_remove_sysfs_files(&host->dev);
+	sysfs_remove_groups(&host->dev.kobj, host->ops->hdev_attr_groups);
+	device_for_each_child(&host->dev, NULL,  __mdev_device_destroy);
+	device_unregister(&host->dev);
 }
+EXPORT_SYMBOL(mdev_unregister_host_device);
 
 int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
 {
 	int ret;
 	struct mdev_device *mdev;
-	struct parent_device *parent;
-
-	parent = mdev_get_parent_from_dev(dev);
-	if (!parent)
-		return -EINVAL;
+	struct mdev_host *host = dev_to_host(dev);
 
-	mutex_lock(&parent->mdev_list_lock);
 	/* Check for duplicate */
-	mdev = __find_mdev_device(parent, uuid);
+	mdev = find_mdev_device(host, uuid);
 	if (mdev) {
 		ret = -EEXIST;
 		goto create_err;
@@ -345,12 +178,10 @@ int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
 	}
 
 	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
-	mdev->parent = parent;
-	kref_init(&mdev->ref);
 
-	mdev->dev.parent  = dev;
-	mdev->dev.bus     = &mdev_bus_type;
-	mdev->dev.release = mdev_device_release;
+	mdev->dev.parent = dev;
+	mdev->dev.bus = &mdev_bus_type;
+	mdev->dev.groups = host->ops->mdev_attr_groups;
 	dev_set_name(&mdev->dev, "%pUl", uuid.b);
 
 	ret = device_register(&mdev->dev);
@@ -363,123 +194,35 @@ int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
 	if (ret)
 		goto create_failed;
 
-	ret = mdev_create_sysfs_files(&mdev->dev);
-	if (ret)
-		goto create_sysfs_error;
-
-	list_add(&mdev->next, &parent->mdev_list);
-	mutex_unlock(&parent->mdev_list_lock);
-
 	dev_dbg(&mdev->dev, "MDEV: created\n");
 
 	return ret;
 
-create_sysfs_error:
-	mdev_device_destroy_ops(mdev, true);
-
 create_failed:
 	device_unregister(&mdev->dev);
 
 create_err:
-	mutex_unlock(&parent->mdev_list_lock);
-	mdev_put_parent(parent);
 	return ret;
 }
 
 int mdev_device_destroy(struct device *dev, uuid_le uuid)
 {
 	struct mdev_device *mdev;
-	struct parent_device *parent;
-	int ret;
+	struct mdev_host *host = dev_to_host(dev);
 
-	parent = mdev_get_parent_from_dev(dev);
-	if (!parent)
+	mdev = find_mdev_device(host, uuid);
+	if (!mdev)
 		return -ENODEV;
 
-	mutex_lock(&parent->mdev_list_lock);
-	mdev = __find_mdev_device(parent, uuid);
-	if (!mdev) {
-		ret = -EINVAL;
-		goto destroy_err;
-	}
-
-	mdev_remove_sysfs_files(&mdev->dev);
-	ret = mdev_device_destroy_ops(mdev, false);
-	if (ret)
-		goto destroy_err;
-
-	mutex_unlock(&parent->mdev_list_lock);
-	mdev_put_device(mdev);
-
-	mdev_put_parent(parent);
-	return ret;
-
-destroy_err:
-	mutex_unlock(&parent->mdev_list_lock);
-	mdev_put_parent(parent);
-	return ret;
+	return __mdev_device_destroy(&mdev->dev, NULL);
 }
 
 void mdev_device_supported_config(struct device *dev, char *str)
 {
-	struct parent_device *parent;
-
-	parent = mdev_get_parent_from_dev(dev);
+	struct mdev_host *host = dev_to_host(dev);
 
-	if (parent) {
-		if (parent->ops->supported_config)
-			parent->ops->supported_config(parent->dev, str);
-		mdev_put_parent(parent);
-	}
-}
-
-int mdev_device_set_online_status(struct device *dev, bool online)
-{
-	int ret = 0;
-	struct mdev_device *mdev;
-	struct parent_device *parent;
-
-	mdev = mdev_get_device(to_mdev_device(dev));
-	if (!mdev)
-		return -EINVAL;
-
-	parent = mdev->parent;
-
-	if (parent->ops->set_online_status)
-		ret = parent->ops->set_online_status(mdev, online);
-
-	if (ret)
-		pr_err("mdev online failed  %d\n", ret);
-	else {
-		if (online)
-			kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
-		else
-			kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
-	}
-
-	mdev_put_device(mdev);
-
-	return ret;
-}
-
-int mdev_device_get_online_status(struct device *dev, bool *online)
-{
-	int ret = 0;
-	struct mdev_device *mdev;
-	struct parent_device *parent;
-
-	mdev = mdev_get_device(to_mdev_device(dev));
-	if (!mdev)
-		return -EINVAL;
-
-	parent = mdev->parent;
-
-	if (parent->ops->get_online_status)
-		ret = parent->ops->get_online_status(mdev, online);
-
-	mdev_put_device(mdev);
-
-	return ret;
+	if (host->ops->supported_config)
+		host->ops->supported_config(&host->dev, str);
 }
 
 static int __init mdev_init(void)
@@ -487,10 +230,8 @@ static int __init mdev_init(void)
 	int ret;
 
 	ret = mdev_bus_register();
-	if (ret) {
-		pr_err("Failed to register mdev bus\n");
-		return ret;
-	}
+	if (ret)
+		pr_err("failed to register mdev bus: %d\n", ret);
 
 	return ret;
 }
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
index 8afc2d8..d298aaf 100644
--- a/drivers/vfio/mdev/mdev_driver.c
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -5,6 +5,8 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -52,7 +54,7 @@ static void mdev_detach_iommu(struct mdev_device *mdev)
 static int mdev_probe(struct device *dev)
 {
 	struct mdev_driver *drv = to_mdev_driver(dev->driver);
-	struct mdev_device *mdev = to_mdev_device(dev);
+	struct mdev_device *mdev = dev_to_mdev(dev);
 	int ret;
 
 	ret = mdev_attach_iommu(mdev);
@@ -73,7 +75,7 @@ static int mdev_probe(struct device *dev)
 static int mdev_remove(struct device *dev)
 {
 	struct mdev_driver *drv = to_mdev_driver(dev->driver);
-	struct mdev_device *mdev = to_mdev_device(dev);
+	struct mdev_device *mdev = dev_to_mdev(dev);
 
 	if (drv && drv->remove)
 		drv->remove(dev);
@@ -83,10 +85,32 @@ static int mdev_remove(struct device *dev)
 	return 0;
 }
 
+static int mdev_online(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+
+	if (drv && drv->online)
+		return drv->online(dev);
+
+	return 0;
+}
+
+static int mdev_offline(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+
+	if (drv && drv->offline)
+		return drv->offline(dev);
+
+	return 0;
+}
+
 struct bus_type mdev_bus_type = {
 	.name		= "mdev",
 	.probe		= mdev_probe,
 	.remove		= mdev_remove,
+	.online		= mdev_online,
+	.offline	= mdev_offline,
 };
 EXPORT_SYMBOL_GPL(mdev_bus_type);
 
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 07ad1b3..f153292 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -5,6 +5,8 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -17,12 +19,6 @@ int  mdev_bus_register(void);
 void mdev_bus_unregister(void);
 
 /* Function prototypes for mdev_sysfs */
-
-extern struct class_attribute mdev_class_attrs[];
-
-int  parent_create_sysfs_files(struct device *dev);
-void parent_remove_sysfs_files(struct device *dev);
-
 int  mdev_create_sysfs_files(struct device *dev);
 void mdev_remove_sysfs_files(struct device *dev);
 
@@ -30,7 +26,4 @@ int  mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params);
 int  mdev_device_destroy(struct device *dev, uuid_le uuid);
 void mdev_device_supported_config(struct device *dev, char *str);
 
-int mdev_device_set_online_status(struct device *dev, bool online);
-int mdev_device_get_online_status(struct device *dev, bool *online);
-
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index ed55cd5..7d55188 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -5,6 +5,10 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ * Author:
+ *	Jike Song <jike.song@intel.com>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -35,17 +39,17 @@ static ssize_t mdev_destroy_store(struct device *dev,
 				  const char *buf, size_t count);
 static DEVICE_ATTR_WO(mdev_destroy);
 
-static ssize_t online_store(struct device *dev, struct device_attribute *attr,
-			    const char *buf, size_t count);
-static ssize_t online_show(struct device *dev, struct device_attribute *attr,
-			   char *buf);
-static DEVICE_ATTR_RW(online);
+static const struct attribute *mdev_host_attrs[] = {
+	&dev_attr_mdev_supported_types.attr,
+	&dev_attr_mdev_create.attr,
+	&dev_attr_mdev_destroy.attr,
+	NULL,
+};
 
-/* Static functions */
 
 #define SUPPORTED_TYPE_BUFFER_LENGTH	4096
 
-/* mdev sysfs Functions */
+/* mdev host sysfs functions */
 static ssize_t mdev_supported_types_show(struct device *dev,
 					 struct device_attribute *attr,
 					 char *buf)
@@ -70,25 +74,24 @@ static ssize_t mdev_create_store(struct device *dev,
 				 struct device_attribute *attr,
 				 const char *buf, size_t count)
 {
-	char *str, *pstr;
-	char *uuid_str, *mdev_params = NULL, *params = NULL;
+	char *str;
+	char *uuid_str, *params = NULL;
 	uuid_le uuid;
 	int ret;
 
-	pstr = str = kstrndup(buf, count, GFP_KERNEL);
-
+	str = kstrndup(buf, count, GFP_KERNEL);
 	if (!str)
 		return -ENOMEM;
 
 	uuid_str = strsep(&str, ":");
 	if (!uuid_str) {
-		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		pr_err("mdev_create: empty UUID string %s\n", buf);
 		ret = -EINVAL;
 		goto create_error;
 	}
 
 	if (str)
-		params = mdev_params = kstrdup(str, GFP_KERNEL);
+		params = kstrdup(str, GFP_KERNEL);
 
 	ret = uuid_le_to_bin(uuid_str, &uuid);
 	if (ret) {
@@ -96,7 +99,7 @@ static ssize_t mdev_create_store(struct device *dev,
 		goto create_error;
 	}
 
-	ret = mdev_device_create(dev, uuid, mdev_params);
+	ret = mdev_device_create(dev, uuid, params);
 	if (ret)
 		pr_err("mdev_create: Failed to create mdev device\n");
 	else
@@ -104,7 +107,7 @@ static ssize_t mdev_create_store(struct device *dev,
 
 create_error:
 	kfree(params);
-	kfree(pstr);
+	kfree(str);
 	return ret;
 }
 
@@ -112,23 +115,15 @@ static ssize_t mdev_destroy_store(struct device *dev,
 				  struct device_attribute *attr,
 				  const char *buf, size_t count)
 {
-	char *uuid_str, *str, *pstr;
+	char *str;
 	uuid_le uuid;
 	int ret;
 
-	str = pstr = kstrndup(buf, count, GFP_KERNEL);
-
+	str = kstrndup(buf, count, GFP_KERNEL);
 	if (!str)
 		return -ENOMEM;
 
-	uuid_str = strsep(&str, ":");
-	if (!uuid_str) {
-		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
-		ret = -EINVAL;
-		goto destroy_error;
-	}
-
-	ret = uuid_le_to_bin(uuid_str, &uuid);
+	ret = uuid_le_to_bin(str, &uuid);
 	if (ret) {
 		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
 		goto destroy_error;
@@ -139,102 +134,22 @@ static ssize_t mdev_destroy_store(struct device *dev,
 		ret = count;
 
 destroy_error:
-	kfree(pstr);
-	return ret;
-}
-
-static ssize_t online_store(struct device *dev, struct device_attribute *attr,
-			    const char *buf, size_t count)
-{
-	char *str;
-	int ret;
-	uint32_t online_status;
-	bool online;
-
-	str = kstrndup(buf, count, GFP_KERNEL);
-	if (!str)
-		return -ENOMEM;
-
-	ret = kstrtouint(str, 0, &online_status);
 	kfree(str);
-
-	if (ret) {
-		pr_err("online: parsing error %s\n", buf);
-		return ret;
-	}
-
-	online = online_status > 0 ? true : false;
-
-	ret = mdev_device_set_online_status(dev, online);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t online_show(struct device *dev, struct device_attribute *attr,
-			   char *buf)
-{
-	int ret;
-	bool online = false;
-
-	ret = mdev_device_get_online_status(dev, &online);
-	if (ret)
-		return ret;
-
-	ret = sprintf(buf, "%d\n", online);
 	return ret;
 }
 
-int parent_create_sysfs_files(struct device *dev)
-{
-	int ret;
-
-	ret = sysfs_create_file(&dev->kobj,
-				&dev_attr_mdev_supported_types.attr);
-	if (ret) {
-		pr_err("Failed to create mdev_supported_types sysfs entry\n");
-		return ret;
-	}
-
-	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
-	if (ret) {
-		pr_err("Failed to create mdev_create sysfs entry\n");
-		goto create_sysfs_failed;
-	}
-
-	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
-	if (ret) {
-		pr_err("Failed to create mdev_destroy sysfs entry\n");
-		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
-	} else
-		return ret;
-
-create_sysfs_failed:
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
-	return ret;
-}
-
-void parent_remove_sysfs_files(struct device *dev)
-{
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
-}
-
 int mdev_create_sysfs_files(struct device *dev)
 {
 	int ret;
 
-	ret = sysfs_create_file(&dev->kobj, &dev_attr_online.attr);
+	ret = sysfs_create_files(&dev->kobj, mdev_host_attrs);
 	if (ret)
-		pr_err("Failed to create 'online' entry\n");
+		pr_err("sysfs_create_files failed: %d\n", ret);
 
 	return ret;
 }
 
 void mdev_remove_sysfs_files(struct device *dev)
 {
-	sysfs_remove_file(&dev->kobj, &dev_attr_online.attr);
+	sysfs_remove_files(&dev->kobj, mdev_host_attrs);
 }
-
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index babcb72..1236200 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -5,6 +5,8 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -15,65 +17,50 @@
 
 #include <uapi/linux/vfio.h>
 
-struct parent_device;
-
-/*
- * Mediated device
- */
 
+/* mediated device */
 struct mdev_device {
 	struct device		dev;
-	struct parent_device	*parent;
 	struct iommu_group	*group;
 	uuid_le			uuid;
 	void			*driver_data;
-
-	/* internal only */
-	struct kref		ref;
-	struct list_head	next;
 };
 
-
 /**
- * struct parent_ops - Structure to be registered for each parent device to
- * register the device to mdev module.
+ * struct mdev_host_ops - Structure to be registered for each host device to
+ * to mdev.
  *
  * @owner:		The module owner.
- * @dev_attr_groups:	Default attributes of the parent device.
- * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @hdev_attr_groups:	Default attributes of the host device.
+ * @mdev_attr_groups:	Default attributes of the mdev device.
  * @supported_config:	Called to get information about supported types.
- *			@dev : device structure of parent device.
+ *			@dev : device structure of host device.
  *			@config: should return string listing supported config
  *			Returns integer: success (0) or error (< 0)
- * @create:		Called to allocate basic resources in parent device's
+ * @create:		Called to allocate basic resources in host device's
  *			driver for a particular mediated device. It is
  *			mandatory to provide create ops.
  *			@mdev: mdev_device structure on of mediated device
  *			      that is being created
- *			@mdev_params: extra parameters required by parent
+ *			@mdev_params: extra parameters required by host
  *			device's driver.
  *			Returns integer: success (0) or error (< 0)
- * @destroy:		Called to free resources in parent device's driver for a
- *			a mediated device. It is mandatory to provide destroy
- *			ops.
+ * @destroy:		Called to free resources in host device's driver for a
+ *			a mediated device instance. It is mandatory to provide
+ *			destroy ops.
  *			@mdev: mdev_device device structure which is being
- *			       destroyed
+ *				destroyed
  *			Returns integer: success (0) or error (< 0)
  *			If VMM is running and destroy() is called that means the
  *			mdev is being hotunpluged. Return error if VMM is
  *			running and driver doesn't support mediated device
  *			hotplug.
- * @reset:		Called to reset mediated device.
- *			@mdev: mdev_device device structure.
- *			Returns integer: success (0) or error (< 0)
- * @set_online_status:	Called to change to status of mediated device.
- *			@mdev: mediated device.
- *			@online: set true or false to make mdev device online or
- *			offline.
+ * @start:		Called to initiate mediated device initialization
+ *			process in host device's driver before VMM starts.
+ *			@mdev: mediated device structure
  *			Returns integer: success (0) or error (< 0)
- * @get_online_status:	Called to get online/offline status of  mediated device
- *			@mdev: mediated device.
- *			@online: Returns status of mediated device.
+ * @stop:		Called to teardown mediated device related resources
+ *			@mdev: mediated device structure
  *			Returns integer: success (0) or error (< 0)
  * @read:		Read emulation callback
  *			@mdev: mediated device structure
@@ -87,75 +74,47 @@ struct mdev_device {
  *			@count: number of bytes to be written
  *			@pos: address.
  *			Retuns number on bytes written on success or error.
- * @get_irq_info:	Called to retrieve information about mediated device IRQ
- *			@mdev: mediated device structure
- *			@irq_info: VFIO IRQ flags and count.
- *			Returns integer: success (0) or error (< 0)
- * @set_irqs:		Called to send about interrupts configuration
- *			information that VMM sets.
+ * @mmap:		Memory Map
  *			@mdev: mediated device structure
- *			@flags, index, start, count and *data : same as that of
- *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
- * @get_device_info:	Called to get VFIO device information for a mediated
- *			device.
- *			@vfio_device_info: VFIO device info.
- *			Returns integer: success (0) or error (< 0)
- * @get_region_info:	Called to get VFIO region size and flags of mediated
- *			device.
- *			@mdev: mediated device structure
- *			@region_info: output, returns size and flags of
- *				      requested region.
- *			@cap_type_id: returns id of capability.
- *			@cap_type: returns pointer to capability structure
- *			corresponding to capability id.
+ *			@pos: address
+ *			@virtaddr: target user address to start at. Vendor
+ *			driver can change if required.
+ *			@pfn: host address of kernel memory, vendor driver
+ *			can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			driver can change, if required.
  *			Returns integer: success (0) or error (< 0)
  *
- * Parent device that support mediated device should be registered with mdev
- * module with parent_ops structure.
+ * Host device that support mediated device should be registered with mdev
+ * module with mdev_host_ops structure.
  */
-
-struct parent_ops {
-	struct module   *owner;
-	const struct attribute_group **dev_attr_groups;
+struct mdev_host_ops {
+	struct module *owner;
+	const struct attribute_group **hdev_attr_groups;
 	const struct attribute_group **mdev_attr_groups;
 
-	int	(*supported_config)(struct device *dev, char *config);
-	int     (*create)(struct mdev_device *mdev, char *mdev_params);
-	int     (*destroy)(struct mdev_device *mdev);
-	int     (*reset)(struct mdev_device *mdev);
-	int     (*set_online_status)(struct mdev_device *mdev, bool online);
-	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
-	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
-			loff_t pos);
-	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
-			 loff_t pos);
-	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
-	int	(*get_irq_info)(struct mdev_device *mdev,
-				struct vfio_irq_info *irq_info);
-	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
-			    unsigned int index, unsigned int start,
-			    unsigned int count, void *data);
-	int	(*get_device_info)(struct mdev_device *mdev,
-				   struct vfio_device_info *dev_info);
-	int	(*get_region_info)(struct mdev_device *mdev,
-				   struct vfio_region_info *region_info,
-				   u16 *cap_type_id, void **cap_type);
-};
+	int (*supported_config)(struct device *dev, char *config);
+	int (*create)(struct mdev_device *mdev, char *mdev_params);
+	void (*destroy)(struct mdev_device *mdev);
 
-/*
- * Parent Device
- */
+	int (*start)(struct mdev_device *mdev);
+	int (*stop)(struct mdev_device *mdev);
 
-struct parent_device {
-	struct device		*dev;
-	const struct parent_ops	*ops;
+	ssize_t (*read)(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *pos);
+	ssize_t (*write)(struct mdev_device *mdev, const char __user *buf,
+			size_t count, loff_t *pos);
+	int (*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	long (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
+			unsigned long arg);
+};
 
-	/* internal */
-	struct kref		ref;
-	struct list_head	next;
-	struct list_head	mdev_list;
-	struct mutex		mdev_list_lock;
-	wait_queue_head_t	release_done;
+/* mdev host device */
+struct mdev_host {
+	struct device dev;
+	const struct mdev_host_ops *ops;
 };
 
 /**
@@ -164,25 +123,16 @@ struct parent_device {
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
- *
  **/
 struct mdev_driver {
 	const char *name;
-	int  (*probe)(struct device *dev);
+	int (*probe)(struct device *dev);
 	void (*remove)(struct device *dev);
+	int (*online)(struct device *dev);
+	int (*offline)(struct device *dev);
 	struct device_driver driver;
 };
 
-static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
-{
-	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
-}
-
-static inline struct mdev_device *to_mdev_device(struct device *dev)
-{
-	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
-}
-
 static inline void *mdev_get_drvdata(struct mdev_device *mdev)
 {
 	return mdev->driver_data;
@@ -195,18 +145,15 @@ static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
 
 extern struct bus_type mdev_bus_type;
 
-#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
-
-extern int  mdev_register_device(struct device *dev,
-				 const struct parent_ops *ops);
-extern void mdev_unregister_device(struct device *dev);
-
-extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
-extern void mdev_unregister_driver(struct mdev_driver *drv);
+#define to_mdev_driver(drv) container_of(drv, struct mdev_driver, driver)
+#define dev_to_host(_dev) container_of((_dev), struct mdev_host, dev)
+#define dev_to_mdev(_dev) container_of((_dev), struct mdev_device, dev)
 
-extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
-extern void mdev_put_device(struct mdev_device *mdev);
+struct mdev_host *mdev_register_host_device(struct device *dev,
+				 const struct mdev_host_ops *ops);
+void mdev_unregister_host_device(struct mdev_host *host);
 
-extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+void mdev_unregister_driver(struct mdev_driver *drv);
 
 #endif /* MDEV_H */

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver
@ 2016-09-20 12:48     ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-20 12:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
/* {snip} */

To show a straight-forward way of introducing an independent
struct device for the middle layer (in Kirti's patch the parent device,
we changed it to mdev_host since 'parent' is something too
generic or misleading) between physical device and mdev devices,
and how it will make the whole thing simpler, here is the incremental patch
against Kirti's version 7, which is exactly the same as the
the standalone version sent out Sep02.

This is only for demonstration. The sysfs interfaces changes are
kept although there are lots of discussions since.

--
Thanks,
Jike




diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 4a23c13..7c70753 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
-obj-$(CONFIG_VFIO_MDEV) += mdev/
+obj-$(CONFIG_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 703abd0..d25439f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -1,5 +1,5 @@
 
-config VFIO_MDEV
+config MDEV
     tristate "Mediated device driver framework"
     depends on VFIO
     default n
@@ -8,11 +8,3 @@ config VFIO_MDEV
 	See Documentation/vfio-mediated-device.txt for more details.
 
         If you don't know what do here, say N.
-
-config VFIO_MDEV_DEVICE
-    tristate "VFIO support for Mediated devices"
-    depends on VFIO && VFIO_MDEV
-    default n
-    help
-        VFIO based driver for mediated devices.
-
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index e5087ed..8bd78b5 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -1,6 +1,4 @@
 
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
-obj-$(CONFIG_VFIO_MDEV) += mdev.o
-obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
-
+obj-$(CONFIG_MDEV) += mdev.o
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 9f278c7..cb27ccf 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -5,6 +5,11 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ * Author:
+ *	Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *	Jike Song <jike.song@intel.com>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -23,316 +28,144 @@
 
 #include "mdev_private.h"
 
-#define DRIVER_VERSION		"0.1"
+#define DRIVER_VERSION		"0.2"
 #define DRIVER_AUTHOR		"NVIDIA Corporation"
-#define DRIVER_DESC		"Mediated device Core Driver"
-
-static LIST_HEAD(parent_list);
-static DEFINE_MUTEX(parent_list_lock);
-
-static int mdev_add_attribute_group(struct device *dev,
-				    const struct attribute_group **groups)
-{
-	return sysfs_create_groups(&dev->kobj, groups);
-}
-
-static void mdev_remove_attribute_group(struct device *dev,
-					const struct attribute_group **groups)
-{
-	sysfs_remove_groups(&dev->kobj, groups);
-}
-
-/* Should be called holding parent->mdev_list_lock */
-static struct mdev_device *__find_mdev_device(struct parent_device *parent,
-					      uuid_le uuid)
-{
-	struct mdev_device *mdev;
-
-	list_for_each_entry(mdev, &parent->mdev_list, next) {
-		if (uuid_le_cmp(mdev->uuid, uuid) == 0)
-			return mdev;
-	}
-	return NULL;
-}
-
-/* Should be called holding parent_list_lock */
-static struct parent_device *__find_parent_device(struct device *dev)
-{
-	struct parent_device *parent;
-
-	list_for_each_entry(parent, &parent_list, next) {
-		if (parent->dev == dev)
-			return parent;
-	}
-	return NULL;
-}
+#define DRIVER_DESC		"Mediated Device Core Driver"
 
-static void mdev_release_parent(struct kref *kref)
-{
-	struct parent_device *parent = container_of(kref, struct parent_device,
-						    ref);
-	kfree(parent);
-}
 
-static
-inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+static int __find_mdev_device(struct device *dev, void *data)
 {
-	if (parent)
-		kref_get(&parent->ref);
-
-	return parent;
-}
+	struct mdev_device *mdev = dev_to_mdev(dev);
 
-static inline void mdev_put_parent(struct parent_device *parent)
-{
-	if (parent)
-		kref_put(&parent->ref, mdev_release_parent);
+	return (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0);
 }
 
-static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
+static struct mdev_device *find_mdev_device(struct mdev_host *host,
+					    uuid_le uuid)
 {
-	struct parent_device *parent;
+	struct device *dev;
 
-	mutex_lock(&parent_list_lock);
-	parent = mdev_get_parent(__find_parent_device(dev));
-	mutex_unlock(&parent_list_lock);
+	dev = device_find_child(&host->dev, &uuid, __find_mdev_device);
+	if (!dev)
+		return NULL;
 
-	return parent;
+	return dev_to_mdev(dev);
 }
 
 static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
 {
-	struct parent_device *parent = mdev->parent;
-	int ret;
-
-	ret = parent->ops->create(mdev, mdev_params);
-	if (ret)
-		return ret;
-
-	ret = mdev_add_attribute_group(&mdev->dev,
-					parent->ops->mdev_attr_groups);
-	if (ret)
-		parent->ops->destroy(mdev);
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-	return ret;
+	return host->ops->create(mdev, mdev_params);
 }
 
-static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+static void mdev_device_destroy_ops(struct mdev_device *mdev)
 {
-	struct parent_device *parent = mdev->parent;
-	int ret = 0;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-	/*
-	 * If vendor driver doesn't return success that means vendor
-	 * driver doesn't support hot-unplug
-	 */
-	ret = parent->ops->destroy(mdev);
-	if (ret && !force)
-		return -EBUSY;
-
-	mdev_remove_attribute_group(&mdev->dev,
-				    parent->ops->mdev_attr_groups);
-
-	return ret;
-}
-
-static void mdev_release_device(struct kref *kref)
-{
-	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
-	struct parent_device *parent = mdev->parent;
-
-	list_del(&mdev->next);
-
-	/*
-	 * This unlock pairs with mutex held by mdev_put_device() through
-	 * kref_put_mutex()
-	 */
-	mutex_unlock(&parent->mdev_list_lock);
-
-	device_unregister(&mdev->dev);
-	wake_up(&parent->release_done);
-	mdev_put_parent(parent);
-}
-
-struct mdev_device *mdev_get_device(struct mdev_device *mdev)
-{
-	if (mdev)
-		kref_get(&mdev->ref);
-	return mdev;
+	host->ops->destroy(mdev);
 }
-EXPORT_SYMBOL(mdev_get_device);
-
-void mdev_put_device(struct mdev_device *mdev)
-{
-	struct parent_device *parent;
-
-	if (!mdev)
-		return;
-
-	parent = mdev->parent;
-	kref_put_mutex(&mdev->ref, mdev_release_device,
-		       &parent->mdev_list_lock);
-}
-EXPORT_SYMBOL(mdev_put_device);
 
 /*
- * mdev_register_device : Register a device
- * @dev: device structure representing parent device.
+ * mdev_register_host_device : register a mdev host device
+ * @dev: device structure of the physical device under which the created
+ *       host device will be.
  * @ops: Parent device operation structure to be registered.
  *
- * Add device to list of registered parent devices.
- * Returns a negative value on error, otherwise 0.
+ * Register a mdev host device as the mediator of mdev devices.
+ * Returns the pointer of mdev host device structure for success, NULL
+ * for errors.
  */
-int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+struct mdev_host *mdev_register_host_device(struct device *pdev,
+				const struct mdev_host_ops *ops)
 {
-	int ret = 0;
-	struct parent_device *parent;
+	int rc = 0;
+	struct mdev_host *host;
 
-	if (!dev || !ops)
-		return -EINVAL;
+	if (!pdev || !ops) {
+		dev_warn(pdev, "dev or ops is NULL\n");
+		return NULL;
+	}
 
 	/* check for mandatory ops */
-	if (!ops->create || !ops->destroy)
-		return -EINVAL;
-
-	mutex_lock(&parent_list_lock);
-
-	/* Check for duplicate */
-	parent = __find_parent_device(dev);
-	if (parent) {
-		ret = -EEXIST;
-		goto add_dev_err;
+	if (!ops->create || !ops->destroy) {
+		dev_warn(pdev, "create and destroy methods are necessary\n");
+		return NULL;
 	}
 
-	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
-	if (!parent) {
-		ret = -ENOMEM;
-		goto add_dev_err;
-	}
+	host = kzalloc(sizeof(*host), GFP_KERNEL);
+	if (!host)
+		return NULL;
 
-	kref_init(&parent->ref);
-	list_add(&parent->next, &parent_list);
+	host->dev.parent = pdev;
+	host->ops = ops;
+	dev_set_name(&host->dev, "mdev-host");
 
-	parent->dev = dev;
-	parent->ops = ops;
-	mutex_init(&parent->mdev_list_lock);
-	INIT_LIST_HEAD(&parent->mdev_list);
-	init_waitqueue_head(&parent->release_done);
-	mutex_unlock(&parent_list_lock);
+	rc = device_register(&host->dev);
+	if (rc)
+		goto register_error;
 
-	ret = parent_create_sysfs_files(dev);
-	if (ret)
+	rc = mdev_create_sysfs_files(&host->dev);
+	if (rc)
 		goto add_sysfs_error;
 
-	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
-	if (ret)
+	rc = sysfs_create_groups(&host->dev.kobj, ops->hdev_attr_groups);
+	if (rc)
 		goto add_group_error;
 
-	dev_info(dev, "MDEV: Registered\n");
-	return 0;
+	dev_info(&host->dev, "mdev host device registered\n");
+	return host;
 
 add_group_error:
-	mdev_remove_sysfs_files(dev);
+	mdev_remove_sysfs_files(&host->dev);
+
 add_sysfs_error:
-	mutex_lock(&parent_list_lock);
-	list_del(&parent->next);
-	mutex_unlock(&parent_list_lock);
-	mdev_put_parent(parent);
-	return ret;
+	device_unregister(&host->dev);
 
-add_dev_err:
-	mutex_unlock(&parent_list_lock);
-	return ret;
+register_error:
+	kfree(host);
+	return NULL;
 }
-EXPORT_SYMBOL(mdev_register_device);
-
-/*
- * mdev_unregister_device : Unregister a parent device
- * @dev: device structure representing parent device.
- *
- * Remove device from list of registered parent devices. Give a chance to free
- * existing mediated devices for given device.
- */
+EXPORT_SYMBOL(mdev_register_host_device);
 
-void mdev_unregister_device(struct device *dev)
+static int __mdev_device_destroy(struct device *dev, void *data)
 {
-	struct parent_device *parent;
-	struct mdev_device *mdev = NULL;
-	int ret;
+	struct mdev_device *mdev = dev_to_mdev(dev);
 
-	mutex_lock(&parent_list_lock);
-	parent = __find_parent_device(dev);
-
-	if (!parent) {
-		mutex_unlock(&parent_list_lock);
-		return;
-	}
-	dev_info(dev, "MDEV: Unregistering\n");
-
-	/*
-	 * Remove parent from the list and remove "mdev_create" and
-	 * "mdev_destroy" sysfs files so that no new mediated device could be
-	 * created for this parent
-	 */
-	list_del(&parent->next);
-	parent_remove_sysfs_files(dev);
-	mutex_unlock(&parent_list_lock);
-
-	mdev_remove_attribute_group(dev,
-				    parent->ops->dev_attr_groups);
-
-	while (!list_empty(&parent->mdev_list)) {
-		mutex_lock(&parent->mdev_list_lock);
-		if (!list_empty(&parent->mdev_list)) {
-			mdev = list_first_entry(&parent->mdev_list,
-						struct mdev_device, next);
-			mdev_device_destroy_ops(mdev, true);
-		}
-		mutex_unlock(&parent->mdev_list_lock);
-
-		if (mdev)
-			mdev_put_device(mdev);
-	}
+	mdev_device_destroy_ops(mdev);
+	device_unregister(&mdev->dev);
 
-	do {
-		ret = wait_event_interruptible_timeout(parent->release_done,
-				list_empty(&parent->mdev_list), HZ * 10);
-		if (ret == -ERESTARTSYS) {
-			dev_warn(dev, "Mediated devices are in use, task"
-				      " \"%s\" (%d) "
-				      "blocked until all are released",
-				      current->comm, task_pid_nr(current));
-		}
-	} while (ret <= 0);
-
-	mdev_put_parent(parent);
+	return 0;
 }
-EXPORT_SYMBOL(mdev_unregister_device);
 
 /*
- * Functions required for mdev_sysfs
+ * mdev_unregister_host_device : unregister a mdev host device
+ * @host: the mdev host device structure
+ *
+ * Unregister a mdev host device as the mediator
  */
-static void mdev_device_release(struct device *dev)
+void mdev_unregister_host_device(struct mdev_host *host)
 {
-	struct mdev_device *mdev = to_mdev_device(dev);
+	if (!host)
+		return;
 
-	dev_dbg(&mdev->dev, "MDEV: destroying\n");
-	kfree(mdev);
+	dev_info(&host->dev, "mdev host device unregistered\n");
+
+	mdev_remove_sysfs_files(&host->dev);
+	sysfs_remove_groups(&host->dev.kobj, host->ops->hdev_attr_groups);
+	device_for_each_child(&host->dev, NULL,  __mdev_device_destroy);
+	device_unregister(&host->dev);
 }
+EXPORT_SYMBOL(mdev_unregister_host_device);
 
 int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
 {
 	int ret;
 	struct mdev_device *mdev;
-	struct parent_device *parent;
-
-	parent = mdev_get_parent_from_dev(dev);
-	if (!parent)
-		return -EINVAL;
+	struct mdev_host *host = dev_to_host(dev);
 
-	mutex_lock(&parent->mdev_list_lock);
 	/* Check for duplicate */
-	mdev = __find_mdev_device(parent, uuid);
+	mdev = find_mdev_device(host, uuid);
 	if (mdev) {
 		ret = -EEXIST;
 		goto create_err;
@@ -345,12 +178,10 @@ int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
 	}
 
 	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
-	mdev->parent = parent;
-	kref_init(&mdev->ref);
 
-	mdev->dev.parent  = dev;
-	mdev->dev.bus     = &mdev_bus_type;
-	mdev->dev.release = mdev_device_release;
+	mdev->dev.parent = dev;
+	mdev->dev.bus = &mdev_bus_type;
+	mdev->dev.groups = host->ops->mdev_attr_groups;
 	dev_set_name(&mdev->dev, "%pUl", uuid.b);
 
 	ret = device_register(&mdev->dev);
@@ -363,123 +194,35 @@ int mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params)
 	if (ret)
 		goto create_failed;
 
-	ret = mdev_create_sysfs_files(&mdev->dev);
-	if (ret)
-		goto create_sysfs_error;
-
-	list_add(&mdev->next, &parent->mdev_list);
-	mutex_unlock(&parent->mdev_list_lock);
-
 	dev_dbg(&mdev->dev, "MDEV: created\n");
 
 	return ret;
 
-create_sysfs_error:
-	mdev_device_destroy_ops(mdev, true);
-
 create_failed:
 	device_unregister(&mdev->dev);
 
 create_err:
-	mutex_unlock(&parent->mdev_list_lock);
-	mdev_put_parent(parent);
 	return ret;
 }
 
 int mdev_device_destroy(struct device *dev, uuid_le uuid)
 {
 	struct mdev_device *mdev;
-	struct parent_device *parent;
-	int ret;
+	struct mdev_host *host = dev_to_host(dev);
 
-	parent = mdev_get_parent_from_dev(dev);
-	if (!parent)
+	mdev = find_mdev_device(host, uuid);
+	if (!mdev)
 		return -ENODEV;
 
-	mutex_lock(&parent->mdev_list_lock);
-	mdev = __find_mdev_device(parent, uuid);
-	if (!mdev) {
-		ret = -EINVAL;
-		goto destroy_err;
-	}
-
-	mdev_remove_sysfs_files(&mdev->dev);
-	ret = mdev_device_destroy_ops(mdev, false);
-	if (ret)
-		goto destroy_err;
-
-	mutex_unlock(&parent->mdev_list_lock);
-	mdev_put_device(mdev);
-
-	mdev_put_parent(parent);
-	return ret;
-
-destroy_err:
-	mutex_unlock(&parent->mdev_list_lock);
-	mdev_put_parent(parent);
-	return ret;
+	return __mdev_device_destroy(&mdev->dev, NULL);
 }
 
 void mdev_device_supported_config(struct device *dev, char *str)
 {
-	struct parent_device *parent;
-
-	parent = mdev_get_parent_from_dev(dev);
+	struct mdev_host *host = dev_to_host(dev);
 
-	if (parent) {
-		if (parent->ops->supported_config)
-			parent->ops->supported_config(parent->dev, str);
-		mdev_put_parent(parent);
-	}
-}
-
-int mdev_device_set_online_status(struct device *dev, bool online)
-{
-	int ret = 0;
-	struct mdev_device *mdev;
-	struct parent_device *parent;
-
-	mdev = mdev_get_device(to_mdev_device(dev));
-	if (!mdev)
-		return -EINVAL;
-
-	parent = mdev->parent;
-
-	if (parent->ops->set_online_status)
-		ret = parent->ops->set_online_status(mdev, online);
-
-	if (ret)
-		pr_err("mdev online failed  %d\n", ret);
-	else {
-		if (online)
-			kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
-		else
-			kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
-	}
-
-	mdev_put_device(mdev);
-
-	return ret;
-}
-
-int mdev_device_get_online_status(struct device *dev, bool *online)
-{
-	int ret = 0;
-	struct mdev_device *mdev;
-	struct parent_device *parent;
-
-	mdev = mdev_get_device(to_mdev_device(dev));
-	if (!mdev)
-		return -EINVAL;
-
-	parent = mdev->parent;
-
-	if (parent->ops->get_online_status)
-		ret = parent->ops->get_online_status(mdev, online);
-
-	mdev_put_device(mdev);
-
-	return ret;
+	if (host->ops->supported_config)
+		host->ops->supported_config(&host->dev, str);
 }
 
 static int __init mdev_init(void)
@@ -487,10 +230,8 @@ static int __init mdev_init(void)
 	int ret;
 
 	ret = mdev_bus_register();
-	if (ret) {
-		pr_err("Failed to register mdev bus\n");
-		return ret;
-	}
+	if (ret)
+		pr_err("failed to register mdev bus: %d\n", ret);
 
 	return ret;
 }
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
index 8afc2d8..d298aaf 100644
--- a/drivers/vfio/mdev/mdev_driver.c
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -5,6 +5,8 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -52,7 +54,7 @@ static void mdev_detach_iommu(struct mdev_device *mdev)
 static int mdev_probe(struct device *dev)
 {
 	struct mdev_driver *drv = to_mdev_driver(dev->driver);
-	struct mdev_device *mdev = to_mdev_device(dev);
+	struct mdev_device *mdev = dev_to_mdev(dev);
 	int ret;
 
 	ret = mdev_attach_iommu(mdev);
@@ -73,7 +75,7 @@ static int mdev_probe(struct device *dev)
 static int mdev_remove(struct device *dev)
 {
 	struct mdev_driver *drv = to_mdev_driver(dev->driver);
-	struct mdev_device *mdev = to_mdev_device(dev);
+	struct mdev_device *mdev = dev_to_mdev(dev);
 
 	if (drv && drv->remove)
 		drv->remove(dev);
@@ -83,10 +85,32 @@ static int mdev_remove(struct device *dev)
 	return 0;
 }
 
+static int mdev_online(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+
+	if (drv && drv->online)
+		return drv->online(dev);
+
+	return 0;
+}
+
+static int mdev_offline(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+
+	if (drv && drv->offline)
+		return drv->offline(dev);
+
+	return 0;
+}
+
 struct bus_type mdev_bus_type = {
 	.name		= "mdev",
 	.probe		= mdev_probe,
 	.remove		= mdev_remove,
+	.online		= mdev_online,
+	.offline	= mdev_offline,
 };
 EXPORT_SYMBOL_GPL(mdev_bus_type);
 
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 07ad1b3..f153292 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -5,6 +5,8 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -17,12 +19,6 @@ int  mdev_bus_register(void);
 void mdev_bus_unregister(void);
 
 /* Function prototypes for mdev_sysfs */
-
-extern struct class_attribute mdev_class_attrs[];
-
-int  parent_create_sysfs_files(struct device *dev);
-void parent_remove_sysfs_files(struct device *dev);
-
 int  mdev_create_sysfs_files(struct device *dev);
 void mdev_remove_sysfs_files(struct device *dev);
 
@@ -30,7 +26,4 @@ int  mdev_device_create(struct device *dev, uuid_le uuid, char *mdev_params);
 int  mdev_device_destroy(struct device *dev, uuid_le uuid);
 void mdev_device_supported_config(struct device *dev, char *str);
 
-int mdev_device_set_online_status(struct device *dev, bool online);
-int mdev_device_get_online_status(struct device *dev, bool *online);
-
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index ed55cd5..7d55188 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -5,6 +5,10 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ * Author:
+ *	Jike Song <jike.song@intel.com>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -35,17 +39,17 @@ static ssize_t mdev_destroy_store(struct device *dev,
 				  const char *buf, size_t count);
 static DEVICE_ATTR_WO(mdev_destroy);
 
-static ssize_t online_store(struct device *dev, struct device_attribute *attr,
-			    const char *buf, size_t count);
-static ssize_t online_show(struct device *dev, struct device_attribute *attr,
-			   char *buf);
-static DEVICE_ATTR_RW(online);
+static const struct attribute *mdev_host_attrs[] = {
+	&dev_attr_mdev_supported_types.attr,
+	&dev_attr_mdev_create.attr,
+	&dev_attr_mdev_destroy.attr,
+	NULL,
+};
 
-/* Static functions */
 
 #define SUPPORTED_TYPE_BUFFER_LENGTH	4096
 
-/* mdev sysfs Functions */
+/* mdev host sysfs functions */
 static ssize_t mdev_supported_types_show(struct device *dev,
 					 struct device_attribute *attr,
 					 char *buf)
@@ -70,25 +74,24 @@ static ssize_t mdev_create_store(struct device *dev,
 				 struct device_attribute *attr,
 				 const char *buf, size_t count)
 {
-	char *str, *pstr;
-	char *uuid_str, *mdev_params = NULL, *params = NULL;
+	char *str;
+	char *uuid_str, *params = NULL;
 	uuid_le uuid;
 	int ret;
 
-	pstr = str = kstrndup(buf, count, GFP_KERNEL);
-
+	str = kstrndup(buf, count, GFP_KERNEL);
 	if (!str)
 		return -ENOMEM;
 
 	uuid_str = strsep(&str, ":");
 	if (!uuid_str) {
-		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		pr_err("mdev_create: empty UUID string %s\n", buf);
 		ret = -EINVAL;
 		goto create_error;
 	}
 
 	if (str)
-		params = mdev_params = kstrdup(str, GFP_KERNEL);
+		params = kstrdup(str, GFP_KERNEL);
 
 	ret = uuid_le_to_bin(uuid_str, &uuid);
 	if (ret) {
@@ -96,7 +99,7 @@ static ssize_t mdev_create_store(struct device *dev,
 		goto create_error;
 	}
 
-	ret = mdev_device_create(dev, uuid, mdev_params);
+	ret = mdev_device_create(dev, uuid, params);
 	if (ret)
 		pr_err("mdev_create: Failed to create mdev device\n");
 	else
@@ -104,7 +107,7 @@ static ssize_t mdev_create_store(struct device *dev,
 
 create_error:
 	kfree(params);
-	kfree(pstr);
+	kfree(str);
 	return ret;
 }
 
@@ -112,23 +115,15 @@ static ssize_t mdev_destroy_store(struct device *dev,
 				  struct device_attribute *attr,
 				  const char *buf, size_t count)
 {
-	char *uuid_str, *str, *pstr;
+	char *str;
 	uuid_le uuid;
 	int ret;
 
-	str = pstr = kstrndup(buf, count, GFP_KERNEL);
-
+	str = kstrndup(buf, count, GFP_KERNEL);
 	if (!str)
 		return -ENOMEM;
 
-	uuid_str = strsep(&str, ":");
-	if (!uuid_str) {
-		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
-		ret = -EINVAL;
-		goto destroy_error;
-	}
-
-	ret = uuid_le_to_bin(uuid_str, &uuid);
+	ret = uuid_le_to_bin(str, &uuid);
 	if (ret) {
 		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
 		goto destroy_error;
@@ -139,102 +134,22 @@ static ssize_t mdev_destroy_store(struct device *dev,
 		ret = count;
 
 destroy_error:
-	kfree(pstr);
-	return ret;
-}
-
-static ssize_t online_store(struct device *dev, struct device_attribute *attr,
-			    const char *buf, size_t count)
-{
-	char *str;
-	int ret;
-	uint32_t online_status;
-	bool online;
-
-	str = kstrndup(buf, count, GFP_KERNEL);
-	if (!str)
-		return -ENOMEM;
-
-	ret = kstrtouint(str, 0, &online_status);
 	kfree(str);
-
-	if (ret) {
-		pr_err("online: parsing error %s\n", buf);
-		return ret;
-	}
-
-	online = online_status > 0 ? true : false;
-
-	ret = mdev_device_set_online_status(dev, online);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t online_show(struct device *dev, struct device_attribute *attr,
-			   char *buf)
-{
-	int ret;
-	bool online = false;
-
-	ret = mdev_device_get_online_status(dev, &online);
-	if (ret)
-		return ret;
-
-	ret = sprintf(buf, "%d\n", online);
 	return ret;
 }
 
-int parent_create_sysfs_files(struct device *dev)
-{
-	int ret;
-
-	ret = sysfs_create_file(&dev->kobj,
-				&dev_attr_mdev_supported_types.attr);
-	if (ret) {
-		pr_err("Failed to create mdev_supported_types sysfs entry\n");
-		return ret;
-	}
-
-	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
-	if (ret) {
-		pr_err("Failed to create mdev_create sysfs entry\n");
-		goto create_sysfs_failed;
-	}
-
-	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
-	if (ret) {
-		pr_err("Failed to create mdev_destroy sysfs entry\n");
-		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
-	} else
-		return ret;
-
-create_sysfs_failed:
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
-	return ret;
-}
-
-void parent_remove_sysfs_files(struct device *dev)
-{
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
-	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
-}
-
 int mdev_create_sysfs_files(struct device *dev)
 {
 	int ret;
 
-	ret = sysfs_create_file(&dev->kobj, &dev_attr_online.attr);
+	ret = sysfs_create_files(&dev->kobj, mdev_host_attrs);
 	if (ret)
-		pr_err("Failed to create 'online' entry\n");
+		pr_err("sysfs_create_files failed: %d\n", ret);
 
 	return ret;
 }
 
 void mdev_remove_sysfs_files(struct device *dev)
 {
-	sysfs_remove_file(&dev->kobj, &dev_attr_online.attr);
+	sysfs_remove_files(&dev->kobj, mdev_host_attrs);
 }
-
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index babcb72..1236200 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -5,6 +5,8 @@
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -15,65 +17,50 @@
 
 #include <uapi/linux/vfio.h>
 
-struct parent_device;
-
-/*
- * Mediated device
- */
 
+/* mediated device */
 struct mdev_device {
 	struct device		dev;
-	struct parent_device	*parent;
 	struct iommu_group	*group;
 	uuid_le			uuid;
 	void			*driver_data;
-
-	/* internal only */
-	struct kref		ref;
-	struct list_head	next;
 };
 
-
 /**
- * struct parent_ops - Structure to be registered for each parent device to
- * register the device to mdev module.
+ * struct mdev_host_ops - Structure to be registered for each host device to
+ * to mdev.
  *
  * @owner:		The module owner.
- * @dev_attr_groups:	Default attributes of the parent device.
- * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @hdev_attr_groups:	Default attributes of the host device.
+ * @mdev_attr_groups:	Default attributes of the mdev device.
  * @supported_config:	Called to get information about supported types.
- *			@dev : device structure of parent device.
+ *			@dev : device structure of host device.
  *			@config: should return string listing supported config
  *			Returns integer: success (0) or error (< 0)
- * @create:		Called to allocate basic resources in parent device's
+ * @create:		Called to allocate basic resources in host device's
  *			driver for a particular mediated device. It is
  *			mandatory to provide create ops.
  *			@mdev: mdev_device structure on of mediated device
  *			      that is being created
- *			@mdev_params: extra parameters required by parent
+ *			@mdev_params: extra parameters required by host
  *			device's driver.
  *			Returns integer: success (0) or error (< 0)
- * @destroy:		Called to free resources in parent device's driver for a
- *			a mediated device. It is mandatory to provide destroy
- *			ops.
+ * @destroy:		Called to free resources in host device's driver for a
+ *			a mediated device instance. It is mandatory to provide
+ *			destroy ops.
  *			@mdev: mdev_device device structure which is being
- *			       destroyed
+ *				destroyed
  *			Returns integer: success (0) or error (< 0)
  *			If VMM is running and destroy() is called that means the
  *			mdev is being hotunpluged. Return error if VMM is
  *			running and driver doesn't support mediated device
  *			hotplug.
- * @reset:		Called to reset mediated device.
- *			@mdev: mdev_device device structure.
- *			Returns integer: success (0) or error (< 0)
- * @set_online_status:	Called to change to status of mediated device.
- *			@mdev: mediated device.
- *			@online: set true or false to make mdev device online or
- *			offline.
+ * @start:		Called to initiate mediated device initialization
+ *			process in host device's driver before VMM starts.
+ *			@mdev: mediated device structure
  *			Returns integer: success (0) or error (< 0)
- * @get_online_status:	Called to get online/offline status of  mediated device
- *			@mdev: mediated device.
- *			@online: Returns status of mediated device.
+ * @stop:		Called to teardown mediated device related resources
+ *			@mdev: mediated device structure
  *			Returns integer: success (0) or error (< 0)
  * @read:		Read emulation callback
  *			@mdev: mediated device structure
@@ -87,75 +74,47 @@ struct mdev_device {
  *			@count: number of bytes to be written
  *			@pos: address.
  *			Retuns number on bytes written on success or error.
- * @get_irq_info:	Called to retrieve information about mediated device IRQ
- *			@mdev: mediated device structure
- *			@irq_info: VFIO IRQ flags and count.
- *			Returns integer: success (0) or error (< 0)
- * @set_irqs:		Called to send about interrupts configuration
- *			information that VMM sets.
+ * @mmap:		Memory Map
  *			@mdev: mediated device structure
- *			@flags, index, start, count and *data : same as that of
- *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
- * @get_device_info:	Called to get VFIO device information for a mediated
- *			device.
- *			@vfio_device_info: VFIO device info.
- *			Returns integer: success (0) or error (< 0)
- * @get_region_info:	Called to get VFIO region size and flags of mediated
- *			device.
- *			@mdev: mediated device structure
- *			@region_info: output, returns size and flags of
- *				      requested region.
- *			@cap_type_id: returns id of capability.
- *			@cap_type: returns pointer to capability structure
- *			corresponding to capability id.
+ *			@pos: address
+ *			@virtaddr: target user address to start at. Vendor
+ *			driver can change if required.
+ *			@pfn: host address of kernel memory, vendor driver
+ *			can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			driver can change, if required.
  *			Returns integer: success (0) or error (< 0)
  *
- * Parent device that support mediated device should be registered with mdev
- * module with parent_ops structure.
+ * Host device that support mediated device should be registered with mdev
+ * module with mdev_host_ops structure.
  */
-
-struct parent_ops {
-	struct module   *owner;
-	const struct attribute_group **dev_attr_groups;
+struct mdev_host_ops {
+	struct module *owner;
+	const struct attribute_group **hdev_attr_groups;
 	const struct attribute_group **mdev_attr_groups;
 
-	int	(*supported_config)(struct device *dev, char *config);
-	int     (*create)(struct mdev_device *mdev, char *mdev_params);
-	int     (*destroy)(struct mdev_device *mdev);
-	int     (*reset)(struct mdev_device *mdev);
-	int     (*set_online_status)(struct mdev_device *mdev, bool online);
-	int     (*get_online_status)(struct mdev_device *mdev, bool *online);
-	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
-			loff_t pos);
-	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
-			 loff_t pos);
-	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
-	int	(*get_irq_info)(struct mdev_device *mdev,
-				struct vfio_irq_info *irq_info);
-	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
-			    unsigned int index, unsigned int start,
-			    unsigned int count, void *data);
-	int	(*get_device_info)(struct mdev_device *mdev,
-				   struct vfio_device_info *dev_info);
-	int	(*get_region_info)(struct mdev_device *mdev,
-				   struct vfio_region_info *region_info,
-				   u16 *cap_type_id, void **cap_type);
-};
+	int (*supported_config)(struct device *dev, char *config);
+	int (*create)(struct mdev_device *mdev, char *mdev_params);
+	void (*destroy)(struct mdev_device *mdev);
 
-/*
- * Parent Device
- */
+	int (*start)(struct mdev_device *mdev);
+	int (*stop)(struct mdev_device *mdev);
 
-struct parent_device {
-	struct device		*dev;
-	const struct parent_ops	*ops;
+	ssize_t (*read)(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *pos);
+	ssize_t (*write)(struct mdev_device *mdev, const char __user *buf,
+			size_t count, loff_t *pos);
+	int (*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	long (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
+			unsigned long arg);
+};
 
-	/* internal */
-	struct kref		ref;
-	struct list_head	next;
-	struct list_head	mdev_list;
-	struct mutex		mdev_list_lock;
-	wait_queue_head_t	release_done;
+/* mdev host device */
+struct mdev_host {
+	struct device dev;
+	const struct mdev_host_ops *ops;
 };
 
 /**
@@ -164,25 +123,16 @@ struct parent_device {
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
- *
  **/
 struct mdev_driver {
 	const char *name;
-	int  (*probe)(struct device *dev);
+	int (*probe)(struct device *dev);
 	void (*remove)(struct device *dev);
+	int (*online)(struct device *dev);
+	int (*offline)(struct device *dev);
 	struct device_driver driver;
 };
 
-static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
-{
-	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
-}
-
-static inline struct mdev_device *to_mdev_device(struct device *dev)
-{
-	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
-}
-
 static inline void *mdev_get_drvdata(struct mdev_device *mdev)
 {
 	return mdev->driver_data;
@@ -195,18 +145,15 @@ static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
 
 extern struct bus_type mdev_bus_type;
 
-#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
-
-extern int  mdev_register_device(struct device *dev,
-				 const struct parent_ops *ops);
-extern void mdev_unregister_device(struct device *dev);
-
-extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
-extern void mdev_unregister_driver(struct mdev_driver *drv);
+#define to_mdev_driver(drv) container_of(drv, struct mdev_driver, driver)
+#define dev_to_host(_dev) container_of((_dev), struct mdev_host, dev)
+#define dev_to_mdev(_dev) container_of((_dev), struct mdev_device, dev)
 
-extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
-extern void mdev_put_device(struct mdev_device *mdev);
+struct mdev_host *mdev_register_host_device(struct device *dev,
+				 const struct mdev_host_ops *ops);
+void mdev_unregister_host_device(struct mdev_host *host);
 
-extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+void mdev_unregister_driver(struct mdev_driver *drv);
 
 #endif /* MDEV_H */

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-20 12:53     ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-20 12:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
/* {snip} */

To show another possible implementation of vfio-mdev, which provides
the thinnest framework and let the vendor physical drivers do
whatever they want to.

Again, it is diff-ed against Kirti's version 7, for demonstration only.

--
Thanks,
Jike


diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index d25439f..b2fe0c6 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -8,3 +8,11 @@ config MDEV
 	See Documentation/vfio-mediated-device.txt for more details.
 
         If you don't know what do here, say N.
+
+config VFIO_MDEV
+    tristate "VFIO Bus driver for Mediated devices"
+    depends on VFIO && MDEV
+    default n
+    help
+        VFIO Bus driver for mediated devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 8bd78b5..ee9f89f 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,3 +2,4 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV) += vfio_mdev.o
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index 28f13ae..c22ebd8 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -1,10 +1,15 @@
 /*
- * VFIO based Mediated PCI device driver
+ * VFIO Bus driver for Mediated device
  *
  * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ * Author:
+ *	Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *	Jike Song <jike.song@intel.com>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -22,24 +27,21 @@
 
 #include "mdev_private.h"
 
-#define DRIVER_VERSION  "0.1"
+#define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "NVIDIA Corporation"
-#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+#define DRIVER_DESC     "VFIO Bus driver for Mediated device"
 
 struct vfio_mdev {
 	struct iommu_group *group;
 	struct mdev_device *mdev;
-	struct vfio_device_info dev_info;
 };
 
 static int vfio_mdev_open(void *device_data)
 {
-	int ret = 0;
-
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
-	return ret;
+	return 0;
 }
 
 static void vfio_mdev_close(void *device_data)
@@ -47,220 +49,17 @@ static void vfio_mdev_close(void *device_data)
 	module_put(THIS_MODULE);
 }
 
-static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
-	size_t size;
-
-	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
-	header = vfio_info_cap_add(caps, size,
-				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
-
-	sparse_cap = container_of(header,
-			struct vfio_region_info_cap_sparse_mmap, header);
-	sparse_cap->nr_areas = sparse->nr_areas;
-	memcpy(sparse_cap->areas, sparse->areas,
-	       sparse->nr_areas * sizeof(*sparse->areas));
-	return 0;
-}
-
-static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
-
-	header = vfio_info_cap_add(caps, sizeof(*cap),
-				   VFIO_REGION_INFO_CAP_TYPE, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
-
-	type_cap = container_of(header, struct vfio_region_info_cap_type,
-				header);
-	type_cap->type = cap->type;
-	type_cap->subtype = cap->type;
-	return 0;
-}
-
 static long vfio_mdev_unlocked_ioctl(void *device_data,
 				     unsigned int cmd, unsigned long arg)
 {
-	int ret = 0;
 	struct vfio_mdev *vmdev = device_data;
-	struct parent_device *parent = vmdev->mdev->parent;
-	unsigned long minsz;
-
-	switch (cmd) {
-	case VFIO_DEVICE_GET_INFO:
-	{
-		struct vfio_device_info info;
-
-		minsz = offsetofend(struct vfio_device_info, num_irqs);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		if (parent->ops->get_device_info)
-			ret = parent->ops->get_device_info(vmdev->mdev, &info);
-		else
-			return -EINVAL;
-
-		if (ret)
-			return ret;
-
-		if (parent->ops->reset)
-			info.flags |= VFIO_DEVICE_FLAGS_RESET;
-
-		memcpy(&vmdev->dev_info, &info, sizeof(info));
-
-		return copy_to_user((void __user *)arg, &info, minsz);
-	}
-	case VFIO_DEVICE_GET_REGION_INFO:
-	{
-		struct vfio_region_info info;
-		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-		u16 cap_type_id = 0;
-		void *cap_type = NULL;
-
-		minsz = offsetofend(struct vfio_region_info, offset);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		if (parent->ops->get_region_info)
-			ret = parent->ops->get_region_info(vmdev->mdev, &info,
-						       &cap_type_id, &cap_type);
-		else
-			return -EINVAL;
-
-		if (ret)
-			return ret;
-
-		if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && cap_type) {
-			switch (cap_type_id) {
-			case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
-				ret = sparse_mmap_cap(&caps, cap_type);
-				if (ret)
-					return ret;
-				break;
-
-			case VFIO_REGION_INFO_CAP_TYPE:
-				ret = region_type_cap(&caps, cap_type);
-				if (ret)
-					return ret;
-				break;
-			default:
-				return -EINVAL;
-			}
-		}
-
-		if (caps.size) {
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-							sizeof(info), caps.buf,
-							caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
-			}
-			kfree(caps.buf);
-		}
-
-		return copy_to_user((void __user *)arg, &info, minsz);
-	}
-	case VFIO_DEVICE_GET_IRQ_INFO:
-	{
-		struct vfio_irq_info info;
-
-		minsz = offsetofend(struct vfio_irq_info, count);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if ((info.argsz < minsz) ||
-		    (info.index >= vmdev->dev_info.num_irqs))
-			return -EINVAL;
-
-		if (parent->ops->get_irq_info)
-			ret = parent->ops->get_irq_info(vmdev->mdev, &info);
-		else
-			return -EINVAL;
-
-		if (ret)
-			return ret;
-
-		if (info.count == -1)
-			return -EINVAL;
-
-		return copy_to_user((void __user *)arg, &info, minsz);
-	}
-	case VFIO_DEVICE_SET_IRQS:
-	{
-		struct vfio_irq_set hdr;
-		u8 *data = NULL, *ptr = NULL;
-
-		minsz = offsetofend(struct vfio_irq_set, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if ((hdr.argsz < minsz) ||
-		    (hdr.index >= vmdev->dev_info.num_irqs) ||
-		    (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
-				  VFIO_IRQ_SET_ACTION_TYPE_MASK)))
-			return -EINVAL;
-
-		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
-			size_t size;
-
-			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
-				size = sizeof(uint8_t);
-			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
-				size = sizeof(int32_t);
-			else
-				return -EINVAL;
-
-			if (hdr.argsz - minsz < hdr.count * size)
-				return -EINVAL;
-
-			ptr = data = memdup_user((void __user *)(arg + minsz),
-						 hdr.count * size);
-			if (IS_ERR(data))
-				return PTR_ERR(data);
-		}
-
-		if (parent->ops->set_irqs)
-			ret = parent->ops->set_irqs(vmdev->mdev, hdr.flags,
-						    hdr.index, hdr.start,
-						    hdr.count, data);
-		else
-			ret = -EINVAL;
-
-		kfree(ptr);
-		return ret;
-	}
-	case VFIO_DEVICE_RESET:
-	{
-		if (parent->ops->reset)
-			return parent->ops->reset(vmdev->mdev);
-
-		return -EINVAL;
-	}
-	}
-	return -ENOTTY;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
+
+	if (host->ops->ioctl)
+		return host->ops->ioctl(mdev, cmd, arg);
+
+	return -ENODEV;
 }
 
 static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
@@ -268,63 +67,12 @@ static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
 {
 	struct vfio_mdev *vmdev = device_data;
 	struct mdev_device *mdev = vmdev->mdev;
-	struct parent_device *parent = mdev->parent;
-	unsigned int done = 0;
-	int ret;
-
-	if (!parent->ops->read)
-		return -EINVAL;
-
-	while (count) {
-		size_t filled;
-
-		if (count >= 4 && !(*ppos % 4)) {
-			u32 val;
-
-			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
-						*ppos);
-			if (ret <= 0)
-				goto read_err;
-
-			if (copy_to_user(buf, &val, sizeof(val)))
-				goto read_err;
-
-			filled = 4;
-		} else if (count >= 2 && !(*ppos % 2)) {
-			u16 val;
-
-			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
-						*ppos);
-			if (ret <= 0)
-				goto read_err;
-
-			if (copy_to_user(buf, &val, sizeof(val)))
-				goto read_err;
-
-			filled = 2;
-		} else {
-			u8 val;
-
-			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
-			if (ret <= 0)
-				goto read_err;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-			if (copy_to_user(buf, &val, sizeof(val)))
-				goto read_err;
+	if (host->ops->read)
+		return host->ops->read(mdev, buf, count, ppos);
 
-			filled = 1;
-		}
-
-		count -= filled;
-		done += filled;
-		*ppos += filled;
-		buf += filled;
-	}
-
-	return done;
-
-read_err:
-	return -EFAULT;
+	return -ENODEV;
 }
 
 static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
@@ -332,75 +80,24 @@ static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
 {
 	struct vfio_mdev *vmdev = device_data;
 	struct mdev_device *mdev = vmdev->mdev;
-	struct parent_device *parent = mdev->parent;
-	unsigned int done = 0;
-	int ret;
-
-	if (!parent->ops->write)
-		return -EINVAL;
-
-	while (count) {
-		size_t filled;
-
-		if (count >= 4 && !(*ppos % 4)) {
-			u32 val;
-
-			if (copy_from_user(&val, buf, sizeof(val)))
-				goto write_err;
-
-			ret = parent->ops->write(mdev, (char *)&val,
-						 sizeof(val), *ppos);
-			if (ret <= 0)
-				goto write_err;
-
-			filled = 4;
-		} else if (count >= 2 && !(*ppos % 2)) {
-			u16 val;
-
-			if (copy_from_user(&val, buf, sizeof(val)))
-				goto write_err;
-
-			ret = parent->ops->write(mdev, (char *)&val,
-						 sizeof(val), *ppos);
-			if (ret <= 0)
-				goto write_err;
-
-			filled = 2;
-		} else {
-			u8 val;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-			if (copy_from_user(&val, buf, sizeof(val)))
-				goto write_err;
+	if (host->ops->write)
+		return host->ops->write(mdev, buf, count, ppos);
 
-			ret = parent->ops->write(mdev, &val, sizeof(val),
-						 *ppos);
-			if (ret <= 0)
-				goto write_err;
-
-			filled = 1;
-		}
-
-		count -= filled;
-		done += filled;
-		*ppos += filled;
-		buf += filled;
-	}
-
-	return done;
-write_err:
-	return -EFAULT;
+	return -ENODEV;
 }
 
 static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_mdev *vmdev = device_data;
 	struct mdev_device *mdev = vmdev->mdev;
-	struct parent_device *parent = mdev->parent;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-	if (parent->ops->mmap)
-		return parent->ops->mmap(mdev, vma);
+	if (host->ops->mmap)
+		return host->ops->mmap(mdev, vma);
 
-	return -EINVAL;
+	return -ENODEV;
 }
 
 static const struct vfio_device_ops vfio_mdev_dev_ops = {
@@ -413,28 +110,27 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
 	.mmap		= vfio_mdev_mmap,
 };
 
-int vfio_mdev_probe(struct device *dev)
+static int vfio_mdev_probe(struct device *dev)
 {
 	struct vfio_mdev *vmdev;
-	struct mdev_device *mdev = to_mdev_device(dev);
+	struct mdev_device *mdev = dev_to_mdev(dev);
 	int ret;
 
 	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
 	if (IS_ERR(vmdev))
 		return PTR_ERR(vmdev);
 
-	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->mdev = mdev;
 	vmdev->group = mdev->group;
 
 	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
 	if (ret)
 		kfree(vmdev);
 
-	mdev_put_device(mdev);
 	return ret;
 }
 
-void vfio_mdev_remove(struct device *dev)
+static void vfio_mdev_remove(struct device *dev)
 {
 	struct vfio_mdev *vmdev;
 
@@ -442,10 +138,34 @@ void vfio_mdev_remove(struct device *dev)
 	kfree(vmdev);
 }
 
-struct mdev_driver vfio_mdev_driver = {
-	.name	= "vfio_mdev",
-	.probe	= vfio_mdev_probe,
-	.remove	= vfio_mdev_remove,
+static int vfio_mdev_online(struct device *dev)
+{
+	struct mdev_device *mdev = dev_to_mdev(dev);
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
+
+	if (host->ops->start)
+		return host->ops->start(mdev);
+
+	return -ENOTSUPP;
+}
+
+static int vfio_mdev_offline(struct device *dev)
+{
+	struct mdev_device *mdev = dev_to_mdev(dev);
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
+
+	if (host->ops->stop)
+		return host->ops->stop(mdev);
+
+	return -ENOTSUPP;
+}
+
+static struct mdev_driver vfio_mdev_driver = {
+	.name		= "vfio_mdev",
+	.probe		= vfio_mdev_probe,
+	.remove		= vfio_mdev_remove,
+	.online		= vfio_mdev_online,
+	.offline	= vfio_mdev_offline,
 };
 
 static int __init vfio_mdev_init(void)

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
@ 2016-09-20 12:53     ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-20 12:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
/* {snip} */

To show another possible implementation of vfio-mdev, which provides
the thinnest framework and let the vendor physical drivers do
whatever they want to.

Again, it is diff-ed against Kirti's version 7, for demonstration only.

--
Thanks,
Jike


diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index d25439f..b2fe0c6 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -8,3 +8,11 @@ config MDEV
 	See Documentation/vfio-mediated-device.txt for more details.
 
         If you don't know what do here, say N.
+
+config VFIO_MDEV
+    tristate "VFIO Bus driver for Mediated devices"
+    depends on VFIO && MDEV
+    default n
+    help
+        VFIO Bus driver for mediated devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 8bd78b5..ee9f89f 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,3 +2,4 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV) += vfio_mdev.o
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index 28f13ae..c22ebd8 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -1,10 +1,15 @@
 /*
- * VFIO based Mediated PCI device driver
+ * VFIO Bus driver for Mediated device
  *
  * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
  *     Author: Neo Jia <cjia@nvidia.com>
  *	       Kirti Wankhede <kwankhede@nvidia.com>
  *
+ * Copyright (c) 2016 Intel Corporation.
+ * Author:
+ *	Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *	Jike Song <jike.song@intel.com>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -22,24 +27,21 @@
 
 #include "mdev_private.h"
 
-#define DRIVER_VERSION  "0.1"
+#define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "NVIDIA Corporation"
-#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+#define DRIVER_DESC     "VFIO Bus driver for Mediated device"
 
 struct vfio_mdev {
 	struct iommu_group *group;
 	struct mdev_device *mdev;
-	struct vfio_device_info dev_info;
 };
 
 static int vfio_mdev_open(void *device_data)
 {
-	int ret = 0;
-
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
-	return ret;
+	return 0;
 }
 
 static void vfio_mdev_close(void *device_data)
@@ -47,220 +49,17 @@ static void vfio_mdev_close(void *device_data)
 	module_put(THIS_MODULE);
 }
 
-static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
-	size_t size;
-
-	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
-	header = vfio_info_cap_add(caps, size,
-				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
-
-	sparse_cap = container_of(header,
-			struct vfio_region_info_cap_sparse_mmap, header);
-	sparse_cap->nr_areas = sparse->nr_areas;
-	memcpy(sparse_cap->areas, sparse->areas,
-	       sparse->nr_areas * sizeof(*sparse->areas));
-	return 0;
-}
-
-static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
-
-	header = vfio_info_cap_add(caps, sizeof(*cap),
-				   VFIO_REGION_INFO_CAP_TYPE, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
-
-	type_cap = container_of(header, struct vfio_region_info_cap_type,
-				header);
-	type_cap->type = cap->type;
-	type_cap->subtype = cap->type;
-	return 0;
-}
-
 static long vfio_mdev_unlocked_ioctl(void *device_data,
 				     unsigned int cmd, unsigned long arg)
 {
-	int ret = 0;
 	struct vfio_mdev *vmdev = device_data;
-	struct parent_device *parent = vmdev->mdev->parent;
-	unsigned long minsz;
-
-	switch (cmd) {
-	case VFIO_DEVICE_GET_INFO:
-	{
-		struct vfio_device_info info;
-
-		minsz = offsetofend(struct vfio_device_info, num_irqs);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		if (parent->ops->get_device_info)
-			ret = parent->ops->get_device_info(vmdev->mdev, &info);
-		else
-			return -EINVAL;
-
-		if (ret)
-			return ret;
-
-		if (parent->ops->reset)
-			info.flags |= VFIO_DEVICE_FLAGS_RESET;
-
-		memcpy(&vmdev->dev_info, &info, sizeof(info));
-
-		return copy_to_user((void __user *)arg, &info, minsz);
-	}
-	case VFIO_DEVICE_GET_REGION_INFO:
-	{
-		struct vfio_region_info info;
-		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-		u16 cap_type_id = 0;
-		void *cap_type = NULL;
-
-		minsz = offsetofend(struct vfio_region_info, offset);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (info.argsz < minsz)
-			return -EINVAL;
-
-		if (parent->ops->get_region_info)
-			ret = parent->ops->get_region_info(vmdev->mdev, &info,
-						       &cap_type_id, &cap_type);
-		else
-			return -EINVAL;
-
-		if (ret)
-			return ret;
-
-		if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && cap_type) {
-			switch (cap_type_id) {
-			case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
-				ret = sparse_mmap_cap(&caps, cap_type);
-				if (ret)
-					return ret;
-				break;
-
-			case VFIO_REGION_INFO_CAP_TYPE:
-				ret = region_type_cap(&caps, cap_type);
-				if (ret)
-					return ret;
-				break;
-			default:
-				return -EINVAL;
-			}
-		}
-
-		if (caps.size) {
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-							sizeof(info), caps.buf,
-							caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
-			}
-			kfree(caps.buf);
-		}
-
-		return copy_to_user((void __user *)arg, &info, minsz);
-	}
-	case VFIO_DEVICE_GET_IRQ_INFO:
-	{
-		struct vfio_irq_info info;
-
-		minsz = offsetofend(struct vfio_irq_info, count);
-
-		if (copy_from_user(&info, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if ((info.argsz < minsz) ||
-		    (info.index >= vmdev->dev_info.num_irqs))
-			return -EINVAL;
-
-		if (parent->ops->get_irq_info)
-			ret = parent->ops->get_irq_info(vmdev->mdev, &info);
-		else
-			return -EINVAL;
-
-		if (ret)
-			return ret;
-
-		if (info.count == -1)
-			return -EINVAL;
-
-		return copy_to_user((void __user *)arg, &info, minsz);
-	}
-	case VFIO_DEVICE_SET_IRQS:
-	{
-		struct vfio_irq_set hdr;
-		u8 *data = NULL, *ptr = NULL;
-
-		minsz = offsetofend(struct vfio_irq_set, count);
-
-		if (copy_from_user(&hdr, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if ((hdr.argsz < minsz) ||
-		    (hdr.index >= vmdev->dev_info.num_irqs) ||
-		    (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
-				  VFIO_IRQ_SET_ACTION_TYPE_MASK)))
-			return -EINVAL;
-
-		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
-			size_t size;
-
-			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
-				size = sizeof(uint8_t);
-			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
-				size = sizeof(int32_t);
-			else
-				return -EINVAL;
-
-			if (hdr.argsz - minsz < hdr.count * size)
-				return -EINVAL;
-
-			ptr = data = memdup_user((void __user *)(arg + minsz),
-						 hdr.count * size);
-			if (IS_ERR(data))
-				return PTR_ERR(data);
-		}
-
-		if (parent->ops->set_irqs)
-			ret = parent->ops->set_irqs(vmdev->mdev, hdr.flags,
-						    hdr.index, hdr.start,
-						    hdr.count, data);
-		else
-			ret = -EINVAL;
-
-		kfree(ptr);
-		return ret;
-	}
-	case VFIO_DEVICE_RESET:
-	{
-		if (parent->ops->reset)
-			return parent->ops->reset(vmdev->mdev);
-
-		return -EINVAL;
-	}
-	}
-	return -ENOTTY;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
+
+	if (host->ops->ioctl)
+		return host->ops->ioctl(mdev, cmd, arg);
+
+	return -ENODEV;
 }
 
 static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
@@ -268,63 +67,12 @@ static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
 {
 	struct vfio_mdev *vmdev = device_data;
 	struct mdev_device *mdev = vmdev->mdev;
-	struct parent_device *parent = mdev->parent;
-	unsigned int done = 0;
-	int ret;
-
-	if (!parent->ops->read)
-		return -EINVAL;
-
-	while (count) {
-		size_t filled;
-
-		if (count >= 4 && !(*ppos % 4)) {
-			u32 val;
-
-			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
-						*ppos);
-			if (ret <= 0)
-				goto read_err;
-
-			if (copy_to_user(buf, &val, sizeof(val)))
-				goto read_err;
-
-			filled = 4;
-		} else if (count >= 2 && !(*ppos % 2)) {
-			u16 val;
-
-			ret = parent->ops->read(mdev, (char *)&val, sizeof(val),
-						*ppos);
-			if (ret <= 0)
-				goto read_err;
-
-			if (copy_to_user(buf, &val, sizeof(val)))
-				goto read_err;
-
-			filled = 2;
-		} else {
-			u8 val;
-
-			ret = parent->ops->read(mdev, &val, sizeof(val), *ppos);
-			if (ret <= 0)
-				goto read_err;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-			if (copy_to_user(buf, &val, sizeof(val)))
-				goto read_err;
+	if (host->ops->read)
+		return host->ops->read(mdev, buf, count, ppos);
 
-			filled = 1;
-		}
-
-		count -= filled;
-		done += filled;
-		*ppos += filled;
-		buf += filled;
-	}
-
-	return done;
-
-read_err:
-	return -EFAULT;
+	return -ENODEV;
 }
 
 static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
@@ -332,75 +80,24 @@ static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
 {
 	struct vfio_mdev *vmdev = device_data;
 	struct mdev_device *mdev = vmdev->mdev;
-	struct parent_device *parent = mdev->parent;
-	unsigned int done = 0;
-	int ret;
-
-	if (!parent->ops->write)
-		return -EINVAL;
-
-	while (count) {
-		size_t filled;
-
-		if (count >= 4 && !(*ppos % 4)) {
-			u32 val;
-
-			if (copy_from_user(&val, buf, sizeof(val)))
-				goto write_err;
-
-			ret = parent->ops->write(mdev, (char *)&val,
-						 sizeof(val), *ppos);
-			if (ret <= 0)
-				goto write_err;
-
-			filled = 4;
-		} else if (count >= 2 && !(*ppos % 2)) {
-			u16 val;
-
-			if (copy_from_user(&val, buf, sizeof(val)))
-				goto write_err;
-
-			ret = parent->ops->write(mdev, (char *)&val,
-						 sizeof(val), *ppos);
-			if (ret <= 0)
-				goto write_err;
-
-			filled = 2;
-		} else {
-			u8 val;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-			if (copy_from_user(&val, buf, sizeof(val)))
-				goto write_err;
+	if (host->ops->write)
+		return host->ops->write(mdev, buf, count, ppos);
 
-			ret = parent->ops->write(mdev, &val, sizeof(val),
-						 *ppos);
-			if (ret <= 0)
-				goto write_err;
-
-			filled = 1;
-		}
-
-		count -= filled;
-		done += filled;
-		*ppos += filled;
-		buf += filled;
-	}
-
-	return done;
-write_err:
-	return -EFAULT;
+	return -ENODEV;
 }
 
 static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_mdev *vmdev = device_data;
 	struct mdev_device *mdev = vmdev->mdev;
-	struct parent_device *parent = mdev->parent;
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
 
-	if (parent->ops->mmap)
-		return parent->ops->mmap(mdev, vma);
+	if (host->ops->mmap)
+		return host->ops->mmap(mdev, vma);
 
-	return -EINVAL;
+	return -ENODEV;
 }
 
 static const struct vfio_device_ops vfio_mdev_dev_ops = {
@@ -413,28 +110,27 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
 	.mmap		= vfio_mdev_mmap,
 };
 
-int vfio_mdev_probe(struct device *dev)
+static int vfio_mdev_probe(struct device *dev)
 {
 	struct vfio_mdev *vmdev;
-	struct mdev_device *mdev = to_mdev_device(dev);
+	struct mdev_device *mdev = dev_to_mdev(dev);
 	int ret;
 
 	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
 	if (IS_ERR(vmdev))
 		return PTR_ERR(vmdev);
 
-	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->mdev = mdev;
 	vmdev->group = mdev->group;
 
 	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
 	if (ret)
 		kfree(vmdev);
 
-	mdev_put_device(mdev);
 	return ret;
 }
 
-void vfio_mdev_remove(struct device *dev)
+static void vfio_mdev_remove(struct device *dev)
 {
 	struct vfio_mdev *vmdev;
 
@@ -442,10 +138,34 @@ void vfio_mdev_remove(struct device *dev)
 	kfree(vmdev);
 }
 
-struct mdev_driver vfio_mdev_driver = {
-	.name	= "vfio_mdev",
-	.probe	= vfio_mdev_probe,
-	.remove	= vfio_mdev_remove,
+static int vfio_mdev_online(struct device *dev)
+{
+	struct mdev_device *mdev = dev_to_mdev(dev);
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
+
+	if (host->ops->start)
+		return host->ops->start(mdev);
+
+	return -ENOTSUPP;
+}
+
+static int vfio_mdev_offline(struct device *dev)
+{
+	struct mdev_device *mdev = dev_to_mdev(dev);
+	struct mdev_host *host = dev_to_host(mdev->dev.parent);
+
+	if (host->ops->stop)
+		return host->ops->stop(mdev);
+
+	return -ENOTSUPP;
+}
+
+static struct mdev_driver vfio_mdev_driver = {
+	.name		= "vfio_mdev",
+	.probe		= vfio_mdev_probe,
+	.remove		= vfio_mdev_remove,
+	.online		= vfio_mdev_online,
+	.offline	= vfio_mdev_offline,
 };
 
 static int __init vfio_mdev_init(void)

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-20  2:50               ` Jike Song
@ 2016-09-20 16:24                 ` Alex Williamson
  2016-09-21  3:19                   ` Jike Song
  0 siblings, 1 reply; 162+ messages in thread
From: Alex Williamson @ 2016-09-20 16:24 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, kraxel, Dong Jia, kevin.tian, cjia, kvm,
	qemu-devel, pbonzini

On Tue, 20 Sep 2016 10:50:47 +0800
Jike Song <jike.song@intel.com> wrote:

> On 09/20/2016 04:03 AM, Alex Williamson wrote:
> > On Tue, 20 Sep 2016 00:43:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/20/2016 12:06 AM, Alex Williamson wrote:  
> >>> On Mon, 19 Sep 2016 23:52:36 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 8/26/2016 7:43 PM, Kirti Wankhede wrote:    
> >>>>> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> >>>>> On 8/25/2016 2:52 PM, Dong Jia wrote:      
> >>>>>> On Thu, 25 Aug 2016 09:23:53 +0530      
> >>>>    
> >>>>>>> +
> >>>>>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> >>>>>>> +			      size_t count, loff_t *ppos)
> >>>>>>> +{
> >>>>>>> +	struct vfio_mdev *vmdev = device_data;
> >>>>>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>>>>> +	struct parent_device *parent = mdev->parent;
> >>>>>>> +	unsigned int done = 0;
> >>>>>>> +	int ret;
> >>>>>>> +
> >>>>>>> +	if (!parent->ops->read)
> >>>>>>> +		return -EINVAL;
> >>>>>>> +
> >>>>>>> +	while (count) {      
> >>>>>> Here, I have to say sorry to you guys for that I didn't notice the
> >>>>>> bad impact of this change to my patches during the v6 discussion.
> >>>>>>
> >>>>>> For vfio-ccw, I introduced an I/O region to input/output I/O
> >>>>>> instruction parameters and results for Qemu. The @count of these data
> >>>>>> currently is 140. So supporting arbitrary lengths in one shot here, and
> >>>>>> also in vfio_mdev_write, seems the better option for this case.
> >>>>>>
> >>>>>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> >>>>>> can do that in the parent read/write callbacks instead.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>      
> >>>>>
> >>>>> I would like to know Alex's thought on this. He raised concern with this
> >>>>> approach in v6 reviews:
> >>>>> "But I think this is exploitable, it lets the user make the kernel
> >>>>> allocate an arbitrarily sized buffer."
> >>>>>       
> >>>>
> >>>> Read/write callbacks are for slow path, emulation of mmio region which
> >>>> are mainly device registers. I do feel it shouldn't support arbitrary
> >>>> lengths.
> >>>> Alex, I would like to know your thoughts.    
> >>>
> >>> The exploit was that the mdev layer allocated a buffer and copied the
> >>> entire user buffer into kernel space before passing it to the vendor
> >>> driver.  The solution is to pass the __user *buf to the vendor driver
> >>> and let them sanitize and split it however makes sense for their
> >>> device.  We shouldn't be assuming naturally aligned, up to dword
> >>> accesses in the generic mdev layers.  Those sorts of semantics are
> >>> defined by the device type.  This is another case where making
> >>> the mdev layer as thin as possible is actually the best thing to
> >>> do to make it really device type agnostic.  Thanks,
> >>>     
> >>
> >>
> >> Alex,
> >>
> >> These were the comments on v6 patch:
> >>  
> >>>>> Do we really need to support arbitrary lengths in one shot?  Seems
> >>>>> like
> >>>>> we could just use a 4 or 8 byte variable on the stack and iterate
> >>>>> until
> >>>>> done.
> >>>>>    
> >>>>
> >>>> We just want to pass the arguments to vendor driver as is here.Vendor
> >>>> driver could take care of that.    
> >>  
> >>> But I think this is exploitable, it lets the user make the kernel
> >>> allocate an arbitrarily sized buffer.    
> >>
> >> As per above discussion in v7 version, this module don't allocated
> >> memory from heap.
> >>
> >> If vendor driver allocates arbitrary memory in kernel space through mdev
> >> module interface, isn't that would be exploit?  
> > 
> > Yep, my 4-8/byte chunking idea was too PCI specific.  If a vendor
> > driver introduces an exploit, that's a bug in the vendor driver.  I'm
> > not sure if we can or should attempt to guard against that.  Ultimately
> > the vendor driver is either open source and we can inspect it for such
> > exploits or it's closed source, taints the kernel, and we hope for the
> > best.  It might make a good unit test to perform substantially sized
> > reads/writes to the mdev device.  
> 
> Can't agree more! :-)
> 
> > Perhaps the only sanity test we can
> > make in the core is to verify the access doesn't exceed the size of
> > the region as advertised by the vendor driver.  Thanks,  
> 
> Even performing a lightweight sanity check, would require vfio-mdev
> to be able to decode the ppos into a particular region, that means
> information of all regions should be stored in the framework. I guess
> it is not your preferred way :)

There's certainly a trade-off there, we don't support dynamic regions,
the user expects them to be stable and the mdev-core code can expect
that also.  It might simplify the vendor drivers slightly if the core
could perform such a basic sanity test, but the cost to do so would be
that the core needs to have an understanding of the region layout of
the device.  That seems like non-trivial overhead to consolidate
testing that the vendor driver itself can do much more efficiently.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-20 16:24                 ` Alex Williamson
@ 2016-09-21  3:19                   ` Jike Song
  2016-09-21  4:51                     ` Alex Williamson
  0 siblings, 1 reply; 162+ messages in thread
From: Jike Song @ 2016-09-21  3:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, kraxel, Dong Jia, kevin.tian, cjia, kvm,
	qemu-devel, pbonzini

On 09/21/2016 12:24 AM, Alex Williamson wrote:
> On Tue, 20 Sep 2016 10:50:47 +0800
> Jike Song <jike.song@intel.com> wrote:

/* trim the quotations */

>> Even performing a lightweight sanity check, would require vfio-mdev
>> to be able to decode the ppos into a particular region, that means
>> information of all regions should be stored in the framework. I guess
>> it is not your preferred way :)
> 
> There's certainly a trade-off there, we don't support dynamic regions,
> the user expects them to be stable and the mdev-core code can expect
> that also. It might simplify the vendor drivers slightly if the core
> could perform such a basic sanity test, but the cost to do so would be
> that the core needs to have an understanding of the region layout of
> the device.

I agree with why the requirement is, but I am suspicious that,
if we assume the regions are stable, try to encode/decode that within
the mdev-core framework - instead of vendor drivers - that is because
we want mdev to be API compatible with vfio-pci?

Being API compatible with vfio-pci is (IMHO) the most beautiful thing
in current mdev design, but is it necessary to make it mandatory? 
How about letting the underlining vendor drivers to decide whether
it is API compatible with vfio-pci, or will have a different set of
userspace API?


> That seems like non-trivial overhead to consolidate
> testing that the vendor driver itself can do much more efficiently.

Yes, this is also a trade-off if adopted :(


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-21  3:19                   ` Jike Song
@ 2016-09-21  4:51                     ` Alex Williamson
  2016-09-21  5:02                       ` Jike Song
  0 siblings, 1 reply; 162+ messages in thread
From: Alex Williamson @ 2016-09-21  4:51 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, kraxel, Dong Jia, kevin.tian, cjia, kvm,
	qemu-devel, pbonzini

On Wed, 21 Sep 2016 11:19:17 +0800
Jike Song <jike.song@intel.com> wrote:

> On 09/21/2016 12:24 AM, Alex Williamson wrote:
> > On Tue, 20 Sep 2016 10:50:47 +0800
> > Jike Song <jike.song@intel.com> wrote:  
> 
> /* trim the quotations */
> 
> >> Even performing a lightweight sanity check, would require vfio-mdev
> >> to be able to decode the ppos into a particular region, that means
> >> information of all regions should be stored in the framework. I guess
> >> it is not your preferred way :)  
> > 
> > There's certainly a trade-off there, we don't support dynamic regions,
> > the user expects them to be stable and the mdev-core code can expect
> > that also. It might simplify the vendor drivers slightly if the core
> > could perform such a basic sanity test, but the cost to do so would be
> > that the core needs to have an understanding of the region layout of
> > the device.  
> 
> I agree with why the requirement is, but I am suspicious that,
> if we assume the regions are stable, try to encode/decode that within
> the mdev-core framework - instead of vendor drivers - that is because
> we want mdev to be API compatible with vfio-pci?
> 
> Being API compatible with vfio-pci is (IMHO) the most beautiful thing
> in current mdev design, but is it necessary to make it mandatory? 
> How about letting the underlining vendor drivers to decide whether
> it is API compatible with vfio-pci, or will have a different set of
> userspace API?

Are you assuming that I'm suggesting using VFIO_PCI_OFFSET_TO_INDEX in
the mdev core?  We've been through that, I've rejected it, that's not
at all what I'm describing.  The vfio bus driver defines the region
layout, but once defined it is fixed for a given device instance.  A
user does not need to call ioctl(VFIO_DEVICE_GET_REGION_INFO) prior to
every region access to make sure the region offsets haven't changed
dynamically.  If it's fixed to the user than it's also fixed to the
mdev core for a given device instance, so nothing prevents the core
code from doing its own enumeration of the region offsets and sizes and
caching them into data structures.  That has nothing whatsoever to do
with vfio-pci and makes no assumptions about the layout of regions
within device fd. Thanks,

Alex

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices
  2016-09-21  4:51                     ` Alex Williamson
@ 2016-09-21  5:02                       ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-21  5:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, kraxel, Dong Jia, kevin.tian, cjia, kvm,
	qemu-devel, pbonzini

On 09/21/2016 12:51 PM, Alex Williamson wrote:
> On Wed, 21 Sep 2016 11:19:17 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 09/21/2016 12:24 AM, Alex Williamson wrote:
>>> On Tue, 20 Sep 2016 10:50:47 +0800
>>> Jike Song <jike.song@intel.com> wrote:  
>>
>> /* trim the quotations */
>>
>>>> Even performing a lightweight sanity check, would require vfio-mdev
>>>> to be able to decode the ppos into a particular region, that means
>>>> information of all regions should be stored in the framework. I guess
>>>> it is not your preferred way :)  
>>>
>>> There's certainly a trade-off there, we don't support dynamic regions,
>>> the user expects them to be stable and the mdev-core code can expect
>>> that also. It might simplify the vendor drivers slightly if the core
>>> could perform such a basic sanity test, but the cost to do so would be
>>> that the core needs to have an understanding of the region layout of
>>> the device.  
>>
>> I agree with why the requirement is, but I am suspicious that,
>> if we assume the regions are stable, try to encode/decode that within
>> the mdev-core framework - instead of vendor drivers - that is because
>> we want mdev to be API compatible with vfio-pci?
>>
>> Being API compatible with vfio-pci is (IMHO) the most beautiful thing
>> in current mdev design, but is it necessary to make it mandatory? 
>> How about letting the underlining vendor drivers to decide whether
>> it is API compatible with vfio-pci, or will have a different set of
>> userspace API?
> 
> Are you assuming that I'm suggesting using VFIO_PCI_OFFSET_TO_INDEX in
> the mdev core?  We've been through that, I've rejected it, that's not
> at all what I'm describing.  The vfio bus driver defines the region
> layout, but once defined it is fixed for a given device instance.  A
> user does not need to call ioctl(VFIO_DEVICE_GET_REGION_INFO) prior to
> every region access to make sure the region offsets haven't changed
> dynamically.  If it's fixed to the user than it's also fixed to the
> mdev core for a given device instance, so nothing prevents the core
> code from doing its own enumeration of the region offsets and sizes and
> caching them into data structures.  That has nothing whatsoever to do
> with vfio-pci and makes no assumptions about the layout of regions
> within device fd. Thanks,
> 

I misunderstood that previously and I understand the whole idea now.
Thanks for the kind explanation! :)

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-29  2:17     ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-29  2:17 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

+Guangrong

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> Reviewed-on: http://git-master/r/1175707
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  drivers/vfio/vfio.c             | 117 ++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 498 ++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 580 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5469de..e3e342861e04 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> +{
> +	struct vfio_device *device;
> +	struct vfio_group *group;
> +	int ret;
> +
> +	device = vfio_device_get_from_dev(dev);
> +	if (!device)
> +		return ERR_PTR(-EINVAL);
> +
> +	group = device->group;
> +	if (!atomic_inc_not_zero(&group->container_users)) {
> +		ret = -EINVAL;
> +		goto err_ret;
> +	}
> +
> +	if (group->noiommu) {
> +		atomic_dec(&group->container_users);
> +		ret = -EPERM;
> +		goto err_ret;
> +	}
> +
> +	if (!group->container->iommu_driver ||
> +	    !vfio_group_viable(group)) {
> +		atomic_dec(&group->container_users);
> +		ret = -EINVAL;
> +		goto err_ret;
> +	}
> +
> +	vfio_device_put(device);
> +	return group;
> +
> +err_ret:
> +	vfio_device_put(device);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for local
> + * domain only.
> + * @dev [in] : device
> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for local domain only.
> + * @dev [in] : device
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..d52d75fd0f04 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct local_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};
> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	struct local_addr_space	*local_addr_space;
>  };
>  
>  struct vfio_dma {
> @@ -83,6 +91,22 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;
> +	atomic_t		ref_count;
> +};
> +
> +
> +#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
> +			 (list_empty(&iommu->domain_list) ? false : true)
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	node = domain->local_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->local_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, size_t prot)
> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> @@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  
>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, 1);
>  		return 1;
>  	}
>  
> @@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, i);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)
>  {
>  	unsigned long unlocked = 0;
>  	long i;
> @@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlocked);
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
> +				   unsigned long vaddr, int prot,
> +				   unsigned long *pfn_base,
> +				   bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->local_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
> +				     unsigned long pfn, int prot,
> +				     bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
> +				 do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
> +						 &pfn[i], do_accounting);

Hi Kirti,

Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
whether the vaddr already pinned or not. That probably means, if the caller 
calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.

GUP always increases the page refcnt.

FWIW, I would like to have the pfn_list_lock implemented with key == iova,
so you can always try to find the PFN for a given iova, and pin it only if
not found.

--
Thanks,
Jike


> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
> +						 do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	domain = iommu->local_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);
> +
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +	}
>  
>  	return unlocked;
>  }
> @@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
> +						      unmapped >> PAGE_SHIFT,
> +						      dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
> +		dma->size = size;
> +		goto map_done;
> +	}
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> +		npage = __vfio_pin_pages_remote(vaddr + dma->size,
> +						size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> @@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>  		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +			__vfio_unpin_pages_remote(pfn, npage, prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->local_domain) {
> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
> +	    (bus == &mdev_bus_type)) {
> +		if (iommu->local_domain) {
> +			list_add(&group->next,
> +				 &iommu->local_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +
> +		domain->local_addr_space = kzalloc(sizeof(*domain->local_addr_space),
> +						   GFP_KERNEL);
> +		if (!domain->local_addr_space) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +
> +		domain->local_addr_space->task = current;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->local_addr_space->pfn_list = RB_ROOT;
> +		mutex_init(&domain->local_addr_space->pfn_list_lock);
> +		iommu->local_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1207,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->local_addr_space->pfn_list))) {
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1228,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> +	if (iommu->local_domain) {
> +		domain = iommu->local_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
> +			list_del(&group->next);
> +			kfree(group);
>  
> +			if (list_empty(&domain->group_list)) {
> +				vfio_local_unpin_all(domain);
> +				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				kfree(domain);
> +				iommu->local_domain = NULL;
> +			}
> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu and local domain doesn't
> +			 * exist, the all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular(&iommu->domain_list) &&
> +				   (!iommu->local_domain))
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> -			goto done;
> +			break;
>  		}
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1305,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->local_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->local_addr_space)
> +		vfio_local_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->local_domain) {
> +		vfio_release_domain(iommu->local_domain);
> +		kfree(iommu->local_domain);
> +		iommu->local_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1450,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..0bd25ba6223d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
> @@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-09-29  2:17     ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-29  2:17 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

+Guangrong

On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> Reviewed-on: http://git-master/r/1175707
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  drivers/vfio/vfio.c             | 117 ++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 498 ++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 580 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5469de..e3e342861e04 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> +{
> +	struct vfio_device *device;
> +	struct vfio_group *group;
> +	int ret;
> +
> +	device = vfio_device_get_from_dev(dev);
> +	if (!device)
> +		return ERR_PTR(-EINVAL);
> +
> +	group = device->group;
> +	if (!atomic_inc_not_zero(&group->container_users)) {
> +		ret = -EINVAL;
> +		goto err_ret;
> +	}
> +
> +	if (group->noiommu) {
> +		atomic_dec(&group->container_users);
> +		ret = -EPERM;
> +		goto err_ret;
> +	}
> +
> +	if (!group->container->iommu_driver ||
> +	    !vfio_group_viable(group)) {
> +		atomic_dec(&group->container_users);
> +		ret = -EINVAL;
> +		goto err_ret;
> +	}
> +
> +	vfio_device_put(device);
> +	return group;
> +
> +err_ret:
> +	vfio_device_put(device);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for local
> + * domain only.
> + * @dev [in] : device
> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for local domain only.
> + * @dev [in] : device
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..d52d75fd0f04 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct local_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};
> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	struct local_addr_space	*local_addr_space;
>  };
>  
>  struct vfio_dma {
> @@ -83,6 +91,22 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;
> +	atomic_t		ref_count;
> +};
> +
> +
> +#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
> +			 (list_empty(&iommu->domain_list) ? false : true)
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	node = domain->local_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->local_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, size_t prot)
> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> @@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  
>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, 1);
>  		return 1;
>  	}
>  
> @@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, i);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)
>  {
>  	unsigned long unlocked = 0;
>  	long i;
> @@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlocked);
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
> +				   unsigned long vaddr, int prot,
> +				   unsigned long *pfn_base,
> +				   bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->local_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
> +				     unsigned long pfn, int prot,
> +				     bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
> +				 do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
> +						 &pfn[i], do_accounting);

Hi Kirti,

Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
whether the vaddr already pinned or not. That probably means, if the caller 
calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.

GUP always increases the page refcnt.

FWIW, I would like to have the pfn_list_lock implemented with key == iova,
so you can always try to find the PFN for a given iova, and pin it only if
not found.

--
Thanks,
Jike


> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
> +						 do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	domain = iommu->local_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);
> +
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +	}
>  
>  	return unlocked;
>  }
> @@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
> +						      unmapped >> PAGE_SHIFT,
> +						      dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
> +		dma->size = size;
> +		goto map_done;
> +	}
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> +		npage = __vfio_pin_pages_remote(vaddr + dma->size,
> +						size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> @@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>  		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +			__vfio_unpin_pages_remote(pfn, npage, prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->local_domain) {
> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
> +	    (bus == &mdev_bus_type)) {
> +		if (iommu->local_domain) {
> +			list_add(&group->next,
> +				 &iommu->local_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +
> +		domain->local_addr_space = kzalloc(sizeof(*domain->local_addr_space),
> +						   GFP_KERNEL);
> +		if (!domain->local_addr_space) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +
> +		domain->local_addr_space->task = current;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->local_addr_space->pfn_list = RB_ROOT;
> +		mutex_init(&domain->local_addr_space->pfn_list_lock);
> +		iommu->local_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1207,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->local_addr_space->pfn_list))) {
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1228,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> +	if (iommu->local_domain) {
> +		domain = iommu->local_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
> +			list_del(&group->next);
> +			kfree(group);
>  
> +			if (list_empty(&domain->group_list)) {
> +				vfio_local_unpin_all(domain);
> +				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				kfree(domain);
> +				iommu->local_domain = NULL;
> +			}
> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu and local domain doesn't
> +			 * exist, the all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular(&iommu->domain_list) &&
> +				   (!iommu->local_domain))
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> -			goto done;
> +			break;
>  		}
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1305,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->local_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->local_addr_space)
> +		vfio_local_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->local_domain) {
> +		vfio_release_domain(iommu->local_domain);
> +		kfree(iommu->local_domain);
> +		iommu->local_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1450,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..0bd25ba6223d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
> @@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */
> 

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-09-29  2:17     ` [Qemu-devel] " Jike Song
@ 2016-09-29 15:06       ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-29 15:06 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong



On 9/29/2016 7:47 AM, Jike Song wrote:
> +Guangrong
> 
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:

...

>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>> +				       unsigned long *user_pfn,
>> +				       long npage, int prot,
>> +				       unsigned long *phys_pfn)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain;
>> +	int i, j, ret;
>> +	long retpage;
>> +	unsigned long remote_vaddr;
>> +	unsigned long *pfn = phys_pfn;
>> +	struct vfio_dma *dma;
>> +	bool do_accounting = false;
>> +
>> +	if (!iommu || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (!iommu->local_domain) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	domain = iommu->local_domain;
>> +
>> +	/*
>> +	 * If iommu capable domain exist in the container then all pages are
>> +	 * already pinned and accounted. Accouting should be done if there is no
>> +	 * iommu capable domain in the container.
>> +	 */
>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +		dma_addr_t iova;
>> +
>> +		iova = user_pfn[i] << PAGE_SHIFT;
>> +
>> +		dma = vfio_find_dma(iommu, iova, 0);
>> +		if (!dma) {
>> +			ret = -EINVAL;
>> +			goto pin_unwind;
>> +		}
>> +
>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>> +
>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>> +						 &pfn[i], do_accounting);
> 
> Hi Kirti,
> 
> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
> whether the vaddr already pinned or not. That probably means, if the caller 
> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
> 
> GUP always increases the page refcnt.
> 
> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
> so you can always try to find the PFN for a given iova, and pin it only if
> not found.
> 

I didn't get how there would be a memory leak.

Right, GUP increases refcnt, so if vfio_pin_pages() is called for
multiple types for same GPA, refcnt would be incremented. In
vfio_iommu_type1_pin_pages() pinned pages list is maintained with
ref_count. If pfn is already in list, ref_count is incremented and same
is used while unpining pages.

Kirti




^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-09-29 15:06       ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-29 15:06 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong



On 9/29/2016 7:47 AM, Jike Song wrote:
> +Guangrong
> 
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:

...

>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>> +				       unsigned long *user_pfn,
>> +				       long npage, int prot,
>> +				       unsigned long *phys_pfn)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain;
>> +	int i, j, ret;
>> +	long retpage;
>> +	unsigned long remote_vaddr;
>> +	unsigned long *pfn = phys_pfn;
>> +	struct vfio_dma *dma;
>> +	bool do_accounting = false;
>> +
>> +	if (!iommu || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (!iommu->local_domain) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	domain = iommu->local_domain;
>> +
>> +	/*
>> +	 * If iommu capable domain exist in the container then all pages are
>> +	 * already pinned and accounted. Accouting should be done if there is no
>> +	 * iommu capable domain in the container.
>> +	 */
>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +		dma_addr_t iova;
>> +
>> +		iova = user_pfn[i] << PAGE_SHIFT;
>> +
>> +		dma = vfio_find_dma(iommu, iova, 0);
>> +		if (!dma) {
>> +			ret = -EINVAL;
>> +			goto pin_unwind;
>> +		}
>> +
>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>> +
>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>> +						 &pfn[i], do_accounting);
> 
> Hi Kirti,
> 
> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
> whether the vaddr already pinned or not. That probably means, if the caller 
> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
> 
> GUP always increases the page refcnt.
> 
> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
> so you can always try to find the PFN for a given iova, and pin it only if
> not found.
> 

I didn't get how there would be a memory leak.

Right, GUP increases refcnt, so if vfio_pin_pages() is called for
multiple types for same GPA, refcnt would be incremented. In
vfio_iommu_type1_pin_pages() pinned pages list is maintained with
ref_count. If pfn is already in list, ref_count is incremented and same
is used while unpining pages.

Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-09-29 15:06       ` [Qemu-devel] " Kirti Wankhede
@ 2016-09-30  2:58         ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-30  2:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, Xiao, Guangrong, kvm, qemu-devel,
	alex.williamson, kraxel, pbonzini, bjsdjshi

On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
> 
> 
> On 9/29/2016 7:47 AM, Jike Song wrote:
>> +Guangrong
>>
>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> 
> ...
> 
>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>> +				       unsigned long *user_pfn,
>>> +				       long npage, int prot,
>>> +				       unsigned long *phys_pfn)
>>> +{
>>> +	struct vfio_iommu *iommu = iommu_data;
>>> +	struct vfio_domain *domain;
>>> +	int i, j, ret;
>>> +	long retpage;
>>> +	unsigned long remote_vaddr;
>>> +	unsigned long *pfn = phys_pfn;
>>> +	struct vfio_dma *dma;
>>> +	bool do_accounting = false;
>>> +
>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&iommu->lock);
>>> +
>>> +	if (!iommu->local_domain) {
>>> +		ret = -EINVAL;
>>> +		goto pin_done;
>>> +	}
>>> +
>>> +	domain = iommu->local_domain;
>>> +
>>> +	/*
>>> +	 * If iommu capable domain exist in the container then all pages are
>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>> +	 * iommu capable domain in the container.
>>> +	 */
>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>> +
>>> +	for (i = 0; i < npage; i++) {
>>> +		struct vfio_pfn *p;
>>> +		dma_addr_t iova;
>>> +
>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>> +
>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>> +		if (!dma) {
>>> +			ret = -EINVAL;
>>> +			goto pin_unwind;
>>> +		}
>>> +
>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>> +
>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>> +						 &pfn[i], do_accounting);
>>
>> Hi Kirti,
>>
>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>> whether the vaddr already pinned or not. That probably means, if the caller 
>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>
>> GUP always increases the page refcnt.
>>
>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>> so you can always try to find the PFN for a given iova, and pin it only if
>> not found.
>>
> 
> I didn't get how there would be a memory leak.
> 
> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
> multiple types for same GPA, refcnt would be incremented. In
> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
> ref_count. If pfn is already in list, ref_count is incremented and same
> is used while unpining pages.
> 

Let's have a close look at vfio_unpin_pfn:

	static int vfio_unpin_pfn(struct vfio_domain *domain,
				  struct vfio_pfn *vpfn, bool do_accounting)
	{
		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
					    do_accounting);

		if (atomic_dec_and_test(&vpfn->ref_count))
			vfio_remove_from_pfn_list(domain, vpfn);

		return 1;
	}

Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
you only set it back to (N-1).

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-09-30  2:58         ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-30  2:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
> 
> 
> On 9/29/2016 7:47 AM, Jike Song wrote:
>> +Guangrong
>>
>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> 
> ...
> 
>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>> +				       unsigned long *user_pfn,
>>> +				       long npage, int prot,
>>> +				       unsigned long *phys_pfn)
>>> +{
>>> +	struct vfio_iommu *iommu = iommu_data;
>>> +	struct vfio_domain *domain;
>>> +	int i, j, ret;
>>> +	long retpage;
>>> +	unsigned long remote_vaddr;
>>> +	unsigned long *pfn = phys_pfn;
>>> +	struct vfio_dma *dma;
>>> +	bool do_accounting = false;
>>> +
>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&iommu->lock);
>>> +
>>> +	if (!iommu->local_domain) {
>>> +		ret = -EINVAL;
>>> +		goto pin_done;
>>> +	}
>>> +
>>> +	domain = iommu->local_domain;
>>> +
>>> +	/*
>>> +	 * If iommu capable domain exist in the container then all pages are
>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>> +	 * iommu capable domain in the container.
>>> +	 */
>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>> +
>>> +	for (i = 0; i < npage; i++) {
>>> +		struct vfio_pfn *p;
>>> +		dma_addr_t iova;
>>> +
>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>> +
>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>> +		if (!dma) {
>>> +			ret = -EINVAL;
>>> +			goto pin_unwind;
>>> +		}
>>> +
>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>> +
>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>> +						 &pfn[i], do_accounting);
>>
>> Hi Kirti,
>>
>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>> whether the vaddr already pinned or not. That probably means, if the caller 
>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>
>> GUP always increases the page refcnt.
>>
>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>> so you can always try to find the PFN for a given iova, and pin it only if
>> not found.
>>
> 
> I didn't get how there would be a memory leak.
> 
> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
> multiple types for same GPA, refcnt would be incremented. In
> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
> ref_count. If pfn is already in list, ref_count is incremented and same
> is used while unpining pages.
> 

Let's have a close look at vfio_unpin_pfn:

	static int vfio_unpin_pfn(struct vfio_domain *domain,
				  struct vfio_pfn *vpfn, bool do_accounting)
	{
		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
					    do_accounting);

		if (atomic_dec_and_test(&vpfn->ref_count))
			vfio_remove_from_pfn_list(domain, vpfn);

		return 1;
	}

Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
you only set it back to (N-1).

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-09-30  2:58         ` [Qemu-devel] " Jike Song
@ 2016-09-30  3:10           ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-30  3:10 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

On 09/30/2016 10:58 AM, Jike Song wrote:
> On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
>>
>>
>> On 9/29/2016 7:47 AM, Jike Song wrote:
>>> +Guangrong
>>>
>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>
>> ...
>>
>>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>>> +				       unsigned long *user_pfn,
>>>> +				       long npage, int prot,
>>>> +				       unsigned long *phys_pfn)
>>>> +{
>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>> +	struct vfio_domain *domain;
>>>> +	int i, j, ret;
>>>> +	long retpage;
>>>> +	unsigned long remote_vaddr;
>>>> +	unsigned long *pfn = phys_pfn;
>>>> +	struct vfio_dma *dma;
>>>> +	bool do_accounting = false;
>>>> +
>>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>>> +		return -EINVAL;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	if (!iommu->local_domain) {
>>>> +		ret = -EINVAL;
>>>> +		goto pin_done;
>>>> +	}
>>>> +
>>>> +	domain = iommu->local_domain;
>>>> +
>>>> +	/*
>>>> +	 * If iommu capable domain exist in the container then all pages are
>>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>>> +	 * iommu capable domain in the container.
>>>> +	 */
>>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>>> +
>>>> +	for (i = 0; i < npage; i++) {
>>>> +		struct vfio_pfn *p;
>>>> +		dma_addr_t iova;
>>>> +
>>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>>> +
>>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>>> +		if (!dma) {
>>>> +			ret = -EINVAL;
>>>> +			goto pin_unwind;
>>>> +		}
>>>> +
>>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>> +
>>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>>> +						 &pfn[i], do_accounting);
>>>
>>> Hi Kirti,
>>>
>>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>>> whether the vaddr already pinned or not. That probably means, if the caller 
>>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>>
>>> GUP always increases the page refcnt.
>>>
>>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>>> so you can always try to find the PFN for a given iova, and pin it only if
>>> not found.
>>>
>>
>> I didn't get how there would be a memory leak.
>>
>> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
>> multiple types for same GPA, refcnt would be incremented. In
>> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
>> ref_count. If pfn is already in list, ref_count is incremented and same
>> is used while unpining pages.
>>
> 
> Let's have a close look at vfio_unpin_pfn:
> 
> 	static int vfio_unpin_pfn(struct vfio_domain *domain,
> 				  struct vfio_pfn *vpfn, bool do_accounting)
> 	{
> 		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
> 					    do_accounting);
> 
> 		if (atomic_dec_and_test(&vpfn->ref_count))
> 			vfio_remove_from_pfn_list(domain, vpfn);
> 
> 		return 1;
> 	}
> 
> Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
> vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
> you only set it back to (N-1).
> 

What's more, since all pinned {iova, pfni} already saved, it's better to
consult it before calling GUP, which will get_page() unconditionally.

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-09-30  3:10           ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-09-30  3:10 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

On 09/30/2016 10:58 AM, Jike Song wrote:
> On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
>>
>>
>> On 9/29/2016 7:47 AM, Jike Song wrote:
>>> +Guangrong
>>>
>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>
>> ...
>>
>>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>>> +				       unsigned long *user_pfn,
>>>> +				       long npage, int prot,
>>>> +				       unsigned long *phys_pfn)
>>>> +{
>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>> +	struct vfio_domain *domain;
>>>> +	int i, j, ret;
>>>> +	long retpage;
>>>> +	unsigned long remote_vaddr;
>>>> +	unsigned long *pfn = phys_pfn;
>>>> +	struct vfio_dma *dma;
>>>> +	bool do_accounting = false;
>>>> +
>>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>>> +		return -EINVAL;
>>>> +
>>>> +	mutex_lock(&iommu->lock);
>>>> +
>>>> +	if (!iommu->local_domain) {
>>>> +		ret = -EINVAL;
>>>> +		goto pin_done;
>>>> +	}
>>>> +
>>>> +	domain = iommu->local_domain;
>>>> +
>>>> +	/*
>>>> +	 * If iommu capable domain exist in the container then all pages are
>>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>>> +	 * iommu capable domain in the container.
>>>> +	 */
>>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>>> +
>>>> +	for (i = 0; i < npage; i++) {
>>>> +		struct vfio_pfn *p;
>>>> +		dma_addr_t iova;
>>>> +
>>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>>> +
>>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>>> +		if (!dma) {
>>>> +			ret = -EINVAL;
>>>> +			goto pin_unwind;
>>>> +		}
>>>> +
>>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>> +
>>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>>> +						 &pfn[i], do_accounting);
>>>
>>> Hi Kirti,
>>>
>>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>>> whether the vaddr already pinned or not. That probably means, if the caller 
>>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>>
>>> GUP always increases the page refcnt.
>>>
>>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>>> so you can always try to find the PFN for a given iova, and pin it only if
>>> not found.
>>>
>>
>> I didn't get how there would be a memory leak.
>>
>> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
>> multiple types for same GPA, refcnt would be incremented. In
>> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
>> ref_count. If pfn is already in list, ref_count is incremented and same
>> is used while unpining pages.
>>
> 
> Let's have a close look at vfio_unpin_pfn:
> 
> 	static int vfio_unpin_pfn(struct vfio_domain *domain,
> 				  struct vfio_pfn *vpfn, bool do_accounting)
> 	{
> 		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
> 					    do_accounting);
> 
> 		if (atomic_dec_and_test(&vpfn->ref_count))
> 			vfio_remove_from_pfn_list(domain, vpfn);
> 
> 		return 1;
> 	}
> 
> Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
> vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
> you only set it back to (N-1).
> 

What's more, since all pinned {iova, pfni} already saved, it's better to
consult it before calling GUP, which will get_page() unconditionally.

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-09-30  3:10           ` [Qemu-devel] " Jike Song
@ 2016-09-30 11:44             ` Kirti Wankhede
  -1 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-30 11:44 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong



On 9/30/2016 8:40 AM, Jike Song wrote:
> On 09/30/2016 10:58 AM, Jike Song wrote:
>> On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
>>>
>>>
>>> On 9/29/2016 7:47 AM, Jike Song wrote:
>>>> +Guangrong
>>>>
>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>>
>>> ...
>>>
>>>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>> +				       unsigned long *user_pfn,
>>>>> +				       long npage, int prot,
>>>>> +				       unsigned long *phys_pfn)
>>>>> +{
>>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>>> +	struct vfio_domain *domain;
>>>>> +	int i, j, ret;
>>>>> +	long retpage;
>>>>> +	unsigned long remote_vaddr;
>>>>> +	unsigned long *pfn = phys_pfn;
>>>>> +	struct vfio_dma *dma;
>>>>> +	bool do_accounting = false;
>>>>> +
>>>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	mutex_lock(&iommu->lock);
>>>>> +
>>>>> +	if (!iommu->local_domain) {
>>>>> +		ret = -EINVAL;
>>>>> +		goto pin_done;
>>>>> +	}
>>>>> +
>>>>> +	domain = iommu->local_domain;
>>>>> +
>>>>> +	/*
>>>>> +	 * If iommu capable domain exist in the container then all pages are
>>>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>>>> +	 * iommu capable domain in the container.
>>>>> +	 */
>>>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>>>> +
>>>>> +	for (i = 0; i < npage; i++) {
>>>>> +		struct vfio_pfn *p;
>>>>> +		dma_addr_t iova;
>>>>> +
>>>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>>>> +
>>>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>>>> +		if (!dma) {
>>>>> +			ret = -EINVAL;
>>>>> +			goto pin_unwind;
>>>>> +		}
>>>>> +
>>>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>>> +
>>>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>>>> +						 &pfn[i], do_accounting);
>>>>
>>>> Hi Kirti,
>>>>
>>>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>>>> whether the vaddr already pinned or not. That probably means, if the caller 
>>>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>>>
>>>> GUP always increases the page refcnt.
>>>>
>>>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>>>> so you can always try to find the PFN for a given iova, and pin it only if
>>>> not found.
>>>>
>>>
>>> I didn't get how there would be a memory leak.
>>>
>>> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
>>> multiple types for same GPA, refcnt would be incremented. In
>>> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
>>> ref_count. If pfn is already in list, ref_count is incremented and same
>>> is used while unpining pages.
>>>
>>
>> Let's have a close look at vfio_unpin_pfn:
>>
>> 	static int vfio_unpin_pfn(struct vfio_domain *domain,
>> 				  struct vfio_pfn *vpfn, bool do_accounting)
>> 	{
>> 		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
>> 					    do_accounting);
>>
>> 		if (atomic_dec_and_test(&vpfn->ref_count))
>> 			vfio_remove_from_pfn_list(domain, vpfn);
>>
>> 		return 1;
>> 	}
>>
>> Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
>> vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
>> you only set it back to (N-1).
>>

User of vfio_pin_pages() should call vfio_unpin_pages() also,  so here
we unpin it once. If vfio_pin_pages() is called twice for same page, we
should get vfio_unpin_pages() twice for same page.

If users of these APIs don't follow this, then
vfio_release_domain() -> vfio_local_unpin_all() takes care of unpin,
decrement ref_count and delete node on (ref_count == 0) for all
remaining pfn.

> 
> What's more, since all pinned {iova, pfni} already saved, it's better to
> consult it before calling GUP, which will get_page() unconditionally.

pfn is required to unpin page, so we have pfn as key for rbtree.
vfio_pin_pages() is called with user_pfn or iova, which can't be used to
search in rbtree with iova in optimized way. Raw way would be to goto
each node of rbtree and check iova which would hamper the performance in
if this is called in performance critical path.
So here optimized way is to first pin it, get pfn and check if already
exist in rbtree. If it exist increment ref_count else add it to the rbtree.

Thanks,
Kirti


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-09-30 11:44             ` Kirti Wankhede
  0 siblings, 0 replies; 162+ messages in thread
From: Kirti Wankhede @ 2016-09-30 11:44 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong



On 9/30/2016 8:40 AM, Jike Song wrote:
> On 09/30/2016 10:58 AM, Jike Song wrote:
>> On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
>>>
>>>
>>> On 9/29/2016 7:47 AM, Jike Song wrote:
>>>> +Guangrong
>>>>
>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>>
>>> ...
>>>
>>>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>> +				       unsigned long *user_pfn,
>>>>> +				       long npage, int prot,
>>>>> +				       unsigned long *phys_pfn)
>>>>> +{
>>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>>> +	struct vfio_domain *domain;
>>>>> +	int i, j, ret;
>>>>> +	long retpage;
>>>>> +	unsigned long remote_vaddr;
>>>>> +	unsigned long *pfn = phys_pfn;
>>>>> +	struct vfio_dma *dma;
>>>>> +	bool do_accounting = false;
>>>>> +
>>>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	mutex_lock(&iommu->lock);
>>>>> +
>>>>> +	if (!iommu->local_domain) {
>>>>> +		ret = -EINVAL;
>>>>> +		goto pin_done;
>>>>> +	}
>>>>> +
>>>>> +	domain = iommu->local_domain;
>>>>> +
>>>>> +	/*
>>>>> +	 * If iommu capable domain exist in the container then all pages are
>>>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>>>> +	 * iommu capable domain in the container.
>>>>> +	 */
>>>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>>>> +
>>>>> +	for (i = 0; i < npage; i++) {
>>>>> +		struct vfio_pfn *p;
>>>>> +		dma_addr_t iova;
>>>>> +
>>>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>>>> +
>>>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>>>> +		if (!dma) {
>>>>> +			ret = -EINVAL;
>>>>> +			goto pin_unwind;
>>>>> +		}
>>>>> +
>>>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>>> +
>>>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>>>> +						 &pfn[i], do_accounting);
>>>>
>>>> Hi Kirti,
>>>>
>>>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>>>> whether the vaddr already pinned or not. That probably means, if the caller 
>>>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>>>
>>>> GUP always increases the page refcnt.
>>>>
>>>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>>>> so you can always try to find the PFN for a given iova, and pin it only if
>>>> not found.
>>>>
>>>
>>> I didn't get how there would be a memory leak.
>>>
>>> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
>>> multiple types for same GPA, refcnt would be incremented. In
>>> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
>>> ref_count. If pfn is already in list, ref_count is incremented and same
>>> is used while unpining pages.
>>>
>>
>> Let's have a close look at vfio_unpin_pfn:
>>
>> 	static int vfio_unpin_pfn(struct vfio_domain *domain,
>> 				  struct vfio_pfn *vpfn, bool do_accounting)
>> 	{
>> 		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
>> 					    do_accounting);
>>
>> 		if (atomic_dec_and_test(&vpfn->ref_count))
>> 			vfio_remove_from_pfn_list(domain, vpfn);
>>
>> 		return 1;
>> 	}
>>
>> Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
>> vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
>> you only set it back to (N-1).
>>

User of vfio_pin_pages() should call vfio_unpin_pages() also,  so here
we unpin it once. If vfio_pin_pages() is called twice for same page, we
should get vfio_unpin_pages() twice for same page.

If users of these APIs don't follow this, then
vfio_release_domain() -> vfio_local_unpin_all() takes care of unpin,
decrement ref_count and delete node on (ref_count == 0) for all
remaining pfn.

> 
> What's more, since all pinned {iova, pfni} already saved, it's better to
> consult it before calling GUP, which will get_page() unconditionally.

pfn is required to unpin page, so we have pfn as key for rbtree.
vfio_pin_pages() is called with user_pfn or iova, which can't be used to
search in rbtree with iova in optimized way. Raw way would be to goto
each node of rbtree and check iova which would hamper the performance in
if this is called in performance critical path.
So here optimized way is to first pin it, get pfn and check if already
exist in rbtree. If it exist increment ref_count else add it to the rbtree.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v7 3/4] vfio iommu: Add support for mediated devices
  2016-09-30 11:44             ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-08  7:09               ` Jike Song
  -1 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-10-08  7:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

On 09/30/2016 07:44 PM, Kirti Wankhede wrote:
> On 9/30/2016 8:40 AM, Jike Song wrote:
>> On 09/30/2016 10:58 AM, Jike Song wrote:
>>> On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
>>>>
>>>>
>>>> On 9/29/2016 7:47 AM, Jike Song wrote:
>>>>> +Guangrong
>>>>>
>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>>>
>>>> ...
>>>>
>>>>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>>> +				       unsigned long *user_pfn,
>>>>>> +				       long npage, int prot,
>>>>>> +				       unsigned long *phys_pfn)
>>>>>> +{
>>>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>>>> +	struct vfio_domain *domain;
>>>>>> +	int i, j, ret;
>>>>>> +	long retpage;
>>>>>> +	unsigned long remote_vaddr;
>>>>>> +	unsigned long *pfn = phys_pfn;
>>>>>> +	struct vfio_dma *dma;
>>>>>> +	bool do_accounting = false;
>>>>>> +
>>>>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	mutex_lock(&iommu->lock);
>>>>>> +
>>>>>> +	if (!iommu->local_domain) {
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto pin_done;
>>>>>> +	}
>>>>>> +
>>>>>> +	domain = iommu->local_domain;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * If iommu capable domain exist in the container then all pages are
>>>>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>>>>> +	 * iommu capable domain in the container.
>>>>>> +	 */
>>>>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>>>>> +
>>>>>> +	for (i = 0; i < npage; i++) {
>>>>>> +		struct vfio_pfn *p;
>>>>>> +		dma_addr_t iova;
>>>>>> +
>>>>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>>>>> +
>>>>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>>>>> +		if (!dma) {
>>>>>> +			ret = -EINVAL;
>>>>>> +			goto pin_unwind;
>>>>>> +		}
>>>>>> +
>>>>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>>>> +
>>>>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>>>>> +						 &pfn[i], do_accounting);
>>>>>
>>>>> Hi Kirti,
>>>>>
>>>>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>>>>> whether the vaddr already pinned or not. That probably means, if the caller 
>>>>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>>>>
>>>>> GUP always increases the page refcnt.
>>>>>
>>>>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>>>>> so you can always try to find the PFN for a given iova, and pin it only if
>>>>> not found.
>>>>>
>>>>
>>>> I didn't get how there would be a memory leak.
>>>>
>>>> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
>>>> multiple types for same GPA, refcnt would be incremented. In
>>>> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
>>>> ref_count. If pfn is already in list, ref_count is incremented and same
>>>> is used while unpining pages.
>>>>
>>>
>>> Let's have a close look at vfio_unpin_pfn:
>>>
>>> 	static int vfio_unpin_pfn(struct vfio_domain *domain,
>>> 				  struct vfio_pfn *vpfn, bool do_accounting)
>>> 	{
>>> 		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
>>> 					    do_accounting);
>>>
>>> 		if (atomic_dec_and_test(&vpfn->ref_count))
>>> 			vfio_remove_from_pfn_list(domain, vpfn);
>>>
>>> 		return 1;
>>> 	}
>>>
>>> Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
>>> vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
>>> you only set it back to (N-1).
>>>
> 
> User of vfio_pin_pages() should call vfio_unpin_pages() also,  so here
> we unpin it once. If vfio_pin_pages() is called twice for same page, we
> should get vfio_unpin_pages() twice for same page.
>

If this is the deliberate design, why do you need a 'ref_count'? You can
simply drop the 'ref_count' and blame the caller for pinning/unpinning
different times.

> If users of these APIs don't follow this, then
> vfio_release_domain() -> vfio_local_unpin_all() takes care of unpin,
> decrement ref_count and delete node on (ref_count == 0) for all
> remaining pfn.
>

Here you did pay attention to the "caller doesn't follow this" situation.
However, dealing with 'ref_count' in vfio-iommu is not enough: memory
leaked.

>>
>> What's more, since all pinned {iova, pfni} already saved, it's better to
>> consult it before calling GUP, which will get_page() unconditionally.
>
> pfn is required to unpin page, so we have pfn as key for rbtree.
> vfio_pin_pages() is called with user_pfn or iova, which can't be used to
> search in rbtree with iova in optimized way. Raw way would be to goto
> each node of rbtree and check iova which would hamper the performance in
> if this is called in performance critical path.
> So here optimized way is to first pin it, get pfn and check if already
> exist in rbtree. If it exist increment ref_count else add it to the rbtree.
> 

Of course pfn is required to unpin page, I 100% agree. But that doesn't 
change the argue: using iova instead for key, you can still store pfn
along with it.

By the way, calling GUP unconditionally hurts more than searching rbtree.


--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices
@ 2016-10-08  7:09               ` Jike Song
  0 siblings, 0 replies; 162+ messages in thread
From: Jike Song @ 2016-10-08  7:09 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, Xiao, Guangrong

On 09/30/2016 07:44 PM, Kirti Wankhede wrote:
> On 9/30/2016 8:40 AM, Jike Song wrote:
>> On 09/30/2016 10:58 AM, Jike Song wrote:
>>> On 09/29/2016 11:06 PM, Kirti Wankhede wrote:
>>>>
>>>>
>>>> On 9/29/2016 7:47 AM, Jike Song wrote:
>>>>> +Guangrong
>>>>>
>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>>>
>>>> ...
>>>>
>>>>>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>>> +				       unsigned long *user_pfn,
>>>>>> +				       long npage, int prot,
>>>>>> +				       unsigned long *phys_pfn)
>>>>>> +{
>>>>>> +	struct vfio_iommu *iommu = iommu_data;
>>>>>> +	struct vfio_domain *domain;
>>>>>> +	int i, j, ret;
>>>>>> +	long retpage;
>>>>>> +	unsigned long remote_vaddr;
>>>>>> +	unsigned long *pfn = phys_pfn;
>>>>>> +	struct vfio_dma *dma;
>>>>>> +	bool do_accounting = false;
>>>>>> +
>>>>>> +	if (!iommu || !user_pfn || !phys_pfn)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	mutex_lock(&iommu->lock);
>>>>>> +
>>>>>> +	if (!iommu->local_domain) {
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto pin_done;
>>>>>> +	}
>>>>>> +
>>>>>> +	domain = iommu->local_domain;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * If iommu capable domain exist in the container then all pages are
>>>>>> +	 * already pinned and accounted. Accouting should be done if there is no
>>>>>> +	 * iommu capable domain in the container.
>>>>>> +	 */
>>>>>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>>>>>> +
>>>>>> +	for (i = 0; i < npage; i++) {
>>>>>> +		struct vfio_pfn *p;
>>>>>> +		dma_addr_t iova;
>>>>>> +
>>>>>> +		iova = user_pfn[i] << PAGE_SHIFT;
>>>>>> +
>>>>>> +		dma = vfio_find_dma(iommu, iova, 0);
>>>>>> +		if (!dma) {
>>>>>> +			ret = -EINVAL;
>>>>>> +			goto pin_unwind;
>>>>>> +		}
>>>>>> +
>>>>>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>>>>>> +
>>>>>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>>>>>> +						 &pfn[i], do_accounting);
>>>>>
>>>>> Hi Kirti,
>>>>>
>>>>> Here you call __vfio_pin_pages_local() > vaddr_get_pfn() > GUP regardless
>>>>> whether the vaddr already pinned or not. That probably means, if the caller 
>>>>> calls vfio_pin_pages() with a GPA for multiple times, you get memory leaks.
>>>>>
>>>>> GUP always increases the page refcnt.
>>>>>
>>>>> FWIW, I would like to have the pfn_list_lock implemented with key == iova,
>>>>> so you can always try to find the PFN for a given iova, and pin it only if
>>>>> not found.
>>>>>
>>>>
>>>> I didn't get how there would be a memory leak.
>>>>
>>>> Right, GUP increases refcnt, so if vfio_pin_pages() is called for
>>>> multiple types for same GPA, refcnt would be incremented. In
>>>> vfio_iommu_type1_pin_pages() pinned pages list is maintained with
>>>> ref_count. If pfn is already in list, ref_count is incremented and same
>>>> is used while unpining pages.
>>>>
>>>
>>> Let's have a close look at vfio_unpin_pfn:
>>>
>>> 	static int vfio_unpin_pfn(struct vfio_domain *domain,
>>> 				  struct vfio_pfn *vpfn, bool do_accounting)
>>> 	{
>>> 		__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
>>> 					    do_accounting);
>>>
>>> 		if (atomic_dec_and_test(&vpfn->ref_count))
>>> 			vfio_remove_from_pfn_list(domain, vpfn);
>>>
>>> 		return 1;
>>> 	}
>>>
>>> Here you didn't call __vfio_unpin_pages_for_mdev -- thereby put_page -- for
>>> vpfn->ref_count times. If page->_refcount increased by GUP for (N) times, here
>>> you only set it back to (N-1).
>>>
> 
> User of vfio_pin_pages() should call vfio_unpin_pages() also,  so here
> we unpin it once. If vfio_pin_pages() is called twice for same page, we
> should get vfio_unpin_pages() twice for same page.
>

If this is the deliberate design, why do you need a 'ref_count'? You can
simply drop the 'ref_count' and blame the caller for pinning/unpinning
different times.

> If users of these APIs don't follow this, then
> vfio_release_domain() -> vfio_local_unpin_all() takes care of unpin,
> decrement ref_count and delete node on (ref_count == 0) for all
> remaining pfn.
>

Here you did pay attention to the "caller doesn't follow this" situation.
However, dealing with 'ref_count' in vfio-iommu is not enough: memory
leaked.

>>
>> What's more, since all pinned {iova, pfni} already saved, it's better to
>> consult it before calling GUP, which will get_page() unconditionally.
>
> pfn is required to unpin page, so we have pfn as key for rbtree.
> vfio_pin_pages() is called with user_pfn or iova, which can't be used to
> search in rbtree with iova in optimized way. Raw way would be to goto
> each node of rbtree and check iova which would hamper the performance in
> if this is called in performance critical path.
> So here optimized way is to first pin it, get pfn and check if already
> exist in rbtree. If it exist increment ref_count else add it to the rbtree.
> 

Of course pfn is required to unpin page, I 100% agree. But that doesn't 
change the argue: using iova instead for key, you can still store pfn
along with it.

By the way, calling GUP unconditionally hurts more than searching rbtree.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 162+ messages in thread

end of thread, other threads:[~2016-10-08  7:12 UTC | newest]

Thread overview: 162+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-25  3:53 [PATCH v7 0/4] Add Mediated device support Kirti Wankhede
2016-08-25  3:53 ` [Qemu-devel] " Kirti Wankhede
2016-08-25  3:53 ` [PATCH v7 1/4] vfio: Mediated device Core driver Kirti Wankhede
2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
2016-09-08  8:09   ` Jike Song
2016-09-08  8:09     ` [Qemu-devel] " Jike Song
2016-09-08  9:38     ` Neo Jia
2016-09-08  9:38       ` [Qemu-devel] " Neo Jia
2016-09-09  6:26       ` Jike Song
2016-09-09  6:26         ` [Qemu-devel] " Jike Song
2016-09-09 17:48     ` Kirti Wankhede
2016-09-09 17:48       ` [Qemu-devel] " Kirti Wankhede
2016-09-09 18:42       ` Alex Williamson
2016-09-09 18:42         ` [Qemu-devel] " Alex Williamson
2016-09-09 19:55         ` Kirti Wankhede
2016-09-09 19:55           ` [Qemu-devel] " Kirti Wankhede
2016-09-12  5:10           ` Jike Song
2016-09-12  5:10             ` [Qemu-devel] " Jike Song
2016-09-12  7:49             ` Kirti Wankhede
2016-09-12  7:49               ` [Qemu-devel] " Kirti Wankhede
2016-09-12 15:53               ` Alex Williamson
2016-09-12 15:53                 ` [Qemu-devel] " Alex Williamson
2016-09-19  7:08                 ` Jike Song
2016-09-19  7:08                   ` [Qemu-devel] " Jike Song
2016-09-19 17:29                 ` Kirti Wankhede
2016-09-19 17:29                   ` [Qemu-devel] " Kirti Wankhede
2016-09-19 18:11                   ` Alex Williamson
2016-09-19 18:11                     ` [Qemu-devel] " Alex Williamson
2016-09-19 20:09                     ` Kirti Wankhede
2016-09-19 20:09                       ` [Qemu-devel] " Kirti Wankhede
2016-09-19 20:59                       ` Alex Williamson
2016-09-20 12:48   ` Jike Song
2016-09-20 12:48     ` [Qemu-devel] " Jike Song
2016-08-25  3:53 ` [PATCH v7 2/4] vfio: VFIO driver for mediated devices Kirti Wankhede
2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
2016-08-25  9:22   ` Dong Jia
2016-08-25  9:22     ` [Qemu-devel] " Dong Jia
2016-08-26 14:13     ` Kirti Wankhede
2016-08-26 14:13       ` [Qemu-devel] " Kirti Wankhede
2016-09-08  2:38       ` Jike Song
2016-09-08  2:38         ` [Qemu-devel] " Jike Song
2016-09-19 18:22       ` Kirti Wankhede
2016-09-19 18:22         ` Kirti Wankhede
2016-09-19 18:36         ` Alex Williamson
2016-09-19 18:36           ` Alex Williamson
2016-09-19 19:13           ` Kirti Wankhede
2016-09-19 19:13             ` Kirti Wankhede
2016-09-19 20:03             ` Alex Williamson
2016-09-19 20:03               ` Alex Williamson
2016-09-20  2:50               ` Jike Song
2016-09-20 16:24                 ` Alex Williamson
2016-09-21  3:19                   ` Jike Song
2016-09-21  4:51                     ` Alex Williamson
2016-09-21  5:02                       ` Jike Song
2016-09-08  2:45     ` Jike Song
2016-09-08  2:45       ` [Qemu-devel] " Jike Song
2016-09-13  2:35       ` Jike Song
2016-09-13  2:35         ` [Qemu-devel] " Jike Song
2016-09-20  5:48         ` Dong Jia Shi
2016-09-20  5:48         ` [Qemu-devel] " Dong Jia Shi
2016-09-20  6:37           ` Jike Song
2016-09-20  6:37             ` [Qemu-devel] " Jike Song
2016-09-20 12:53   ` Jike Song
2016-09-20 12:53     ` [Qemu-devel] " Jike Song
2016-08-25  3:53 ` [PATCH v7 3/4] vfio iommu: Add support " Kirti Wankhede
2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
2016-08-25  7:29   ` Dong Jia
2016-08-25  7:29     ` [Qemu-devel] " Dong Jia
2016-08-26 13:50     ` Kirti Wankhede
2016-08-26 13:50       ` [Qemu-devel] " Kirti Wankhede
2016-09-29  2:17   ` Jike Song
2016-09-29  2:17     ` [Qemu-devel] " Jike Song
2016-09-29 15:06     ` Kirti Wankhede
2016-09-29 15:06       ` [Qemu-devel] " Kirti Wankhede
2016-09-30  2:58       ` Jike Song
2016-09-30  2:58         ` [Qemu-devel] " Jike Song
2016-09-30  3:10         ` Jike Song
2016-09-30  3:10           ` [Qemu-devel] " Jike Song
2016-09-30 11:44           ` Kirti Wankhede
2016-09-30 11:44             ` [Qemu-devel] " Kirti Wankhede
2016-10-08  7:09             ` Jike Song
2016-10-08  7:09               ` [Qemu-devel] " Jike Song
2016-08-25  3:53 ` [PATCH v7 4/4] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-08-25  3:53   ` [Qemu-devel] " Kirti Wankhede
2016-09-03 16:40   ` Kirti Wankhede
2016-09-03 16:40     ` [Qemu-devel] " Kirti Wankhede
2016-08-30 16:16 ` [PATCH v7 0/4] Add Mediated device support Alex Williamson
2016-08-30 16:16   ` [Qemu-devel] " Alex Williamson
2016-08-31  6:12   ` Tian, Kevin
2016-08-31  6:12     ` [Qemu-devel] " Tian, Kevin
2016-08-31  7:04     ` Jike Song
2016-08-31  7:04       ` [Qemu-devel] " Jike Song
2016-08-31 15:48       ` Alex Williamson
2016-08-31 15:48         ` [Qemu-devel] " Alex Williamson
2016-09-01  4:09         ` Tian, Kevin
2016-09-01  4:09           ` [Qemu-devel] " Tian, Kevin
2016-09-01  4:10         ` Tian, Kevin
2016-09-01  4:10           ` [Qemu-devel] " Tian, Kevin
2016-09-01 18:22         ` Kirti Wankhede
2016-09-01 18:22           ` [Qemu-devel] " Kirti Wankhede
2016-09-01 20:01           ` Alex Williamson
2016-09-01 20:01             ` [Qemu-devel] " Alex Williamson
2016-09-02  6:17             ` Kirti Wankhede
2016-09-02  6:17               ` [Qemu-devel] " Kirti Wankhede
2016-09-01 16:47     ` Michal Privoznik
2016-09-01 16:59       ` Alex Williamson
2016-09-01 16:59         ` [Qemu-devel] " Alex Williamson
2016-09-02  4:48         ` Michal Privoznik
2016-09-02  5:21           ` Kirti Wankhede
2016-09-02 10:05             ` Paolo Bonzini
2016-09-02 17:15               ` Kirti Wankhede
2016-09-02 17:25                 ` Paolo Bonzini
2016-09-02 18:33                   ` Kirti Wankhede
2016-09-02 20:29                     ` [libvirt] " John Ferlan
2016-09-02 20:29                       ` [Qemu-devel] [libvirt] " John Ferlan
2016-09-03 16:31                       ` Kirti Wankhede
2016-09-03 16:31                         ` [Qemu-devel] " Kirti Wankhede
2016-09-06 17:54                         ` [libvirt] [Qemu-devel] " Alex Williamson
2016-09-06 17:54                           ` [Qemu-devel] [libvirt] " Alex Williamson
2016-09-02 21:48                     ` [Qemu-devel] " Paolo Bonzini
2016-09-03 11:56                       ` [libvirt] " John Ferlan
2016-09-03 11:56                         ` [Qemu-devel] [libvirt] " John Ferlan
2016-09-03 13:07                         ` [libvirt] [Qemu-devel] " Paolo Bonzini
2016-09-03 13:07                           ` [Qemu-devel] [libvirt] " Paolo Bonzini
2016-09-03 17:47                           ` Kirti Wankhede
2016-09-03 17:47                             ` [Qemu-devel] " Kirti Wankhede
2016-09-03 16:34                       ` [Qemu-devel] " Kirti Wankhede
2016-09-06 17:40                         ` Alex Williamson
2016-09-06 19:35                           ` Kirti Wankhede
2016-09-06 21:28                             ` Alex Williamson
2016-09-07  8:22                               ` Tian, Kevin
2016-09-07  8:22                                 ` Tian, Kevin
2016-09-07 16:00                                 ` Alex Williamson
2016-09-07 16:15                               ` Kirti Wankhede
2016-09-07 16:44                                 ` Alex Williamson
2016-09-07 18:06                                   ` Kirti Wankhede
2016-09-07 22:13                                     ` Alex Williamson
2016-09-08 18:48                                       ` Kirti Wankhede
2016-09-08 20:51                                         ` Alex Williamson
2016-09-07 18:17                                   ` Neo Jia
2016-09-07 18:27                                     ` Daniel P. Berrange
2016-09-07 18:32                                       ` Neo Jia
2016-09-07  6:48                           ` Tian, Kevin
2016-09-07  6:48                             ` Tian, Kevin
2016-09-02 20:19               ` [libvirt] " John Ferlan
2016-09-02 20:19                 ` [Qemu-devel] [libvirt] " John Ferlan
2016-09-02 21:44                 ` [libvirt] [Qemu-devel] " Paolo Bonzini
2016-09-02 21:44                   ` [Qemu-devel] [libvirt] " Paolo Bonzini
2016-09-02 23:57                   ` [libvirt] [Qemu-devel] " Laine Stump
2016-09-02 23:57                     ` [Qemu-devel] [libvirt] " Laine Stump
2016-09-03 16:49                     ` [libvirt] [Qemu-devel] " Kirti Wankhede
2016-09-03 16:49                       ` [Qemu-devel] [libvirt] " Kirti Wankhede
2016-09-05  7:52                     ` [libvirt] [Qemu-devel] " Paolo Bonzini
2016-09-05  7:52                       ` [Qemu-devel] [libvirt] " Paolo Bonzini
2016-09-03 11:57                   ` [libvirt] [Qemu-devel] " John Ferlan
2016-09-03 11:57                     ` [Qemu-devel] [libvirt] " John Ferlan
2016-09-05  7:54                     ` [libvirt] [Qemu-devel] " Paolo Bonzini
2016-09-05  7:54                       ` [Qemu-devel] [libvirt] " Paolo Bonzini
2016-09-02 17:55         ` [libvirt] [Qemu-devel] " Laine Stump
2016-09-02 17:55           ` [Qemu-devel] [libvirt] " Laine Stump
2016-09-02 19:15           ` Alex Williamson
2016-09-02 19:15             ` [Qemu-devel] " Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.